Friday 8 July 2011

Some ponderings on Google's research on inter-language linking (Bengali <-> Swahili, Nepali <-> Marathi)

On the Google Research Blog, the latest post (by ) concerns inter-language linking, i.e. looking at webpages' off-site links which go to a page in another language. From the post:
Most web pages link to other pages on the same web site, and the few off-site links they have are almost always to other pages in the same language. It's as if each language has its own web which is loosely linked to the webs of other languages. However, there are a small but significant number of off-site links between languages. These give tantalizing hints of the world beyond the virtual.
I'm particularly interested in the data on Indian language webpages' inter-language linking, especially as there are some perplexing findings. But let's start with some findings which aren't really that surprising.

One of the features measured is the degree to which webpages in a particular language are "introverted" or "extroverted", where more "introverted" webpage languages have fewer inter-language off-site links. The data are summarised here:

Webpage languages which are higher (on the y-axis) are more introverted; webpage languages which are further to the right (on the x-axis) represent languages with a greater number of total webpages.

First, a word about the apparently high degree of English-language webpage "extroversion". The relatively high percentage of English-language websites which link to non-English websites is unlikely to represent a high percentage of native English speakers who are linking to non-English websites. Rather, this would seem to simply reflect English's status as a/the world language, so that even sites whose audience may largely consist of non-native English speakers may choose to create English-language websites simply in order to have a larger audience. And I suspect the "extroverted" English-language webpages are of that type: English is the language chosen for this type of website due to its ability to reach a more "universal" audience, but the site itself may have "local" interests, reflected by its linking to non-English language websites.

But it's the Indian languages that I really want to talk about. Given the large number of Hindi speakers, one might at first be surprised at the relatively small number of Hindi language sites (compared to say Japanese). This, I think, is easily explained by the status of English in India, especially amongst people who would be more likely to create and use Internet sites. In another words, many native Hindi speakers would choose to create English- rather Hindi-language webpages. The high degree of insularity ("introversion") of Hindi-language webpages in terms of inter-language linkage is likely not unconnected. In the context of modern India, choosing to create a Hindi- rather than English-language website is already a more "insular" choice, given the widespread use of English in India itself. Those website content creator who choose Hindi medium over English medium are likely to have more "insular" interests, and thus would not be as likely to link to non-Hindi sites (and even less likely to link to non-Indian language sites).

So, thus far, there isn't really anything terribly surprisingly about these findings. But when we look at the particular inter-language link connections which are strongest, especially in the case of Indian languages, there are some weird data:

[The arrows indicate directionality of linkage; red connections are stronger than green connections.] As point out:
Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish.
I would add that the Swahili-Bengali and Swahili-Tagalog links are not only strong (red), but also bidirectional (e.g. Swahili pages are linking to Bengali pages, and Bengali pages to Swahili pages). It is hard to think of convincing explanations for the connections between Swahili and Bengali (or Swahili and Tagalog). One possibility comes to mind, which is that, in terms of total Internet representation, the number of pages in Bengali, Swahili, and Tagalog is relatively small. Here the Google researchers' webpage selection criteria is presumably relevant:
The particular choice of pages in our corpus here reflects decisions about what is `important'. For example, in a language with few pages every page is considered important, while for languages with more pages some selection method is required, based on pagerank for example.
This means that for languages with a smaller Internet population individuals could have a greater effect on the particular inter-language linkages than is the case for languages with larger Internet populations.  And thus perhaps the existence of a few creators of Bengali webpage content who happen to live in central eastern Africa could be responsible for some these unexpected inter-language linkages. I would be curious to what sort of Bengali sites link to Swahili sites (and vice-versa) to see if this is a plausible idea.

There is something which worries me about these data though: look at the linkages between the Indo-Aryan languages (Punjabi, Gujarati, Marathi, Bengali, Nepali, Hindi). Punjabi, Gujarati, Marathi, Bengali, and Nepali all have strong bidirectional links with Hindi, which is to be expected given Hindi's status as a Indian lingua franca. Notice however that other than being linked with Hindi, none of the other Indo-Aryan languages are inter-linked with each other: except for Nepali and Marathi.

In India,there are large Nepali communities in West Bengal and other eastern parts of India.Marathi is spoken in Maharashtra in the far western part of India. I would be unsurprised if there were strong Marathi-Gujarati inter-language linkages (since these two languages are spoken in the neighbouring states), or if there were a strong inter-language linkage between Nepali and Bengali. But a Nepali-Marathi link doesn't make sense, at least in absence of other intra-Indo-Aryan linkages.

There is one property which I can think of which does link Nepali and Marathi, namely the fact that they both are written in Devanagari script (also used for Hindi). Gujarati, Punjabi, and Bengali, on the other hand, are each written in their own scripts (distinct from Devanagari). So I wonder if there is any possibility that the script is creating "false hits" when the off-site link connections for Nepali and Marathi are being computed. 

That also makes me worry about the other surprising inter-language linkages, such as Bengali-Swahili, Swahili-Tagalog. Not, obviously, that these languages share a common script, but whether some of the apparent connections are artefacts of the algorithm, whether due to use of a common script or some other factor. If they're not simply artefacts, then it certainly would be interesting to find out why, for instance, Bengali-language and Swahili-language webpages are linking to each other.


  1. Script doesn't directly influence Google's judgment of language, as far as I know. However, it seems to me that it's easier to read a related language in a familiar script than a related language in an unfamiliar script. It may well be the scripts as much as anything else that keeps the major Dravidian languages apart.

    What is really a good candidate for a statistical anomaly is the notion that 12% of Belarusan pages link to Armenian pages and 17% link back. Very distant relatives, not geographically close, no common script? There aren't even any non-stop flights from Minsk to Yerevan, which was the next thing I thought of. I don't think it's likely.

  2. I'm not convinced that script doesn't influence Google's judgement of language. If that were true, then what criteria do they use to distinguish Hindi from Urdu? (Note that their data includes Hindi and Urdu as two distinct languages, which makes sense from both orthographic and cultural angles).

    On reading related(/structurally similar) languages in familiar scripts vs. related languages in unfamiliar scripts:

    That may well be true, but I don't think it's the reason behind the apparent connections found in this study in the case of Nepali and Marathi.

    It's possible, in the case of Nepali and Marathi, that it's not the fault of the Google algorithm calculating inter-languages connections. Perhaps there are Nepali aggregators which are mistakenly pulling in Marathi content because of the script (and vice-versa).

    I don't know what to say about the Belarusan-Armenian connection. The Swahili-Bengali one is even worse: not related, not geographically close, and no common script.

    So I wonder what's going on.

  3. Thanks for your review of this material. I liked your point about the insularity of Hindi-language websites, and think you're probably on to something with the script connection, especially because I was puzzling over the Hindi-Urdu issue.
    A question related to John's point about reading related languages that use the same script, how similar ARE Nepali and Marathi? I can't read Urdu at all,even though I could get by in conversation, so I ge the "familiar language/different script" issue, but I don't find Nepali and Hindi that similar. I can "read" Nepali, in the sense of pronouncing the words, but even as my Hindi gets better, I find that Nepali still stumps me most times. When I transliterate Gujarati into devanagari, I get a much better sense of meaning that I do from the devanagari Nepali, so I'd be interested in knowing just how close Nepali and Marathi are.

  4. It's interesting that of the Dravidian languages, all of them are well-linked except Tamil. I would have expected Tamil and Malayalam to have a strong bidirectional link, as they are the most closely related of the family, but instead the only bidirectional link Tamil shares is with Kannada. Perhaps that's because Bangalore city has large populations of both Kannada and Tamil speakers, whereas there's no metro area where Tamil and Malayalam are both spoken.

    And THAT makes me wonder if it's urban centres that are key here... Mumbai draws a lot of migrant Nepali-speaking labor. So perhaps it's govt. and NGO websites in Mumbai that explain the Marathi-Nepali connection? Just speculating

  5. @maxqnz: Marathi and Nepali aren't particularly similar. I mean all of the North Indo-Aryan languages are similar in many respects, but---given this---Nepali and Marathi aren't particularly close. There are various genetic groupings which have been proposed for Indo-Aryan, but in none of these are Nepali and Marathi part of the same group (other than both being "Outer" languages, in contrast to "Central" languages; the latter group consists more or less of Hindi and Punjabi). Nepali is sometimes analysed as a "Northern Indo-Aryan" language and Marathi as a "Southern Indo-Aryan" language.

    Gujarati---though not a "Southern Indo-Aryan" language---is certainly closer linguistically (and geographically) to Marathi than Nepali is.

    [Re: "reading" Nepali - you can't even quite apply the same pronunciation rules to Nepali as Hindi. There are a number of differences, including the fact that ऐ is /ai/ औ is /au/ in Nepali, rather than /æ/ (or /ɛ/) and /ɔ/, as in most varieties of Hindi. Further, intervocalic /h/ is dropped in Nepali, and schwas are pronounced in places you wouldn't expect them to be given Hindi pronunciation rules.]

  6. Thanks for the clarification on Nepali - I'd picked up some of those from my Nepali friends, but didn't know about the intervocalic /h/ or the schwas.

  7. @Blaft Publications: Notice that all of the Dravidian languages are strongly and bidirectionally linked to Hindi as well---except for Tamil. At first I put this down to Tamil-Hindi language politics, but---as you note---it's odd that Tamil and Malayalam aren't more strongly linked, given their close genetic and geographic associations.

    The idea of urban centres where both languages are well-represented is interesting: but does it explain the strong Malayalam-Kannada, Malayalam-Telugu, Telugu-Kannada linkages?

    And certainly Gujarati should be well represented in Mumbai, so why no strong Gujarati-Marathi linkage? Sure, Mumbai draws lots of Nepali-speaking labour, but as I understand it there are also lots of poor Bengalis (both West Bengalis & Bangladeshis) in Mumbai as well -- so why no Marathi-Bengali linkage?

    So I still wonder what aspects of these linkages are real and which are artefacts, either of the collection process itself, or some "non-human" Internet component (like overly aggressive content aggregators).

  8. "Malayalam-Kannada, Malayalam-Telugu, Telugu-Kannada". This may be half-baked at best, but is it possible that cinema is part of the connection here? South Indian films have a very strong internet presence, and it seems that most big hits in one Dravidian language get remade in one or more of the others. Certainly if English-language sites devoted to South Indian films are any guide, the sort of connections described above are unsurprising.

  9. Sure, I would not be surprised if some of the connections are cinema-based. But the problem is that that still leaves the gaps explained. Why is Tamil so poorly connected? There are obviously lots of hit Tamil films, which get dubbed and/or remade into not only other Dravidian languages, but also into Hindi as well (Enthiran is but the latest example).

    Re: Nepali "h". I learned Nepali from books before learning from native speakers, and really was confused by this for some time. Prominently, the phrase थाहा छ (equivalent to, and used in the same ways as Hindi मालूम है) is *always* pronounced like "thaa cha" (where थाहा is rendered as monosyllabic!). Also the word बाहुन "brahmin", I still want to pronounce with an "h".

    (Hindi *sometimes* drops intervocalic "h"s, in fast speech, e.g. in words like बहिन, but Nepali *always* drops them.)

  10. @be_slayed: Mangalore might explain Malayalam-Kannada, but yeah, hard to explain the others. maxqnz's cinema explanation is probably better. Then again, Swahili-Tagalog is so goofy, maybe it shouldn't be taken too seriously.

  11. @horshod: Yes, I was thinking about something along those lines. What the explanation for Swahili<->Bengali is though, I don't know.

  12. The first thing that I could think of when I saw the Marathi-Nepali linkage was that the algorithm is messing it up. Whenever I am on a Marathi webpage Google asks me if I want to translate that page "which is in Hindi". I guess, as Marathi is not one of the languages supported by Google translate, it is probably reading all the Sanskrit-like (tadbhav) and Sanskrit (tatsam) words in Marathi (and not just the script) to conclude that the page is in Hindi. The same thing might be happening in the research experiment with Marathi and Nepali.

  13. Swahili is a language of a part of Africa which used to have a large Indian commercial community. However, I'd have expected the connection to be from Swahili to a language of the west of India, not Bengali. Or were the Indian merchants in Africa primarily Bengals?

  14. @Anthony: The Indian communities in Africa are primarily made up of speakers of Western Indian languages (Gujarati, Punjabi, Hindi etc.). The only exception is Mauritius (which isn't mainland Africa anyway), where there are Bihari (Bhojpuri, I believe) speakers. But I don't know of any substantial Bengali communities in Africa.

    So, yeah, it's still a weird connection.

  15. Marathi is not originated completely from north indian language group. It is invented from Maharashtri. And also, the original script of marathi is 'Modi' (मोडी) which is quite similar to Gujrati script. and also, i don't think that it is linked with nepali as the word nepali is written as
    नेपाली in nepali
    नेपाळी In MARATI.
    There is no similarity in these languages.
    nepali is from pali & marathi is from maharashtri.

  16. @Ke!)@र: Nepali isn't from Pali anymore than Marathi is from Pali. Both languages come from Apabhramsha. Sure, the script that these languages are in may have changed at various points in time. But the point is that right now both Nepali and Marathi use Devanagari (there are some minor differences of course).

  17. I would have expected Tamil and Malayalam to have a strong bidirectional link, as they are the most closely related of the family, but instead the only bidirectional link Tamil shares is with Kannada. Perhaps that's because Bangalore city has large populations of both Kannada and Tamil speakers, whereas there's no metro area where Tamil and Malayalam are both spoken.
    gta punjab