Friday, 8 July 2011

Some ponderings on Google's research on inter-language linking (Bengali <-> Swahili, Nepali <-> Marathi)

On the Google Research Blog, the latest post (by ) concerns inter-language linking, i.e. looking at webpages' off-site links which go to a page in another language. From the post:
Most web pages link to other pages on the same web site, and the few off-site links they have are almost always to other pages in the same language. It's as if each language has its own web which is loosely linked to the webs of other languages. However, there are a small but significant number of off-site links between languages. These give tantalizing hints of the world beyond the virtual.
I'm particularly interested in the data on Indian language webpages' inter-language linking, especially as there are some perplexing findings. But let's start with some findings which aren't really that surprising.

One of the features measured is the degree to which webpages in a particular language are "introverted" or "extroverted", where more "introverted" webpage languages have fewer inter-language off-site links. The data are summarised here:

Webpage languages which are higher (on the y-axis) are more introverted; webpage languages which are further to the right (on the x-axis) represent languages with a greater number of total webpages.

First, a word about the apparently high degree of English-language webpage "extroversion". The relatively high percentage of English-language websites which link to non-English websites is unlikely to represent a high percentage of native English speakers who are linking to non-English websites. Rather, this would seem to simply reflect English's status as a/the world language, so that even sites whose audience may largely consist of non-native English speakers may choose to create English-language websites simply in order to have a larger audience. And I suspect the "extroverted" English-language webpages are of that type: English is the language chosen for this type of website due to its ability to reach a more "universal" audience, but the site itself may have "local" interests, reflected by its linking to non-English language websites.

But it's the Indian languages that I really want to talk about. Given the large number of Hindi speakers, one might at first be surprised at the relatively small number of Hindi language sites (compared to say Japanese). This, I think, is easily explained by the status of English in India, especially amongst people who would be more likely to create and use Internet sites. In another words, many native Hindi speakers would choose to create English- rather Hindi-language webpages. The high degree of insularity ("introversion") of Hindi-language webpages in terms of inter-language linkage is likely not unconnected. In the context of modern India, choosing to create a Hindi- rather than English-language website is already a more "insular" choice, given the widespread use of English in India itself. Those website content creator who choose Hindi medium over English medium are likely to have more "insular" interests, and thus would not be as likely to link to non-Hindi sites (and even less likely to link to non-Indian language sites).

So, thus far, there isn't really anything terribly surprisingly about these findings. But when we look at the particular inter-language link connections which are strongest, especially in the case of Indian languages, there are some weird data:

[The arrows indicate directionality of linkage; red connections are stronger than green connections.] As point out:
Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish.
I would add that the Swahili-Bengali and Swahili-Tagalog links are not only strong (red), but also bidirectional (e.g. Swahili pages are linking to Bengali pages, and Bengali pages to Swahili pages). It is hard to think of convincing explanations for the connections between Swahili and Bengali (or Swahili and Tagalog). One possibility comes to mind, which is that, in terms of total Internet representation, the number of pages in Bengali, Swahili, and Tagalog is relatively small. Here the Google researchers' webpage selection criteria is presumably relevant:
The particular choice of pages in our corpus here reflects decisions about what is `important'. For example, in a language with few pages every page is considered important, while for languages with more pages some selection method is required, based on pagerank for example.
This means that for languages with a smaller Internet population individuals could have a greater effect on the particular inter-language linkages than is the case for languages with larger Internet populations.  And thus perhaps the existence of a few creators of Bengali webpage content who happen to live in central eastern Africa could be responsible for some these unexpected inter-language linkages. I would be curious to what sort of Bengali sites link to Swahili sites (and vice-versa) to see if this is a plausible idea.

There is something which worries me about these data though: look at the linkages between the Indo-Aryan languages (Punjabi, Gujarati, Marathi, Bengali, Nepali, Hindi). Punjabi, Gujarati, Marathi, Bengali, and Nepali all have strong bidirectional links with Hindi, which is to be expected given Hindi's status as a Indian lingua franca. Notice however that other than being linked with Hindi, none of the other Indo-Aryan languages are inter-linked with each other: except for Nepali and Marathi.

In India,there are large Nepali communities in West Bengal and other eastern parts of India.Marathi is spoken in Maharashtra in the far western part of India. I would be unsurprised if there were strong Marathi-Gujarati inter-language linkages (since these two languages are spoken in the neighbouring states), or if there were a strong inter-language linkage between Nepali and Bengali. But a Nepali-Marathi link doesn't make sense, at least in absence of other intra-Indo-Aryan linkages.

There is one property which I can think of which does link Nepali and Marathi, namely the fact that they both are written in Devanagari script (also used for Hindi). Gujarati, Punjabi, and Bengali, on the other hand, are each written in their own scripts (distinct from Devanagari). So I wonder if there is any possibility that the script is creating "false hits" when the off-site link connections for Nepali and Marathi are being computed. 

That also makes me worry about the other surprising inter-language linkages, such as Bengali-Swahili, Swahili-Tagalog. Not, obviously, that these languages share a common script, but whether some of the apparent connections are artefacts of the algorithm, whether due to use of a common script or some other factor. If they're not simply artefacts, then it certainly would be interesting to find out why, for instance, Bengali-language and Swahili-language webpages are linking to each other.

Sunday, 3 July 2011

What speechitatest you? On engineered language change amongst high schoolers

The latest Saturday Morning Breakfast Cereal, on high school language change:

The type of language change the students are shown undergoing would require more than a source of new lexical items, I would think.

We find morphological change: Wouldsest for 2nd person singular present of "would".

And syntactic change: What speechitated Harvard? for "What did Harvard say?" (note the necessity of do-periphrasis in modern English).

How could a thesaurus (of fake synonyms) drive these sorts of changes? [Of course, under Minimalism, parametric variation, including differences in word order, is theorised to be a reflex of formal features which are borne by lexical items. So perhaps if the thesaurus had some way of encoding abstract syntactic features in such a way that they would be picked up along with the phonological and semantic aspects of the lexical item....]