Tuesday, 3 August 2010

Computational approaches to understanding language evolution [video]

In her recent ILLS talk, Tandy Warnow discusses computational approaches for inferring language evolution and linguistic relationships:

Computational methods for inferring evolutionary histories of languages

A page with links to various publications associated with this project (and the datasets used) is available here: http://www.cs.rice.edu/~nakhleh/CPHL/.

One of the interesting points of this study is the relationship of Germanic with respect to the other branches of Indo-European. Germanic, at least when the morphological data is given more weight, is not particularly closely related to Italic or Celtic, though it shares a number of lexical similarities with these groups. This is suggestive of a later migration of Germanic-speaking peoples into an area where they came into contact with Italo-Celtic speakers. In any case, it's an interesting approach to historical data.

For more ILLS2 videos, see this link: http://ills.linguistics.illinois.edu/current/showcase.html.


  1. First of all, welcome back!

    Second, you might want to embed the picture from the CPHL home page: it's a very nice clear tree for the 24 languages in the IE database.

    Third, the great thing about this particular work is that it uses cladistic methods, but bases them on the output of the comparative method rather than trying to ignore or bypass it. As a simple example, since Armenian erku 'two' is known to be regularly derived from PIE, it is correctly treated as part of the same lexical character as the other IE languages, whereas the usual run of mass-comparison folks would treat it as different because it doesn't look like the rest. Similarly, known borrowings are weeded out rather than allowed to produce bogus relationships.

    Fourth, the latest work shows that the current best tree is consistent with all the data if we assume that there exist just three fairly plausible sets of borrowings hitherto unknown that therefore haven't been removed (the direction of borrowing can't be determined). These are Germanic and Italic, Germanic and Baltic, and Italic and Greco-Armenian (which last sounds strange, but we don't really know where Proto-Italic was spoken). Other possible ways to repair the tree fail because the languages are too separated in time or space or both.

    Finally, the same methodology was applied to the modern West Germanic languages, and what it shows that they don't form a tree at all, but a random muddle. If we didn't have earlier forms of the languages with fewer borrowings, we would not be able to reconstruct their genetic relationships at all.

  2. Thanks for posting this. I've always wanted to know more how the Pennsylvania family tree has been formed, though admittedly I've been lazy myself in tracking down the articles.

    My only issues with strict Stammbaum attempts at classification are the issues of how it handles linguistic convergence phenomena, which has been giving me some nightmares lately when working on the Greek dialects (most issues presently troubling me are conveniently summarized by Garrett 2006 = http://linguistics.berkeley.edu/~garrett/IEConvergence.pdf), and also in the history of the Northwest Semitic languages with their historical contact with Akkadian and the internal convergence between Aramaic and Hebrew.

    Still, in terms of language-families with a good spread from high-linguistic density to low-linguistic density areas, such as IE, that give a great 'branchiness', this approach still seems overall valid.

  3. Thanks all.

    @John: I added the tree image, as suggested. With respect to your final point, I imagine the same is true of the modern Indo-Aryan languages. Though, in fact, even with (some) earlier forms of the modern languages, their relationships are not entirely clear (earlier scholarship posited two waves of Indo-Aryan migration in order to account for an apparent inner/outer dichotomy).