Historical linguistics aims at inferring the most likely language phylogenetic trees starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical, phonological, syntactic) features or characters for many different languages: a set of parallel corpora whose compilation represents a paramount achievement in linguistics.
From this perspective the reconstruction of language trees is just an example in the wider class of the so-called inverse problems: starting from present, incomplete and often noisy, information one aims at inferring the most likely past evolutionary history. A fondamental issue in inverse problems is the evaluation of the inference we make. A standard way of dealing with it is somehow circular: one generates data with artificial models and then applies a suitable method to infer the model starting from the knowledge of the generated data. This procedure presents an intrinsic limitation: when dealing with real data sets, one typically does not know which is the real process that has produced them, and thus which is the most suitable model to use to generate the artificial dataset.
Coming back to the problem of tracing the evolutionary history of languages, a possible way out is to compare inferred phylogenies with expert classifications. We have compared the outcome of different phylogeny reconstrucion methods with the Ethnologue (partial) classification, focusing in particular on state-of-the-art distance-based methods using state-of-the-art linguistic databases for phylogeny reconstruction.
We show here the reconstruction accurateness achieved by our algorithm SBiX on 48 language families across the world.
This map represents the level of accuracy of the FastSBiX algorithm on several language families throughout the world. The colors code the values of the Generalized Quartet Distance (GQD), normalized with the corresponding random value, between the trees inferred with the FastSBiX algorithm for each language family included in the ASJP database and the corresponding Ethnologue classifications. Blue regions corresponds to languages families for which the inferred trees strongly agree with the Ethnologue classification, while red regions corresponds to poorly reconstructed language families. Yellow is for the families in which a random reconstruction would get a score GQD score of zero, meaning that the Etnologue classification has a null resolution (the corresponding tree is a star). Grey areas are those for which no data are present in the databases we consider to reconstruct language trees. Asterisks are for regions which include more than one family of languages. For details refer to the paper “On the accuracy of language trees” (Pompei et al. 2011).