Tuesday, July 25, 2017

More on similarities in linguistics

In an earlier blogpost I discussed various reasons for similarity of certain traits in languages. I emphasized four major reasons for similarities, for example, in the lexicon of languages: coincidence, natural reasons, inheritance, and contact (see also List 2014: 55f and Aikhenvald 2007: 5). Despite the problems of distinguishing inherited from borrowed traits, which I called historical reasons for similarity, controlling for coincidence and history can often be done in a rather straightforward way. Coincidence can be called by applying a frequency criterion: if certain similarities are extremely spurious, they are usually due to chance. Historical similarities can be detected with the help of classical methods for language comparison. If, using these methods, we know, for example, that two or more languages are genetically related or have been developing in close contact with each other, then we will usually assume that shared traits among them are due to their shared history.

The third group of similarities, on the other hand, which I called natural, is a bit more difficult to interpret, since it is not entirely clear what "natural" means in this context. My earlier example was the word for "mother", which in many languages is expressed as "mama", similar to "father", which is often expressed as "papa", even in languages where we know that they are not related. or only extremely distantly related (if we assume that language was only invented once), and will thus be acquired rather early by children.

In the case of "mama" and "papa", we can blame our articulatory apparatus, which makes sounds like [m], [p], and [a] very easy to pronounce for all humans, no matter where and in which time they are born. Calling this "nature" is probably justified, given that pronouncability is not per se characteristic for language as a general means of complex communication. In sign languages, for example, pronouncability does not play any role, as those languages are never pronounced, but expressed with the help of gestures. But even in sign languages, we also find cross-linguistic similarities, which seem to be independent of coincindence or history: body parts, for example, are often expressed iconically, e.g., by pointing to them (see Woodward 1993 for details).

However, not all of those similarities between languages that are not due to history or coincidence are necessarily due to our articulation apparatus. We can think of many different reasons for cross-linguistic similarities, such as, for example, innate settings of the human brain, or global similarities of the environment in which humans live. In the past, colleagues have occasionally pointed out to me the heterogeneity of this class of "natural" similarities. When trying to further subdivide them, the former could be called "similarities due to cognition", while the latter could be called "similarities due to environment". But neither of these two groups seems to be quite satisfying, as we do not really know the relation between environment and cognition. We may also assume that there is a certain influence between the two, and depending on where we draw the border, we would either subscribe to a predominantly Aristotelian viewpoint, where we assign the predominant role to the environment, or a Platonic viewpoint, where we assign it to the innate "ideas" which are given to us along with our brain.

As an example for the difficulty of distinguishing different sources of "natural" similarity, let us have a look at how languages of the world express a fixed set of concepts. In a very simplistic view, given only two things we want to express, for instance the concept "hand" and the concept "arm", we can ask whether a given language will use the same or different words as a rule. English, for example, uses two different words, namely hand and arm, and so does German (Hand and Arm), while Russian uses only one word, ruka, to refer to both concepts in most situations (in Russian, there is another word kist', which can be used to denote "hand", but it is rarely used). We can say that Russian ruka is polysemous, since the word form has at least two meanings. A better way of expressing this is to say that Russian colexifies "hand" and "arm" (François 2008), since the term polysemy has a specific usage in linguistics, referring to words expressing multiple meanings that should be "conceptually close" or "developed from semantic change", which is an extremely vague definition that further requires us to know the history of a given word form and the development of its meanings.

Cross-linguistically, the colexification of "arm" and "hand", i.e. that many languages tend to use a single word to denote both concepts, occurs extremely often in the languages of the world; so often that we can rule out that the use of one word for two concepts is due to coincidence (compare the colexifications of "arm" in the CLICS database by List et al. 2014 through this link). Given that the colexification recurs also in different language families spoken in different regions of the world, we can further rule out historical reasons. This leaves us with the heterogeneous class of "natural reasons for similarities". But what kind of natural similarities are we dealing with here? Are they cognitive? They surely are in some sense, as we can say that humans have good reasons to consider the hand and the arm as one continuous part of their body.

But this continuity is also given by the structure of our body, which itself is given independently of our perception. One could argue that our perception grounds in our bodily experience, but if we look further into other frequent colexifications, e.g. between "dark" and "black" (this occurs in more than 20 language families, see here), as well as "bright" and "white" (occurs in three language families, see here), our perception is less dependent on our body but more on the environment in which we experience darkness and brightness, since most humans have eyesight and do not live entirely in caves.

It is some kind of the egg-hen problem of who was there first, and the more I think about it, I prefer to avoid giving any clear-cut preference to either the egg nor the hen. We can obviously try to make a more fine-grained distinction between different kinds of non-historical and non-coincidental similarities between languages, but unless psychologists and cognitive scientists solve general problems of perception and environment, it seems that, at least for the moment, "natural similarities" is explicit enough as a term to describe universal patterns in the languages of the world.

  • François, A. (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, M. (ed.): From polysemy to semantic change. Benjamins: Amsterdam. 163-215.
  • List, J.-M., T. Mayer, A. Terhalle, and M. Urban (eds.) (2014) CLICS: Database of Cross-Linguistic Colexifications. Forschungszentrum Deutscher Sprachatlas: Marburg. http://www.webcitation.org/6ccEMrZYM.
  • List, J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, 2393-2400.
  • Woodward, J. (1993) Lexical evidence for the existence of South Asian and East Asian sign language families. Journal of Asian Pacific Communication 4.2: 91-107.

Tuesday, July 18, 2017

Stacking neighbour-nets: ancestors and descendants

Although they are not phylogenetic networks in an evolutionary sense, neighbour-nets are amazingly efficient when it comes to depicting actual ancestor-descendant relationships. Spencer et al. (2004), for example, applied various methods of phylogenetic inference to reconstruct a known phylogeny of scripts, text copies made by scribes based on an original text, or copies of that text. They found that the neighbour-net algorithm produced a graph that best depicts the actual ancestor (original text) to descendant (text copy made by a scribe) relationships. The reason is relatively simple: neighbour-nets are well-suited to extract and reflect differentiation patterns from a distance matrix.

In this post I will explore this idea in some detail, because I think that it has important practical implications (which I will further elaborate in future posts). If data are available for time slices of the evolutionary history, then it is possible to stack a series of networks that describe each time slice, thus providing a much more comprehensively inferred genealogy.


If a distance matrix reflects exactly the phylogeny (i.e. if the signal from the matrix is trivial), then the neighbour-net looks like a tree, with the ancestors placed seemingly at the internal nodes (medians) of the subtree(s) containing their descendants — that is, the neighbour-net looks like a median network (Figure 1). In fact, this is just mimicry. The ancestors are not actually placed on the internal nodes (medians) but are connected by zero-length edge bundles to the centre of the graph (the roots of their descendants).

Figure 1. Neighbour-net inferred from a perfect distance matrix (from Denk & Grimm 2009)

In reality, distance matrices will not be exact reflections of the phylogeny (the ‘true’ tree) but distorted; and this will be reflected also in the neighbour-net (Fig. 2).

Figure 2. Neighbour-net inferred from an imperfect distance matrix (cf. Denk & Grimm 2009, fig. 1)

I superimposed a potentially inferred tree on Figure 2 to highlight some distorting effects. Note that this could be a tree optimised using a distance criterion such as minimum evolution or least-squares, or a tree optimised under maximum likelihood or maximum parsimony (one of the alternative topologies found in the sample of equally parsimonious trees).

The following distorting effects may apply during tree-inference: (i) a misplaced outgroup-inferred ingroup root (due to convergences shared by the outgroup and members of lineage B), (ii) lineage B is dissolved into a grade, although it should be a clade (convergences shared by members of lineages A and B), and (iii) the all-ingroup ancestor is resolved as the sister to lineage A, but not B. The neighbour-net illustrates the uncertainties of placing several taxa (the box-like parts of the graph), while keeping the ancestors equally close to their descendants. This provides information lost (or overlooked) when just inferring a tree (but not by exploratory data analysis of e.g. the bootstrap support patterns).

The basics — why stack networks?

Figure 3 shows a hypothetical evolution of a phylogenetic lineage in a two-dimensional morpho-space. The common ancestor (black dot) gives rise to two, morphologically somewhat distinct lineages (bluish vs. reddish coloured). The lineages evolve over time and diverge again. The overall differentiation within the group increases, but eventually the potential niche/morphospace is filled. When looking only at the final situation (i.e. the modern-day situation), we may be tempted to infer wrong relationships based on the morphological distinctness. Each one of the blue and red daughter lineages evolved into similar niches and obtained somewhat similar morphological character suites substantially distinct from the one of their respective sister taxa. Translating this situation into a distance matrix (or a character matrix) to infer a tree, will provide a wrong topology, when long-branch attraction steps in, that recognises one or two blue-red sister pair(s).

Figure 3. Evolution (vertically) and diversification of a lineage in a two-dimensional morphospace (horizontally)

Adding all of the ancestors can help to escape long-branch attraction (Figure 4; see also Wiens 2005). We don’t find red111 as sister to blu11, but still resolve the wrong sister relationship between the overall too-similar endpoints of the converging red and blue sub-lineages (red222 and blu22). The remainder of the red lineage is erroneously dissolved into (A) an ancient sister lineage (red0); (B) a real clade (red1, red11, red111), the red sub-lineage most distinct from the blu lineage; and (C) a grade (red2, red22), “basal” (wrong terminology, but often used) to the blue lineage, collecting the older members of the red sub-lineage evolving towards the morphospace of the blue lineage.

Figure 4. Neighbour-joining tree inferred on a distance matrix exactly reflecting the pairwise distances along both axes in the 2-dimensional morphospace

Only a few, (data-wise) trivial branches of the true tree (broadened green edges) are found in the inferred tree. An obvious defect of this tree is that the phylogenetic distance, the sum of branch lengths between two tips, does not reflect the pairwise distances encoded in the matrix. For instance, the all-ancestor should be equally distant from both of the ancestors of the red and blue lineages (red0, blu0).

The neighbour-net shows something rather different (Figure 5). A large box can be seen, referring to the highly incompatible signal induced by the ancestors and their direct and subsequent descendants.

Figure 5. Neighbour-net splits graph inferred on the same distance matrix

Red222 and blu22, the false sisters, are placed next to each other and share an edge bundle. They are most similar to each other and increasingly distinct from all other taxa included in the analysis. However, their nearest relatives are their actual ancestors (red22 and blu2), which are already quite distinct from each other, and show affinities to further members to their clade but not to the other clade (the topology of the tree from Figure 4 is shown in yellow in Figure 5).

The network includes the true tree in addition to edge bundles referring to wrong alternatives (induced by imperfect data; in this case, convergence due to evolution into a similar morphospace). The phylogenetic distances between two tips (via alternative pathways) reflect much better the pairwise distances (almost exactly). Thus, the neighbour-net is a much more comprehensive display of the actual signals in the imperfect matrix than a tree could ever be (independent of the optimisation criterion used).

The fossils included in our example represent a time sequence, an actual change in time. With that information as background, we can much more easily access the neighbour-net’s structure (Figure 6).

Figure 6. The same neighbour-net with time-slices

Already in the second time-slice the lineage diverged into two distinct branches represented by red0 and blu0. Both evolved (blu0→blu00) and diverged (red1, red2) in the next time slice, and so on. The fact that the neighbour-net is not a phylogenetic graph (in an evolutionary sense) becomes a strength. In any tree, we would need to deduce, or be tempted to deduce, (inclusive) common origins (Hennig’s monophyly) from the clades exhibited in the rooted version of the tree. Here, using the most natural root to root the tree, the oldest representative of the lineage (ancestor), misplaces the two representatives of the second time-slice along with those involved in subsequent radiations.

Stacking networks

There are two simple ways to trace the change in differentiation patterns through time using stacked networks: (A) Generate a series of networks per time-slice and identify the closest relatives in the next (and/or preceding) time-slice, or (B) Generate networks that combine the taxa of two subsequent time-slices.

Figure 7. A sequence of neighbour-nets, with the taxa filtered by age. (These are actual reconstructions inferred by SplitsTree based on the taxon-filtered subsets of the all-inclusive distance matrix.)
My example only includes up to four co-eval taxa, and hence the neighbour-nets are trivial graphs for each time-slice (Figure 7). The connecting lines between the time-slices indicate the closest and next-closest possible descendants (as defined by the smallest and next-smallest distance) of each earlier taxon. The thickness of the connections reflects their absolute similarity — the thickest lines indicate a morphological pairwise distance of 0.13, and the thinnest are distances of > 0.33. The colour indicates whether the connection reflects a true (green) or false (yellow) relationship.

Analysed this way, the matrix’ signal appears quite perfect in relation to the true tree. One can trace the increasing diversity within the clade (all descendants of the all-ancestor), as well as the misleading decreasing distance between the blu2/blu22 and red22/red222 lineages. In this case, the same stacking procedure would also work with trees, as the neighbour-nets are trivial and very tree-like. With real world data, the differences may be more profound.

With more complex or less complete data, we have a higher risk that ancestor-descendant relationships will not be straightforwardly identified by highest-similarity pairs. Missing data, for instance, can result in distances that mask or over-estimate the actual phylogenetic distances between an older and a younger taxon. Using the stacking procedure illustrated in Figure 7, such problems can become visible in the form of related taxa from the same time-slice that are connected to unrelated or distantly related taxa in the preceding or following time-slices.

But how to identify more likely candidates? One possibility is to assess potential phylogenetic relationship by combining taxa of two subsequent time-slices. The connectives between the reconstructions are then straightforward: each taxon is always used in two different reconstructions (Figure 8). This procedure allows us to establish the phylogenetic affinities of a taxon with respect to co-eval and older taxa (potential siblings and ancestors) or co-eval and younger taxa (potential siblings and descendants).

Figure 8. A sequence of neighbour-nets, each one including the taxa of two subsequent time-slices

Stacking networks — what’s next?

The first thing, obviously, is to test the suggested procedures for real-world data, involving groups with a dense and well-studied fossil record. I will provide a real-world example in my next post using the matrix we put up for our systematic revision of Osmundaceae (King Ferns) rhizomes (Bomfleur, Grimm & McLoughlin 2017).

Simulations may help to identify misleads caused by missing data, and the resulting distorted distance matrices, and non-comprehensively sampled (time-wise) phylogenies. They may also be informative regarding whether consensus networks reflecting competing branch support can be used for similar approaches.

Programmers are needed, too. For my graphics, I established the inter-time-slice connectives by hand; but it would be handy to have a programme environment that can do this.


Bomfleur B, Grimm GW, McLoughlin S. 2017. The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5: e3433. Open access: https://peerj.com/articles/3433/

Denk T, Grimm GW. 2009. The biogeographic history of beech trees. Review of Palaeobotany and Palynology 158: 83-100.

Spencer M, Davidson EA, Barbrook AC, Howe CJ. 2004. Phylogenetics of artificial manuscripts. Journal of Theoretical Biology 227: 503-511.

Wiens JJ [, Soltis P ?]. 2005. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Systematic Biology 54: 731-742. Open access: https://academic.oup.com/sysbio/article-lookup/doi/10.1080/10635150500234583

Tuesday, July 11, 2017

The curious case of the word “stemma” — from circles to trees

Each word has its own history, according to a maxim attributed to Jules Gilliéron that makes some historical linguists tremble. One with a curious history is the word stemma (plural stemmata), which we stumble upon when investigating the development of phylogenetic trees.

David has been exploring this question for some time, showing how the origin ultimately lies in the alternative to the hierarchical model of the Aristotelian "scale" offered by the practice (and the metaphors) of genealogies and pedigrees. While dealing with possible influences on 19th-century biology, I have explored a different field, stemmatics ("textual criticism"), which shares with genealogical practices both the tree model and, obviously, the word stemma. As stemmatics is one of the first scientific approaches to the idea, and considering that the now widespread tree is likely a calque of German Baum, itself a calque of stemma, it is worth writing a bit about the history of this word.

Stemma / stemmata

Dictionaries (as well as my queries in Google Books, which would probably fail to impress a reviewer) agree that not even in Romance languages does this Latin word display an uninterrupted tradition from the time of Caesar. It only entered languages such as English and Italian, with the meaning of "genealogical tree; pedigree; nobility", from the mid 17th century on. This date supports the theory that family pedigrees were not commonplace before the 17th century (when they became a true fashion, as in Strein, 1559), despite being drawn since the Middle Ages and always discussed — as in royal disputes or in the case of the genealogies of Jesus found in the Gospels (likely drawn to confirm the messianic claims with Jewish criteria, but assimilated to the European mindset). In short, modern stemmata are mostly a product of Neoclassical fashion, and their popularity was influenced by the same descriptions of Roman pedigrees where the word was learned.

Speaking of pedigrees, this Latin word is of Greek descent, a loanword of στέμμα [stémma], meaning "wreath, crown". This sense was already a development of an original "that which surrounds; circle": in Homer's Iliad, for example, we still find an occurrence in the first sense of "circle" (of warriors, cf. XIII.736); but elsewhere the word refers to a laurel-wreath wound around a staff, mostly in the plural and in relation to the laurel god Apollo (cf. Iliad, I.14, I.28, and I.373). The development is due to the costume of conceding crowning wreaths, with στέμμα deriving from the verb στέφειν [stéphein] ("to encircle, to crown, to wreathe, to tie around") by the addition of the morpheme -μα [-ma], used to form nouns denoting the result of an action, as in the analogous case of γράφω (gráphō, "write") and γράμμα (grámma, "that which is written"). Our word ultimately derives from the Proto-Indo-European root *stebh- "post, stem; to place firmly on; to fasten", related to English "(to) step" and "staff".

Theodosius offers a laurel wreath to the victor;
on the base of the obelisk in the Hippodrome (Istanbul)
[source: Wikipedia]

The "wreath, garland, chaplet" meaning is attested in Ancient Greek literature of all times and genres, such as in tragedy (cf. Euripides, Andromache, 894), comedy (cf. Aristophanes, Wealth, 39), philosophy (cf. Plato, Republic, 617c), and historiography (cf. Thucydides, Peloponnesian War, IV.133). At least one metaphoric usage is attested, in the sense of "web/tangle of life" (cf. Euripides, Orestes, 12), and various inscriptions indicate an additional meaning of "guild" (such as in one "guild of huntsmen" epigraph quoted by Liddell & Scott, 1940). The genealogical meaning is only found in later Greek authors like Plutarch (1st century CE), suggesting that it was imported from Latin.

The Roman meaning developed from the custom of decorating the portraits of one’s ancestors, sometimes in elaborate full-wall genealogies, with laurel wreaths indicating both excellence and nobility (as "noble" pretty much meant "descending from gods"). Domestic cults were central to Roman religion, and this practice seems to have become so widespread in Imperial times that it turned into a banality, with the laurel decoration being decried as a symbol of vanity by poets and philosophers alike. The custom – and the usage of stemma for "genealogical tree" – is mentioned twice by Seneca in essays of utmost importance for Roman Stoicism. In Ad Lucilium Epistulae Morales, XLIV.1, he says:
Si quid est aliud in philosophia boni, hoc est, quod stemma non inspicit. Omnes, si ad originem primam revocantur, a dis sunt. [Philosophy also has this advantage: it does not look at your genealogical tree. Everyone, if we look at their remotest origin, descends from the gods].
A similar reference, with a more detailed description of the practice, is found in his De Beneficiis, XXX.28:
We all spring from the same source, have the same origin; no man is nobler than another except in so far as the nature of one man is more upright and more capable of good actions. Those who display ancestral busts in their halls [qui imagines in atrio exponunt], and place in the entrance of their houses the names of their family, arranged in a long row and entwined in the multiple ramifications of a genealogical tree [ac multis stemmatum illigata flexuris] – are these not notable rather than noble? Heaven is the one parent of us all, whether from his earliest origin each one arrives at his present degree by an illustrious or obscure line of ancestors. You must not be duped by those who, in making a review of their ancestors, wherever they find an illustrious name lacking, foist in the name of a god. [adapted from the translation of Basore, 1935]

A golden laurel wreath, probably originating from Cyprus, 4th-3rd century BC
[source: Wikipedia]

In matters of phylogenetics, to prove that something existed is usually not enough, as we should try to demonstrate its influence and descent. Both are clear in the case of Seneca: his moral essays were read and copied without interruption in the early years of Christianity, proliferated during the Carolingian Renaissance, and were among the most published works of secular Western literature for centuries. The Wikipedia article on the second essay is well referenced on the matter:
Three translations were made into English during the sixteenth and early seventeenth century. The first translation at all into English was made in 1569 by Nicolas Haward, of books one to three, while the first full translation into English was made in 1578 by Arthur Golding, and the second in 1614 by Thomas Lodge. Roger L'Estrange made a relevant work in 1678, he had been making efforts on Seneca's works since at least 1639. A partial Latin publication of books 1 to 3, being edited by M. Charpentier & F. Lemaistre, was made circa 1860, books 1 to 3 were translated into French by de Wailly, and a translation into English was made by JW. Basore circa 1928-1935.
The new meaning od the word is confirmed by many other authors popular in Medieval times, and especially after the Renaissance, such as Suetonius (cf. Nero, 37; Galba, 2) and Statius (cf. Silvae, 3). Pliny the Elder's Naturalis Historia, an obligatory reading for all Western scholars from the Renaissance to at least the 19th century, is another important source. When exposing the history of Roman art and discussing the honor attached to portraits, Pliny mentions that "in ancient times" people had much care for faithful likeness, when "portraits modeled in wax were arranged, each in its separate niche, to be always in readiness to accompany the funeral processions of the family [... while the] the pedigree [stemmata] of the individual was traced in lines upon each of these coloured portraits" (XXXV.6, adapted from Bostock, 1855).

The last important source to note is the eight Satire of Juvenal, on the paradoxes of the Roman aristocracy, where the word stemma, as usual in the plural, is used to open the poem:
Stemmata quid faciunt? quid prodest, Pontice, longo / sanguine censeri, pictos ostendere uultus / maiorum et stantis in curribus Aemilianos / et Curios iam dimidios umeroque minorem / Coruinum et Galbam auriculis nasoque carentem [Genealogies, what are they worth? What is in for you, Ponticus, in being judged by ancient bloodline, in flaunting the portraits of your ancestors, the Aemilians standing on chariots, only half of the Curii, a Corvinus devoid of shoulders, and a Galba missing ears and nose?]
Sources suggest that the new meaning was well established by the reign of Hadrian (2nd century CE), including the derivative meanings of "high value" (cf. Martial, Satyra, VIII.6) and "antique", as in Prudentius (cf. Liber Cathemerinon, VII.81), a Christian author much read in Medieval times. As already mentioned, the word even found its way back into Greek with the new semantic shift, such as in Plutarch, one of the most popular Greek authors since the Renaissance. In his Numa, 1, we find:
ἔστι δὲ καὶ περὶ τῶν Νομᾶ τοῦ βασιλέως χρόνων, καθ᾽ οὓς γέγονε, νεανικὴ διαφορά, καίπερ ἐξ ἀρχῆς εἰς τοῦτον κατάγεσθαι τῶν στεμμάτων ἀκριβῶς δοκούντων ["There is likewise a vigorous dispute about the time at which King Numa lived, although from the beginning down to him the genealogies seem to be made out accurately"; Perrin, 1914].
It is somewhat ironic that the accusations of futility and uselessness of genealogical trees probably contributed to the Medieval and Renaissance restoration of such practices. Informed about the Roman tradition, and equipped with examples from nobility and religion, people turned genealogy and its trees into a fashion. This helped to lay the ground for the acceptance of the tree model when new scientific endeavors required a better way to describe things, like dog races and strawberry varieties, especially when non-ascending genealogies (who descends from whom, instead of who are the ancestors of whom) were already common, and when the concept of the "tree of life" gained a new popularity.

Neptune's genealogy as per Boccaccio.
Paris: Luois Hornken, 1511. [source]

  • Aristophanes (1938).Wealth. The Complete Greek Drama, vol. 2. Eugene O'Neill, Jr. New York: Random House
  • στέμμα in Autenrieth, Georg (1891) A Homeric Dictionary for Schools and Colleges. New York: Harper and Brothers.
  • στέμμα in Bailly, Anatole (1935) Le Grand Bailly: Dictionnaire grec-français. Paris: Hachette.
  • Euripides (forthcoming) Euripides, with an English translation by David Kovacs. Cambridge MA: Harvard University Press.
  • Euripides (1938) The Complete Greek Drama, edited by Whitney J. Oates and Eugene O'Neill, Jr. in two volumes. New York: Random House.
  • stemma in Lewis, Charlton T; Short, Charles Short (1879) A Latin Dictionary. Founded on Andrews' edition of Freund's Latin dictionary. revised, enlarged, and in great part rewritten by. Oxford: Clarendon Press.
  • στέμμα in Liddell & Scott (1940) A Greek–English Lexicon. Oxford: Clarendon Press.
  • στέμμα in Liddell & Scott (1889) An Intermediate Greek–English Lexicon. New York: Harper & Brothers.
  • Omero (1990) Iliade. Traduzione di Rosa Calzecchi Onesti. Torino: Giulio Einaudi editore.
  • Plato (1903) Platonis Opera, ed. John Burnet. Oxford: Oxford University Press.
  • Pliny the Elder (1855) The Natural History. John Bostock, H.T. Riley. London. Taylor and Francis.
  • Plutarch (1914).Plutarch's Lives. with an English Translation by. Bernadotte Perrin. Cambridge MA: Harvard University Press. London: William Heinemann.
  • Seneca (1917-1925) Ad Lucilium Epistulae Morales, volume 1-3. Richard M. Gummere. Cambridge MA: Harvard University Press; London: William Heinemann.
  • Seneca, Lucius Annasus (1928-1935) Moral Essays. Translated by John W. Basore. The Loeb Classical Library. London: W. Heinemann. 3 vols.: Volume III.
  • Statius, P. Papinius (1928) Statius, Vol I. John Henry Mozley. London: William Heinemann; New York: G.P. Putnam's Sons.
  • Strein, Richardus (1559) Gentium et familiarum Romanorum stemmata. Paris[?]: Henr. Stephanus.
  • Suetonius (1889).The Lives of the Twelve Caesars; An English Translation, Augmented with the Biographies of Contemporary Statesmen, Orators, Poets, and Other Associates. Suetonius. Publishing Editor. J. Eugene Reed. Alexander Thomson. Philadelphia: Gebbie & Co.
  • Thucydides (1942) Historiae in two volumes. Oxford: Oxford University Press.

Tuesday, July 4, 2017

Should we try to infer trees on tree-unlikely matrices?

Spermatophyte morphological matrices that combine extinct and extant taxa notoriously have low branch support, as traditionally established using non-parametric bootstrapping under parsimony as optimality criterion. Coiro, Chomicki & Doyle (2017) recently published a pre-print to show that this can be overcome to some degree by changing to Bayesian-inferred posterior probabilities. They also highlight the use of support consensus networks for investigating potential conflict in the data. This is a good start for a scientific community that so far has put more of their trust in either (i) direct visual comparison of fossils with extant taxa or (ii) collections of most parsimonious trees inferred based on matrices with high level of probably homoplasious characters and low compatibility. But do those matrices really require or support a tree? Here, I try to answer this question.


Coiro et al. mainly rely on a recent matrix by Rothwell & Stockey (2016), which marks the current endpoint of a long history of putting up and re-scoring morphology-based matrices (Coiro et al.’s fig. 1b). All of these matrices provide, to various degrees, ambiguous signal. This is not overly surprising, as these matrices include a relatively high number of fossil taxa with many data gaps (due to preservation and scoring problems), and combine taxa that perished a hundred or more millions years ago with highly derived, possibly distant-related modern counterparts.

Rothwell & Stockey state (p. 929) "As is characteristic for the results from the analysis of matrices with low character state/taxon ratios, results of the bootstrap analysis (1000 replicates) yielded a much less fully resolved tree (not figured)." Coiro et al.’s consensus trees and network based on 10,000 parsimony bootstrap replicates nicely depicts this issue, and may explain why Rothwell & Stockey decided against showing those results. When studying an earlier version of their matrix (Rothwell, Crepet & Stockey 2009), they did not provide any support values, citing a paper published in 2006, where the authors state (Rothwell & Nixon 2006, p. 739): “… support values, whether low or high for particular groups, would only mislead the reader into believing we are presenting a proposed phylogeny for the groups in question. Differences among most-parsimonious trees are sufficient to illuminate the points we wish to make here, and support values only provide what we consider to be a false sense of accuracy in these assessments”.

Do the data support a tree?

The problem is not just low support. In fact, the tree showed by Rothwell & Stockey with its “pectinate arrangement” conflicts in parts with the best-supported topology, a problem that also applied to its 2009 predecessor. This general “pectinate” arrangement of a large, low or unsupported grade is not uncommon for strict consensus trees based on morphological matrices that include fossils and extant taxa (see e.g. the more proximal parts of the Tree of Life, e.g. birds and their dinosaur ancestors).

The support patterns indicate that some of the characters are compatible with the tree, but many others are not. Of the 34 internodes (branches) in the shown tree (their fig. 28 shows a strict consensus tree based on a collection of equally parsimonious trees), 12 have lower bootstrap support under parsimony than their competing alternatives (Fig. 1). Support may be generally low for any alternative, but the ones in the tree can be among the worst.

The main problem is that the matrix simply does not provide enough tree-like signal to infer a tree. Delta Values (Holland et al. 2002) can be used as a quick estimate for the treelikeliness of signal in a matrix. In the case of large all-spermatophyte matrices (Hilton & Bateman 2006; Friis et al. 2007; Rothwell, Crepet & Stockey 2009; Crepet & Stevenson 2010), the matrix Delta Values (mDV) are ≥ 0.3. For comparison, molecular matrices resulting in more or less resolved trees have mDV of ≤ 0.15. The individual Delta Values (iDV), which can be an indicator of how well a taxon behaves during tree inference, go down to 0.25 for extant angiosperms – very distinct from all other taxa in the all-spermatophyte matrices with low proportions of missing data/gaps – and reach values of 0.35 for fossil taxa with long-debated affinities.

The newest 2016 matrix is no exception with a mDV of 0.322 (the highest of all mentioned matrices), and iDVs range between 0.26 (monocots and other extant angiosperms) and 0.39 for Doylea mongolica (a fossil with very few scored characters). In the original tree, Doylea (represented by two taxa) is part of the large grade and indicated as the sister to Gnetidae (or Gnetales) + angiosperms (molecular trees associate the Gnetidae with conifers and Ginkgo). According to the bootstrap analysis, Doylea is closest to the extant Pinales, the modern conifers. Coiro et al. found the same using Bayesian inference. Their posterior probability (PP) of a Doylea-Podocarpus-Pinus clade is 0.54, and Rothwell & Stockey’s Doylea-Ginkgo-angiosperm clade conflicts with a series of splits with PPs up to 0.95.

Figure 1. Parsimony bootstrap network based on 10,000 pseudoreplicate trees
inferred from the matrix of Rothwell & Stockey.
Edges not found in the authors’ tree in red, edges also found in the tree in green.
Extant taxa in blue bold font. The edge length is proportional to the frequency of the
according split (taxon bipartition, branch in a possible tree) in the pseudoreplicate
tree sample. The network includes all edges of the authors’ tree except for
Doylea + Gnetidae + Petriellales + angiosperms vs. all other gymnosperms and
extinct seed plant groups. Such a split has also no bootstrap support (BS < 10)
using least-square and maximum likelihood optimum criteria.

Do the data require a tree?

As David made a point in an earlier post, neighbour-nets are not really “phylogenetic networks” in the evolutionary sense. Being unrooted and 2-dimensional, they don’t depict a phylogeny, which has to be a sort of (rooted) tree, a one-dimensional graph with time as the only axis (this includes reticulation networks where nodes can be the crossing point of two internodes rather than their divergence point). The neighbour-net algorithm is an extension into two dimensions of the neighbour-joining algorithm, the latter infers a phylogenetic tree serving a distance criterion such as minimum evolution or least-squares (Felsenstein 2004). Essentially, the neighbour-net is a ‘meta-phylogenetic’ graph inferring and depicting the best and second-best alternative for each relationship. Thus, neighbour-nets can help to establish whether the signal from a matrix, treelike or not as it is the cases here, supports potential and phylogenetic relationships, and explore the alternatives much more comprehensively than would be possible with a strict-consensus or other tree (Fig. 2).

Figure 2. Neighbour-net based on a mean distance matrix inferred
from the matrix of Rothwell & Stockey.
The distance to the "progymnosperms", a potential ancestral group of the
seed plants, can be taken as a measurement for the derivedness of each
major group. The primitive seed ferns are placed between progymnosperms
 and the gymnosperms connected by partly compatible edge bundles; the
putatively derived "higher seed ferns" isolated between the progymnosperms
and the long-edged angiosperms. Shared edge-bundles and 'neighbourness'
reflect quite well potential phylogenetic relationships and eventual ambiguities,
as in the case of Gnetidae. Colouring as in Figure 1; some taxon names
are abbreviated.

In addition, neighbour-nets usually are better backgrounds to map patterns of conflicting or partly conflicting support seen in a bootstrap, jackknife or Bayesian-inferred tree sample. In Fig. 3, I have mapped the bootstrap support for alternative taxon bipartitions (branches in a tree) on the background of the neighbour-net in Fig. 2.

Obvious and less-obvious relationships are simultaneously revealed, and their competing support patterns depicted. Based on the graph, we can see (edge lengths of the neighbour-net) that there is a relatively weak primary but substantial bootstrap support for the Petriellales (a recently described taxon new to the matrix) as sister to the angiosperms. Several taxa, or groups of closely related taxa, are characterised by long terminal edges/edge bundles, rooting in the boxy central part of the graph. Any alternative relationship of these taxa/taxon groups receives equally low support, but there are notable differences in the actual values.

There is little signal to place most of the fossil “seed ferns” (extinct seed plants) in relation to the modern groups, and a very ambiguous signal regarding the relationship of the Gnetidae (or Gnetales) with the two main groups of extant seed plants, the conifers (Pinidae; see C. Earle’s gymnosperm database) and angiosperms (for a list and trees, see P. Stevens’ Angiosperm Phylogeny Website).

The Gnetidae is a strongly distinct (also genetically) group of three surviving genera, being a persistent source of headaches for plant phylogeneticists. Placed as sister to the Pinaceae (‘Gnepine’ hypothesis) in early molecular trees (long-branch attraction artefact), the currently favoured hypothesis (‘Gnetifer’) places the Gnetidae as sister to all conifers (Pinatidae) in an all-gymnosperm clade (including Gingko and possibly the cycads).

As favoured by the branch support analyses, and contrasting with the preferred 2016 tree, the two Doyleas are placed closest to the conifers, nested within a commonly found group including the modern and ancient conifers and their long-extinct relatives (Cordaitales), and possibly Ginkgo (Ginkgoidae). In the original parsimony strict consensus tree, they are placed in the distal part as sister to a Gnetidae and Petriellales + angiosperms (possibly long-branch attraction). The grade including the ‘primitive seed ferns’ (Elkinsia through Callistophyton), seen also in Rothwell and Stockey’s 2016 tree, may be poorly supported under maximum parsimony (the criterion used to generate the tree), but receives quite high support when using a probabilistic approach such as maximum likelihood bootstrapping or Bayesian inference to some degree (Fig. 3; Coiro, Chomicki & Doyle 2017).

Figure 3. Neighbour-net from above used to map alternative support patterns.
Numbers refer to non-parametric bootstrap (BS) support for alternative phylogenetic
splits under three optimality criteria: maximum likelihood (ML) as implemented in
RAxML (using MK+G model), maximum parsimony (MP), and least-squares
(via neighbour-joining, NJ; using PAUP*); and Bayesian posterior probabilties
(using MrBayes 3.2; see Denk & Grimm 2009, for analysis set-up). The circular
arrangement of the taxa allows tracking most edges in the authors’ tree and their,
sometimes better supported, alternatives. The edge lengths provide direct
information about the distinctness of the included taxa to each other; the structure
of the graph informs about the how tree-like the signal is regarding possible
phylogenetic relationships or their alternatives. Colouring as in Figure 1;
some taxon names are abbreviated.

Numerous morphological matrices provide non-treelike signals. A tree can be inferred, but its topology may be only one of many possible trees. In the framework of total evidence, this may be not such a big problem, because the molecular partitions will predefine a tree, and fossils will simply be placed in that tree based on their character suites. Without such data, any tree may be biased and a poor reflection of the differentiation patterns.

By not forcing the data in a series of dichotomies, neighbour-nets provide a quick, simple alternative. Unambiguous, well-supported branches in a tree will usually result in tree-like portions of the neighbour net. Boxy portions in the neighbour-net pinpoint the ambiguous or even problematic signals from the matrix. Based on the graph, one can extract the alternatives worth testing or exploring. Support for the alternatives can be established using traditional branch support measures. Since any morphological matrix will combine those characters that are in line with the phylogeny as well as those that are at odds with it (convergences, character misinterpretations), the focus cannot be to infer a tree, but to establish the alternative scenarios and the support for them in the data matrix.


Coiro M, Chomicki G, Doyle JA. 2017. Experimental signal dissection and method sensitivity analyses reaffirm the potential of fossils and morphology in the resolution of seed plant phylogeny. bioRxiv DOI:10.1101/134262

Crepet WL, Stevenson DM. 2010. The Bennettitales (Cycadeoidales): a preliminary perspective of this arguably enigmatic group. In: Gee CT, ed. Plants in Mesozoic Time: Morphological Innovations, Phylogeny, Ecosystems. Bloomington: Indiana University Press, pp. 215-244.

Denk T, Grimm GW. 2009. The biogeographic history of beech trees. Review of Palaeobotany and Palynology 158: 83-100.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Friis EM, Crane PR, Pedersen KR, Bengtson S, Donoghue PCJ, Grimm GW, Stampanoni M. 2007. Phase-contrast X-ray microtomography links Cretaceous seeds with Gnetales and Bennettitales. Nature 450: 549-552 [all important information needed for this post is in the supplement to the paper; a figure showing the actual full analysis results can be found at figshare]

Hilton J, Bateman RM. 2006. Pteridosperms are the backbone of seed-plant phylogeny. Journal of the Torrey Botanical Society 133: 119-168.

Holland BR, Huber KT, Dress A, Moulton V. 2002. Delta Plots: A tool for analyzing phylogenetic distance data. Molecular Biology and Evolution 19: 2051-2059.

Rothwell GW, Crepet WL, Stockey RA. 2009. Is the anthophyte hypothesis alive and well? New evidence from the reproductive structures of Bennettitales. American Journal of Botany 96: 296–322.

Rothwell GW, Nixon K. 2006. How does the inclusion of fossil data change our conclusions about the phylogenetic history of the euphyllophytes? International Journal of Plant Sciences 167: 737–749.

Rothwell GW, Stockey RA. 2016. Phylogenetic diversification of Early Cretaceous seed plants: The compound seed cone of Doylea tetrahedrasperma. American Journal of Botany 103: 923–937.

Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12760.

Tuesday, June 27, 2017

Trees do not necessarily help in linguistic reconstruction

In historical linguistics, "linguistic reconstruction" is a rather important task. It can be divided into several subtasks, like "lexical reconstruction", "phonological reconstruction", and "syntactic reconstruction" — it comes conceptually close to what biologists would call "ancestral state reconstruction".

In phonological reconstruction, linguists seek to reconstruct the sound system of the ancestral language or proto-language, the Ursprache that is no longer attested in written sources. The term lexical reconstruction is less frequently used, but it obviously points to the reconstruction of whole lexemes in the proto-language, and requires sub-tasks, like semantic reconstruction where one seeks to identify the original meaning of the ancestral word form from which a given set of cognate words in the descendant languages developed, or morphological reconstruction, where one tries to reconstruct the morphology, such as case systems, or frequently recurring suffixes.

In a narrow sense, linguistic reconstruction only points to phonological reconstruction, which is something like the holy grail of computational approaches, since, so far, no method has been proposed that would convincingly show that one can do without expert insights. Bouchard-Côté et al. (2013) use language phylogenies to climb a language tree from the leaves to the root, using sophisticated machine-learning techniques to infer the ancestral states of words in Oceanic languages. Hruschka et al. (2015) start from sites in multiple alignments of cognate sets of Turkish languages to infer both a language tree, as well as the ancestral states along with the sound changes that regularly occurred at the internal nodes of the tree. Both approaches show that phylogenetic methods could, in principle, be used to automatically infer which sounds were used in the proto-language; and both approaches report rather promising results.

None of the approaches, however, is finally convincing, both for practical and methodological reasons. First, they are applied to language families that are considered to be rather "easy" to reconstruct. The tough cases are larger language families with more complex phonology, like Sino-Tibetan or any of its subbranches, including even shallow families like Sinitic (Chinese), or Indo-European, where the greatest achievements of the classical methods for language comparison have been made.

Second, they rely on a wrong assumption, that the sounds used in a set of attested languages are necessarily the pool of sounds that would also be the best candidates for the Ursprache. For example, Saussure (1879) proposed that Proto-Indo-European had at least two sounds that did not survive in any of the descendant languages, the so-called laryngeals, which are nowadays commonly represented as h₁, h₂, and h₃, and which leave complex traits in the vocalism and the consonant systems of some Indo-European languages. Ever since then, it has been a standard assumption that it is always possible that none of the ancestral sounds in a given proto-language is still attested in any its descendants.

A third interesting point, which I consider a methodological problem of the methods, is that both of them are based on language trees, which are either given to the algorithm or inferred during the process. Given that most if not all approaches to ancestral state reconstruction in biology are based on some kind of phylogeny, even if it is a rooted evolutionary network, it may sound strange that I criticize this point. But in fact, when linguists use the classical methods to infer ancestral sounds and ancestral sound systems, phylogenies do not necessarily play an important role.

The reason for this lies in the highly directional nature of sound change, especially in the consonant systems of languages, which often makes it extremely easy to predict the ancestral sound without invoking any phylogeny more complex than a star tree. That is, in linguistics we often have a good idea about directed character-state changes. For example, if a linguist observers a [k] in one set of languages and a [ts] in another languages in the same alignment site of multiple cognate sets, then they will immediately reconstruct a *k for the proto-language, since they know that [k] can easily become [ts] but not vice versa. The same holds for many sound correspondence patterns that can be frequently observed among all languages of the world, including cases like [p] and [f], [k] and [x], and many more. Why should we bother about any phylogeny in the background, if we already know that it is much more likely that these changes occurred independently? Directed character-state assessments make a phylogeny unnecessary.

Sound change in this sense is simply not well treated in any paradigm that assumes some kind of parsimony, as it simply occurs too often independently. The question is less acute with vowels, where scholars have observed cycles of change in ancient languages that are attested in written sources. Even more problematic is the change of tones, where scholars have even less intuition regarding preference directions or preference transitions; and also because ancient data does not describe the tones in the phonetic detail we would need in order to compare it with modern data. In contrast to consonant reconstruction, where we can do almost exclusively without phylogenies, phylogenies may indeed provide some help to shed light on open questions in vowel and tone change.

But one should not underestimate this task, given the systemic pressure that may crucially impact on vowel and tone systems. Since there are considerably fewer empty spots in the vowel and tone space of human languages, it can easily happen that the most natural paths of vowel or tone development (if they exist in the end) are counteracted by systemic pressures. Vowels can be more easily confused in communication, and this holds even more for tones. Even if changes are "natural", they could create conflict in communication, if they produce very similar vowels or tones that are hard to distinguish by the speakers. As a result, these changes could provoke mergers in sounds, with speakers no longer distinguishing them at all; or alternatively, changes that are less "natural" (physiologically or acoustically) could be preferred by a speech society in order to maintain the effectiveness of the linguistic system.

In principle, these phenomena are well-known to trained linguists, although it is hard to find any explicit statements in the literature. Surprisingly, linguistic reconstruction (in the sense of phonological reconstruction) is hard for machines, since it is easy for trained linguists. Every historical linguist has a catalogue of existing sounds in their head as well as a network of preference transitions, but we lack a machine-readable version of those catalogues. This is mainly because transcriptions systems widely differ across subfields and families, and since no efforts to standardize these transcriptions have been successful so far.

Without such catalogues, however, any efforts to apply vanilla-style methods for ancestral state reconstruction from biology to linguistic reconstruction in historical linguistics, will be futile. We do not need the trees for linguistic reconstruction, but the network of potential pathways of sound change.

  • Bouchard-Côté, A., D. Hall, T. Griffiths, and D. Klein (2013): Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11. 4224–4229.
  • Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology 25.1: 1-9.
  • Saussure, F. (1879): Mémoire sur le système primitif des voyelles dans les langues indo- européennes. Teubner: Leipzig.

Tuesday, June 20, 2017

Cichlids, species and trees

Lake Malawi, in south-eastern Africa, is famous for its large diversity of cichlid fishes. Indeed, it sometimes seems to have more biologists studying these fish than there are actual fish in the lake, even though there are allegedly hundreds of cichlid fish species in that lake. In this sense, it is somewhat similar to Lake Baikal, in southern Siberia, home to the sole species of freshwater seals.

The cichlid biologists are interested in describing the extensive fish diversity, pondering its origin, and thus its contribution to the study of speciation. After all, we are talking about what is usually claimed to be "the most extensive recent vertebrate adaptive radiation". So, we are talking here as much about population genetics as we are about ichthyology.

Inevitably, the genome biologists have been spotted in the vicinity of the lake; and we now have a preliminary report from them:
Milan Malinsky, Hannes Svardal, Alexandra M. Tyers, Eric A. Miska, Martin J. Genner, George F. Turner, Richard Durbin (2017) Whole genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. BioRxiv 143859.
These authors summarize the situation like this:
We characterize [the] genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times.
The last sentence seems to be somewhat disingenuous. How could a single tree be expected to describe this scale of biodiversity? Any rapid radiation of diversity is unlikely to be completely tree-like. The increase in diversity can be modeled as a tree, sure, but it is very unlikely that there will be instant separation of the taxa, and so the tree model will be ignoring a large part of the evolutionary action. There will, for example, be ongoing introgression between the diverging taxa, as well as hybridization due to incomplete breeding barriers. These avenues for gene flow can best be modeled as a network, not a tree.

The issue here is that the authors write the paper solely from the perspective of an expected phylogenetic tree, and then feel compelled to explain why they do  not produce such a tree. Indeed, the authors present their paper as a study of "violations of the species tree concept".

For data analysis, they proceed as follows:
To obtain a first estimate of between-species relationships we divided the genome into 2543 non-overlapping windows, each comprising 8000 SNPs (average size: 274kb), and constructed a Maximum Likelihood (ML) phylogeny separately for each window, obtaining trees with 2542 different topologies.
So, only two sequence blocks produced the same tree, presumably by random chance. An example "tree" for 12 OTUs is shown in the diagram. It superimposes a possible mitochondrial trees on a summary of the "genome tree".

Example phylogeny from Malinsky (2012)

The authors continue:
The fact that we are using over 25 million variable sites suggests these differences are not due to sampling noise, but reflect conflicting biological signals in the data. For example, gene flow after the initial separation of species can distort the overall phylogeny and lead to intermediate placement of admixed taxa in the tree topology.
Note that gene flow is seen to "distort" the phylogeny rather than being an integral part of it. In this case, "phylogeny" apparently refers solely to the diversification part evolutionary history, rather than to the whole history.

The ultimate questions from this paper are: "what is a species concept?", and "what is a species tree?". The authors write a lot about species and trees, and yet their data provide very clear evidence that both "species" and "tree" are very restrictive concepts for studying the cichlids of Lake Malawi.

Coincidentally, another recent paper tackles the same problems:
Britta S. Meyer, Michael Matschiner, Walter Salzburger (2017) Disentangling incomplete lineage sorting and introgression to refine species-tree estimates for Lake Tanganyika cichlid fishes. Systematic Biology 66: 531-550.
The authors describe their work, on the same fish group but in a lake further north-west, as follows:
Because of the rapid lineage formation in these groups, and occasional gene flow between the participating species, it is often difficult to reconstruct the phylogenetic history of species that underwent an adaptive radiation. In this study, we present a novel approach for species-tree estimation in rapidly diversifying lineages, where introgression is known to occur, and apply it to a multimarker data set containing up to 16 specimens per species for a set of 45 species of East African cichlid fishes (522 individuals in total), with a main focus on the cichlid species flock of Lake Tanganyika. We first identified, using age distributions of most recent common ancestors in individual gene trees, those lineages in our data set that show strong signatures of past introgression ... We then applied the multispecies coalescent model to estimate the species tree of Lake Tanganyika cichlids, but excluded the lineages involved in these introgression events, as the multispecies coalescent model does not incorporate introgression. This resulted in a robust species tree.
Once again, phylogeny = species tree.

Tuesday, June 13, 2017

Bayesian inference of phylogenetic networks

Over the years, a number of methods have been explored for constructing evolutionary networks, starting with parsimony criteria for optimization, and moving on to likelihood-based inference. However, the development of Bayesian methods has been somewhat delayed by the computational complexities involved.

Network from Radice (2012)

The earliest work on this topic seems to be the thesis of:
Rosalba Radice (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.
Apparently, the only part of this work to be published has been:
Rosalba Radice (2012) A Bayesian approach to modelling reticulation events with application to the ribosomal protein gene rps11 of flowering plants. Australian & New Zealand Journal of Statistics 54: 401-426.
The method described requires the prior specification of the species tree (phylogeny), and the position and number of the reticulation events. The algorithm was implemented in the R language.

More recently, methods have been developed that infer phylogenies by using (i) incomplete lineage sorting (ILS) to model gene-tree incongruence arising from vertical inheritance, and (ii) introgression / hybridization to model gene-tree incongruence attributable to horizontal gene flow. ILS has been addressed using the multispecies coalescent.

The first of these publications was:
Dingqiao Wen, Yun Yu, Luay Nakhleh (2016) Bayesian inference of reticulate phylogenies under the multispecies network coalescent. PLoS Genetics 12(5): e1006006. [Correction: 2017 PLoS Genetics 13(2): e1006598]
The method requires the set of gene trees as input, along with the number of reticulations. The algorithm was implemented in the PhyloNet package.

In the past few months, two manuscripts have appeared that try to co-estimate the gene trees and the species network, using the original sequence data (assumed to be without recombination) as input:
Dingqiao Wen, Luay Nakhleh (2017) Co-estimating reticulate phylogenies and gene trees from multi-locus sequence data. bioRxiv 095539. [v.2; v.1: 2016]
Chi Zhang, Huw A Ogilvie, Alexei J Drummond, Tanja Stadler (2017) Bayesian inference of species networks from multilocus sequence data. bioRxiv 124982.
The algorithm for the first method has been implemented in the PhyloNet package, while the second has been implemented in the Beast2 package.

Finally, another manuscript describes a method utilizing data based on single nucleotide polymorphisms (SNPs) and/or amplified fragment length polymorphisms (AFLPs), which thus sidesteps the assumption of no recombination:
Jiafan Zhu, Dingqiao Wen, Yun Yu, Heidi Meudt, Luay Nakhleh (2017) Bayesian inference of phylogenetic networks from bi-allelic genetic markers. bioRxiv 143545.
This method has also been implemented in PhyloNet.

Due to the computational complexity of likelihood inference, all of these methods are currently severely restricted in the number of OTUs that can be analyzed, irrespective of whether these involve multiple samples from the same species or not. In this sense, parsimony-based inference or approximate likelihood methods are still useful for constructing evolutionary networks of any size. However, progress is clearly being made to alleviate the computational restrictions.

Tuesday, June 6, 2017

Bears, genomes and gene flow

It has traditionally been assumed that speciation occurs when gene flow between populations ceases. However, nothing in biology ever remains simple — the more we study any biological phenomenon the more complex it becomes. So, speciation with gene flow is becoming a more commonly discussed topic. This is especially so with the advent of genome sequencing, which allows us to study the extent of gene flow in the past, rather than solely in the present.

A case in point is the recent paper by:
Vikas Kumar, Fritjof Lammers, Tobias Bidon, Markus Pfenninger, Lydia Kolter, Maria A. Nilsson and Axel Janke (2017) The evolutionary history of bears is characterized by gene flow across species. Nature Scientific Reports 7: 46487.
This paper considers the evolutionary relationships among seven species of bears, with multiple genome samples from four of those species. The coalescent species tree (based on 18,621 genome fragments > 25 kb), which accounts for incomplete lineage sorting (ILS), is well supported, as shown here.

However, numerous individual genome-fragment trees support alternative topologies. For example, 38% of the trees support a topology where the Asiatic black bear is the sister to the American black - Brown - Polar bear clade. This suggests that there is more than simply ILS that creates the conflicting genome trees.

The authors applied several different data analyses to investigate the possibility of gene flow among the species. They found considerable evidence for gene flow, as shown in the network (the arrow colors represent different analyses).

Indeed, each of the six in-group species could conceivably be connected by gene flow to each of the other five species. The network shows evidence that the Brown, Asiatic and Sloth bears might have all five connections, while the Polar and Sun bears have four, and the American bear has three.

As the authors note, some of this potential gene flow cannot have occurred directly between species, because they live in different habitats. Instead, it may be remnants of ancestral gene flow, or gene flow through a vector species. In particular, the strongest signal of gene flow connects the Asiatic black bear with the ancestor of the American black - Brown - Polar bear clade.

Ancestral gene flow is of considerable importance when studying evolution. Charles Darwin was perhaps the first to note (in his notebooks) that we should always treat ancestors as species not as taxonomic groups, no matter how big the groups of descendants now are. Whole kingdoms and phyla were once a single species, if the contemporary groups are monophyletic

Tuesday, May 30, 2017

Killer arguments and the nature of proof in historical sciences

Some long time ago, somebody told me this joke, which I just found again on the internet in an English version (following jokes.cc.com, with modifications based on my memory):
Teacher: "Four crows are on the fence. The farmer shoots one. How many are left?"
Little Johnny: "None."
Teacher: "Listen carefully: Four crows are on the fence. The farmer shoots one. How many are left?"
Little Johnny: "None."
Teacher: "Can you explain that answer?"
Little Johnny: "One is shot, the others fly away. There are none left."
Teacher: "Well, that isn't the correct answer, but I like the way you think."
Little Johnny: "Teacher, can I ask a question?"
Teacher: "Sure."
Little Johnny: "There are three women in the park. The first one reads a love novel, the second one reads the newspaper, and the third one updates her FaceBook profile, which one of them is married?"
Teacher: "The one reading the newspaper?"
Little Johnny: "No. The one with the wedding ring on, but I like the way you think."
Given the title of this post, you may wonder why I tell you that joke. The reason is that for me, the essence of the joke is expressing the situation we often have in the historical sciences when we talk about "proof", be it of the closer relationship of different species, or the ultimate relationship of languages. Given the evidence we are given, we can reach an awful lot of conclusions in order to arrive at a convincing story, but if we see the wedding ring on somebody's hand, we know the true story no matter what other evidence we are given. The wedding ring in the joke serves as a killer argument — no matter what other evidence we consider, it is much more likely that the person who is married is the one with the ring than anybody else.

We often face similar situations in the historical sciences where we seek some kind of true story behind a couple of facts, when we are given external evidence that is just pointing to the right answer, or — let's be careful — the most probable answer, independent of where the other evidence might point to. We can think of similar situations in crime investigations, where we may think that a large body of evidence convicts some person as a murderer until we see some video proof that reveals the real offender.

That crime investigations have a lot in common with research in the historical sciences has been noted before by many people, notably the famous Umberto Eco (1932-2016), who edited a whole anthology on the role of circumstantial evidence in linguistics, semiotics, and philosophy (Eco and Sebeok 1983) where scholars compared the work of Sherlock Holmes with the work of people in the historical sciences. What Sherlock Holmes and historical linguists (and also evolutionary biologists) have in common is the use of abduction as their fundamental mode of reasoning. The term itself goes back to Charles Sanders Peirce (1839-1914), who distinguished it from deduction and induction:
Accepting the conclusion that an explanation is needed when facts contrary to what we should expect emerge, it follows that the explanation must be such a proposition as would lead to the prediction of the observed facts, either as necessary consequences or at least as very probable under the circumstances. A hypothesis then, has to be adopted, which is likely in itself, and renders the facts likely. This step of adopting a hypothesis as being suggested by the facts, is what I call abduction. I reckon it as a form of inference, however problematical the hypothesis may be held. (Peirce 1931/1958: 7.202)
Our problem in the historical sciences is that we are searching an original situation: what was the case a long time ago, based on general knowledge about (evolutionary or historical) processes and the results of this situation. When Sherlock Holmes looks at a crime scene, he sees the results of an action and uses his knowledge of human behaviour to find the one who was responsible for the crime. When doctors listen to the heartbeat of patients who are short of breath, they try to find out what causes their disease by making use of their knowledge about symptoms and the diseases that could have caused them. When linguists look at words from different languages, they make use of their knowledge of processes of language change and language contact in order to work out why those languages are so similar.

As do medical practitioners or crime investigators, we have our general schema, our protocol, which we use to carry out our investigations. Biologists search for similar DNA sequences, linguists look for similar sound sequences. In most cases, this works fine, although we are usually left with uncertainties and things that do not really seem to add up. As long as we can quietly follow the protocol, we are fine; and even if the results of our research do not necessarily last for a long time, being superceded by more recent research, we usually have the impression that we did the best we could, given the complex circumstances with their complex circumstancial evidence. But once in a while, we uncover evidence similar to video proofs in crime investigation, or wedding rings as in the Little Johnny joke — evidence that is so striking that we have to put our protocol to one side and just accept that there is only one solution, no matter what the rest of our evidence or our protocol might point to.

In 1879, Ferdinand de Saussure (1857-1913) predicted two consonantal sounds in Proto-Indo-European based on circumstantial evidence (Saussure 1879). In 1927, Jerzy Kuryłowicz (1895-1978) could show that one of the sounds was still pronounced in Hittite, an Indo-European language that was not known during Saussure's time (Lehmann 1992: 33), and had just been deciphered. While Saussure followed protocol in his investigation, Kuryłowicz provided the video proof, and only since then, Saussure's hypothesis has become communis opinio in historical linguistics.

I assume that nobody will doubt the existence of different kinds of proof, different qualities of proof, in historical disciplines. If we are left with nothing else but our protocol, we can derive certain conclusions, but we can easily abandon our protocol once we have been presented with those killer arguments, that specific kind of proof that is so striking that we do not need to bother to have a look at any alternative facts again. I do not know of any similar examples in biology, but in linguistics (and in crime investigation, at least judging from the criminal novels I have read), it is obvious that our evidence cannot only be ranked, but that we also have a huge incline between the standard evidence we use to make most of our arguments and those killer arguments that are so striking that no doubt is left.

In the short story The Adventure of the Beryl Coronet, Sherlock Holmes says:
[When] you have excluded the impossible, whatever remains, however improbable, must be the truth.
But this is only partially true, as in Sherlock Holmes' cases the truth is usually (but not always!) presented in such a form that it does not leave any place for doubt. Sherlock Holmes is a genius at finding the wedding rings on the fingers of his witnesses. As historical scientists, we are often much less lucky, but probably also less talented than Mr. Holmes. We are thus left with the fundamental problem of not knowing how to find the killer evidence, or how to quantify the doubt in those cases where we just follow the general protocol of our discipline.

  • Eco, U. and T. Sebeok (1983) The Sign of Three. Dupin, Holmes, Peirce. Indiana University Press: Bloomington.
  • Lehmann, W. (1992) Historical linguistics. An Introduction. Routledge: London.
  • Peirce, C. (1931/1958) Collected Papers of Charles Sanders Peirce. Harvard University Press: Cambridge, Mass.
  • Saussure, F. (1879) Mémoire sur le système primitif des voyelles dans les langues indo-européennes. Teubner: Leipzig.

Tuesday, May 23, 2017

A test case for phylogenetic methods and stemmatics: the Divine Comedy

In a previous post I gave an outline of stemmatics, and briefly touched on the adoption and advantages of phylogenetic methods for textual criticism (On stemmatics and phylogenetic methods). Here I present the results of an empirical investigation I have been conducting, in which such methods are used to study some philological dilemmas of a cornerstone work in textual criticism, Dante Alighieri's Divine Comedy. I am reproducing parts of the text and the results of a paper still under review; the NEXUS file for this research is available on GitHub.

Before describing the analysis, I discuss the work and its tradition, as well as some of the open questions concerning its textual criticism. This should not only allow the main audience of this blog to understand (and perhaps question) my work, but it is also a way to familiarize you with the kind of research conducted in stemmatics. After all, the first step is the recensio, a deep review of all information that can be gathered about a work.

The Divine Comedy

The Divine Comedy is an Italian medieval poem, and one of the most successful and influential medieval works. It is written in a rigid structure that, when compared to other works, guaranteed it a certain resistance to copy errors, as most changes would be immediately evident. Composed of three canticas (Inferno, Purgatory, and Paradise), the first of its 100 cantos were written in 1306-07, with the work completed not long before the death of the author in 1321. Written mostly during Dante's exile from his home city, Florence (Tuscany), like many works of the time it was published as the author wrote it, and not only upon completion. In fact, it is even possible, while not proven, that the author changed some cantos and published revisions, thus being himself the source of unresolvable differences.

No original manuscript has survived, but scholarship has traced the development of the tradition from copies and historical research. The poem is one of the most copied works of the Middle Ages, with more than 600 known complete copies, besides 200 partial and fragmentary witnesses. For of comparison, there are around 80 copies of Chaucer's Canterbury Tales,which is itself a successful work by medieval standards

Commercial enterprises soon developed to attend the market demand of its success. In terms of geographical diffusion, quantitative data suggests that, before the Black Death that ravaged the city of Florence in 1348, scribal activity was more intense in Tuscany than in Northern Italy, where the author had died. Among the hypotheses for its textual evolution, the results of my investigation support the widespread hypothesis that Dante published his work with Florentine orthography in Northern Italy. That is, the first copies adopted Northern orthographic standards, which would then revert to Tuscan customs, with occasional misinterpretations, when the work found its way back to Florence. These essentials of the transmission must be considered when curating a critical edition, as the less numerous Northern manuscripts, albeit with an adapted orthography, can in general be assumed to be closer to the archetype (if there ever was one to speak of) than Florentine ones.

The tradition is characterized by intentional contamination, as the work soon became a focus of politics and grammar prescriptivism. Errors and contamination have already been demonstrated in the earliest securely dated manuscript, the Landiano of 1336 (cf. Shaw, 2011), and can be already identified in the first commentaries dating from the 1320s (such as in the one by Jacopo Alighieri, the author's son).

Critical studies

Here are some details about previous studies. I have included considerable stemmatic information, but I include a biological analogy to help make sense for non-experts.

The first critical editions date from the 19th century, but a stemmatic approach would only be advanced at the end of that century, by Michele Barbi. Facing the problem of applying Lachmann's method to a long text with a massive tradition, in 1891 Barbi proposed his list of around 400 loci (samples of the text), inviting scholars to contribute the readings in the manuscripts they had access to. His project, which intended to establish a complete genealogy without the need for a full collatio, had disappointing results, with only a handful of responses. Mario Casella would later (1921) conduct the first formal stemmatic study on the poem, grouping some older manuscripts in two families, α and β, of unequal number of witnesses but equal value for the emendatio. His two families are not rooted at a higher level, but he observed that they share errors supporting the hypothesis of a common ancestor, likely copied by a Northern scribe.

Casella's stemma, reproduced from Shaw (2011).

Forty years later, Giorgio Petrocchi proposed to overcome the large stemma by employing only witnesses dating from before the editorial activity of Giovanni Boccaccio, as his alterations and influence were considered to be too pervasive. Petrocchi defended a cut-off date of 1355 as being necessary for a stemmatic approach that would otherwise have been impossible, given the level of contamination of later copies. The restriction in the number of witnesses was contrasted by his expansion of the collatio to the entire text, criticizing Barbi's loci as subjective selections for which there was no proof of sufficiency.

Making use of analogies with biology, we may say that Barbi proposed to establish a tree from a reduced number of "proteins" for all possible "taxa". Casella considered this to be impracticable and, selecting a few representative "fossils", built a tree from a large number of phenotypic characteristics. Finally, Petrocchi produced a network while considering the entire "genome" for all "fossils" dated from before an event that, while well-supported in theory (we could compare its effects to a profound climate change), was nonetheless arbitrary.

Petrocchi's stemma, reproduced from Shaw (2011).

Questions about Petrocchi's methodology and assumptions were soon raised, particularly regarding the proclaimed influence of Boccaccio, without quantitative proofs either that his editions were as influential as asserted or that all later witnesses were superfluous for stemmatics. Later research focused on questioning his stemma. For example, the absence of consensus about the relationship between the Ash and Ham manuscripts, the supposedly weak demonstration of the polytomy of Mad, Rb, and Urb (the "Northern manuscripts"), and the dating of Gv (likely copied fifty to a hundred years after Petrocchi's assumption). Evidence was presented that Co, a key manuscript in his stemma, could not be an ancestor of Lau (its copyist was still active in the 15th century), and that Ga contained disjunctive errors not found in its supposed decedents. Abusing once more of the biological analogy, the dating of his "fossils" was in some cases plainly wrong.

Federico Sanguineti presented an alternative stemma in 2001, arguing that a rigorous application of stemmatics would evidence errors in Petrocchi. To that end, he decided to resurrect Barbi's loci and trace the first complete genealogy, without arbitrary and a priori decisions about the usefulness of the textual witnesses. Sanguineti defended the suggestion that, after this proper recensio, a small number of manuscripts (which he eventually set to seven) would be sufficient for emendation. His stemma, described as "optimistic in its elegance and minimalism" (Shaw 2011), resulted in a critical edition that heavily relied in a single manuscript, Urb, the only witness of his β family (as Rb was displaced from the proximity it had in Petrocchi's stemma, and Mad was excluded from the analysis). Keeping with the biological analogy, he proposed building a tree from an extremely reduced number of "proteins", but for all "taxa". In the end, however, the reduced number of "proteins" was considered only for seven "taxa", selected mostly due to their age.

Sanguineti's stemma, reproduced from Shaw (2011).

The edition of Sanguineti was attacked by critics, who confronted the limited number of manuscripts used in the emendatio, the position of Rb, the high value attributed to LauSC, and the unparalleled importance of Urb, all resulting in an unexpected Northern coloring to the language of a Florentine writer. Regarding his methodology, reviewers pointed out that stemmatic principles had not been followed strictly, as the elimination was not restricted to descripti, but extendied to branches that were considered to be too contaminated

The digital edition of Prue Shaw (2011) was developed as a project for phylogenetic testing of Sanguineti's assumptions. Her edition includes complete manuscript transcriptions, and the transcriptions include all of the layers of revision of each manuscript (original readings and corrections by later hands), and are complemented by high-quality reproductions of the manuscripts. After testing the validity of Sanguineti's method and stemma, Shaw concluded that his claims do not "stand up to close scrutiny", and that the entire edition is compromised, because Rb "is shown unequivocally to be a collaterale of Urb, and not a member of α as [Sanguineti] maintains".

Applying phylogenetic methods

With the goal of following and, to a large part, replicating Shaw (2011), I have analyzed signals of phylogenetic proximity for validating stemmatic hypotheses, produced both a computer-generated and a computer-assisted phylogeny (equivalent to a stemma), and evaluated the performance of suchphylogenies with methods of ancestral state reconstruction.

I wanted to investigate the proximity of witnesses and the statistical support for the published stemmas. After experiments with rooted graphs, I made a decision to use NeighborNets, in which splits are indicative of observed divergences and edge lengths are proportional to the observed differences. These unrooted split networks were preferable because they facilitated visual investigation, and also provided results for the subsequent steps. These involved exploring the topology and evaluating potential contaminations, guiding the elimination of taxa whose data would be redundant for establishing prior hypotheses on genealogical relationships. Analyses were conducted using all manuscript layers and critical editions, both with and without bootstrapping, thus obtaining results supported in terms of inferred trees as well as of character data.

NeighborNet of the manuscripts and revisions from my data, generated with SplitsTree
(Huson & Bryant 2006)

The analysis confirmed most of the conclusions of Shaw (2011) — there are no doubts about the proximity and distinctiveness of Ash and Ham, with Sanguineti's hypothesis (in which they are collaterals) better supported than Petrocchi's hypothesis (in which the first is an ancestor of the second). The proximity of Mart and Triv was confirmed; but the position of the ancestors postulated by Petrocchi and Sanguineti should be questioned in face of the signals they share with LauSC, perhaps because of contamination. The most important finding, in line with Shaw and in contrast with the fundamental assumption of Sanguineti, is the clear demonstration of the relationship between Rb and Urb.

The relationship analyses allowed the generation of trees for further evaluation. Despite the goal of a full Bayesian tree-inference, I discarded that option because, without a careful and demanding selection of priors, it would yield flawed results. As such, I made the decision to build trees using both stochastic inference and user design (ie. manually). This postponed more complex topology analyses for future research, but generated the structures needed by the subsequent investigation steps; both trees are included in the datafile.

The second tree (shown below), allowing polytomies and manually constructed by myself, tries to combine the findings of Petrocchi and Sanguineti by resolving their differences with the support of the relationship analyses. Using Petrocchi's edition as a gold standard, and considering only single hypothesis reconstructions, parsimonious ancestral state reconstruction agree with 9,016 characters (79.9%). When considering multiple hypotheses, instead, reconstructions agree with 10,226 characters (90.7%). Cases of disagreement were manually analyzed and, as expected, most resulted from readings supported by the tradition but refuted by Petrocchi on exegetic grounds.

My proposed tree for the manuscripts selected by Sanguineti,
generated with PhyD3 (Kreft et al., 2017).

This tree suggests that, in general, Petrocchi's network is better supported than the tree by Sanguineti, as phylogenetic principles lead us to expect — the first was built considering statistical properties and using all of available data, while the second relied in many intuitions and hypothesis never really tested. In particular, it supports the findings of Shaw and, as such, allows us to indicate the critical edition of Petrocchi as the best one. Even more important, however, it is a further evidence of the usefulness of phylogenetic methods, when appropriately used, in stemmatics.


Alagherii, Dantis (2001) Comedìa. Edited by Federico Sanguineti. Firenze: Edizioni del Galluzzo.

Alighieri, Dante (1994) La Commedia Secondo L’antica Vulgata: Introduzione. Edited by Giorgio Petrocchi. Opere di Dante Alighieri v. 1. Firenze: Le Lettere.

Huson, Daniel H.; Bryant, David (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23: 254–267.

Inglese, Giorgio (2007) Inferno, Revisione del testo e commento. Roma: Carocci.

Kreft, Lukasz; Botzki, Alexander; Coppens, Frederik; Vandepoele, Klaas; Van Bel, Michiel (2017) PhyD3: a Phylogenetic Tree Viewer with Extended PhyloXML Support for Functional Genomics Data Visualization. BioRxiv. Doi: 10.1101/107276.

Leonardi, Anna M.C. (1991) Introduzione. In: La Divina Commedia, by Dante Alighieri. Milano: Arnoldo Mondadori Editore.

Shaw, Prue (2011) Commedia: a Digital Edition. Birmingham: Scholarly Digital Editions.

Trovato, Paolo (2016) Metodologia editoriale per la Commedia di Dante Alighieri. Ferrara. https://www.youtube.com/watch?v=BfKUOAR9PXA. Date of access: March 19, 2017.