Monday, September 22, 2014

Reducing networks to trees

I have commented before about the perceived tendency to resist thinking about evolutionary relationships as networks (Resistance to network thinking), and even to present reticulating evolutionary relationships as trees rather than as networks (The dilemma of evolutionary networks and Darwinian trees). Charles Darwin seems to be the guilty party in starting this phenomenon.

This behavior becomes particularly obvious when we consider family genealogies. A good example appears when we consider the family relationships of the Olympian gods of Ancient Greece. Several illustrations of these relationships are gathered together on the Olympian Gods Family Tree web page.

Noteworthy is the particularly frisky nature of Zeus, who "got around a bit", to put it mildly. As shown in the first diagram, Zeus was the offspring of Cronus and Rhea. However, he then fathered children with at least nine people, including two of his own sisters, an aunt, a first cousin, and several first cousins once removed, among others. This creates the complex network shown.

However, not everyone wants to draw family genealogies as reticulating networks. After all, they are usually called "family trees". As shown by the examples below, the most common way to reduce a network to a tree is simply to repeat people's names as often as necessary. That is, rather than have them appear once (representing their birth) with multiple reticulating connections representing their reproductive relationships, they appear repeatedly, once for their birth and once for each relationship, so that there are no reticulations. I will leave it to you to count how often Zeus appears in each of these so-called family trees.

Clearly, this is misleading, and it makes no sense to obscure the fact that a so-called tree is actually a reticulate network. If relationships are reticulate then it is best to illustrate them that way, rather than to disguise the networks as trees.

Wednesday, September 17, 2014

Using data-display networks to assess evolutionary inferences

Phylogenetic networks are of two types: those that produce direct evolutionary inferences about gene flow (eg. hybridization networks, HGT networks), and those that display multiple patterns in multivariate datasets without any necessary evolutionary implications. The latter (called data-display networks) can be used both a priori as tools for exploratory data analysis (EDA), and a posteriori as a means of evaluating (or cross-checking) the support for inferences derived from other analyses (such as evolutionary networks).

Here, I present an example of the a posteriori usage.

The data and initial analysis come from:
Fu Q, Meyer M, Gao X, Stenzel U, Burbano HA, Kelso J, Pääbo S. (2013) DNA analysis of an early modern human from Tianyuan Cave, China. Proceedings of the National Academy of Sciences of the USA 110: 2223-2227.
They describe their genome data and evolutionary analysis like this:
We have extracted DNA from a 40,000-year-old anatomically modern human from Tianyuan Cave outside Beijing, China.
To investigate the relationship of the Tianyuan individual to present-day populations, we compared it to chromosome 21 sequences from 11 present-day humans from different parts of the world (a San, a Mbuti, a Yoruba, a Mandenka, and a Dinka from Africa; a French and a Sardinian from Europe; a Papuan, a Dai, and a Han from Asia; and a Karitiana from South America) and a Denisovan individual, each sequenced to 24- to 33-fold genomic coverage. Denisovans are an extinct group of Asian hominins related to Neandertals [and used as an outgroup]. In the combined dataset, 86,525 positions variable in at least one individual are of high quality in all 13 individuals.
To more accurately gauge how the population from which the Tianyuan individual is derived was related to Eurasian populations, while taking gene flow between populations into account, we used a recent approach that estimates a maximum-likelihood tree of populations and then identifies relationships between populations that are a poor fit to the tree model and that may be due to gene flow [using the TreeMix program] ... The maximum-likelihood tree [reproduced above] shows that the branch leading to the Tianyuan individual is long, due to its lower sequence quality. However, among Eurasian populations, Tianyuan clearly falls with Asian rather than European populations (bootstrap support 100%). The strongest signal not compatible with a bifurcating tree is an inferred gene-flow event that suggests that 6.7% of chromosome 21 in the Papuan individual is derived from Denisovans ... When this is taken into account, the Tianyuan individual appears ancestral to all Asian individuals studied. We note, however, that the relationship of the Tianyuan and Papuan individuals is not resolved (bootstrap support 31%).
Setting aside the faux pas about the Tianyuan individual being "ancestral" to the others (it is shown in the tree-based figure as the sister group not the ancestor), most of the other interpretations can be assessed by looking at the multivariate data independently of any evolutionary inference. This can be done using the pairwise nucleotide differences among the samples (provided in Table 1 of the paper) and a NeighborNet data-display network, as shown in the splits graph below.

We can note the following points, some of which support the authors' conclusions and some of which don't. [Note: the authors refer to their figure as a "tree", although it is an introgression network.]:
  • All terminal edges in the network are long, and so there is actually not much genomic information on chromosome 21 about relationships.
  • The network splits do roughly match the tree splits, and so the network apparently does reflect some evolutionary information.
  • The identified gene flow from the Denisovan to the Papuan is represented by a clear split in the network. The weight (0.7335) makes it the fifth largest non-trivial split. That is, it is larger than some of the splits that purportedly represent tree-like evolution.
  • The largest split (weight = 2.8942) separates the non-African samples from the African samples + Denisovan outgroup, which does accord with the postulated dispersal of humans out of Africa.
  • The second (1.1459) and third (0.8073) largest splits are near the root of the tree.
  • The European split is the fourth largest (0.7670). The South American sample is included with the Asian group, reflecting the idea that the native people of the Americas migrated there from Asia across the Bering Strait.
  • The relationships among the Asian samples in the network do not all match those in the tree. Notably, the Han+Dai split (0.5124) is smaller than the Han+Karitiana split (0.6292), and yet the former appears in the tree with 100% bootstrap support.
  • The Han+Dai+Karitiana split is well supported (0.4450), but the Han+Dai+Karitiana +Papuan split is not (0.0152), as reflected in the 31% bootstrap value for the latter in the tree.
  • The Han+Dai+Karitiana+Papuan+Tianyuan split is not displayed in the network, although it has a long edge in the tree. The closest network split, as displayed, includes the Denisovan sample. Thus, the network emphasizes the reticulate Denisovan-Papuan relationship at the expense of the showing all of the tree-like relationship among the Asian samples.
  • The Tianyuan edge is not long in the network whereas it is long in the tree. This is likely to be because of uncertainty in its placement in the tree, rather than poor sequence quality, as claimed by the authors.

Thus, the data-display network questions some of the details of the authors' evolutionary network. However, it does support placing the Tianyuan sample with the Asian ones, as well as possible gene flow from the Denisovan sample to the Papuan one.

It thus seems to be a valuable procedure to cross-check any evolutionary analysis with a data-display network. As I have noted before (Networks and bootstraps as tree-support criteria; How networks differ from bootstrapped trees), bootstap values on a tree are insufficient as a means of assessing the robustness of evolutionary diagrams.