Monday, October 5, 2015

A network of blood types

The relationship between phenotypes and allele frequencies is often introduced in textbooks using the example of human blood type. There are three alleles for the blood-type gene (IA, IB, IO), and these produce four phenotypes (A, B, AB, and O) since IA and IB are co-dominant and IO is recessive.

The proportions of these phenotypes vary among human ethnic groups, and this variation provides one of the simplest demonstrations that human inter-breeding does not occur at random — that is, Hardy-Weinberg equilibrium is not maintained at a global scale. This can be pictured using a phylogenetic network.

The data come from Racial and Ethnic Distribution of ABO Blood Types. As usual, the phylogenetic network is being used as a form of exploratory data analysis. I first used the manhattan distance to calculate the similarity of the ethnic groups, based on the frequencies of the four blood phenotypes. This was followed by a Neighbor-net analysis to display the between-group similarities as a phylogenetic network. So, ethnic groups that are closely connected in the network are similar to each other based on the relative frequencies of their blood types, and those that are further apart are progressively more different from each other.

You will note that very few of the ethnic peoples that are either geographically or historically related have similar distributions of blood types. Indeed, only the Irish and the Scots are closely related both in history and in the network. So, at a global scale, breeding occurs almost entirely within ethnic groups and not between them. Widespread modern migration has not yet obscured this pattern.

There is, however, a broad range of phenotypic variation in blood type. For example, the bottom right-hand part of the network shows those ethnic groups that are dominated by the O phenotype, the top-right is dominated by type A, and the bottom-left by type B.

Of particular interest are those groups for which the B allele is not been recorded in the dataset (ie. the B and AB phenotypes are absent), which includes the Australian Aborigines, the Bororo and Peruvian Indians from South America, the Shompen Nicobars from the Indian Ocean, and the Blackfoot and Navajo peoples from North America. The Maoris and Mayans also have a very low frequency. The Bororo, Peruvian Indian and Shompen peoples also seem to lack the A allele; and it is extremely rare in the Mayans. No group lacks the O allele, but it is lowest in the people from the Grand Andaman islands in the Indian Ocean.

Wednesday, September 30, 2015

Are networks actually used to explore reticulate histories?

A look at the modern literature clearly shows that many, if not most, researchers do not use network methods when exploring reticulate evolutionary histories. As examples of the range of possible approaches, I will briefly discuss two papers from a recent journal issue.

Archaic introgression
Pengfei Qin and Mark Stoneking (2015) Denisovan ancestry in East Eurasian and Native American populations. Molecular Biology and Evolution 32: 2665-2674.
The data used for this study of archaic introgression in hominids were genome-wide SNPs from 2,493 modern humans, plus a chimpanzee and two fossils, one from the only known Denisovan individual and one from a Neandertal. The data were reduced to f4 summary statistics, which assess the correlation between the allele frequency differences of two pairs of populations. (If populations A and B are consistent with forming a clade with respect to populations C and D, then the f4 statistic is expected to be 0.) The proportions of introgressions between populations were then calculated as the ratios between selected f4 statistics. Finally, the results of the series of calculations were presented as an admixture (or introgression) network.

There are design problems with this experiment, but at least the authors do use an explicit method to produce the introgression pattern for their phylogenetic network. They do, however, draw the network manually.

The obvious experimental problem is lack of replication, which is a basic requirement of traditional science. In this case, the work is ostensibly about archaic introgression, but there is no replication of the Denisovan, Neandertal or chimpanzee samples, which are the key ones for quantifying archaic patterns. Mind you, there are only a couple of bones of the Denisovan, so the lack of replication is hardly surprising, however regrettable it may be.

There are also technical problems, such as the artifactual arch pattern in the PCA plot (see Distortions and artifacts in Principal Components Analysis analysis of genome data).

Finally, note that the "introgression" arrows in the network do not point from the ostensible source but always from a sister taxon of that source. This is basically the argument that we cannot know ancestors, and so we must represent them as sister taxa to their putative descendants in an evolutionary diagram.

Yeast recombination
Baojun Wu, Adnan Buljic and Weilong Hao (2015) Extensive horizontal transfer and homologous recombination generate highly chimeric mitochondrial genomes in yeast. Molecular Biology and Evolution 32: 2559-2570.
The authors studied aligned sequences of 40 mitochondrial genomes from yeasts, and report "extensive, homologous-recombination-mediated, mitochondrial-to-mitochondrial HGT, leading to genomes that are highly chimeric." Recombination was evaluated using various methods from the RDP4 program. Horizontal gene transfer (HGT) was evaluated by comparing different mitochondrial genome regions (introns as well as exons). No phylogenetic network was presented to summarize the phylogenetic relationships, just a long series of incongruent gene (or locus) trees.

The lack of a network summary of HGT studies is quite common. This is in spite of programs available to evaluate HGT and display the results. The focus in such studies seems to be on mechanisms, instead, rather than on the phylogenetic history.

The general experimental issue with the study of HGT is that evidence for it is solely inference from incongruence: (i) incongruent gene trees must be the result of either incomplete lineage sorting (ILS), gene duplication-loss (DL) or gene flow, and (ii) if it is the latter and the taxa are not closely related, then it is called HGT. This is not particularly evidence, especially when ILS and DL are not explicitly evaluated. These days, there are several methods available for doing this.