Monday, July 30, 2012

Some odd network definitions and terms


The word "network" is an over-loaded term, and "network analysis" means different things to different people. There are many specific forms of network analysis used in diverse studies, such as epidemiology, metabolic pathways, phylogenetics, the internet, transportation systems, electrical circuits, project plans, etc. Here, I am going to ignore all of these quantitative ideas.

In his book A Dictionary of the English Language (1755), Samuel Johnson defined a network as:
"Nétwork. n.s. [net and work.]  Any thing reticulated or decussated, at equal distances, with interstices between the intersections."
Biologists, mathematicians and computer scientists have all found this definition to be less than helpful. Still, it was a start.

Oddly, in biology the word "network" has been used to refer to an unrooted tree. This usage arose in the early days of cladistics, from the idea that an unrooted tree represents a set of rooted trees (one potential root per edge in the tree). This usage is usually credited to James S. Farris (1970, Methods for computing Wagner trees. Systematic Zoology 19: 83-92):
"Trees are directed entities in which the root is presumed to represent a point chronologically prior to any descendent point ... If the root is not specified, we have an "undirected tree," or a network. A network with a certain set of nodes may correspond to a wide class of trees with the same nodes, each tree differing from the others in the class only in the position of its root."
Other people have also used the word network with various meanings. The Online Etymology Dictionary (2010) has this to say about the history of the various uses of the word:
Network
"net-like arrangement of threads, wires, etc.," 1560, from net (n.) + work (n.). Extended sense of "any complex, interlocking system" is from 1839 (originally in reference to transport by rivers, canals, and railways). Meaning "broadcasting system of multiple transmitters" is from 1914; sense of "interconnected group of people" is from 1947. The verb, in reference to computers, is from 1972; in reference to persons, it is attested from 1980s.
The idea of trees as line graphs is usually credited to Arthur Cayley (1857, On the theory of the analytical forms called trees. Philosophical Magazine 13: 172-176). Cayley defined his terms with reference to a set of illustrations:
"The inspection of these figures will at once show what is meant by ... the terms root, branches (which may be either main branches, intermediate branches, or free branches), and knots (which may be either the root itself, or proper knots, or the extremities of the free branches)."


It is not clear to me at what point "knots" became "nodes" in mathematical usage, or "branches" became "edges", but biologically "nodes" is more accurate than "knots" although "branches" is more accurate than "edges". Cayley continued to use the term "knots" in his subsequent four papers on trees (1859, 1875, 1881, 1889).

Wednesday, July 25, 2012

Are mathematical constraints biologically realistic?


Mathematicians and other computational scientists have produced their own definitions of phylogenetic networks, independently of biologists. For the evolutionary type of phylogenetic network, the definition usually looks something like this:

A phylogenetic network is a rooted, directed graph (consisting of nodes, plus edges that connect each parent node to its child nodes) such that:
(1) There is exactly one node having indegree 0, the root
        - all other nodes have indegree 1 or 2
(2) All nodes with indegree 1 have outdegree 2 or 0
        - nodes with outdegree 2 are tree nodes
        - nodes with outdegree 0 are leaves, distinctly labelled
(3) The root has outdegree 2, and
(4) Nodes with indegree 2 have outdegree 1, called reticulation nodes.

An obvious question of interest is how (or whether?) this definition connects to what biologists have in mind when they use the term "phylogenetic network". Clearly, this definition places considerable restrictions on the networks that will be inferred by any mathematical algorithm, which in turn affects their use as models for biological inference.

The first thing to note is that unrooted networks are excluded, because the graph is directed. Thus, many (if not most) of the phylogenetic networks that have appeared in the literature are excluded from the discussion. Furthermore, a tree is considered to have all internal nodes with indegree 1 and outdegree 2 (i.e. no reticulation nodes), and we know this to be biologically unrealistic, in general. (Otherwise, this blog would be redundant!)

Biologically, the other parts of the definition imply:
One node of indegree 0
- the network has no previous ancestry that is to be inferred
Nodes with outdegree 0 are labelled
- observed (contemporary) taxa occur only at the leaves
All nodes with indegree 2 have outdegree 1
- reticulation and speciation cannot occur simultaneously
No nodes with indegree >2
- reticulation events cannot involve input from more than 2 parents simultaneously
No nodes with outdegree >2
- speciation involves only two children at a time.

These do not appear to be onerous biological restrictions. Indeed, he first two have been standard characteristics of tree-building for several decades. The other three are also logical extensions of  the restrictions that have previously been placed on trees. However, phylogenetic history is unlikely to have been as simple as implied by these features. Thus, biologists will need to keep a careful eye on whether the simplifications are affecting the networks inferred for their particular group of organisms.

Other restrictions

In addition to the restrictions created by the definition, other topological restrictions have been used to make the mathematical inference algorithms computationally tractable. Thus, only certain sub-families of possible networks are considered by most of the computer programs. These include:
  • tree-child network, tree-sibling network
  • level-k network, galled tree
  • binary input trees for hybridization and HGT networks
  • binary characters for recombination networks.
These restrictions may be unrelated to each other; so, we can consider them separately.

Tree-child, Tree-sibling

In a tree-child network, every internal node has at least one child node that is a tree node
- ie. a reticulation event cannot be followed immediately by another reticulation event
In a tree-sibling network, every reticulation node has at least one sibling node that is a tree node
- ie. a parent cannot be directly involved in two separate reticulation events
Note that every tree-child network will also be a tree-sibling network, but not vice versa.

Algorithmically, these two restrictions may involve the addition of extra tree nodes to an inferred network, in order to satisfy the restrictions. Biologically, the question is whether real networks are this simple. Arenas et al. (2008) simulated data under the coalescent with recombination, and found that even at small recombination rates most of the networks produced were already more complex than a tree-sibling network. On the other hand, Arenas et al. (2010) analyzed real population-level data from the PopSet and Polymorphix databases using the TCS program, and found that >98% the resulting networks could be characterized as tree-sibling. So, there is cause for optimism, in the sense that the "optimum" networks algorithmically are not necessarily complex, at least for closely related organisms (ie. within species).

Level-k network, Galled tree

A network has level k if each tangled part of the network (ie. each biconnected component) contains at most k reticulation nodes (see this previous post). This is a generalization of the older notion of galled trees, in which reticulation cycles do not overlap (ie. do not share edges or nodes), as galled trees are level-1 networks. Level-k networks can also be seen as a generalization of networks with k reticulation nodes, although there may be a difference between a network with minimum level and one with a minimum number of reticulations.

Algorithmically, these restrictions have been used to guide the search for (or choice of) the "optimal" inferred network. Biologically, these notions do not seem to have been investigated, but basically they restrict how complex inferred reticulation histories can be. In particular, they restrict the complexity of any given subset of each network. It has been noted that optimizing k can easily lead to networks that look biologically unrealistic (Huson et al. 2011).

Binary input

The requirements for binary input trees and binary characters are restrictions that have been applied in the past, because they greatly reduce the complexity of the input to the network algorithms, but they are now being relaxed. Effectively, the restrictions are to fully dichotomous trees and SNP characters. These are not unusual restrictions in evolutionary analysis, but they are obviously unrealistic.

As I noted in an earlier blog post, non-binary data often reflect uncertainty in the input, rather than a strictly bifurcating history, and this is not taken into account in the network inference if the input is restricted to a binary state. In particular, it may be unnecessarily hard to construct a network (because not all of the data signals relate to reticulation), and the resulting networks may have far too many reticulation nodes.

Conclusion

It is still an open question about the extent to which we can use these topologically restricted families of mathematical networks as a basis for reconstructing biological histories. Clearly, much more work is needed to understand the connections between the mathematical restrictions and the requirements of biological modelling.

References

M. Arenas, M. Patricio, D. Posada, G. Valiente (2010) Characterization of phylogenetic networks with NetTest. BMC Bioinformatics 11: 268.

M. Arenas, G. Valiente, D. Posada (2008) Characterization of reticulate networks based on the coalescent with recombination. Molecular Biology and Evolution 25: 2517-2520.

D. H. Huson, R. Rupp, C. Scornavacca (2011) Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press.

Monday, July 23, 2012

Charles Darwin's unpublished tree sketches, Part 3


In two previous blog posts I discussed those unpublished tree sketches housed among Charles Darwin's manuscript notes (Part 1) and those contained in his letters (Part 2). In this new post I point out that there is technically one other "unpublished" empirical phylogenetic tree.

In his Monograph on the Sub-class Cirripedia, published in two volumes (1851 and 1854) by the Ray Society, Darwin provided a comprehensive taxonomic revision and classification of the known species of barnacles. Since he had already developed his ideas about evolution, and the relationship between taxonomy and phylogeny, when he conducted the barnacle work, it has been assumed that the classification was produced within a modern phylogenetic context. So, when Darwin makes explicit comments in the Monograph about the relationships between the barnacle taxa it has been assumed that this refers to phylogenetic relationships. However, Darwin never published an explicit phylogenetic tree of this, or any other, taxonomic group.

Nevertheless, in 1973 Michael Ghiselin and Linda Jaffe did attempt to uncover Darwin's implicit phylogeny of the barnacles (Phylogenetic classification in Darwin's monograph on the sub-class Cirripedia. Systematic Zoology 22: 132-140). This was done by constructing the tree based on Darwin's written descriptions of relationships (ie. the words were turned into a picture). The resulting picture shows both the phylogeny of the genera and their classification.


This therefore counts as an unpublished empirical tree, but this time contained among Darwin's published works, rather than his notebooks and letters.

Wednesday, July 18, 2012

The first gene transfer (HGT) network (1910)


I have previously noted in this blog that the first two published phylogenetic networks (by Buffon in 1755 and Duchesne in 1766) were hybridization networks. This leads to the obvious question: what was the first phylogenetic network illustrating horizontal gene transfer (HGT)?

This depends, of course, on exactly how one defines "HGT". If we require explicit reference to genes, then this must post-date the origin of our current understanding of genetics and the nature of genetic material. The first description of HGT is usually credited to Victor Freeman (1951), which thus sets an earliest possible date. However, it will take quite some bibliographic investigation to work out who first illustrated this with a phylogenetic network (none of the earliest reports were concerned with phylogeny). [See the later post The first HGT network.]

However, if we consider HGT to be a subset of genome transfer (or genome fusion), which is the horizontal transfer of an entire organismal genome, then a much earlier date becomes possible. This is because the idea of endosymbiosis, which posits eukaryote organelles as the acquisition of different bacterial genomes, dates back more than a century.

For example, Constantin Mereschkowsky (1905) developed his symbiogenesis theory with the explicit goal of explaining the evolutionary development of land plants from algae-like forms of life, postulating that chloroplasts originated as symbiotic blue-green algae. Richard Altman (1890) had already proposed (indirectly) that what we now call mitochondria are also symbionts.

Mereschkowsky (1910) then took this idea further, and developed a scenario for the origin of the nucleus and cytoplasm from two kinds of organisms and two kinds of protoplasm, called mycoplasm and amoeboplasm. Each kind of protoplasm had an origin in different historical epochs. He illustrated this two-stage symbiosis idea with an explicit network, which appears on page 366 of his paper.

Click to enlarge.

Mereschkowsky's own interpretation of this diagram as a genome-transfer network thus seems clear enough, even though he makes no explicit reference to a genome.

References

Altman R. (1890) Die Elementarorganismen und ihre Beziehungen zu den Zellen. Veit, Leipzig.

Freeman V.J. (1951) Studies on the virulence of bacteriophage-infected strains of Corynebacterium diphtheriae. Journal of Bacteriology 61: 675–688.

Mereschkowsky C. (1905) Über Natur und Ursprung der Chromatophoren im Pflanzenreiche. Biologisches Centralblatt 25: 593–604.

Mereschkowsky C. (1910) Theorie der zwei Plasmaarten als Grundlage der Symbiogenese, einer neuen Lehre von der Entstehung der Organismen. Biologisches Centralblatt 30: 278–303, 321–347, 353–367.

Monday, July 16, 2012

Phylogenetic network of the FIFA World Cup


Since this is post #50 in this blog, I thought I might try something ambitious, just to celebrate.

There have been several attempts to provide visualizations of the relative success of the different national teams at the FIFA World Cup competitions. This is quite a complex task, because there have been 19 competitions so far, and at least 74 teams have competed in the finals at least once. The relationships between these teams represent a network within each competition, based on their relative success at the games they play, and this network changes through time across the various competitions. Here, I review some of the previous network analyses, and then I present a combined analysis of all of the competitions based on a phylogenetic network.

Background to the Association Football World Cup

The Fédération Internationale de Football Association (FIFA) World Cup™ competition has been played every 4 years since 1930, except 1942 and 1946. Teams qualify for places in the finals by playing against other teams within defined geographical regions: Europe, Africa, Asia, Oceania, North+Central America, and South America. Most of the teams qualify for the finals by succeeding within their region, but the remainder qualify in a subsequent inter-region competition. The host nation(s) automatically qualify.

The number of teams competing has changed dramatically over the years (13-204), as has the number of teams accepted into the finals (13-32). Here is a summary of the finalists at the time of the 2010 competition (yellow represents previous finals participations and red the 2010 one). It also shows the seven countries who have won the competition.

Click to enlarge.

Given the 80 years over which the competitions have been held, there have been some changes in the political entities that the teams represent. Confusion over this issue affects some of the graphs shown below. FIFA officially attributes the various results as follows:
(i) all West Germany results go to Germany (leaving 1 finals result for East Germany);
(ii) all Yugoslavia and Serbia & Montenegro results are attributed to Serbia (since the break-up, both Croatia and Slovenia have reached the finals independently);
(iii) all Czechoslovakia results are attributed to both the Czech Republic and Slovakia;
(iv) all USSR results go to Russia (only Ukraine has reached the finals independently).

The results available for analysis are for the finals only, at the end of which FIFA provides an ordering of these teams based on their success in the finals. The full data are presented at the official FIFA site, and a summary is reproduced at Wikipedia.

In the data, some of the "zero" results are attributable to the team not competing in that year's World Cup competition at all, while others are attributable to the team not getting to the finals. Only the Brazilian and French teams have competed in all 19 editions, and only the Brazilians have made it to the finals every time.

The format of the finals competition has also changed over the years, at least partly in response to the increasing number of teams involved. Nevertheless, an official ranking of all finals teams has been produced for each edition. Otherwise: the ball is round, a team has 11 players (plus a substitute or three), and a match takes 90 minutes (possibly with extra time, and maybe a bizarre lottery called a "penalty shoot-out").

Previous Analyses

The finals competition is usually considered to be the most widely viewed sporting event in the world, surpassing even the Olympic Games. Not unexpectedly, there is now an enormous internet presence before, during and after each Cup, and some of the web sites have rather impressive data visualizations. These consist of (i) pre-competition viewing information, team analyses and result forecasts, (ii) competition game presentations, and (iii) post-game summaries and incredibly detailed de-constructions of each game (with every move made by each player and its effect on the outcome).

Some of the data visualizations for the 2010 competition have been collected for viewing at:

An example of a network analysis used as a team summary is illustrated here, in which directed line graphs (from 2010 Football World Cup Graphs) show the pattern of ball movement, averaged across the first-round games from the 2010 finals. The nodes of the graphs represent the players and the arrows represent the ball passes, with the size and colour of the arrows representing the number of passes between players.

Click to enlarge.

Note that the German team mainly builds their attacks from their defenders (notably #17 and #16), which is a strategy they have been successfully using for many decades (they are, on average, the most consistently successful team in World Cup history). The English team works mainly from the midfield, and concentrates their attack through player #10. (Note that although #16 receives the ball frequently from the defenders he usually just passes it back.) Sadly, this concentration presents an easily predicted strategy, and it focuses play precisely where the Germans are concentrating their defensive work (on the field, German #16 plays near English #4). So, if these two teams were to meet, and played in a similar manner to what is shown in the networks, the outcome is easy to forecast. (The Germans won 4-1 when the two teams met in the second round of the competition). Forecasting is not always this easy, of course.

This is a type of social network, and it is amenable to examination using the standard network summaries for each player. For example, closeness centrality (the summed shortest pathlengths to all other nodes, measuring how easy it is to reach a given node in the network) measures how well connected a player is in the team; and betweenness centrality (the number of inter-node shortest paths on which a node lies, measuring the extent to which a node lies on a path to other nodes) measures how the ball flow between players depends on each other player. Both teams have relatively evenly distributed centrality values, in this example, so that no single player can be said to be a "key" player for either team.

Another form of social network connects the players and their country teams and club teams (from FIFA 2010 World Cup as Networks). In this example, the nodes represent the countries in the 2010 finals, and two countries are connected if they have players who share the same club in which they play. Node size represents (closeness?) centrality in the network.

Click to enlarge.

Centrality in an organizational social network has been linked to team and individual performance, because the players could transfer knowledge from different clubs to their own country teams. In this example, their large centrality might have been a contributor to the Netherlands' success, as they appeared in the final. However, they then lost to the Spanish team, who have a very low centrality (most of their players play in Spain).

In addition to this sort of analysis, there have also been attempts to summarize and visualize the competition results from the entire 19-cup history. For example, there is an interactive set of bar charts at:   FIFA World Cup Statistics with Tableau.

However, a network can try to summarize the information in a single diagram, rather than a set of diagrams. This has been tried by:
Ulrik Brandes (2006) Centrality: Concepts and Methods. NetSci 2006 Workshop, 16-19 May 2006, Bloomington, Indiana, USA.
The two networks shown here summarize the data from the matches at the World Cup finals from 1930–2002. The nodes represent the teams, and the arrows represent the results of all of the matches played between each pair of teams (the arrows point to the winner).

Betweenness centrality
Click to enlarge.

Note that the German team, with the greatest average success across all competitions, has the greatest betweenness centrality (defined above), rather than the Brazilian team, who have won the most Cups (4 to the German's 3, at the time of the graph).

Closeness centrality
Click to enlarge.

There is no single "most central" team for closeness centrality, but instead a group of those teams who had appeared most often in the finals, to that time (the unlabelled teams include France, Hungary, former Yugoslavia and Czechoslovakia).

Node-degree centrality is not shown here (the number of incident edges to a node, measuring how well-connected each node is), but Brazil is most central based on that measure, having appeared in all of the Cup finals and thus having played against more teams than anyone else.

Another interesting attempt to simultaneously view the entire dataset using networks is provided by:
Adel Ahmed, Xiaoyan Fu, Seok-Hee Hong, Quan Hoang Nguyen, Kai Xu (2010) Visual analysis of history of World Cup: a dynamic network with dynamic hierarchy and geographic clustering. Pages 25-39 in M.L. Huang, Q.V. Nguyen, K. Zhang (eds) Visual Information Communication. Springer, New York.
The two network visualizations shown here summarize the data from the matches at the World Cup finals from 1930–2006. As above, in the network for each year the nodes represent the teams, and the arrows represent the results of all of the matches. The network summaries are based on node-degree centrality, as the more successful teams play in more games. Unfortunately, the dataset used separates the results for "Germany" and "West Germany" (contrary to FIFA), thus reducing the apparent success of the German team.

The first graph displays the centrality values as a wheel, with the size of each node representing the value. The yearly values are arranged in concentric circles (coloured by year), with 1930 in the centre, and the countries are represented by the spokes (indicated by their flags, and grouped into their geographical regions). This reveals the change in centrality value for each team through time. The German, Brazilian and Italian teams, for example, each have an almost continuous series of nodes. This construction can be viewed more clearly in the animation of the graph provided by the authors.

Click to enlarge.

The next graph arranges the teams in concentric circles based on groupings of their centrality values (they are grouped based on the range of values across the 18 Cups), with the team having the highest value (Brazil) in the centre. All of the games played are represented as connecting lines. This graph thus super-imposes the results for all 18 competitions (i.e. it is the union of the separate networks for each year). The size of the nodes represents the largest centrality value observed for each team.


The authors present an animation of this graph, in order to show the change in the 18 component networks through time. Each competition network forms a slice that can be viewed separately.

These networks provide no particularly deep insight into the history of the World Cup, in the sense that they summarize only patterns that are already obvious in the data. Nevertheless, they are effective summaries of a complex time-series of dynamic networks.

Phylogenetic Analysis

A phylogenetic analysis seeks to uncover the historical patterns associated with a group of objects for which multi-variable data have been collected. It is thus related to other multivariate analysis techniques, such as ordination and clustering, as well as to line-graph visualization techniques.

The network analysis assumes, of course, that the data have been formed by some historical process(es), and it produces a visualization that places objects with similar histories near each other in the network. The World Cup data are thus ideal for this type of analysis.

For my analysis, the FIFA rank-order data for each Cup were range-scaled to vary from 1 (last in the order) to 2 (first in the order), to deal with the varying number of finalists. Absence from the finals was coded as 0 (which could be due to not competing that year, or to competing but not qualifying for the finals).

For those teams that have changed through time (listed above), I have followed (i) and (ii). For (iii), I have attributed all results to Czech/Slovakia, since the Czech Republic and Slovakia have never been in the finals together. [It is unnecessary to have the results duplicated, since the two countries would be almost perfectly correlated.] For (iv) I have also attributed all results to Russia/Ukraine, since they have never been in the finals together.

The similarity among the 19 scores for each pair of teams was calculated using the Steinhaus dissimilarity. The Steinhaus dissimilarity ignores "negative matches", as discussed in a previous blog post, so that two teams are not considered to be similar just because they were both absent from the finals in the same years. This is important, because (a) there are another c. 130 teams who have always been absent from the finals (and would then need to be accommodated), and (b) we would need to somehow account for the two different reasons for being absent from the finals.

A Neighbor-net analysis was used to display the between-team similarities as a phylogenetic network. This decomposes the similarities into a series of bi-partitions of the teams, and then tries to display as many of these bi-partitions as possible in two dimensions. Each bi-partition represents the division of the teams into two sub-groups, where the data indicate that the two sub-groups differ in some way. That is, countries that are closely connected in the network are similar to each other based on their World Cup results, and those that are further apart are progressively more different from each other.

Click to enlarge.

To interpret the graph, it can be noted that the biggest split (i.e. best supported by the data) separates North Korea, Greece, Algeria, New Zealand and Honduras into a partition apart from the other teams. Inspection of the original data shows that these five teams all appeared in the 2010 finals and did poorly, while not appearing in most of the other finals.

The network has two main bi-partitions of interest, and the split that separates each sub-group is highlighted in red or blue in the graph. This pattern of two bi-partitions thus creates four quadrants in the network. The lower-left quadrant (from Romania clockwise round to Serbia/Montenegro) contains those teams who have been successful on most of those occasions when they have appeared in the finals (e.g. they have made it to the quarter-finals). Note that the most successful teams (Brazil, Germany, Italy) do not stand out in this group. The left quadrant (Mexico to South Korea) contains those teams whose finals results have varied from very good to very poor. The upper quadrant (Ecuador to Iran) contains those teams who have usually been moderately successful whenever they have qualified for the finals (e.g. they have made it to the second round). The right quadrant (Jamaica to Norway) contains those teams who have usually been unsuccessful when they have appeared in the finals (e.g. they have been eliminated in the first round).

This phylogenetic network thus provides a very effective summary of the main features of the World Cup results when averaged over all of the competitions.

If we want an alternative network summary that emphasizes the success of the most successful teams, then it would have to include the "negative matches", because one of the main indicators of a team's success is the fact that have appeared in most of the finals (i.e. they have few zeroes, indicating absence). The similarity measure that includes these, but is otherwise equivalent to the Steinhaus dissimilarity, is the Manhattan distance. Note that this analysis treats all absences from the finals as equivalent, and (arbitrarily) includes only those teams who have made it to the finals at least once.

Click to enlarge.

This network basically highlights those teams who have appeared in the semi-finals at least once (I have labelled only these teams on the network). It does, however, also strongly emphasize the most successful teams, with most of the winners at the very top of the graph. Uruguay is separated from the other winners because it has rarely done well since its early wins in 1930 and 1950. The other (unlabelled) teams in the upper part of the network are Mexico, Romania, Scotland (top to bottom on the left) and Switzerland (on the right), who regularly have made it to the quarter-finals but never to the semi-finals.

This phylogenetic network thus provides an effective summary of the successful teams when averaged over all of the competitions.

Note that the two phylogenetic networks succeed in a very different manner from the other types of network, as shown above. Here the summary is based on calculating the similarity of the teams' results and then displaying this as a network, whereas in the other, more traditional, approach the networks are first derived as direct displays of the data and then their centrality is calculated and displayed.

Furthermore, the traditional networks consist only of edges connecting "observed" nodes, whereas the phylogenetic networks have many extra "inferred" nodes and edges. These inferred nodes are designed to support the display of sets of incompatible bi-partitions — they are not intended to be hypothetical extra teams!

These are thus quite different approaches to the same visualization problem. They decompose the problem in different ways and produce different visualizations. They seem to be equally effective, however.

Wednesday, July 11, 2012

Evolutionary trees: old wine in new bottles?


Biblical scripture tells us about the dangers of putting new wine into old bottles (Matthew 9:17, Mark 2:22, Luke 5:37). Here, I will say a few words about the equal folly of putting old wine into new bottles.

A number of authors have considered the iconography used to display biological (and other) relationships, and noted that it both reflects the way biologists think and it influences their ongoing thought processes (Brace 1981; Stevens 1984; O'Hara 1992; Stevens 1994; Clark 2001; Ragan 2009; Gontier 2011; Tassy 2011). In particular, several of these authors have noted that the Darwinian revolution did not much change the way that many biologists thought about biological relationships although it did to some extent change the way they presented them. That is, they practiced the art of putting old wine (their old way of thinking, based on a linear series of relationships) into new bottles (an evolutionary tree, allegedly based on genealogical history).

Changing iconography

The following series of icons illustrates this historical process. We started with the image of a series of steps ascending from "lower" forms to "higher" ones, an idea apparently originating with Aristoteles, with ourselves near the top (eg. Llull 1304, which illustrates the Ladder of the Intellect).

From Llull (1304).
Note that Homo is between the plant & animal steps and the sky, angel & god steps.
Click to enlarge.

We then converted this to the formal Scala Naturae, or Great Chain of Being (e.g. Bonnet 1745), which eliminates the non-living forms and non-earthly beings and thus restricts itself to observed biological phenomena.

From Bonnet (1745).
Note that Homo is at the top of the chain.
Click to enlarge.

Lamarck (1809) challenged this simple linear series, using instead a branching diagram to represent transformations among biological forms, with each form transforming into another form through time. However, Lamarck's version of evolution was still essentially a transformation series among organisms, and thus his diagram was simply a slightly branched version of the Scala Naturae.

Darwin (1859) then challenged the very idea of transformation series, insisting upon both the origin of new biological forms and the extinction of some of the old forms. He used a bush as his icon, but we have always referred to it as a tree. Indeed, it is worth noting that Darwin never explicitly refers to any of his diagrams as a "tree" (see previous blog post), referring instead to "descent with modification". The diagrams were intended to represent his ideas on continuity of descent through time, and the role of speciation in increasing biodiversity and extinction in counter-balancing this (see Kutschera 2011). Their primary purpose was not the representation of phylogenetic history.

From Darwin (1859).
Click to enlarge.

Darwin's "Tree of Life" metaphor is thus quite independent of his diagrams, both published and unpublished (see Penny 2011). However, it is the one paragraph of the Origin containing this poetic imagery that seems to have been inspired people. Indeed, Wallace (1855), the co-discoverer of "Darwinian evolution", had already used a very similar image, when he noted: "the analogy of a branching tree [is] the best mode of representing the natural arrangement of species ... a complicated branching of the lines of affinity, as intricate as the twigs of a gnarled oak ... we have only fragments of this vast system, the stem and main branches being represented by extinct species of which we have no knowledge, while a vast mass of limbs and boughs and minute twigs and scattered leaves is what we have to place in order, and determine the true position each originally occupied with regard to the others".

This theoretical icon was taken up by empirical scientists such as Mivart (1865) and Haeckel (1866), who then used a bush as the icon for their presentation of explicit hypotheses of genealogical relationship.

From Mivart (1865).
Note that Homo is on a side-branch.
Click to enlarge.

From Haeckel (1866).
Note that Homo is on a side-branch.
Click to enlarge.

However, this lead was quickly abandoned by many people, including notably Haeckel himself (1874), who subsequently drew trees with a distinct central trunk. They thus, in effect, re-drew the Scala Naturae as a tree rather than as a chain, the only important difference being that some of the forms appear on side-branches rather than along the main trunk. We might call this icon an implicit Scala Naturae rather than an explicit one. *

From Haeckel (1874).
Note that Homo is at the top of the central trunk.
Click to enlarge.

From Smallwood et al. (1948).
Note that Homo is at the top of the central trunk.
Click to enlarge. 

This approach can be used to put any chosen organism at the crown of the tree, not just human beings, as illustrated by Scott (1986). This is the fundamental difference from a chain — a chain is linked so that there are only two possible ends, but a tree can be drawn so that any part of the tree is at the crown.

From Scott (1986).
Click to enlarge.

This sequence of icons shows you that, indeed, one can (metaphorically) put old wine in new bottles — the Scala Naturae can be put into an evolutionary tree. This thinking can be detected throughout modern evolutionary publications (Nee 2005). Unfortunately, it re-inforces the view that evolution is a purposeful and goal-directed process, which runs counter to current scientific understanding.

In this sense, we have lost much of what Darwin tried to tell us back in 1859: that the history of life is a multi-stemmed bush not a single-stemmed tree.

Effect on evolutionary biology

Baum and Smith (2012) have noted the following:

"We do not know why it should be so, but we have learned from working with thousands of students that, without contrary training, people tend to have a one-dimensional and progressive view of evolution. We tend to tell evolution as a story with a beginning, a middle, and an end. Against that backdrop, phylogenetic trees are challenging; they are not linear but branching and fractal, with one beginning and many equally valid ends. Tree thinking is, in short, counterintuitive."

That is, the problem illustrated above is still widespread.

The effects of the distorted iconography manifest themselves in several ways in modern evolutionary biology. This topic has received considerable attention in the literature, and there are a number of very readable expositions on various parts of it (e.g. O'Hara 1992, 1997; Krell and Cranston 2004; Baum et al. 2005; Crisp and Cook 2005; Gregory 2008; Omland et al. 2008; Sandvik 2009; MacDonald and Wiley 2012). Some of the main points of potential bias are:
(i) presenting a sequence of contemporary taxa so that a main axis passes through the diagram (as I have illustrated above); (ii) the left–right ordering of the taxa at the tips (which is mistakenly interpreted as representing an evolutionary series); (iii) the selective pruning of side branches (thus making one line of evolution more prominent); (iv) the use of paraphyletic groups; and (v) the differential resolution of branches.

In particular, in a bush the relationships among monophyletic groups (clades) are equal, in the sense that each clade is the sister to some other clade and vice versa. Thus, clades cannot be "basal" or "crown", because each single clade branches from some other single clade, rather than each clade being a side-branch from a main stem. Logically, at each speciation event two new species arise, rather than one species producing an extra offshoot species. There is no main stem in an evolutionary tree, but instead there is a series of branches leading to a series of twigs, even if some of the branches do have more twigs than others.

Networks

My interest in raising these issues here is in considering whether these apparently widespread problems are likely to affect the use of reticulating phylogenetic networks as much as they do dichotomous phylogenetic trees. Since I recognize two forms of phylogenetic network being used in practice, there are two situations to evaluate when considering an answer to this question.

Unrooted networks
If the icons used for displaying biological relationships are non-genealogical "webs, nets, maps, or other basically horizontal, planar, reticulating structures" (Stevens 1984), as discussed in a previous blog post, then there seems to be little likelihood of linear distortions being produced. This is a simple by-product of the unrooted nature of the diagrams and the consequent lack of direction in the relationships (ie. the relationships between taxa are symmetrical).

Rooted networks
Any rooted, and thus explicitly directional, diagram can be subjected to linear distortions by the simple expedient of emphasizing some directions at the expense of others. In this sense, rooted reticulating networks and non-reticulating trees are prone to the same potential problems. It might be argued that reticulations make it harder to emphaszie a single direction, particularly if there are many reticulations, but this seems to be a weak argument. We will thus need to guard ourselves against the implicit Scala Naturae when employing evolutionary networks (e.g. hybridization networks, HGT networks, recombination networks; see previous blog post) just as much as when employing evolutionary trees.

* Footnote: In Haeckel's defence, I should quote from Richards (2011): "Haeckel regarded these two types of diagrams as having different purposes. The first represented ... a proper stem-tree, one highly branched. The latter diagram simply looked back from a given organism — in this case man — to its lineal progenitors. It's as if one began with the first kind of tree and traced back the series of man's direct ancestors— and this would result in that second kind of tree. Haeckel had not precipitously regressed, within the space of a few years, into a dogmatic teleologist."

References

Baum D.A., Smith S.D. (2012) Tree Thinking: An Introduction to Phylogenetic Biology. Roberts & Company Publishers, Greenwood Village, CO.

Baum D.A., Smith S.D., Donovan S.S. (2005) The tree-thinking challenge. Science 310: 979-980.

Brace C.L. (1981) Tales of the phylogenetic woods: the evolution and significance of evolutionary trees. American Journal of Physical Anthropology 56: 411-429.

Clark CA. (2001) Evolution for John Doe: pictures, the public, and the Scopes trial debate. Journal of American History 87: 1275-1303.

Crisp MD., Cook LG. (2005) Do early branching lineages signify ancestral traits? Trends in Ecology and Evolution 20: 122-128.

Gontier N. (2011) Depicting the Tree of Life: the philosophical and historical roots of evolutionary tree diagrams. Evolution: Education and Outreach 4: 515-538.

Gregory TR. (2008) Understanding evolutionary trees. Evolution: Education and Outreach 1: 121-137.

Krell F-T., Cranston PS. (2004) Which side of a tree is more basal? Systematic Entomology 29: 279-281.

Kutschera U. (2011) From the scala naturae to the symbiogenetic and dynamic tree of life. Biology Direct 6: 33.

MacDonald T., Wiley EO. (2012) Communicating phylogeny: evolutionary tree diagrams in museums. Evolution: Education and Outreach 5: 14-28.

Nee S. (2005) The great chain of being. Nature 435: 429.

O'Hara R.J. (1992) Telling the tree: narrative representation and the study of evolutionary history. Biology and Philosophy 7: 135-160.

O'Hara RJ. (1997) Population thinking and tree thinking in systematics. Zoologica Scripta 26: 323-329.

Omland K.E., Cook L.G., Crisp M.D. (2008) Tree thinking for all biology: the problem with reading phylogenies as ladders of progress. BioEssays 30: 854-867.

Penny D. (2011) Darwin’s theory of descent with modification, versus the biblical Tree of Life. PLoS Biology 9: e1001096.

Ragan M. (2009) Trees and networks before and after Darwin. Biology Direct 4: 43.

Richards R.J. (2011) Images of evolution. American Scientist 99: 165-167.

Sandvik H. (2009) Anthropocentrisms in cladograms. Biology and Philosophy 24: 425-440.

Stevens P.F. (1984) Metaphors and typology in the development of botanical systematics 1690-1960, or the art of putting new wine in old bottles. Taxon 33: 169-211.

Stevens P.F. (1994) The Development of Biological Systematics: Antoine-Laurent de Jussieu, Nature, and the Natural System. Columbia Uni. Press, New York.

Tassy P. (2011) Trees before and after Darwin. Journal of Zoological Systematics and Evolutionary Research 49: 89-101.

Wallace AR. (1855) On the law which has regulated the introduction of new species. Annals and Magazine of Natural History (n.s.) 16: 184-196.

Image Sources

Bonnet C. (1745) Traité d'Insectologie, premier parte. Durand, Paris.

Darwin C. (1859) On the Origin of Species. John Murray, London.

Haeckel E. (1866) Generelle Morphologie der Organismen. Reimer, Berlin.

Haeckel E. (1874) Anthropogenie oder Entwickelungsgeschichte des Menschen. Engelmann, Leipzig.

Lamarck J-B. (1809) Philosophie Zoologique. Dentu et l'Auteur, Paris.

Llull R.  (1304, published 1512) Liber de Ascensu et Descensu Intellectus. Jorge Costilla, Valencia.

Mivart St G. (1865) Contributions towards a more complete knowledge of the axial sleketon in the Primates. Proceedings of the Zoological Society of London 33: 545-592.

Scott J.A. (1986) The Butterflies of North America: a Natural History and Field Guide. Stanford Uni. Press, Stanford.

Smallwood WM., Reveley IL., Bailey G.A., Dodge RA. (1948) Elements of Biology. Allyn & Bacon, Boston.

Monday, July 9, 2012

Eurovision Song Contest 2012: a network analysis


Some time ago I considered the Eurovision Song Contest for 2006, and provided a network analysis of the voting patterns among the countries. It is now time to consider the 2012 contest, especially since the local (ie. Swedish) contestant won. (I am not sure that I would have voted for this song, myself.) There are a number of interesting similarities and differences between the two years.


The data analysis is the same as last time, using the Steinhaus dissimilarity (or Bray-Curtis similarity), so that "negative matches" do not create similarity due to which songs are not voted for (since we have no score for those songs), and using a NeighborNet analysis to display the phylogenetic network. The previous post can be consulted for a detailed description of the rationale, as well as for an introduction to the contest for the uninitiated.

Countries that are closely connected in the network are similar to each other based on their voting patterns, and those that are further apart are progressively more different from each other. I have used the same country colour codes as last time: red represents the countries from Northern Europe and around the Baltic; green is for the countries from Central-Eastern Europe; orange is for the countries of Western Europe; blue is for the countries from Southern Europe and the Middle East; and purple is for the countries from the former Soviet Union, in Far Eastern Europe.

Click to enlarge,

The same strong geographical patterns in the voting patterns are present in 2012 as in 2006. Particularly noteworthy is the divide between Western+Northern Europe (on the left of the graph) and Eastern Europe (on the right). This is so despite the fact that the make-up of the competing countries is somewhat different between the two years.

This is related to the recent success of songs from eastern Europe, in a contest that used to be completely dominated by western and northern Europe. Former countries such as Yugoslavia and the U.S.S.R. that would have been represented by a single song that was voted for by very few countries, are now represented by many songs, with many extra countries to vote for them. There are more countries on the right part of the graph than on the left.

Some of the "outlying" countries in the 2006 analysis (ie. not located with their geographical compatriots) are again outliers in 2012, notably Lithuania, Malta and Switzerland. This replicated pattern may bear looking into by sociologists.

There are, not unexpectedly, some countries that are "outliers" in 2012 but not in 2006, such as Cyprus and Greece, and ones that were "outliers" in 2006 but not in 2012, such as Romania and the Ukraine. Interestingly, the other 2006 "outliers" did not compete in 2012 (Andorra, Armenia, Monaco and Poland).

There are also countries represented in 2012 that did not compete in 2006, thus allowing some of the previous voting predictions to be tested. Intriguingly, some of these countries also have unexpected placements in the network: (i) Hungary and Slovakia, from Central-Eastern Europe, voted with the Northern Europe - Baltic countries; and (ii) Azerbaijan, from the former Soviet Union, voted with the Central-Eastern Europe countries. However, Italy and San Marino, from  Southern Europe, did  vote with at least one of their geographical compatriots.

This set of comparisons suggests that it is possible that unusual voting patterns are associated with countries who compete in the contest only sporadically. That is, only regular participants vote in a predictable way. This hypothesis is testable by examining more years.

The repeated presence of strong geographical patterns in the Eurovision voting behaviour has been noted by previous commentators (see the previous post for references). It would thus be of interest to see how well such patterns can be displayed by a phylogenetic network analysis of the combined data for several contests. This analysis would also examine the suggested inconsistent voting of countries who compete only sporadically.

Wednesday, July 4, 2012

Time inconsistency in evolutionary networks


The temporal ordering of the nodes (and branches) is usually treated as an important feature in an evolutionary network of biological organisms, because the order must be time consistent (Baroni et al. 2004, 2006; Moret et al. 2004). That is, for reticulation events the "horizontal" gene flow can only occur between species that are contemporaries. So, speciation events occur successively but reticulation events occur instantaneously (Sang and Zhong 2000).

For example, it would be unrealistic to hypothesize either a hybridization or a horizontal gene transfer event between a species and one of its own ancestors. Furthermore, each reticulation event must not only be consistent on its own but must be consistent in relation to all of the other events.

Mathematically, inconsistency creates directed pseudo-cycles in the network graph, so that it is not acyclic, as required for an evolutionary history (see previous blog post). Time consistency is thus seen as a useful means of validating a network as a potential biological history, and can even be used as a criterion for choosing among otherwise equally optimal networks.

However, evolutionary analysis is not applied only to biological organisms. It has also been applied to the study of languages (Atkinson & Gray 2005) and to cultural objects (Collard et al. 2006). Indeed, Darwin himself recognized early on that it would be important to show that language (a characteristic solely of humans) had a natural origin and that it develops in a genealogical fashion (ie. it has a pedigree).

Thus, both language and cultural objects have an historical component that can be studied, and both can fit into an evolutionary framework of variation + transmission + selection (Dagg 2011). Moreover, the evolutionary history also consists of both vertical and horizontal transmission. This means that the same data-analysis techniques can potentially be applied to biology, language and culture (Heggarty et al. 2010; Gray et al. 2010).

The issue that I wish to raise here is that time consistency is not a requirement of the evolution of either language or cultural objects, the way that it is for biological organisms. Organisms store the information (that is vertically and horizontally transmitted) in genes that they carry with them, which is what restricts reticulation to occurring only between contemporaries. However, language and culture store their "information" externally, either in the minds of people or in permanent or semi-permanent records (either written or pictorial). Thus, the information available for horizontal transmission can come from the distant past, as well as from the present. *

It is important to note that for language and culture the biological ideas of vertical and horizontal transmission of genetic information need modification (Cavalli-Sforza and Feldman 1981). Vertical (or descending) transmission still involves faithful copying of the information (with perhaps some losses or minor modifications). Lateral transfer, however, can be either horizontal transmission (between contemporary generations) or oblique transmission (between different generations), and it is the latter that allows time-travel of information.

Lateral transfer in this context may be a form of hybridization, in which new concepts are added from elsewhere (eg. synonymous words), but is likely to be a form of HGT in which concepts are simply replaced with something from elsewhere (eg. a new word effectively replaces an old word). Recombination, in which concepts are mutually exchanged, may be rather rare.

As an illustration, Dagg (2011) provides some interesting examples of lateral transfer in the parts of mouse traps. For example, he notes that: "Torsion power may have been transmitted laterally from Egyptian torsion traps to prefabricated dead-fall traps." These traps need not be contemporaneous, because the ideas being transferred may be from pictures or descriptions of old traps rather than from concurrently existing traps. (Joachim Dagg also has a couple of blog posts where he further discusses the evolution of mouse traps: post 1 —  post 2.)

As an alternative example, Johnson et al. (1989) provide an evolutionary network showing the history of the various software (mostly) and hardware components of the revolutionary Xerox 8010 "Star" Information System (ie. computer), introduced in April 1981. Note that almost all of the lateral transfer events (single arrows; mostly hybridization) are time inconsistent. To quote the authors: "Although Star was conceived as a product in 1975 and was released in 1981, many of the ideas that went into it were born in projects dating back over three decades."

Fig. 8 – How systems influenced later systems.
This graph summarizes how various systems related to Star have influenced one another over the years. Time progresses downwards. Double arrows indicate direct successors (i.e., follow-on versions). Many "influence arrows" are due to key designers changing jobs or applying concepts from their graduate research to products.

The implications of time-travelling laterally transferred information for network construction methods may be unfortunate, in the sense that evolutionary networks in biology may be quite different from those for language and culture, with the latter pair requiring somewhat different methods. At a minimum, the requirements for choosing among alternative networks will be different.

A quick look at the current literature involving network analysis of languages and cultural artifacts shows an almost universal use of unrooted graphs, most often a Neighbor-Net, Reduced-Median or Median-Joining network. Such networks cannot directly represent evolutionary history because there is no time direction in the graph. This type of analysis thus neatly side-steps the issue of representing time-travelling information in an evolutionary diagram; and it suggests that social scientists have not yet considered the consequences of the potential lack of time consistency in their data.

* Footnote: I suppose that I should be precise, and note that a modern gene bank does allow genetic information to time travel, as well.

References

Atkinson QD, Gray RD (2005) Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54: 513-526.

Baroni M, Semple C, Steel M (2004) A framework for representing reticulate evolution. Annals of Combinatorics 8: 391–408.

Baroni M, Semple C, Steel M (2006) Hybrids in real time. Systematic Biology 55: 46–56.

Cavalli-Sforza LL, Feldman MW (1981) Cultural Transmission and Evolution. Princeton University Press, Princeton.

Collard M, Shennan SJ, Tehrani JJ (2006) Branching, blending, and the evolution of cultural similarities and differences among human populations. Evolution and Human Behavior 27: 169–184.

Dagg JL (2011) Exploring mouse trap history. Evoluton: Education and Outreach 4: 397–414.

Gray RD, Bryant D, Greenhill SJ (2010) On the shape and fabric of human history. Philosophical Transactions of the Royal Society of London series B 365: 3923-3933.

Heggarty P, Maguire W, McMahon A (2010) Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories. Philosophical Transactions of the Royal Society of London series B 365: 3829-3843.

Johnson J, Roberts TL, Verplank W, Smith DC, Irby C, Beard M, Mackey K (1989) The Xerox "Star": a retrospective. IEEE Computer 22: 11-29.

Moret BME, Nakhleh L, Warnow T, Linder CR, Tholse A, Padolina A, Sun J, Timme R (2004) Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1: 13–23.

Sang T, Zhong Y (2000) Testing hybridization hypotheses based on incongruent gene trees. Systematic Biology 49: 422–434.

Sunday, July 1, 2012

Amazon is trying to tell us something


If one goes to the Humour section of the Books department at www.amazon.co.uk, and does a search for "phylogeny", one gets the following search results:

Inferring Phylogenies by Joseph Felsenstein
Advances in Sponge Science: Phylogeny, Systematics, Ecology edited by Mikel Becerro
Mathematics of Evolution and Phylogeny edited by Olivier Gascuel

If one goes to the Humor & Entertainment section of the Books department at www.amazon.com, and performs the same search, one gets the following search results:

Computational Paleontology edited by Ashraf M.T. Elewa
Evolutionary Biology: Concepts, Molecular and Morphological Evolution edited by Pierre Pontarotti
Lecture Notes in Computer Science 1075 - Proceedings of CPM 1996
Lecture Notes in Bioinformatics 2066 - Proceedings of JOBIM 2000
Lecture Notes in Bioinformatics 2452 - Proceedings of WABI 2002
Lecture Notes in Bioinformatics 3678 - Proceedings of RECOMB-CG 2005
Lecture Notes in Bioinformatics 4205 - Proceedings of RECOMB-CG 2006
Lecture Notes in Bioinformatics 4751 - Proceedings of RECOMB-CG 2007
Lecture Notes in Bioinformatics 5267 - Proceedings of RECOMB-CG 2008
Lecture Notes in Bioinformatics 5817 - Proceedings of RECOMB-CG 2009
Lecture Notes in Bioinformatics 6398 - Proceedings of RECOMB-CG 2010
Lecture Notes in Bioinformatics 5542 - Proceedings of ISBRA 2009
Lecture Notes in Bioinformatics 7292 - Proceedings of ISBRA 2012

None of these books is in any way humorous (at least not intentionally), and so either (i) the cataloguing schemes used by the various Amazon stores leave something to be desired, or (ii) Amazon is telling us that we look funny to the rest of the world.

Perhaps the difference between the two lists has something to do with the British Amazon insisting that "Humour" is also "Entertainment"? I guess that we should be grateful that there are not more books on these lists.