Wednesday, August 26, 2015

Request for datasets

During one of the discussion sessions at the recent Phylogenetic Network Workshop, in Singapore, the need was re-iterated for "gold standard" empirical datasets, in order to aid the development and validation of algorithms for phylogenetic networks.

The current collection of such datasets is located on this blog, at:
However, it is still quite a small database, as so far it has been based solely on my own ability to locate suitable datasets that are freely available (see the comments in Public availability of phylogenetic data).

I would therefore like to remind everyone that if you have, or know of, suitable empirical datasets then please contact me.

The database is currently hierarchically arranged as follows:

Datasets where the history is a tree
  Datasets where the history is known from experimentation
  Datasets where the history is known from retrospective observation
Datasets where the history is reticulated
  Datasets where the history is known from experimentation
  Datasets where the reticulation is inferred
    Lateral Gene Transfer

The basic requirement for a "gold standard" dataset that contains one or more reticulations (ie. there is gene flow) is that the evidence for the reticulation(s) is independent of the particular dataset. That is, there should be either experimental data, or at least another independent dataset, confirming the gene flow. This is quite a tough criterion, particularly for lateral gene transfer, but it is a necessary quality criterion.

Finally, the database requires the processed data (eg. a multiple sequence alignment), rather than the original raw data (see the comments in Releasing phylogenetic data).

Monday, August 24, 2015

Spinach and the iron fallacy

A few weeks ago, the Natural History Apostilles blog ran a series of posts on the origins of the well-known spinach-is-rich-in-iron fallacy. This is more complex than expected. Spinach was originally alleged to have been incorrectly claimed to be rich in iron due to a mis-placed decimal point in a set of comparative data. In fact, this explanation itself seems to be untrue (read the posts).

In the blog posts, Joachim Dagg traced the origins of the alleged explanation, in detail, looking at (almost) all of the relevant historical data. One of the earliest sources of data on spinach turns out to be itself something of a mystery:
Thomas Richardson (1848) Beiträge zur chemischen Kenntnis der Vegetabilien. Annalen der Chemie und Pharmacie LXVII Bd. 3.
This was a single-page fold-out table (without page number) included at the end of volume 67 of the journal. In modern electronic copies, it has been erroneously attached to the last article in that issue.

The table contains values for a range of compounds in the ash produced from a variety of plants and their parts. These data are ripe for a visualization.

As usual, we can use a phylogenetic network as a form of exploratory data analysis, to compare all of the plants in a single diagram. I first normalized the data (since the compounds have very different ranges), and then used the manhattan distance to calculate the similarity of the plants based on their constituents. This was followed by a Neighbor-net analysis to display the between-plant similarities as a phylogenetic network. So, plants (or their parts) that are closely connected in the network are similar to each other based on their chemistry, and those that are further apart are progressively more different from each other.

As you can see, spinach is not particularly unusual in its chemical constituents. Indeed, it is radish, leek and asparagus that are the most unusual.