Sunday, December 25, 2016

James Bond, alcoholic

Merry Christmas to everyone. As usual for this blog at this time of year, for your Christmas reading we will take a look at a particular aspect of human consumption, in this case alcohol.

James Bond was created in 1953 by Ian Fleming (who also created Chitty-Chitty-Bang-Bang, The Magical Car), and over a 14-year period there was a series of 12 novels and two short-story collections. The rights to the character were purchased for the film world in the 1960s, so that over the past 50 years we have had a franchise of 24 official films, plus two other licensed ones (Casino Royale in 1967, and Never Say Never Again in 1983).

Actually, the first licensed Bond film was a long-forgotten one made for CBS TV in 1954. This was a 1-hour version of Casino Royale, starring Barry Nelson as Bond, Peter Lorre as Le Chiffre, and Linda Christian as a renamed Vesper Lynd (see Barry Nelson - den bortglömde Bond).

This movie infographic (excluding the 2015 film, and the unofficial films) is from The Economist.

The Bond character

James Bond has been portrayed in films officially by six different actors, but the character remains essentially the same, although somewhat different from the one depicted in the books.

In early 1997, the monthly magazine Men's Health published an article in which doctors and psychologists commented on the life and lifestyle of the Bond character, the world's most un-secret secret agent (see Sprit, kvinnor och cigarretter tog livet av James Bond). The results were not good — Bond was either dead or close to it, as he was a paranoid, impotent alcoholic.

Bond's psychological profile was that of an emotionally stunted psychopath of type A who suffers from post-traumatic stress. According to Fleming's books, Bond was orphaned at age 11 (his parents died in a mountaineering accident), he lost his virginity in a brothel in Paris at 16, and killed his first mistress the following year. An ideal man to be a licensed assassin.

His massive daily alcohol consumption (all carefully documented in both the books and films) makes him a category 3 alcoholic. This means that he couldn't possibly have done his actual job competently; and it should also have led to violent temper outbursts (which may explain the government-sanctioned killing sprees). The liquor should also have led to a shrinking of his genitals, and have damaged his liver to the extent that it could no longer break down estrogen, so that he started to develop breasts and become impotent. His well-documented sexual excesses would also make him a prime candidate for sexually transmitted diseases. On top of this, the books (but not the films) also document a comprehensive smoking habit.

Bond was, of course, a form of wish-fulfillment for his creator, Ian Fleming, who was also a heavy drinker and smoker. He died of a heart attack at age 56, an age that Bond himself could not possibly have out-lived. Bond was more in danger from his own lifestyle than from SMERSH, or anyone else bent on world domination.

Bond is thus more a collection of memes than an actual character. This infographic is from the GBShowPlates website, and summarizes Bond's lifestyle.

The Bond drinks

Just about every aspect of Bond's career has been analyzed, and ranked, from the music to the cars to the watches, and most especially the women (the so-called "Bond girls"). However, much of the interest seems to lie in the booze, which is what we will look at here.

Along with coffee (and, once, tea), Bond has consumed copious amounts of alcohol, which he tends to drink alone, or in private settings. He is also what is known as a "label drinker", in that the brand is at least as important as the bottle's contents. This is a gift for the liquor industry, who, along with the car industry, are perpetually looking for opportunities for "brand placement" in films and sporting events. Fleming was chastised for introducing this into his books, but he simply replied that it was an attempt to round-out the character.

As far as the novels are concerned, they have received special medical attention by Graham Johnson, Indra Neil Guha, Patrick Davies (2013. Were James Bond’s drinks shaken because of alcohol induced tremor? British Medical Journal 347: f7255). They recorded every drink consumed in every book, calculated the number of alcohol units involved, and then converted that to daily intake (since the books are quite clear about their time span).

Their results are summarized in this infographic, from their article.

Basically, the medical results were as before:
Across 12 of the 14 books, 123.5 days were described, though Bond was unable to consume alcohol for 36 days because of external pressures (admission to hospital, incarceration, rehabilitation). During this time he was documented as consuming 1150.15 units of alcohol. Taking into account days when he was unable to drink, his average alcohol consumption was 92 units a week (1150 units over 87.5 days). Inclusion of the days incarcerated brings his consumption down to 65.2 units a week. His maximum daily consumption was 49.8 units (From Russia with Love day 3). He had 12.5 alcohol free days out of the 87.5 days on which he was able to drink.
Furthermore, when we plotted Bond's alcohol consumption over time, his intake dropped in the middle of his career but gradually increased towards the end. This consistent but variable lifetime drinking pattern has been reported in patients with alcoholic liver disease.
UK NHS [National Health Service] recommendations for alcohol consumption state that an adult male should drink no more than 21 units a week, with no more than 4 units on any one day, and at least two alcohol free days a week. James Bond's drinking habits are well in excess of each of these three parameters. This level of consumption makes him a category 3 drinker (>60 g alcohol / day) and therefore in the highest risk group for malignancies, depression, hypertension, and cirrhosis. He is also at high risk of suffering from sexual dysfunction, which would considerably affect his womanising.
Analyzing the films is more difficult. A number of people have tackled this task, including Nerdist, The Grocer, and Atomic Martinis (now defunct, but repeated on the website of the world's only James Bond Museum, in Sweden), and David Leigh. The basic problem seems to be whether the alcohol is "spotted either in hand, glass or in the background". Also, "The major problem is 007’s frequent enjoyment of multiple bottles of champagne, or portions of bottles of liquor ... it is often impossible to determine exactly how many separate drinks came from a given bottle."

The following infographic (not including the 2015 movie or the unofficial films) is derived from one produced at Buddy Loans. However, some of the people at Reddit were not happy with the original, so it was redesigned, as shown here.

The people at Nerdist took the data from this film infographic, converted it from units of alcohol to grams of alcohol, and then used this to estimate Bond’s total alcohol content. This yields a Blood Alcohol Content of 3.7%. "While some humans have survived a BAC of past 1%, it generally holds that anything past 0.5% will either kill you or leave you seriously poisoned. Therefore ... Bond’s tipsy tally is enough to put a man past a safe limit seven times over."

At The Grocer, they have also pointed out the relative booziness of the various Bond incarnations, by calculating the average intake per film by each actor, in units of alcohol:
Sean Connery
George Lazenby
Roger Moore
Timothy Dalton
Pierce Brosnan
Daniel Craig
Finally, we need a phylogenetic network, of course. I collated the presence/absence of each drink type for each book and movie (excluding the 2015 film) from the book by David Leigh (2012. The Complete Guide to the Drinks of James Bond, 2nd edition. Kindle), and then updated this where it clearly disagrees with other sources. (For example, no mention is made of sherry, and yet it is involved in one of the most popular Bond scenes from the film version of Diamonds are Forever.) I then analyzed the data using a NeighborNet. (James Bond Memes has tried an ordination analysis of the same data source.)

The books are shown in red, and the early films starring Connery and Lazenby are shown in blue (including Connery's later Never Say Never Again). These books and films are almost all at the top and right of the network, indicating that they have a distinct collection of drink types compared to the later films. I suspect that this reflects increasing use of "product placements" in the films. The only book plus movie combination that has similar drinks is You Only Live Twice. Interestingly, the Skyfall movie (from 2012) seems to return to the drinks genre of the earlier works, even though the alcohol consumption is much higher. The most unusual works were the Goldfinger and On Her Majesty's Secret Service books, where a number of drink styles were consumed that appeared nowhere else in the canon.

As noted by Johnson et al. (quoted above):
Despite his alcohol consumption, [Bond] is still described as being able to carry out highly complicated tasks and function at an extraordinarily high level. This is likely to be pure fiction.

Tuesday, December 20, 2016

Isogloss maps are hypergraphs are bipartite networks

Linguists are a very special people. They are very proud, especially when biologists tell them how to do phylogenetic analyses; but their pride is often also justified, as many phylogenetic concepts were initially or independently developed by linguists, be it the family tree model, proposed years before Darwin's (1859) tree by Ćelakovský (1853), or even the cladistic principle of synapomorphies, which are called "exclusively shared innovations" in linguistics (see Brugmann 1884).

Linguists also invented one interesting kind of data-display which so far has never been used by biologists (at least as far as I know): maps of isogloss boundaries. The term "isogloss" is an unfortunate term, as it has multiple usages in linguistics, and its history seems to go back to a naive borrowing from chemistry (but I have not really followed the literature here). On most occasions, it just means "shared trait". That is, it denotes a features shared between two or more languages; and given that languages may share many different features, isoglosses for a group of related languages may yield a very complex type of data. Isoglosses are somehow related to the wave theory, the arch-enemy of the family tree in linguistics, which I described as a mystical theory some time ago, since it never really made it to a clear-cut model that could be formalized (The Wave Theory: the predecessor of network thinking in historical linguistics ).

Some linguists, nevertheless, insist that the waves that are the core of the wave theory are nothing other than isoglosses. More specifically, the waves represent innovations that contribute to the separation of languages (a change in pronunciation of a word here, a change in grammar there), but which are not transmitted vertically — they spread across the speakers of a language and may even cross linguistic borders. One early visualization of these waves can be found in Bloomfield (1933), as shown here:

What Bloomfield essentially does here is pick certain traits of Indo-European languages, calling them isoglosses, and arrange them on a quasi-geographic map of Indo-European languages in such a way that all languages sharing a trait are inside one of these isogloss boundaries.

Only recently, I realised, what this actually means, when I found the "Bible of Network Theory" by Newman (2010) and started reading at a random page, which — as it turned out — treated hypergraphs. Hypergraphs, as I learned from Newman, are graphs in which one edge can connect to more than one node, and Newman used exactly the same visualization for these hyperedges as Bloomfield had done in 1933, without knowing that it was actually a rather complex network structure he was proposing.

Even more interesting than the complex graph structure is that hypergraphs can be likewise displayed as bipartite networks, in which we distinguish two fundamental kinds of nodes, and in which connections are only allowed between nodes of different kinds, without losing any information. In order to do so, one just converts all hyperedges into a node that connects to all nodes (languages in our case) to which the edges connect in the hypergraph. In the same way that Bloomfield labeled the hyperedges in his legend, we can label the isogloss nodes that connect to the languages. The following image shows the resulting bipartite network for Bloomfield's hypergraph:

If you now ask what this tells us after all, I will disappoint you — so far it does not tell us anything, it is just a display of data in a different fashion. Note, however, that hypergraph visualization is not a trivial problem, and if you have enclaves not sharing a trait, it may even be impossible to visualize hypergraphs in a two-dimensional space by just using one line that connects to all nodes. Bipartite networks are easier to handle in this regard. Even more importantly, however, bipartite graphs are also easy to handle algorithmically, and biologists are currently developing new methods to handle them (Corel et al. 2016).

If we visualize the Bloomfield data in a bipartite network using network visualization software such as Cytoscape, we can conveniently explore the data, and arrange the nodes in order to search for patterns in the isoglosses. The following visualization, for example, shows that Bloomfield chose the data well in order to illustrate the amount of conflicting, apparently non-tree-like, signal in Indo-European languages (remember that linguists tend to dislike trees, but not necessarily in a productive way), as the data describes more of a circular structure than a strict hierarchy.

In order to really interpret this kind of data, however, we should not forget that this is still a data-display network. It is by no means a phylogenetic analysis, as we only show how a certain amount of data selected by a scholar and distributed over the given language groups. A true phylogenetic analysis will need to interpret these data, making bold claims about the history of those shared traits.

The existence of sibilants (s-like sounds, like [s, z, ʃˌ ʒ]) for certain velar sounds (k-like sounds, like [k, g, x]), for example, is a trait shared by Balto-Slavic, Indo-Iranian, Armenian, and Albanian, but this does not mean that they all inherited it from a common ancestor, as the process of palatalization, by which velar sounds turn into affricates and fricatives (compare French cent, which was pronounced kentum in Latin), is very frequent in the languages of the world, and may well reflect independent evolution.

Apart from independent development, which would actually force us to revise our network, deleting the respective edges because they are not homologous in the strict sense means that we may also have to deal with differential loss. This quite likely happened with the shared feature labeled as "past e-" in the network, referring to the past tense in Ancient Greek and Indo-Iranian, which was augmented by the prefix e-.

A further reason for those commonalities labelled as isoglosses by linguists may also be simple lateral transfer due to language contact.

Proponents of the wave theory have taken this kind of data as proof that the family tree model is essentially wrong. While I would agree that the family tree model shows only a certain aspect of language evolution, and may therefore be boring at times (and even wrong, if we do not manage to correctly interpret the nature of shared traits), I have a hard time understanding why linguists still insist that isogloss maps are an alternative model of language evolution. They are surely not, in the same way in which splits graphs are not phylogenetic networks, as David emphasized in a recent blogpost.

Unless we add the missing time dimension and analyse how the shared traits originated, isogloss maps and hypergraphs will remain nothing more than an interesting form of data visualization. Given the recent research on bipartite networks, however, we may have some hope that the mysterious waves in historical linguistics may not only find a formal model of representation, but even bring us to the point where we gain new insights into the history of our languages.

  • Bloomfield, L. (1973) Language. Allen & Unwin: London.
  • Brugmann, K. (1884) Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • Čelakovský, F. (1853) Čtení o srovnavací mluvnici slovanské [Lectures on comparative grammar of Slavic]. V komisí u F. Řivnáče: Prague.
  • Corel, E., P. Lopez, R. Méheust, and E. Bapteste (2016) Network-thinking: graphs to analyze microbial complexity and evolution. Trends Microbiol. 24.3: 224-237.
  • Darwin, C. (1859) On the origin of species by means of natural selection, or, the preservation of favoured races in the struggle for life. John Murray: London.
  • Newman, M. (2010) Networks. An Introduction. Oxford University Press: Oxford.

Tuesday, December 13, 2016

Motivations for producing the earliest pedigrees

The stemmata in ancient Roman houses (depicting portraits of ancestors) were used to assert the nobility of the nobles by right of family descent — stemmata distinguished between the patrician class (those with noble ancestry) and plebeians (commoners). It is therefore unsurprising that the Medieval nobility subsequently started to produce diagrams, as their way of illustrating their own succession in unambiguous terms (although it was not until much later that genealogies became common).

For example, as discussed in my post on The first royal pedigree, the earliest known illustration of a family tree is from c.1000 CE (see Schmid 1994), in which Cunigunde of Luxembourg's ancestry is traced in a tree-like manner to include the emperor Charlemagne (Charles the Great), thus legitimizing her claim to being of royal descent — she married Henry, Duke of Bavaria, in 999 CE, and he became King Henry II of Germany in 1002, at which point she became Queen consort of Germany (1002-1024).

However, pedigrees were also produced for the opposite purpose — to try to prevent marriages, for example on the basis that they violated church law. The earliest known such case involved the marriage, in 1043 CE, of King Henry III of Germany (1016-1056, later Emperor Heinrich of the Holy Roman Empire) to Agnès of Poitou (1025-1077).

Heinrich was briefly (1036-1038) married to Gunhilda of Denmark. After her death, for political reasons he wanted to remarry with someone from France. He chose the young daughter of Duke William V of Aquitaine. She thus became Queen consort of Germany (1043-1056) and then Empress consort of the Holy Roman Empire (1046-1056); from 1056-1061 she acted as regent of the Holy Roman Empire during the minority of her son Henry IV.

The official basis for objecting to this marriage was that the bride's and groom's maternal great-grandmothers were half-sisters, so that Henry and Agnes were third cousins. Moreover, on Henry's father's side they were also fourth cousins once removed. This is illustrated in the following genealogy from Michel Parisse (2004).

Note that Henry III appears twice, once as the son of his father and once as the son of his mother, thus simplifying the network to a tree; this is a point that I have commented on before.

The person formally objecting to this marriage was Siegried of Gorze, who researched the family history and drew the first version of the pedigree. As discussed by Bouchard (2001):
Abbot Siegried of the reformed monastery at Gorze wrote very shortly before [the marriage] to his friend Abbot Poppo of Stablo [or Stavelot], who possessed the confidence and respect of Henry, urging him even at the eleventh hour, and at risk of a possible loss of the king's favor, to do all that he possibly could to prevent it. Neither Poppo, nor Bishop Bruno of Toul (later Pope Leo IX), to whom Siegfried addresses still more severe reproaches, nor Henry himself, paid much heed to these representations.
Henry apparently rebutted Siegried's claim by (falsely) claiming that the pedigree was at fault (ie. the great-grandmothers were not half-sisters). Nevertheless, various published versions of Siegfried's pedigree continued to appear over the subsequent 500 years (see Gädeke 1992). You can read Siegfried's original Latin letters (without the accompanying family tree) in the paper by Michel Parisse (2004). Jean-Baptiste Piggin has a transcription of the genealogy, taken from an early 11th century book (see the blog post: Two medieval drawings).

Part of the issue here is the change in the church laws relating to consanguinity (the degrees of relationship within which marriage was uncanonical), which had occurred during the first half of the ninth century. At that time, both the number of forbidden degrees was increased, from four to seven, and the method of calculating those degrees was changed. These two changes are illustrated here (from Bumke 1991).

So, the church councils held at Rome (during the first half of the eighth century) forbade marriage only between: siblings; parents and offspring; grandparents with grandchildren; a man and his niece (but not a woman and her nephew!); and first cousins. However, the canonical changes during the subsequent century forbade everything out to sixth cousin. The reasoning behind these extreme changes is not fully understood.

Needless to say, these new laws of consanguinity created an impossible situation when, as Bumke (1991) puts it:
in the course of the tenth and the first half of the eleventh century a small number of royal and princely families, already connected by marriage ties in the past, emerged and ruled most of western and central Europe.
Under the new rules, it would not take long for a restricted group of people to become too closely related to inter-marry at all — royalty could not marry royalty. So, Henry set a precedent for his kin when he managed to bypass the new rules, which the aristocracy were likely to ignore anyway. These rules remained in force until 1215 (the Fourth Lateran Council), when the degrees were reduced again to four, but still counted in the "new" way.

As a final note, this sort of religious interference was not always unsuccessful. For example, in the early 1100s Henry I of England suggested marrying one of his (illegitimate) daughters to William de Warenne (2nd Earl of Surrey), but was dissuaded by Archbishop Anselm of Canterbury, who pointed out the prohibited degrees involved. Shortly afterwards, Bishop Ivo of Chartres successfully intervened in the proposed marriage of another of Henry's (illegitimate) daughters to Hugh fitz Gervaise of Châteauneuf-en-Thymerais.


Constance Brittain Bouchard (2001) Those of My Blood: Constructing Noble Families in Medieval Francia. University of Pennsylvania Press, Philadelphia.

Joachim Bumke (1991) Courtly Culture: Literature and Society in the High Middle Ages. University of California Press, Berkeley.

Nora Gädeke (1992) Zeugnisse bildlicher Darstellung der Nachkommenschaft Heinrichs I. Arbeiten zur Fruhmittelalterforschung 22. De Gruyter, Berlin.

Michel Parisse (2004) Sigefroid, abbé de Gorze, et le mariage du roi Henri III avec Agnès de Poitou (1043). Un aspect de la réforme lotharingienne. Revue du Nord 356-357: 543-566.

Karl Schmid (1994) Ein verlorenes Stemma Regum Franciae. Zugleich ein Beitrag zur Entstehung und Funktion karolingischer (Bild-)Genealogien in salisch-staufischer Zeit. Frühmittelalterliche Studien 28: 196-225.

Tuesday, December 6, 2016

Why are splits graphs still called phylogenetic networks?

This is an issue that has long concerned me, and which I think causes a lot of confusion among biologists. A phylogenetic tree is usually a clear concept — to a biologist, it is a diagram that displays a hypothesis of evolutionary history. The expectation, then, is that a phylogenetic network does the same thing for reticulate evolutionary histories. However, this is not true of splits graphs; and so there is potential confusion.

Mathematically, of course, a phylogenetic tree is a directed acyclic line graph. It is usually constructed, in practice, by first producing an undirected graph based on some pattern-analysis procedure, and then nominating one of the nodes or edges as the root (say, by specifying an outgroup). So, the mathematics is not really connected to the biological interpretation. To a mathematician, the tree is a set of nodes connected by directed edges, and the nodes could represent anything at all, as could the edges. It is the biologist who artificially imposes the idea that the nodes represent real historical organisms connected by the flow of evolution — ancestors connected to descendants by evolutionary events.

A phylogenetic network should logically be a generalization of this idea of a phylogenetic tree, adding the possibility of evolutionary relationships due to gene flow, in addition to the ancestor-descendant relationships. This can be done, but it is only partly done by splits graphs.

That is, a splits graph generalizes the idea of an undirected line graph (an unrooted tree), but not a directed acyclic graph (a rooted tree). It follows the same logic of using a pattern-analysis procedure to produce an undirected graph, although the graph can have reticulations, and thus is a network rather than necessarily being a bifurcating tree. However, it is not straightforward to specify a root in a way that will turn this into an acyclic graph. So, in general it does not represent a phylogeny.

Indeed, splits graphs are simply one form of multivariate pattern analysis, along with clustering and ordination techniques, which are familiar as data-display methods in phenetics (see Morrison D.A. 2014. Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296-312). In this sense, it makes no difference whatsoever what the data represent — they can be data used for phylogenetics, or they could be any other form of multivariate data. Indeed, this point is illustrated in many of the posts in this blog, which can be accessed in the Analyses page.

So, unlike unrooted trees, unrooted splits graphs are not a route to producing a phylogenetic diagram. Mind you, they are a very useful form of multivariate data analysis in their own right, and I value them highly as a form of exploratory data analysis. But that doesn't make them phylogenetic networks in the biological sense.

So, isn't it about time we stopped calling splits graphs "phylogenetic networks"? They aren't, to a biologist, so why call them that?

Tuesday, November 29, 2016

The origin of an idea: reducing networks to trees

I have written a number of times in this blog about the strong tendency for people to present reticulating evolutionary relationships as trees rather than as networks. This involves them somehow reducing complex networks to bifurcating trees.

When referring to a "family tree", the most common way to reduce a network to a tree is simply to repeat people's names as often as necessary. That is, rather than have them appear once (representing their birth) with multiple reticulating connections representing their reproductive relationships, they appear repeatedly, once for their birth and once for each relationship, so that there are no reticulations. I presented a number of online examples of this process in my posts on Reducing networks to trees and on Thoroughbred horses and reticulate pedigrees.

Recently, Jean-Baptiste Piggin has pointed out that this approach actually has a very long history, indeed, actually dating back to what seems to be the first pictorial representation of a genealogy.

In an earlier post (The first infographic was a genealogy) I described Piggin's work on what he calls the Great Stemma, a diagram from c. 400 CE (Late Antiquity) representing the genealogy of Jesus as presented in the New Testament. In a recent update, Piggin reports:
The Great Stemma contains 13 doppelganger or fetches, that is to say, simultaneous appearances of the same person in two places, e.g. Hezron [as a] child, and separately as an ancestor of Jesus. This graphic method simplifies the layout, but forced the Late Antiquity reader to mentally register these virtual "hyperlinks".
If you view his diagram of the Great Stemma (Touring the Reconstruction), you can see on an overlay a set of links connecting the multiple appearances of the following people:
Athaliah, Gershon, Hezron, Judah, Kohath, Leah, Levi, Mahalath, Merari, Perez, Rachel, Rebekah, and Timna.

This repetition simplifies what is a rather complex diagram, which actually shows a network of family relationships. There is still one reticulation in the diagram, however, because it depicts Jesus' ancestry as described in the New Testament by both Matthew (labeled Filum C in the schematic below) and Luke (labeled Filum D), and these differ regarding the descendants of David (but not his ancestors).

The diagram contains more than just a genealogy (represented by Filum A-D), as it also displays other references from the Bible (indicated in yellow). Piggin is still working on his reconstruction (there are no known copies of the original, only later hand copies), and he continues to make discoveries.

Of especial interest in the genealogies is that Piggin now reconstructs the Great Stemma as having a strictly grid-like arrangement of the people, as discussed in his blog post Secret of the oldest infographic revealed: a grid. The placements of the lineages in the Stemma, and the connections between the people, are not always obvious to modern eyes (see my post on How confusing were the first written genealogies?), since we are used to the modern version of a "family tree" — it took another millenium after the Stemma to settle on the modern version. However, the use of a regular grid-like arrangement in the Stemma seems surprisingly modern by comparison.

Unfortunately, this arrangement seems to have become corrupted in the subsequent hand-made copies, suggesting that the scribes did not always appreciate the grid's organizational importance.

Tuesday, November 22, 2016

Once more on artificial intelligence and machine learning

In an earlier blog post, I expressed my scepticism regarding the scientific value of non-transparent machine learning approaches, which only provide a result but no transparent explanation of how they arrive at their conclusion. I am aware that I run the risk of giving the impression of abusing this blog for my own agenda, against artificial intelligence and machine learning approaches in the historical sciences, by bringing the problem up again. However, a recent post in Nature News (Castelvecchi 2016) further substantiates my original scepticism, providing some interesting new perspectives on the scientific and the practical consequences, so I could not resist mentioning it in my post for this month.

Deep learning approaches in research on artificial intelligence and machine learning go back to the 1950s, and have now become so successful that they are starting to play an increasingly important role in our daily lives, be it that they are used to recommend to us yet another book that somebody has bought along with the book we just want to buy, or that they allow us to take a little nap while driving fancy electronic cars and saving carbon footprints for our next round-the-world trip. The same holds, of course, also for science, and in particular for biology, where neural networks have been used for tasks like homolog detection (Bengio et al. 1990) or protein classification (Leslie et al. 2004). This is true even more for linguistics, where a complete subfield, usually called natural language processing, has emerged (see Hladka and Holub 2015 for an overview), in which algorithms are trained for various tasks related to language, ranging from word segmentation in Chinese texts (Cai and Zhao 2016) to the general task of morpheme detection, which seeks to find the smallest meaningful units in human languages (King 2016).

In the post by Castelvecchi, I found two aspects that triggered my interest. Firstly, the author emphasizes that answers that can be easily and often accurately produced by machine learning approaches do not automatically provide real insights, quoting Vincenco Innocente, a physicist at CERN, saying:
As a scientist ... I am not satisfied with just distinguishing cats from dogs. A scientist wants to be able to say: "the difference is such and such." (Vincenco Innocente, quoted by Castelvecchi 2016: 22)
This expresses precisely (and much more transparently) what I tried to emphasize in the former blog post, namely, that science is primarily concerned with the questions why? and how?, and only peripherally with the question what?

The other interesting aspect is that these apparently powerful approaches can, in fact, be easily betrayed. Given that they are trained on certain data, and that it is usually not known to the trainers what aspects of the training data effectively trigger a given classification, one can in turn use algorithms to train data that will betray an application, forcing it to give false responses. Castelvecchi mentions an experiment by Mahendran and Vedaldi (2015) which illustrates how "a network might see wiggly lines and classify them as a starfish, or mistake black-and-yellow stripes for a school bus" (Castelvecchi 2016: 23).

Putting aside the obvious consequences that arise from abusing the neural networks that are used in our daily lives, this problem is surely not unknown to us as human beings. We can likewise be easily betrayed by our expectations, be it in daily life or in science. This, finally, brings us back to networks and trees, as we all know how difficult it is at times to see the forest behind the tree that our software gives us, or the tree inside the forest of incompletely sorted lineages.

  • Bengio, Y., S. Bengio, Y. Pouliot, and P. Agin (1990): A neural network to detect homologies in proteins. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems 2. Morgan-Kaufmann, pp. 423-430.
  • Cai, D. and H. Zhao (2016) Neural word segmentation learning for Chinese. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 409-420.
  • Castelvecchi, D. (2016): Can we open the blackbox of AI. Nature 538: 20-23.
  • Hladka, B. and M. Holub (2015 A gentle introduction to machine learning for natural language processing: how to start in 16 practical steps. Lang. Linguist. Compass 9.2: 55-76.
  • King, D. (2016) Evaluating sequence alignment for learning inflectional morphology. In: Proceedings of the 14th Annual SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 49–53.
  • Leslie, C., E. Eskin, A. Cohen, J. Weston, and W. Noble (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20.4: 467-476.
  • Mahendran, A. and A. Vedaldi (2015) Understanding deep image representations by inverting them. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp. 5188-5196.

Tuesday, November 15, 2016

Grape harvest dates as proxies for global warming

Phenological patterns are often highly correlated with temperatures. As noted by Chuine et al. (2004):
Biological and documentary proxy records have been widely used to reconstruct temperature variations to assess the exceptional character of recent climate fluctuations. Grape-harvest dates, which are tightly related to temperature, have been recorded locally for centuries in many European countries. These dates may therefore provide one of the longest uninterrupted series of regional temperature anomalies (highs and lows).
Harvest dates of grapes in western Europe (used for wine-making) are of especial interest because they constitute long phenological records, as a result of the fact that the harvest dates are usually officially decreed, based on the ripeness of the grapes. In other words, we have historical records for many locations over many years.

Daux et al. (2012) have compiled many of these records into a publicly accessible database archived at the World Data Center for Paleoclimatology.

This database comprises time series for 380 locations, mainly from France (93% of the data) as well as from Germany, Switzerland, Italy, Spain and Luxemburg. The series have variable lengths up to 479 years, with the oldest harvest date being for 1354 CE in Burgundy. The series are grouped into 27 regions "according to their location, to geomorphological and geological criteria, and to past and present grape varieties." These regions are shown in the map.

Normally, such data would simply be graphed as a time series for each region. However, as usual in this blog, we can examine these data using a phylogenetic network, to perform an exploratory data analysis. However, most of the data are actually "missing", because most of the time series have time gaps or cover only short periods. So, to create a more complete dataset I have extracted the data for the years 1800-1880, inclusive, because for this period 17 of the regions have mostly a complete series.

Two of the time series are shown in the first graph. This shows that the two time series are highly correlated, as are most of them. In this case, the correlation coefficient is 0.87.

I then used the gower distance to calculate the similarity of the different years and regions, based on the harvest dates (the gower measure is needed in order to deal with the fact that some of the data are still missing). This was followed by a neighbor-net analysis to display the between-region and the between-year similarities as two phylogenetic networks.

Only the first network is shown here. Regions that are closely connected in the network are similar to each other based on the variation in their harvest dates through time, and those that are further apart are progressively more different from each other.

Many of the patterns here are to be expected, based on the geographical proximities of the regions, but some are not. For example, Ile de France, Champagne and Vendée - Poitou Charente are all in northern France (see the map) while Bordeaux is in the south-west, and the Rhone Valley regions are in the south-east. As Le Roy Ladurie & Baulant (1980) have noted, the vineyards of northern and central France are in a different climatic zone from the wine regions of southern France (to the south of the Geneva parallel) and those of western France (west of the Chateau-du-Loire meridian).

Similarly, at the other end of the network, the Lower Loire region is not geographically located near any of the associated regions in the network. Possibly the most unexpected pattern, then, is the network separation of the Upper and Lower regions of the Loire Valley, which are the two regions whose time series are graphed above.

Clearly, the network is displaying only quite small differences between the time series. That is, the time patterns are very consistent across the regions, which does indeed make them useful for studying past temperature patterns.


Isabel Chuine, Pascal Yiou, Nicolas Viovy, Bernard Seguin, Valérie Daux, Emmanuel Le Roy Ladurie (2004) Grape ripening as a past climate indicator. Nature 432: 289-290.

V. Daux, I. Garcia de Cortazar-Atauri, P. Yiou, I. Chuine, E. Garnier, E. Le Roy Ladurie, O. Mestre, J. Tardaguila (2012) An open-database of grape harvest dates for climate research: data description and quality assessment. Climate of the Past 8: 1403-1418.

Emmanuel Le Roy Ladurie and Micheline Baulant (1980) Grape harvests from the fifteenth through the nineteenth centuries. Journal of Interdisciplinary History 10: 839-849.

Tuesday, November 8, 2016

Drawing family trees as trees

In a previous blog post (Who first drew a family tree as a tree?), I pointed out that one of the candidates for drawing the first family tree as a tree (as opposed to a stick diagram) is Giovanni Boccaccio, in his Genealogia Deorum Gentilium (On the Genealogy of the Gods of the Gentiles) of 1370 CE.

However, there are arguments against this attribution. For example, Boccaccio's original pedigree was: (1) not about real people; (2) more like a vine rather than a tree; and (3) not rooted at the bottom. The first version of his pedigree that was actually tree-like and rooted at the bottom was in the Italian translation from 1547 CE (and again in the 1554 edition).

Recently, Jean-Baptiste Piggin has indicated in his blog that he is looking for the Oldest family tree. He writes:
What I am looking for here is the earliest example of a thing named "family tree" or "albero genealogico" or "Stammbaum" or "arbre de famille" ... these things had unwitting precursors in previous centuries. There were even 12th-century artists who took pre-existing stemmata and flipped them upside down to depict them as trees. But these were experiments or flukes, not genealogical trees as a general cultural phenomenon.
The conscious idea of presenting a complete family line connected by a woody trunk first shows up in southern German woodcuts in the late 15th century ... The tree as a recognizable category of art, a product where artist and customer know what to expect, only shows up later in the sixteenth century. It looks semi-natural, has a bottom root and clearly tiered generations.
In his blog post Piggin mentions various attempts (at drawing pedigrees) between their first known appearance in c. 1000 CE (see The first royal pedigree) and the late 1500s, when Scipione Ammirato (an Italian writer and historian) set up a cottage industry producing family trees for the nobility.

Highlights of the history of tree-like pedigree diagrams, as currently known, include (with links to copies of the diagrams):

1370 Boccaccio – first pedigree drawn as a vine, with the root at the top
1475 Rodericus (Der Spiegel des Menschlichen Lebens) – multiple intertwining vines
1492 Conrad Bote (Cronecken der Sassen) – first tree, using family shields in place of names
1515 Albrecht Dürer (Ehrenpforte, engraving) – unbranched woody vine
1536 Robert Peril (Family Tree of the House of Habsburg, engraving) – tree, with people along the trunk only, not on the branches
1547 Boccaccio – first version of his pedigree drawn as a tree
1576 Scipione Ammirato – first of his trees, with people along the trunk as well as the branches. Ammirato's first tree is shown above.

The 12th century pedigree that Piggin refers to, and dismisses as a candidate for a real tree, is discussed in his blog post on the Erlangen tree. This pedigree is from one of the copies of the Ekkehardi Chronicon Universale (Chronicle of Ekkehard of Aura, or Chronicle of Frutolf), drawn in c. 1140. The pedigree itself is based on the one shown in my post on The first royal pedigree, except that Cunigunde of Luxembourg (the focus of that earlier pedigree) is strangely absent. The version of interest is shown below, from the Universitätsbibliothek Erlangen-Nürnberg (manuscript 406, referred to as the Erlangen Codex, page 204v).

What is unique about this version of the pedigree is that it has been turned upside down, so that the root is at the bottom, making it look more tree-like. (See also my post on Does it matter which way up a tree is drawn?) As Piggin notes (NB: he uses the word "stemma" to refer to the early versions of pedigrees, with the names in roundels, connected by lines):
Other manuscripts of the Ekkehard Chronicle present the Stemma of Cunigunde more or less faithfully, but the scribe-artist of the Erlangen codex decided to have some fun with it. He inverted it, and drew the figure of Arnulph at the left and Arnulph's saintly mother Begga at right. [Arnulf is the person named at the root of the pedigree.]
What change in medieval culture had made this startling inversion of the stemma not just possible, but acceptable to the customer, probably the Cistercian Monastery of Heilsbronn in Germany, which became the long-term owner of this codex? Is this quirky conversion on an artist's desk the precise moment when the family tree, later to become a prestigious badge of nobility, was invented?
As I have already pointed out, inverted stemmata made to resemble trees with roots in soil are a rarity before the 16th century. It was 16th-century scholars like Scipione Ammirato who deserve the credit as the true originators of the family tree, not the medieval artists who created trees of ancestry more or less by fluke.

Tuesday, November 1, 2016

Phylogenies everywhere

Once you have seen a phylogenetic tree, it is difficult not to see them everywhere.

As a first example, this figure is from Alexander J. Hetherington, Christopher M. Berry, and Liam Dolan (2016) Networks of highly branched stigmarian rootlets developed on the first giant trees. Proceedings of the National Academy of Sciences of the USA 113: 6695-6700.

The authors refer to this forest of trees as a "network", but they also note that "stigmarian rootlets branch in a strictly dichotomous manner through multiple orders of branching", and so there are no reticulations.

This next example is taken from the web, from somewhere in Reddit, I believe. The author refers to it as "Geological Phylogenetics".

Thanks to Luay Nakhleh for drawing my attention to the first example.

Tuesday, October 25, 2016

Sound change as systemic evolution

I have been discussing the peculiarities of sound change in linguistics in a range of blog posts in the past (see Alignments and Phylogenetic Reconstruction, Directional Processes in Language Change, Productive and Unproductive Analogies). My core message was that it is really difficult to find an analogy with biology, as sound change is not the simple mutation of one sound in a certain word, but the regular modification of all sounds of all words in the lexicon which occur in a specific contextual slot.

Scholars have tried to model this as concerted evolution (Hruschka et al. 2015). But the analogy with biology does not sound very convincing, as the change concerns the production of speech rather than its product. By this, I mean that sound change concerns the abstract system by which speakers produce the words of their language. Think of speakers in comic books who lose a tooth in some fight. Often, in order to show how their speech suffers from this loss, writers illustrate this by replacing certain "s" sounds in the speech of the victims with a "th" (in German, it would be an "f"). They do this in order to illustrate that with a lost tooth, it is "very difficult to thpeak". In the same way, writers imitate speech of people suffering from speech impediments like sigmatism (lisp). The loss of a tooth changes all "s"es in a person's language. Sound change, at least one type of sound change, is identical with this.

In a recent talk I gave with Nathan Hill at a conference in Poznań, we found a way to demonstrate this on actual language data. In this talk, we used data from eight Burmish languages (a language family spoken mainly in the South-West of China and in Myanmar), which we coded for partial cognates (as these languages contain many compounds). We aligned these cognate sets automatically, and then searched for recurring patterns in the alignments. One needs to keep in mind that our words in linguistics are extremely short, and we have no more than five sounds per alignment in our data, which translates to five sites in an alignment in biology.

While biology knows certain contextual patterns like hydrophilic stretches in alignments (as already demonstrated in the famous ClustalW software, compare Thompson et al. 1994), the context in which a sound occurs in language evolution is even more important. We can, for example, say, that the beginning of a word or morpheme is usually the most stable part, where sounds change much more slowly than in the other parts (in the end of a word or of a syllable). We thus concentrated only on the first sound of each word and looked at the patterns of sounds we could find there.

Those patterns in our data usually look like this:

Cognate set L1 L2 L3 L4 L5 L6 L7 L8
word 1 p p p Ø f f Ø p
word 2 p Ø p p Ø f p p
word 3 k Ø k s k Ø k
word 4 Ø k Ø s Ø s k
... ... ... ... ... ... ... ... ...

Note that the symbol "Ø" in this context denotes missing data, as we did not find a cognate set in the given language. As always, most of our data is patchy, and we have to deal with that. You can see that when looking only at the first sound in each alignment, we find quite a degree of variation; and if you look at all the data, you can see some things that seem to structure, but the amount of complexity is still immense. You may see this from the following plot, showing only some 100 of the more than 300 patterns we created (coloured cells represent not necessarily the same sound, but one of ten different sound classes to which the more than 50 different sounds in our data belong):

Sound patterns (initial consonant) in the aligned cognates sets of the Burmish languages

Interestingly, however, most of the variation can be reduced quite efficiently with help of network techniques. Since we are dealing with systemic evolution, it is straightforward to group our more than 300 alignments into groups that evolve in an identical manner. At least this is what our linguistic theory predicts, and what linguists have been studying for the last 200 years. When looking at the patterns I gave above, you can see that we can easily group the four sounds into two groups:
Cognate set L1 L2 L3 L4 L5 L6 L7 L8
word 1 p p p Ø f f Ø p
word 2 p Ø p p Ø f p p
- - - - - - - - -
word 3 k Ø k s k Ø k
word 4 Ø k Ø s Ø s k

Essentially, the two groups reflect only two patterns, if we disregard the gaps and merge them into one row each:
Cognate set L1 L2 L3 L4 L5 L6 L7 L8
word 1 / word 2 p p p p f f p p
- - - - - - - - -
word 3 / word 4 k k k s k s k

What is important when grouping two alignments into one pattern is to make sure that they do not contain any conflicting positions. This can be checked in a rather straightforward manner by constructing a network from the data. In this network, the nodes are the alignment sites (word 1, word 2, etc. in our examples), and links are drawn between nodes if two sites are not in conflict with each other. If we use this criterion of compatibility on our data, we receive following network:

Compatibility network of the sites in our aligned cognate sets

In the network, I further coloured the nodes according to the overall similarity of sounds present in them. The legend gives capital letters for major sound classes, in order to facilitate seeing the structure.

This network itself, however, does not tell us how to group the data into classes that correspond to one identical process of systemic evolution, as we can still see many conflicts. In order to solve this, we need to carry out a specific partitioning analysis that cuts the network into an ideally minimal number of cliques. Why cliques? Because a clique will represent patterns in our data that do not show any conflicts in their sounds, and this is exactly what we want to see: those patterns that behave identically, without exceptions.

The problem of finding the minimal clique partition of a network is, unfortunately, a hard one (see Bhasker and Samad 1991), so we needed to use some approximate shortcuts. Nevertheless, with a very simple procedure of clique partitioning, we succeeded at reducing the 317 cognate sets that we selected for our study down to 35 groups that covered 74% of the data (234 cognate set), with a minimal size of 2 alignments per group. The "manual" inspection by the Burmish expert in our team (that is Nathan Hill) showed that many of these patterns correspond to what experts assume was one single sound in the ancestral Proto-Burmish language.

But to just illustrate more closely what I mean by reducing patterns to unique groups, look at the following pattern, which shows different nasal sounds in the data:

Nasal sounds in the Burmish data

And then at another pattern, showing s-sounds:

S-sounds in the Burmish data

I think (at least I hope) that the amount of regularity we find here is enough to demonstrate what is meant by the regularity of sound change in linguistics: sound change is in some sense just like losing a tooth, but for a complete population of speakers, not just one speaker, as the population starts to change all sounds occurring in a certain environment to some other sound.

Our results are not perfect: the 26% of unique patterns, for example, are something we will need to look into in more detail in the near future. A quick check showed that they may result from errors in the cognate annotation, but also from peculiarities in the data, and even simply from sounds that are rare in the languages under investigation.

We are currently looking into these issues, trying to refine our approach. I realized, for example, that the minimal clique coverage problem has been studied before by other researchers, and I found a rather large amount of Russian literature on the topic (see, for example, Bratceva and Čerenin 1994 and Ryzhkov 1975), but those approaches do not seem to have been thoroughly studied in the Western literature. We also know that at some point we need to relax our approach, allowing for some exceptions — we know that systemic sound change processes are easily overridden by language-specific factors, be it lateral transfer, or pragmatics in a larger sense (think of Bob Dylan, talking of "the words I never KNOWED" in order to make sure the word rhymes with "ROAD", or the form "wanna" as a shortcut for "want to").

Not all cases in which speakers changed the pronunciation of sounds have systemic reasons, and we are still far from actually understanding the systemic reasons that lead to the regular aspects of sound change. What we can show, however, is that sound change is really something peculiar in language evolution, with no real counterpart in biology. At least, I do not know of any case where a set of 300 alignments could be reduced to some 35 largely identical patterns. This shows, on the other hand, that the classical biological approaches that try to model each site of an alignment independently are definitely not what we need in order to model sound change realistically. The assumption of independence of sites in an alignment is already problematic in biology. In linguistics, at least in the cases illustrated above, it seems to be just as useless as tossing a coin to predict the weather in a desert: it is too much of an effort with very poor results to be expected.

  • Bhasker, J. and T. Samad (1991): The clique-partitioning problem. Computers \& Mathematics with Applications 22.6. 1 - 11.
  • Bratceva, E. and V. Čerenin (1994): Otyskanie vsex naimen’šix porkrytij grafa klikami [Searching all minimal clique coverages of a graph]. Žurnal Vyčislitel’noj Matematiki i Matematičeskoj Fisiki [Journal of Computational Mathematics and Physics] 34.8-9. 1272-1292.
  • Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015): Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1. 1-9.
  • Ryzhkov, A. (1975): Partitioning a graph into the minimal number of complete subgraphs. Cybernetics 11.6. 939-943. Original article: Рыжков А. П., Разбиение графа на минимальное число полных подграфов .. 90-96. Kybernetika 1975. 6.
  • Thompson, J., D. Higgins, and T. Gibson (1994): CLUSTAL W. Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22.22. 4673–4680.

Tuesday, October 18, 2016

The Genome Cellar is no such thing

In an earlier blog post, I noted that The Music Genome Project is no such thing. The use of the word "genome" in this context is an analogy, in which the musical characteristics are seen as producing a sort of genetic fingerprint. However, this is a false analogy, because the data used for the Music Genome Project are actually phenotypic, not genotypic. Indeed, music has no analog of a genotype.

In a similar vein, the data used for The Genome Cellar are phenotypic, not genotypic, and so this is also a false analogy.

The Genome Cellar is the database used by the Next Glass app. This app was released in November 2014, and a concurrent press release explained the concept:
Next Glass is the breakthrough app that uses science and machine learning software to provide accurate, personalized recommendations to consumers. Next Glass has analyzed tens of thousands of bottles of wine and beer with a mass spectrometer and stores the "DNA" of each product in its Genome Cellar™, which combines with users' Taste Profiles™ to provide product-specific recommendations.
So, the beer / wine data in the Genome Cellar are peaks in a spectrophotometer output. This is made clear in another press release:
Next Glass has developed the world’s first Genome Cellar, an extensive database that contains the chemical makeup – or "DNA" – of tens of thousands of wines and beers. By looking at each bottle on a molecular level, Next Glass defines a unique taste profile for every bottle by analyzing thousands of chemical elements.
This procedure will, indeed, provide a unique fingerprint for each alcoholic product, but it will be a phenotypic one not a genotypic one. Genetics is often chemistry but not all chemistry is genetics.

The idea of the Next Glass app is the same as that for the Music Genome Project — to use the fingerprint of currently liked products (music or wines / beers) to make recommendations for other products that might appeal to the customer. This approach can be expected to work for alcoholic beverages, because the subjective preferences will be based to some extent on the sensory components of the chemical makeup. If you document enough of the chemistry then you are bound to include a large proportion of the sensory part.

Anyway, you can see a short video about the laboratory here.

Finally, you might like to compare this approach with that of WineFriend, which tries to assess your taste in wine with multiple-choice questions, instead of complex chemistry. WineFriend:
uses a simple eight question taste survey that gives insights into a customer's thresholds for sweet, sour, bitterness and intensity of flavour. It then creates a profile which enables it to select wines that are tailored to the individual customer's tastes.
No mention of genomes here.

Tuesday, October 11, 2016

Changes in Playboy's women through 60 years

It has long been known that ideas about female attractiveness, and concern with body weight among young women, are closely related to exposure to mass media images (see the review by Spettigue & Henderson 2004). The print media are particularly involved in this issue, not least the so-called "men's magazines", such as Playboy. It therefore created a great deal of media interest when it was announced in October 2015 that Playboy would no longer feature nude centerfolds (known as Playmates).

Indeed, Playboy has often been claimed as a purveyor of the US society's image of the "ideal woman", although this is surely media exaggeration. Playboy, whether we love it or hate it, has simply portrayed females that the editors thought would sell magazines at the time. Nevertheless, the magazine's choice of models has been used in the professional medical and psychological literature as representative of a prevalent cultural idealization of an ultra-slender female body shape (eg. Garner et al. 1980; Wiseman et al. 1992; Szabo 1996; Spitzer et al. 1999; Katzmarzyk & Davis 2001; Pettijohn & Jungeberg 2004).

It therefore comes as no surprise that the magazine's database of model statistics was subjected to scrutiny in the online media after the 2015 announcement, particularly with regard to how things had changed during the magazine's 62 years. Sadly, some of this analysis was quite poor (eg. Playboy's image of the ideal woman sure has changed). Here, I try to correct this by presenting a more thorough study of the available data.

The data I have used covers all of the Playmates of the Month that have appeared in the US edition of the magazine since its inception. This is contained in a searchable version of the pmstats.txt file that has been maintained by Jim Dean, Johnny Corvin and Doug Ewell, as currently available on Peggy Wilkins' website. This file is an updated compilation of the so-called "vital statistics" of the Playmates from December 1953 to February 2016, inclusive, as reported in Playboy, sometimes supplemented from other available sources.

Note, especially, that the data are basically self-reported by the Playmates. Some of the information has been questioned at various times, notably where it seems to contradict the associated photographic evidence. As a reputable scientist, I should probably have personally checked all of this evidence, but I have not done so (you can do so yourself, based on whatever photos you can find on the internet). I have simply assumed that, at a minimum, the information presents whatever the Playmates thought was a desirable public image at the time of publication.

There are 753 records in the dataset, separately including twins and triplets appearing in the same magazine issue, as well as multiple appearances by the same woman in different issues. The data include: magazine issue month; Playmate name, birth date and birth location; height in inches and weight in pounds; breast, waist and hip dimensions in inches; and photographer name. From this information, for each Playmate I calculated their age at the time of publication, along with standard measurements for determining whether a body is healthy or not: Body Mass Index (BMI), for body size (ie. underweight, normal weight, overweight, obese), and Waist to Hip Ratio (WHR), for body curvaceousness.


As is usual in this blog, the data can be summarized using a phylogenetic network as a form of exploratory data analysis (see How to interpret splits graphs).

I first range-standardized the data (so that all of the measurements are compared on the same scale), and log-transformed the BMI and WHR measurements (because otherwise these ratios will have non-linear relationships to the other variables). I then used the manhattan distance to calculate the similarity of the different publication years and birth locations, based on the Playmates' body dimensions. This was followed by a neighbor-net analysis to display the between-year and the between-location similarities as two phylogenetic networks.

The network of relationships among the years is shown first. Years that are closely connected in the network are similar to each other based on the body dimensions of their Playmates, and those that are further apart are progressively more different from each other.

Click to enlarge

The network shows that there has been a strong and consistent change in Playmate age, size and shape through time. In the graph there is a simple gradient through time form top-right to bottom-left — the 1950s and 1960s are intermingled at the top, with the 1970s below them, the 1980s and 1990s below that, and the 2000s and 2010s intermingled at the bottom.

So, it will be worth looking at time graphs of the individual measurements. Let's start with age.

This does not show a particularly consistent trend, but the average age of the models does increase from 21 to 24 years from beginning to end of the time period.

The next graph shows that the reported height of the Playmates also increases across the 62 years, by 2.5" on average. There is almost no change in average weight across the decades (and so the graph is not shown).

However, far more notable is the relationship between height and weight, as expressed by the BMI, which is shown in the next graph. This does not show a linear trend at all, but a distinctly curved one. That is, the size of Playmates definitely changed through time, becoming thinner for the first 40 years, but then thickening up again for the next 20 years.

This trend has not been discussed in the professional literature, as far as I can determine, perhaps because previous assessments have been based only on a relatively short period of time, not the full 6 decades. Note that the bottom point of the curve occurs in c. 1997, and that by 2016 the BMI measurements had returned to the 1975 level (40 years earlier). I wonder whether they would return to the 1950s level in another 20 years?

More importantly, given that Playmates are to one degree or another reflecting a contemporary societal image of a desirable woman, we can note that 48% of these models are classified as being underweight. The lower limit of a healthy BMI is 18.5, as shown in the next graph, which also shows the boundaries between Mild thinness (17-18.5), Moderate thinness (16-17) and Severe thinness (<16).

Clearly, during the period 1975-1995 the vast majority of the models reported being underweight, while in the 1950s and 1960s very few of them did. This situation has improved recently, with roughly a half being underweight during the past 20 years. Also, several of the reported body sizes are very unhealthy. However, perhaps the BMI values below 16 are unreliable, in the sense that such a person is not likely to be very photogenic.

We can now move on to the circumferences of the models. The next graph shows the time trend for the reported circumference at breast level. This shows the biggest and most consistent change of all, with a dramatic reduction in bustiness.

Indeed, chest sizes of >36" have hardly been reported since the start of 1990, and yet in the early years a buxom 36-24-36 figure was the most common claim by the Playmates. Interestingly, very few of the models have claimed a chest size of 33" (as opposed to 32" or 34"); is this some sort of superstition?

The other large and consistent change in circumference is for waist size, as shown in the next graph. This shows the opposite trend, with an increase in average reported size of 2" across the 60 years.

There was a slight but not consistent reduction in hip circumference during time (and so the graph is not shown). This means that the WHR, the measure of curvaceousness, changed greatly through time, as shown in the next graph. So, with the waists reportedly becoming larger, there was apparently a very large reduction in the curvaceousness of the models through time.

Note that the reduction in BMI was apparently achieved in spite of an increase in waist size — the BMI reduction seems to be related to the increase in average reported height without an increase in weight, and partly to the decrease in chest size.

When combined with the reduction in breast circumference, this means that the Playmates of the 21st century have been a very different shape from those of the mid 20th century. They were taller, with smaller breasts and larger waists, and thus had fewer curves.

We can end this discussion by considering where these Playmates were born. Most of them reported being born in the USA (83%). This means that we can consider how the various states compare in producing nude models. Obviously, more models are likely to come from the most populous states, and so we need to standardize the data by dividing by the population size of each state (as estimated for 2015 in Wikipedia), to yield the number of Playmates per million people in each state.

Apparently, Hawaii and California are more likely than the other states to produce models who are prepared to take their clothes off in public, while Delaware and Vermont have not yet done so, at least as far as Playboy is concerned. The apparently large value for Washington DC represents only 2 models from a relatively small population.

We can also consider whether the dimensions of the models vary in any consistent way between the states. This can be done with a phylogenetic network, as discussed above. In the following network, states that are closely connected are similar to each other based on the body dimensions of their Playmates, and those that are further apart are progressively more different from each other.

There appear to be no consistent patterns here.

So, we can finish by considering the countries from which the remaining 17% of the models originated. Once again, the data are standardized, to yield the number of Playmates per million people in each country (or province, for Canada). The apparently large value for Malta represents one set of twins from a relatively small population.

There have been a relatively large number of models from Scandinavia (Norway, Denmark and Sweden). This presumably represents the number of females whose body shape matches the image required by the Playboy editors, as much as the willingness of Scandinavians to disrobe publicly. However, it is notable that the rate of models from Norway is double those for Denmark and Sweden.


Garner DM, Garfinkel P, Schwartz D, Thompson M (1980) Cultural expectations of thinness in women. Psychological Reports 47: 484-491.

Katzmarzyk PT, Davis C (2001) Thinness and body shape of Playboy centerfolds from 1978 to 1998. International Journal of Obesity 25: 590-592.

Pettijohn TF, Jungeberg BJ (2004) Playboy Playmate curves: changes in facial and body feature preferences across social and economic conditions. Personality and Social Psychology Bulletin 30: 1186-1197.

Spettigue W, Henderson KA (2004) Eating disorders and the role of the media. Canadian Child and Adolescent Psychiatry Review 13: 16-19.

Spitzer BL, Henderson KA, Zivian, MT (1999) Gender differences in population versus media body sizes: a comparison over four decades. Sex Roles 40: 545-565.

Szabo CP (1996) Playboy centrefolds and eating disorders - from male pleasure to female pathology. South African Medical Journal 86: 838-839.

Wiseman CV, Gray JJ, Mosimann JE, Ahrens AH (1992) Cultural expectations of thinness in women: an update. International Journal of Eating Disorders 11: 85-89.