The Traveler’s Restaurant Process — A Better Description of the Dirichlet Process for Partitioning Sets


I. “Have Any of These People Ever Been to a Chinese Restaurant?”

The Dirichlet process is a stochastic process that can be used to partition a set of elements into a set of subsets. In biological modeling, it is commonly used to assign elements into groups, such as molecular sequence sites into distinct rate categories. Very often, an intuitive explanation as to how it works invokes the “Chinese Restaurant Process” analogy. I have always found this analogy very jarring and confusing, as, while it is a good description of how the Dirichlet process works, it is a terrible description of how Chinese restaurants, or, indeed, any other type of restaurant run by (and patronized by) humans[1], works. As Dr. Emily Jane McTavish says, “Have any of these people ever been to a Chinese restaurant?” Indeed. In fact, one may wonder if any of them have been to any restaurant outside of a Kafkaesque performance-art themed one.

II. The “Traveler’s Restaurant Process” Analogy

I believe a far more intuitive analogy might be given by the “Traveler’s Restaurant [Selection] Process”. A common dictum in traveler lore, guides, advice, and general wisdom is that when picking a restaurant for a meal to prefer establishments that appear to be more popular or frequented with locals. And so, we can imagine instead, describing how a group of N travellers distribute themselves among the various restaurants at a food court. As with the original analogy, we have the travelers entering the establishment one-by-one, with the food court initially empty. (Yes, this is still a weakness in the analogy that lends a surreal contrivance to the whole tale, but maybe this can be alleviated somewhat by imagining that it is (a) 1 am in the morning; (b) the travelers have just arrived and are suffering from jet lag; and (c) wander down to the food court after checking in their respective hotels/motels/pensions, and variances in processes times lead them to come in straggling individually. As for the food court open at 1 am — in a number of parts of the world, this is pretty commonplace, I assure you!). The first traveler finds an empty food court, and picks a restaurant at random. The next traveler enters the food court, and looking around, sees just one restaurant occupied. It is possible that she makes a beeline for that restaurant, but maybe she is feeling cranky or does not particular like the other traveler (or other travelers in general), and heads off to a different restaurant. And so on with the next traveler, and the next, and the next, until the last one, each of whom makes an individual decision whether to try a new, empty restaurant (if they are feeling particularly adventurous or misanthrophic or introverted or have squabbled with or are otherwise disdainful of everyone else) or else, at the other extreme, head for the most crowded restaurant (if they are sticking to the true and tried traveler’s restaurant selection dictum or are feeling particular sociable), or everything in between. At the end of it all we take the clusters of travelers as they have distributed themselves into individual restaurants as the partition of the original the set of travelers.

Makes (better) sense?

Maybe not!

III. Working Through the Process

Emily also says that many times just walking through the model expressions and formulas often yields much better intuition than contrived and twisted analogies and metaphors. And maybe she is right!

So, let us take a set of $N$ elements: $t_1, t_2, …, t_N$. We can imagine each element to be a traveler from the “Traveler’s Restaurant Process” analogy, or vice versa, or anything else you prefer. Each element is assigned to a group in turn, creating new groups as needed. With no existing groups, the first element is automatically assigned to a new group. Well, technically, the first element is assigned to new group with probability given by:

\frac{\alpha}{\alpha + n – 1},

where $\alpha$ is known as the concentration parameter, and $n$ is is the 1-based index of the element (i.e., the first element has $n=1$, the second has $n=2$, the third has $n=3$, and so on). The first element has $n=1$, and the above expression evaluates to 1.0, so is always assigned a new group. The more general case, applicable to all subsequent elements up to and including the last one, is that the $n^{\text{th}}$ element is:

  • Either assigned a new group, $g_*$ with probability according to expression (1) above,
  • Or assigned to one of the existing groups, $g_i$, with probability:
    \frac{|g_i|}{\alpha + n – 1},

where $\alpha$ is the concentration parameter and $n$ is is the 1-based index of the element, as previously described, and $|g_i|$ is the number of elements in group $g_i$.

IV. How/Why Does it Work?

How do we know that each element will definitely be assigned to a group? We can check that the probabilities of either being assigned to a new group or existing group sum to one to reassure ourselves of this. Let us say that we are dealing with $n^{\text{th}}$ element out of $N$ elements, where $1 \leq n \leq N$. As described above, the probability that this element will be assigned to a new group is given by expression (1). On the other hand, the probability of being assigned to any one of the existing groups is given by:

\sum_{i=1}^{M}\frac{|g_i|}{\alpha + n – 1}

where $M$ is the number of existing groups, $g_1, g_2, …, g_M$. Now, while we do not know in this general case exactly how the previous $n-1$ elements are distributed amongst the various $M$ existing groups, we do know that, regardless, there must be a total of $n-1$ elements across all groups. So, then, the above expression reduces to:

\frac{\sum_{i=1}^{M} |g_i|}{\alpha + n – 1} = \frac{n-1}{\alpha + n – 1}.

So the probability of either being assigned to a new group or being assigned to an existing group is given by the sum of Expression (1) and (4), which is:

\frac{\alpha}{\alpha + n – 1} + \frac{n-1}{\alpha + n – 1} &= \frac{\alpha + n – 1}{\alpha + n – 1} \\
&= 1.

And thus, we are assured of all elements being assigned to a group, whether a new one or an existing one, as well as being able to sleep at night knowing that the distribution of partitions under the Dirichlet process is a proper probability as it sums to 1.0.

V. The (Anti-?)Concentration Parameter

We have mentioned and used the concentration parameter, $\alpha$, above, without actually explaining it. Basically, the concentration parameter, as its name implies, determines how “clumpy” the process is. Unfortunately, contrary to what its name implies, at low values the process is more “clumpy” — i.e., yielding partitions where elements tend to be grouped together more frequently, resulting in larger and correspondingly fewer subsets — while at high values the process is less “clumpy” — yielding partitions where elements tend to be more dispersed, resulting in smaller and correspondingly more subsets. Yeah, it really should have been called the “anti-concentration” or “dispersion” parameter. Or perhaps folks should just stick to using the more neutral and less evocative, but non-misleading, standard term: the “scaling” parameter.

(See updates here and here that clear up this issue! Basically, the “concentration” parameter gets its name not due to way it concentrates elements, as naive and incorrect intuition led me to think, but rather due to how increasing it leads to the distribution of values across the subsets converging on the base distribution! What values? What base distribution? EXACTLY! Those concepts do not really enter into any of the analogies we have discussed so far, or, indeed, even in the way we have explained the process with reference to the equations and model. Only when considering a version of the DP process where our elements are not just anonymous exchangeable atoms, but value-bearing elements that we want to cluster based on values, does the base distribution enter the picture, and only then does the “concentration” parameter do its concentrating the higher it gets!)

VI. In Action

This Gist wraps up all this logic in a script that you can play with to get a feel for how different concentration parameters work.

With low values of the concentration parameter, we pretty much get all the elements packed into a single set:

# python -n100 -v0 -a 0.01
Mean number of subsets per partition: 1.04
  Mean number of elements per subset: 9.8

On the other hand, with high values of the scaling parameter, we the trend is toward each element being in its own set, with nearly as many subsets in the partition as there are elements in the full set:

# python -n100 -v0 -a 100
Mean number of subsets per partition: 9.65
  Mean number of elements per subset: 1.03944444444

And, of course, moderate values result in something in between:

# python -n100 -v0 -a 1
Mean number of subsets per partition: 3.03
  Mean number of elements per subset: 3.92166666667

# python -n100 -v0 -a 5
Mean number of subsets per partition: 5.76
  Mean number of elements per subset: 1.88055555556

# python -n100 -v0 -a 10
Mean number of subsets per partition: 7.01
  Mean number of elements per subset: 1.49297619048

VII. The Script

The script to run this is shown below. If you are interested in downloading and using it, please visit:


  1. Interestingly enough, I can imagine a restaurant that was patronized by cats working pretty much exactly like the traditional analogy, given the way some cats tend to be clumpers and others loners. So, if you don’t like the “Traveler’s Restaurant Process” analogy, please feel free to use the “Cat Restaurant Process” analogy, or, better yet, the “Cat Room/Furniture Occupation Process” analogy.
  2. Update 2017-07-25:
    So it seems that the concentration parameter gets its name not from the fact that it (inversely) controls the concentration or clustering or elements, but because of its relationship to the base distribution of the Dirichlet process. The base distribution is something that we have not talked about, and I will cover it in a future post once I understand it well enough to talk about it in a useful and interesting way! But for the time being, it is sufficient to say that it is the distribution over the elements themselves before they are assigned into groups. The concentration parameter is so called because, as it increases in value, it increasingly “concentrates” the values of elements on the base distribution (while at the same time increasingly disperses the elements to distinct groups). So the reason for the counter-intuitive name for the parameter is wrong intuition — it is not referring to how “concentrated” the elements are in terms of the subsets of partition, but how closely the distribution of elements resemble the base distribution. More details on this can be found here, where the concentration parameter is described as a prior belief in the base distribution.
  3. Update 2017-07-26:
    So, here is a GREAT explanation of what is going on: Summarizing: basically, if we understand the Dirichlet process using the Chinese Restaurant Process or the Traveler’s Restaurant Process or the Hair-Clog Process etc. etc., we are ignoring the base distribution as we do not care about the values of the elements that are clustered, or the distribution of those values within each subset. Only with reference to, e.g. the Polya Urn Model, where each element has associated with it a value which, and is clustered to a group based on this values to the distribution of values associated with each group, while parameters of the distribution of each “urn” are drawn from a base distribution, do things make sense. As the concentration parameter increases, the elements spread out across more and more subsets/urns/tables, and as the parameters of the distribution of each subset/urn/table’s “value” are sampled from the base distribution, in effect, the base distribution itself gets more and more (and better and better) sampled: as each subset/urn/table represents an independent sampling of the base distribution. Thus the distribution across all subset/urns/tables converges on the base distribution. Conversely, as the concentration parameter decreases, the elements cluster into fewer groups, and, with fewer groups we get more limited sampling of the the base distribution.

“Pre-Columbian Mycobacterial Genomes Reveal Seals As A Source Of New World Human Tuberculosis”

When, in 1994, definitive evidence of tuberculosis in humans was reported from pre-Columbian America, it was a startling. Conventional understanding had pegged tuberculosis as part of the new, exotic, and (to immunologically-naive populaces) deadly menagerie of pathogens brought by Europeans over to the Americas. While there were suggestions of pre-Columbian tuberculosis in the Americans, these were based on lesions on bones, which were ambiguous. Unlike previous cases, however, the Chiribaya mummy from 1000-1300 CE in Peru was shown beyond doubt to have been exposed to tuberculosis:

In the mummy’s right lung and a lymph node, the scientists found scars of disease. These were small, calcified lesions typical of tuberculosis. Extracting fragments from the tissue, molecular biologists isolated genetic material betraying the presence of Mycobacterium tuberculosis.

To find evidence of tuberculosis here some half a millennia before contact seems to suggest that tuberculosis had always been present in the Americas, and the increased incidences (to put it mildly) in post-contact native American populations were the result of contextual changes (such as high population densities, changes in diet or lifestyle, physiologies or immune systems compromised or changed by other diseases or factors, etc.) rather than the “virgin soil” phenomenon. It also led to other questions: given that tuberculosis was thought to have originated some 8000 years ago in the Middle East, as cowpox jumped to humans with the domestication of cattle, how did it get to the Americas before not just Old World people, but cattle got there?

A decade later, the mystery was solved. Leveraging advances not just in molecular sequencing technologies but also theoretical and computational analytic methods, it was found that the strain of tuberculosis found in pre-Columbian humans in the Americas was an entirely different one from the one that wreaked appalling havoc after the arrival of the Europeans. As can clearly be seen from the phylogenetic trees, the pre-Columbian tuberculosis probably was completely independent zoonosis event, jumping to humans from seals. The strain of tuberculosis that was endemic to the Americas did not confer any immunity to the new strain brought over by the Europeans, and hence the “virgin soil” pandemic syndrome that proved so devastating. The seals themselves were thought to have picked it up from hosts in Africa, and brought it over to the Americas through oceanic migration/dispersal.

While seals may seem a strange source given the agricultural practices with which we are currently familiar, resulting in a lot of skepticism in comments in the popular press, many pre-Columbian populations on coastal sites had marine-based resource exploitation economies that included strong interaction with a range of marine animals including seals. In fact, tantalizing clues to the marine origin of this (that only can be seen as clues with hindsight) were proximity of many of the early evidence of the pre-Columbian tuberculosis cases came from sites that were close to the coast or sea, as with the Chiribaya mummy.

All in all, a fascinating historical detective story, with many twists and turns, with the conclusion at the end a great example of the power of modern molecular phylogenetics to peer into the past at the micro- as well as the macro scales.

    Bos, K. I. et al. (2014). Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis. Nature, 514(7523), 494-497. doi:10.1038/nature13591.

Multispecies Coalescent Species Delimitation: Conflating Populations with Species in the Grey Zone (Evolution 2017 Talk)


The always fantastic Evolution meetings were a blast. So many great talks, and, perhaps more importantly, great catching up with so many friends, collaborators, and colleagues!

I presented a talk on our PNAS paper showing how the Multispecies Coalescent model, when used for “species” delimitation, actually delimits Wright-Fisher populations.

Titled “Multispecies Coalescent Species Delimitation: Conflating Populations with Species in the Grey Zone“, the entire talk can be viewed here:

The slides are available here:

“Phylogenomics reveals rapid, simultaneous diversification of three major clades of Gondwanan frogs at the Cretaceous–Paleogene boundary”

Some nice work that ties the timing of the radiation of three independent lineages of frogs, constituting the majority of modern living frogs, to about the time the major groups of dinosaurs took a hit (literally and figuratively!). Compelling and interesting story, with lots of intriguing follow-up questions. A more general article covering the findings is available here.

Yan-Jie Feng, David C. Blackburn, Dan Liang, David M. Hillis, David B. Wake, David C. Cannatella, and Peng Zhang. 2017. Phylogenomics reveals rapid, simultaneous diversification of three major clades of Gondwanan frogs at the Cretaceous–Paleogene boundary. PNAS 2017 ; published ahead of print July 3, 2017, doi:10.1073/pnas.1704632114

Evolution of Bioluminescence in Millipedes

Walk deep into a rainforest at night. Switch off your headlamps. And wait with open eyes. At first, it is so pitch black that you cannot see your own hand if you wave it in front of nose (as Bilbo might have said). As your eyes get accustomed to the darkness, you will realize one thing. Everything glows. Everything. There is fine fuzzy layer of bioluminescent fungus covering dead leaves and the bark of trees, so you can almost make out the forest like some one has traced it out in ghostly yellow-green outline. Little ghostly yellow-green mushroom cap clusters are found here and there. And in certain places, much brighter spots are moving very slowly: millipedes crawling on vegetation. For years I’ve wondered about the functional utility of these different forms of bioluminescence. The mushrooms caps? Surely it cannot be to attract insects? Or warn off predators (after all, why visually call attention to yourself in the first place when nobody can see you)? My (admittedly, in those days, naive) literature searches yielded nothing. I speculated that the only explanation that made sense was that the bioluminescence was a side-effect of some other metabolic process, and in a place where visual channels were not economical to exploit by predators, there was no cost to it. It seems that I was totally wrong! But that is the fantastically cool thing about science. It is precisely when you are wrong that you learn the most!