# The Traveler’s Restaurant Process — A Better Description of the Dirichlet Process for Partitioning Sets

[latexpage]

## I. “Have Any of These People Ever Been to a Chinese Restaurant?”

The Dirichlet process is a stochastic process that can be used to partition a set of elements into a set of subsets. In biological modeling, it is commonly used to assign elements into groups, such as molecular sequence sites into distinct rate categories. Very often, an intuitive explanation as to how it works invokes the “Chinese Restaurant Process” analogy. I have always found this analogy very jarring and confusing, as, while it is a good description of how the Dirichlet process works, it is a terrible description of how Chinese restaurants, or, indeed, any other type of restaurant run by (and patronized by) humans[1], works. As Dr. Emily Jane McTavish says, “Have any of these people ever been to a Chinese restaurant?” Indeed. In fact, one may wonder if any of them have been to any restaurant outside of a Kafkaesque performance-art themed one.

## II. The “Traveler’s Restaurant Process” Analogy

I believe a far more intuitive analogy might be given by the “Traveler’s Restaurant [Selection] Process”. A common dictum in traveler lore, guides, advice, and general wisdom is that when picking a restaurant for a meal to prefer establishments that appear to be more popular or frequented with locals. And so, we can imagine instead, describing how a group of N travellers distribute themselves among the various restaurants at a food court. As with the original analogy, we have the travelers entering the establishment one-by-one, with the food court initially empty. (Yes, this is still a weakness in the analogy that lends a surreal contrivance to the whole tale, but maybe this can be alleviated somewhat by imagining that it is (a) 1 am in the morning; (b) the travelers have just arrived and are suffering from jet lag; and (c) wander down to the food court after checking in their respective hotels/motels/pensions, and variances in processes times lead them to come in straggling individually. As for the food court open at 1 am — in a number of parts of the world, this is pretty commonplace, I assure you!). The first traveler finds an empty food court, and picks a restaurant at random. The next traveler enters the food court, and looking around, sees just one restaurant occupied. It is possible that she makes a beeline for that restaurant, but maybe she is feeling cranky or does not particular like the other traveler (or other travelers in general), and heads off to a different restaurant. And so on with the next traveler, and the next, and the next, until the last one, each of whom makes an individual decision whether to try a new, empty restaurant (if they are feeling particularly adventurous or misanthrophic or introverted or have squabbled with or are otherwise disdainful of everyone else) or else, at the other extreme, head for the most crowded restaurant (if they are sticking to the true and tried traveler’s restaurant selection dictum or are feeling particular sociable), or everything in between. At the end of it all we take the clusters of travelers as they have distributed themselves into individual restaurants as the partition of the original the set of travelers.

Makes (better) sense?

Maybe not!

## III. Working Through the Process

Emily also says that many times just walking through the model expressions and formulas often yields much better intuition than contrived and twisted analogies and metaphors. And maybe she is right!

So, let us take a set of $N$ elements: $t_1, t_2, …, t_N$. We can imagine each element to be a traveler from the “Traveler’s Restaurant Process” analogy, or vice versa, or anything else you prefer. Each element is assigned to a group in turn, creating new groups as needed. With no existing groups, the first element is automatically assigned to a new group. Well, technically, the first element is assigned to new group with probability given by:

\begin{align}
\frac{\alpha}{\alpha + n – 1},
\end{align}

where $\alpha$ is known as the concentration parameter, and $n$ is is the 1-based index of the element (i.e., the first element has $n=1$, the second has $n=2$, the third has $n=3$, and so on). The first element has $n=1$, and the above expression evaluates to 1.0, so is always assigned a new group. The more general case, applicable to all subsequent elements up to and including the last one, is that the $n^{\text{th}}$ element is:

• Either assigned a new group, $g_*$ with probability according to expression (1) above,
• Or assigned to one of the existing groups, $g_i$, with probability:
\begin{align}
\frac{|g_i|}{\alpha + n – 1},
\end{align}

where $\alpha$ is the concentration parameter and $n$ is is the 1-based index of the element, as previously described, and $|g_i|$ is the number of elements in group $g_i$.

## IV. How/Why Does it Work?

How do we know that each element will definitely be assigned to a group? We can check that the probabilities of either being assigned to a new group or existing group sum to one to reassure ourselves of this. Let us say that we are dealing with $n^{\text{th}}$ element out of $N$ elements, where $1 \leq n \leq N$. As described above, the probability that this element will be assigned to a new group is given by expression (1). On the other hand, the probability of being assigned to any one of the existing groups is given by:

\begin{align}
\sum_{i=1}^{M}\frac{|g_i|}{\alpha + n – 1}
\end{align}

where $M$ is the number of existing groups, $g_1, g_2, …, g_M$. Now, while we do not know in this general case exactly how the previous $n-1$ elements are distributed amongst the various $M$ existing groups, we do know that, regardless, there must be a total of $n-1$ elements across all groups. So, then, the above expression reduces to:

\begin{align}
\frac{\sum_{i=1}^{M} |g_i|}{\alpha + n – 1} = \frac{n-1}{\alpha + n – 1}.
\end{align}

So the probability of either being assigned to a new group or being assigned to an existing group is given by the sum of Expression (1) and (4), which is:

\begin{align}
\frac{\alpha}{\alpha + n – 1} + \frac{n-1}{\alpha + n – 1} &= \frac{\alpha + n – 1}{\alpha + n – 1} \\
&= 1.
\end{align}

And thus, we are assured of all elements being assigned to a group, whether a new one or an existing one, as well as being able to sleep at night knowing that the distribution of partitions under the Dirichlet process is a proper probability as it sums to 1.0.

## V. The (Anti-?)Concentration Parameter

We have mentioned and used the concentration parameter, $\alpha$, above, without actually explaining it. Basically, the concentration parameter, as its name implies, determines how “clumpy” the process is. Unfortunately, contrary to what its name implies, at low values the process is more “clumpy” — i.e., yielding partitions where elements tend to be grouped together more frequently, resulting in larger and correspondingly fewer subsets — while at high values the process is less “clumpy” — yielding partitions where elements tend to be more dispersed, resulting in smaller and correspondingly more subsets. Yeah, it really should have been called the “anti-concentration” or “dispersion” parameter. Or perhaps folks should just stick to using the more neutral and less evocative, but non-misleading, standard term: the “scaling” parameter.

(See updates here and here that clear up this issue! Basically, the “concentration” parameter gets its name not due to way it concentrates elements, as naive and incorrect intuition led me to think, but rather due to how increasing it leads to the distribution of values across the subsets converging on the base distribution! What values? What base distribution? EXACTLY! Those concepts do not really enter into any of the analogies we have discussed so far, or, indeed, even in the way we have explained the process with reference to the equations and model. Only when considering a version of the DP process where our elements are not just anonymous exchangeable atoms, but value-bearing elements that we want to cluster based on values, does the base distribution enter the picture, and only then does the “concentration” parameter do its concentrating the higher it gets!)

### VI. In Action

This Gist wraps up all this logic in a script that you can play with to get a feel for how different concentration parameters work.

With low values of the concentration parameter, we pretty much get all the elements packed into a single set:

# python dirichlet_partition.py -n100 -v0 -a 0.01
Mean number of subsets per partition: 1.04
Mean number of elements per subset: 9.8


On the other hand, with high values of the scaling parameter, we the trend is toward each element being in its own set, with nearly as many subsets in the partition as there are elements in the full set:

# python dirichlet_partition.py -n100 -v0 -a 100
Mean number of subsets per partition: 9.65
Mean number of elements per subset: 1.03944444444


And, of course, moderate values result in something in between:

# python dirichlet_partition.py -n100 -v0 -a 1
Mean number of subsets per partition: 3.03
Mean number of elements per subset: 3.92166666667

# python dirichlet_partition.py -n100 -v0 -a 5
Mean number of subsets per partition: 5.76
Mean number of elements per subset: 1.88055555556

# python dirichlet_partition.py -n100 -v0 -a 10
Mean number of subsets per partition: 7.01
Mean number of elements per subset: 1.49297619048