Hello, Science!:
Statistics

If pigs can whistle, then horses can fly... or... De Finetti's Theorem
$If pigs can whistle, then horses can fly... or... De Finetti's Theorem$
We talked about the concept of exchangeability a few weeks ago. I presented three urn models, two of which defined exchangeable probability distributions. The main idea I wanted to convey using the first and second urn models was that exchangeability is not just independence: urn model 2 does not look exchangeable at first sight but some simple algebra convinced us that it is.
Today I want to continue this discussion with a theorem that is strongly related to exchangeability called De Finetti's theorem. The theorem goes as follows: say we have an infinite sequence of binary random variables (say the color of the balls in our urn model) and this sequence is exchangeably, then there exists a probability distribution for
$\theta$
such that
$p(x_1,\cdots,x_n) = \int_{\Theta} \prod_{i=1}^n p(x_i|\theta) p(\theta) d\theta$
In other words, if our sequence of random variables is exchangeable, than it really is a sample from a mixture model with mixture distribution
$p(\theta)$
This kind of theorem reminds me of a quote I once heard in a complexity theory talk: if pigs can whistle, then horses can fly. What do I mean by this? Assuming that a set of random variables is exchangeably doesn't sounds like a very big assumption: e.g. if you plan to cluster something like the MNIST digit recognition dataset, there is nothing a priori that distinguishes one image from another; aka. you could assume they are exchangeable. However, once you make the exchangeability assumption, De Finetti's theoremy suddenly says that really your datapoints are part of a hierarchicaly Bayesian model. In other words: by making the relatively innocent exchangeability assumption, De Finetti's theorem imposes the graphical model structure upon you. A weak assumption implies a rather strong consequence.
There is a catch however: the theorem does not specify what kind of random variable
$\theta$
is, nor what its distribution function is. As a matter of fact,
$\theta$
could be an infinitely large random variable (which would lead us into the realm of nonparametric Bayesian models such as the Dirichlet Process, but that's for another time). For simple models such as urn model 2 we can compute the distribution though:
$\theta$
turns out to be a Beta distributed random variable.
All in all, De Finetti's theorem is an interesting theorem but it is not clear to me of how much practical value it is. There are many extensions and modifications of the theorem; for more information check
- The Concept of Exchangeability - Jose Bernardo
- Exchangeability and Related Topics - David Aldous
What is... Exchangeability?
$What is... Exchangeability?$

Talking about exchangeability, a friend once commented that exchangeability is "too simple too understand". On one hand, it is true that the statement of exchangeability (see below) sounds somewhat trivial, I found that I had absolutely no intuition as to why it is important for machine learning. So after some reading, I present my take on the concept of exchangeability.
What is Exchangeability?
Scenario 1. Imagine we have an urn with r red balls and b blue balls. We draw 3 balls from the urn as follows: we pick a random ball, write down its color and put it back in the urn before drawing a new ball. We introduce 3 random variables: A, B, C which denote the color of the first, second and third ball. It is not hard to see that p(A=r, B=b, C=b) = p(A=b, B=r, C=b); in other words, we can exchange the values of the random variables without changing the joint probability. Intuitively, the reason we can exchange the observations is that our random variables are IID (independent and identically distributed).
Scenario 2. We again pick 3 balls from an urn with r red and b blue balls. We still pick a random ball and note its color, but we put two balls of that color back in the urn. It may not be obvious that the sequence A=r, B=b, C = b has the same probability as the sequence A=b, B=b, C=r since the individual probabilities of picking the red ball first or last are completely different: r/[r+b] when it is the first ball versus r/[r+b+2] when it is the last ball (since two blue balls were added in the mean time). Writing down the equations makes it clear that the two sequence are equi-probable
$p(A=r, B=b, C=b) = \frac{r}{r+b}\frac{b}{r+b+1}\frac{b+1}{r+b+2} = \frac{b}{r+b}\frac{b+1}{r+b+1}\frac{r}{r+b+2} = p(A=b, B=b, C=r)$
It is trivial to generalize this expression to longer sequences. Again, it doesn't matter in what order we pick the balls, the only thing that matter is how many red and how many blue balls we pick. This is reflected in the formula in the sense that denominator of the probability of a sequence only depends on how long the sequence is. The nominator part only needs to know how many balls of each color there are. In our example: it only needs to know that there is a first and second blue ball (contributing b * (b+1) to the nominator) and a first red ball (contributing a).
Scenario 3. Both examples above were exchangeable since reordering the values of the random variables didn't change the probability. Let us consider a similar setup where exchangeability does not apply anymore. We again use the urn scheme with r red balls and b blue balls. However, now when we pick a red ball we note its color and simply put it back, but when we pick a blue ball we note its color and put two back. It is easy to see that we cannot exchange the value of the random variables anymore since
$p(A=r, B=b, C=b) = \frac{r}{r+b}\frac{b}{r+b}\frac{b+1}{r+b+1}$
while
$p(A=b, B=b, C=r) = \frac{b}{r+b}\frac{b+1}{r+b+1}\frac{r}{r+b+2}$
I think the following definition of exchangeability now becomes much more intuitive; we say a set of n random variables Y is exchangeable under a distribution p iff for any permutation pi of the integers 1..n
$p(Y_1=y_1, \cdots, Y_n=y_n) = p(Y_1=y_{\pi(1)}, \cdots, Y_n=y_{\pi(n)})$
Properties of Exchangeability
Let us now briefly discuss some consequences of exchangeability as it will allow us to see why it is such an important concept. First, we compute the marginal probability of the second draw p(B = r) under the different scenarios. Under scenario 1 this is trivial, just before the second draw the content of our urn is exactly as it was when we started: hence p(B=r) = r/(r+b). Under scenario 2, after some simple algebra we find that p(B=r) = p(A=b, B=r) + p(A=r, B=r) = r/(r+b). Now here is the exciting part: we shouldn't have done all the algebra; if we are convinced that the random variables are exchangeable under the distribution of scenario 2, we could have acted as if we were computing the marginal probability for the first draw. Formally, since p(A=b,B=r) = p(A=r, B=b) and substituting this in the expression for p(B=r), we could have marginalized out the B=b part. This property - changing the order around - is incredibly useful when computing probabilities.
More abstractly, here is one way to think of exchangeable sequences. In scenario 2, if a friend just drew a ball from the urn, didn't show it to us and put one extra ball back in the urn, this is not going to make a difference as to the probability of our next draw. However, in scenario 3 above, whether someone drew a ball before us is very important: it drastically changes the probabilities for our next draw. I think this is a very important distinction that sets exchangeable and non-exchangeable distributions appart.
Although exchangeability and IID variables look very similar they are not exactly the same. From scenario one above, it is easy to see that IID random variables are exchangeable. The converse is not true: in scenario 2, p(A=b, B=r) is not equal to p(A=b) p(B=r) and thus the random variables are not independently distributed.
Exchangeability and Machine Learning
Exchangeable distributions are very common in machine learning. The most famous modelling assumption for text processing is just exchangeability: the bag of words model. This modelling assumption states that the probability of a text document depends only on word counts and not on word order. This is exactly the same model as scenario 1 above except that instead of red and blue balls, we now have words from a fixed vocabulary. Is this a realistic assumption one may ask? It certainly is not! We don't expect natural language to be exchangeable: the probability of using the word "States" should certainly be dependent on the word in front of it (a.k.a. higher if that word is "United"). But who cares, the bag of words assumption works incredibly well...
There are many other exchangeable distributions in common use for machine learning: the Dirichlet Process and its Chinese Restaurant Process equivalent are exchangeable distributions, the Indian Buffet Process is an exchangeable distribution on binary matrices. Non-exchangeable distribution are also common: many Markov models (e.g. Hidden Markov Models) aren't exchangeable.
I hope this little overview of exchangeability was useful. I left one improtant concept out of our discussion so far: De Finetti's theorem. This is a very important theorem that applies to exchangeable sequences and I will discuss the theorem in a future post.
Is Bayesian Statistics Easier?

Many people have debated the merits of Bayesian statistics versus Frequentist statistics: very recently there was this discussion in Bayesian Analysis by Gelman. It’s a fascinating discussion and I’m just learning to ins and outs of different arguments myself.
First, I want to point people to this (short) but very insightful article by Tony O’Hagan. There are many ways of looking at the difference between Bayesian and Frequentist methods and this article makes one particular (a bit philosophical) viewpoint very clear.
I’m certainly not knowledgeable enough to take a stand in this debate but one thing I do have a strong opinion about: Bayesian methods are much easier to understand! In most statistics 101 classes we are taught a bit of point estimation (e.g. estimating a mean, a variance) which I think is understandable, but then people get into hypothesis testing and suddenly have to reason about following statements: (quoting Wikipedia here): “Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?”. I certainly find this a really complicated notion to reason about. The Bayesian alternative would say something like: “What is the probability of generating the observed value for this particular model”. It is the same kind of statements as “What is the probability of generating a five from a fair dice”. I honestly think statistics would be a lot more understandable (and hence I think enjoyable) if we were teaching Bayesian statistics before getting into more advanced frequentist methods. I really wonder whether there are schools out there where the undergrad stats curriculum is based on the Bayesian approach?
EDIT - I fixed the link to O'Hagan's article.
Dirichlet Distributions and Entropy

In between all the Netflix excitement I managed to read a paper that was mentioned by David Blei during his machine learning summer school talk: Entropy and Inference, Revisited by Ilya Nemenman, Fariel Shafee and William Bialek from NIPS 2002. This paper discusses some issues that arise when learning Dirichlet distributions. Since Dirichlet distributions are so common, I think these results should be more widely known.
[Given that I’ve been reading a lot of natural language processing papers where Dirichlet distributions are quite common, I’m surprised I haven’t run into this work before.]
First a little bit of introduction on Dirichlet distributions. The Dirichlet distribution is a “distribution over distribution”; in other words: a draw from a Dirichlet distribution is a vector of positive real numbers that sum up to one. The Dirichlet distribution is parameterized by a vector of positive real numbers which captures the mean and variance of the Dirichlet distribution. It is often very natural to work with a slightly constrained Dirichlet distribution called the symmetric Dirichlet distribution: in this case the vector of parameters to the Dirichlet are all the same number. This implies that the mean of the Dirichlet is a uniform distribution and the variance is captured by the magnitude for the vector of parameters. Let us denote with beta the parameter of the symmetric Dirichlet. Then, when \beta is small, samples from the Dirichlet will have high variance while with beta large, samples from the Dirichlet will have small variance. The plot below illustrates this idea for a Dirichlet with 1000 dimensions: the top plot has very small beta and hence a draw from this Dirichlet has only a few nonzero entries (hence high variance) while the Dirichlet with beta = 1, all entries of the sample have roughly the same magnitude (about 0.001).
Another way to approach the effect of beta is to look at the entropy of a sample from the Dirichlet distribution, denoted by S in the images below. The entropy of a Dirichlet draw is high when beta is large. More in particular, it is upper bounded by ln(D) where D is the dimensionality of the Dirichlet when beta approaches infinity and the Dirichlet distribution will approach a singular distribution at completely uniform discrete distribution. When beta approaches 0, a draw from a Dirichlet distribution approaches a delta peak on a random entry which is a distribution with entropy 1. The key problem the authors want to address is that when learning an unknown distribution, if we use a Dirichlet prior, beta pretty much fixes the allowed shapes while we might not have a good reason a priori to believe that what we want to learn is going to look like either one of these distributions.
The way the authors try to give insight into this problem is by computing the entropy of a random draw of a Dirichlet distribution. In equations, if we denote with S the entropy, it will be a random variable with distribution

Computing the full distribution is hard but the authors give a method to compute its mean and variance. The following picture shows the mean and variance of the entropy for draws of a Dirichlet distributions. A bit of notation: K is the dimensionality of the Dirichlet distribution, Xi is the mean entropy (as a function of beta) and sigma is the variance of the entropy as a function of beta.
As you can see from this plot, the entropy of Dirichlet draws is extremely peaked for even moderately large K. The authors give a detailed analysis of what this implies but the main take-away message is this: as you change beta, the entropy of the implied Dirichlet draws varies smoothly, however, because the variance of the entropy is very peaked, the a priori choice of beta almost completely fixes the entropy.
This is problematic as it means that unless our distribution is sampled almost completely, the estimate of the entropy is dominated by the choice of our prior beta. So how can we fix this? The authors suggest a scheme which I don’t completely understand but boils down to a mixture of Dirichlet distributions by specifying a prior distribution on beta as well.
This mixture idea ties in with something we did in our EMNLP 09 paper: when we were training our part-of-speech tagger we had to choose a prior for the distribution which specifies what words are generated for a particular part-of-speech tag. We know that we have part-of-speech tag classes that generate very few words (e.g. determiners, attributes, …) and a few classes that generate a lot of words (e.g. nouns, adjectives, …). At first, we chose a simple Dirichlet distribution (with fixed beta) as our prior and although the results were reasonable, we did run into the effect explained above: if we set beta to be large, we got very few states in the iHMM where each state outputs a lot of words. This is good to capture nouns and verbs but not good for other classes. Conversely, when we chose a small beta we got a lot of states in the iHMM each generating only a few words. Our next idea was to put a Gamma prior on beta; this helped a bit, but still assumed that there is only one value of beta (or one type of entropy distribution) which we want to learn. This is again unreasonable in the context of natural language. Finally, we chose to put a Dirichlet process prior on beta (with a Gamma base measure). This essentially allows different states in the iHMM to have different beta’s (but we only expect to see a few discrete beta’s).
“Entropy and Inference, Revisited” is one of those papers with a lot of intuitive explanations; hopefully it helps you make the right choice for priors in your next paper as well.
Stats Clinic

We are starting a new machine learning related initiative in Cambridge: the walk-in Stats Clinic. The goal of our consulting service is to help applied researchers with their questions in statistics or machine learning. Starting October 21 from 16:30-18:00, we will be in meeting room 5 at the Centre for Mathematical Sciences every other week.
To R or not to R
At the moment 90% of the research in our lab is done using Matlab. Some of us have tried Python but were dissapointed by how hard it was to get decent (read: native) numerical routines on par with Matlab. I've been developing on the.NET platform with a little bit of Java development (for Amazon's Elastic Map-Reduce) on the side. We've had a few discussions in our lab recently about switching to R for our machine learning research. A decision hasn't been made and we're wondering what the larger machine learning community thinks about this issue. Here's a list of pros and cons that I've come up with
Pros:
- great support for statistical functions
- great support from the statistical community
- good (if not better) plotting that Matlab
- reasonable fast (check out Dave's benchmarking)
- free (very important!) if we want to run a job on 100 machines (e.g. in the cloud), I believe currently you need a matlab licence for each one
Cons:
- pre-historic interface, doesn't compare to modern IDE's
- debugging doesn't seem up to Matlab standards
- not much support in machine learning community (?)
- learning curve
The jury is still out on this one... I really wonder how many machine learners out there already use R?

Random for time:

Lily Aldridge in the Pink with Michael Kors

Kerry Washington in Peter Pilotto

Victoria Beckham Covers German Interview Magazine

Behati Prinsloo and Victoria Secret Girls On Promotional Duties

Kate Moss Covers Allure Magazine, August 2013

Rose Byrne Blooming Lovely for Premier of Insidious: Chapter 2

Diane Kruger in Marchesa and Cushnie et Ochs at CFDA Events

Julia Roberts in Leopard Print Gucci Swing Tunic

2013 GQ Men of the Year Awards: An Open Letter

Behati Prinsloo at the YSL Metropolitan Opera Event

Jennifer Lawrence in Victoria, Victoria Beckham