Hello, Science!:
iHMM

  • ICML & UAI 2008 Accepted Papers

    The accepted papers for ICML 2008 and UAI 2008 are online!

    As one of my own papers got accepted, I am preparing a post with some more background information and maybe some extra plots and pictures.

  • Dirichlet Distributions and Entropy

    Dirichlet Distributions and Entropy

    In between all the Netflix excitement I managed to read a paper that was mentioned by David Blei during his machine learning summer school talk: Entropy and Inference, Revisited by Ilya Nemenman, Fariel Shafee and William Bialek from NIPS 2002. This paper discusses some issues that arise when learning Dirichlet distributions. Since Dirichlet distributions are so common, I think these results should be more widely known.

    [Given that I’ve been reading a lot of natural language processing papers where Dirichlet distributions are quite common, I’m surprised I haven’t run into this work before.]

    First a little bit of introduction on Dirichlet distributions. The Dirichlet distribution is a “distribution over distribution”; in other words: a draw from a Dirichlet distribution is a vector of positive real numbers that sum up to one. The Dirichlet distribution is parameterized by a vector of positive real numbers which captures the mean and variance of the Dirichlet distribution. It is often very natural to work with a slightly constrained Dirichlet distribution called the symmetric Dirichlet distribution: in this case the vector of parameters to the Dirichlet are all the same number. This implies that the mean of the Dirichlet is a uniform distribution and the variance is captured by the magnitude for the vector of parameters. Let us denote with beta the parameter of the symmetric Dirichlet. Then, when \beta is small, samples from the Dirichlet will have high variance while with beta large, samples from the Dirichlet will have small variance. The plot below illustrates this idea for a Dirichlet with 1000 dimensions: the top plot has very small beta and hence a draw from this Dirichlet has only a few nonzero entries (hence high variance) while the Dirichlet with beta = 1, all entries of the sample have roughly the same magnitude (about 0.001).

    image

    Another way to approach the effect of beta is to look at the entropy of a sample from the Dirichlet distribution, denoted by S in the images below. The entropy of a Dirichlet draw is high when beta is large. More in particular, it is upper bounded by ln(D) where D is the dimensionality of the Dirichlet when beta approaches infinity and the Dirichlet distribution will approach a singular distribution at completely uniform discrete distribution. When beta approaches 0, a draw from a Dirichlet distribution approaches a delta peak on a random entry which is a distribution with entropy 1. The key problem the authors want to address is that when learning an unknown distribution, if we use a Dirichlet prior, beta pretty much fixes the allowed shapes while we might not have a good reason a priori to believe that what we want to learn is going to look like either one of these distributions.

    The way the authors try to give insight into this problem is by computing the entropy of a random draw of a Dirichlet distribution. In equations, if we denote with S the entropy, it will be a random variable with distribution

    image

    Computing the full distribution is hard but the authors give a method to compute its mean and variance. The following picture shows the mean and variance of the entropy for draws of a Dirichlet distributions. A bit of notation: K is the dimensionality of the Dirichlet distribution, Xi is the mean entropy (as a function of beta) and sigma is the variance of the entropy as a function of beta.

    image

    As you can see from this plot, the entropy of Dirichlet draws is extremely peaked for even moderately large K. The authors give a detailed analysis of what this implies but the main take-away message is this: as you change beta, the entropy of the implied Dirichlet draws varies smoothly, however, because the variance of the entropy is very peaked, the a priori choice of beta almost completely fixes the entropy.

    This is problematic as it means that unless our distribution is sampled almost completely, the estimate of the entropy is dominated by the choice of our prior beta. So how can we fix this? The authors suggest a scheme which I don’t completely understand but boils down to a mixture of Dirichlet distributions by specifying a prior distribution on beta as well.

    This mixture idea ties in with something we did in our EMNLP 09 paper: when we were training our part-of-speech tagger we had to choose a prior for the distribution which specifies what words are generated for a particular part-of-speech tag. We know that we have part-of-speech tag classes that generate very few words (e.g. determiners, attributes, …) and a few classes that generate a lot of words (e.g. nouns, adjectives, …). At first, we chose a simple Dirichlet distribution (with fixed beta) as our prior and although the results were reasonable, we did run into the effect explained above: if we set beta to be large, we got very few states in the iHMM where each state outputs a lot of words. This is good to capture nouns and verbs but not good for other classes. Conversely, when we chose a small beta we got a lot of states in the iHMM each generating only a few words. Our next idea was to put a Gamma prior on beta; this helped a bit, but still assumed that there is only one value of beta (or one type of entropy distribution) which we want to learn. This is again unreasonable in the context of natural language. Finally, we chose to put a Dirichlet process prior on beta (with a Gamma base measure). This essentially allows different states in the iHMM to have different beta’s (but we only expect to see a few discrete beta’s).

    “Entropy and Inference, Revisited” is one of those papers with a lot of intuitive explanations; hopefully it helps you make the right choice for priors in your next paper as well.

  • The Infinite Hidden Markov Model

    I am at ICML two weeks ago I presented some of our work on the infinite hidden Markov model (also known as iHMM or HDP-HMM). Together with a result from Emily Fox, I believe we have come full circle and it is time for a little summary.

    Hidden Markov models (HMM) have been the workhorse of the discrete time, discrete state time series community for over 40 years. We have seen applications in speech, robotics, natural language processing, vision,... Sometimes, domain knowledge can be used to choose the dimensionality of the latent state space, but very frequently there is no reason to prefer one dimension over the other. Moreover, maybe we explicitly want to model the uncertainty in the dimensionality of the latent state space! Hence, variational Bayes, BIC or other model selection techniques do not apply and one option is to turn to infinite capacity models.

    In 2002, Matt Beal, Zoubin Ghahramani and Carl Rasmussen introduced a first version of an infinite capacity model which they called the iHMM. Their proposal is a procedure based on a two level hierarchy of urns which generates a potentially infinite transition matrix. Although it is an extremely elegant proposal, it turned out to be pretty hard to come up with a correct and efficient sampling scheme to learn this model. The model has three parameters: one to control the number of states, one to control the sparsity of the transition matrix and one to control the self transition probability.

    Then, in 2006, Yee Whye Teh, Mike Jordan, Matt Beal and Dave Blei introduced the Hierarchical Dirichlet Process (HDP). The HDP is a very natural prior for building hierarchies of clusters: in other words, cluster datapoints in different groups so that clusters are shared between groups. Although the HDP is technically a bit complex, it is theoretically well founded and in the 2006 paper a relatively efficient Gibbs sampling scheme was introduced. In the same 2006 paper, an infinite capacity hidden Markov model was built on top of the HDP. This model is very close to the original iHMM but only had two parameters: the self transition control was left out. This is unfortunate as for many applications of HMM's (changepoint detection) we know a priori that the Markov model is going to stay in the same state. Finally, this "version 2.0" of the iHMM still only used a Gibbs sampling scheme for inference. This is unfortunate as we have come to understand that Gibbs sampling can perform poorly in time series models.

    Enter ICML 2008: the first iHMM related result was a paper by Emily Fox, Erik Sudderth, Michael Jordan, and Alan Willsky called "An HDP-HMM for Systems with State Persistence". In this paper, the "version 2.0" iHMM model was extended to include the self transition parameter again. As you can read in their paper, this makes a very big difference: combined with some other neat ideas they were able to present some impressive performance on speaker diarization. Next, was our (Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani) paper called "Beam Sampling for the Infinite Hidden Markov Model". The main idea in our work is to use a slice sampling scheme to adaptively truncate the infinite model to a finite model. This then allows us to run a dynamic program and resample the whole state sequence at once. Although we've only presented the beam sampler in the context of the iHMM, the technique is applicable to most nonparametric constructions I know of. One thing that I hope will happen is that someone will try to use the slice sampling technique to cut up nonparametric models into finite models, and use our best inference machinery (variational, BP, EP,... ) to perform inference in the finite model. This would allow us to leverage all the knowledge we've built up for finite graphical models and use them in the nonparametric setting.

    Stepping back and taking a bird's eye view on things: I think that after the four iHMM papers, I think we now have a strong platform to start using the iHMM as a building block in more complicated models. At the NPBayes workshop we've already seen some cool results by Emily who uses the iHMM to switch between different auto-regressive processes and linear dynamic systems. Another trivial extension which we've implemented is to create an IO-iHMM: condition the iHMM on a certain input sequence. Finale and I are currently experimenting with this setup for learning infinite POMDP's; more about that in a future post. I've been promised that more interesting extensions are in the making... let's see what NIPS brings!

    As I have to write up a first year report (Cambridge's idea of a thesis proposal), I have been thinking very hard about a short, self contained and easy explanation of the iHMM. A description of the model so everyone can use and implement it without having to work through the whole HDP formalism. I will definitely post the outcome on my webpage early September. At least one cool spin-off has come out of this report: based on a suggestion by Radford Neil, we've come up with a different dynamic programming based sampling scheme based on the embedded HMM. As it is more incremental work, I will probably only describe it in my first year report and this blog so stay tuned...

  • Parallel Random Number Generators

    We recently got a paper accepted in EMNLP where we elaborate on a point that was touched upon in Mark Johnson’s “Why doesn’t EM find good HMM POS-taggers?”: if we are going to do completely unsupervised POS-tagging, how can non-parametric Bayesian methods help us? There were two parts to our paper: extending the iHMM to be able to handle a medium sized text corpus and an evaluation of the results to see how the iHMM states correspond to POS tags. I’ll blog about our findings later but one thing that came up during this research was the following: because of the size (1,000,000 datapoints) of the corpus (Wall Street Journal of Penn Treebank) we used, we wanted to parallelize the forward-filtering backward-sampling (FFBS) part of the inference. Because all the sentences are independent given the transition matrix and output parameters, we can easily run the FFBS in parallel.

    One thing I started thinking about is the following: imagine I run on a machine with 4 cores and I start a thread for each core to process a chunk of the data. Since the FFBS is a sampling algorithm, each of these 4 threads needs to access a random number generator (RNG). I know of two ways to go about this:

    1. We use one RNG and access it from the 4 threads. Since the RNG’s I use are stateful, we need to lock them when we query them from a number so it’s internal state doesn’t get messed up. Locking an RNG can incur a large overhead and easily becomes the bottleneck of the parallelization.
    2. Use one RNG for each thread. How do we initialize them though? I use the.NET System.Random RNG which uses a time dependent random seed. I don’t think this is a good idea because of the resolution of this time dependent seed is to coarse, and my 4 threads start almost at the same time, they might all end up using the same seed. An alternative is to initialize a global RNG to produce the seeds for the thread local RNG’s.

    The second alternative in (2) is what we used but it does worry me a little bit: how can I know that the streams of random numbers in different threads are going to be ‘good enough’ (whatever that means). As a non-expert on RNG’s I’d like to know whether I’m going to get into trouble with these kinds of approaches. Matlab has a way to generate 3 statistically independent streams using the same RNG with the same seed. This seems like a step in the right direction, but it doesn’t solve my quad core problem, let alone when we run our algorithm on many more cores in Amazon Elastic Map Reduce …

    Any creative solutions out there?