Hello, Science! [Search results for Machine Learning]

Graduate School in Cambridge
Someone mails me to ask what it is like to do a PhD in machine learning in Cambridge. With my first year of graduate school behind me I have a thing or two to say about the program here. So to all aspiring Cambridge graduate students: listen up!

Overall I think Cambridge is an amazing place to be. It's like Disneyland for academics. There is so much stuff going on that every day I am disappointed about which talks I had to miss because there was another one going on at the same time. There are so many machine learning people around that I haven't even met them all. The ones I meet frequently with are
- David MacKay's group
- Microsoft Research Cambridge
- the Computer Lab
- the Statistics Laboratory
- the Engineering Department
- the European Bioinformatics Institute
Needless to say there is plenty of opportunity to talk to people about machine learning. Each group has different accents and if you're interested in applying you might want to evaluate what kind of machine learning you want to do first. If pure machine learning or neuroscience is your thing, you'd probably fit in well with David's or our group. If you are more into applications the other engineering or computer lab groups would be a better fit. If learning theory is your thing, you'll also find people in the computer lab that do this kind of stuff. If you're more into theory the statistics laboratory would probably suit you well. Finally, if you don't mind a 45 minute train ride to London you will find yourself having access to the Gatsby unit, the UCL machine learning guys and probably many others I haven't met (Imperial College,... ?)

Life in Cambridge is good too; the colleges will provide you with a place to stay, friendly people to have dinner with, social events and tons of sports to choose from. Cambridge is also very friendly towards foreigners and I've noticed that in grad school probably more than 70% of the people are internationals.

As far as I am concerned, there are two big turn-offs in Cambridge. The first is the application process itself. The whole thing is completely outdated and you will find yourself being accepted by a professor in February but only getting official letters about it in June or so. The whole process is slow and when you get all your US offers mid April you will find yourself completely in the dark with respect to your Cambridge status. I also found funding much more fragmented than in the US; a lot of people who get funding offers a few weeks before the academic year starts (we're talking September here). The other thing is graduate level classes: Cambridge does not have graduate level classes at the level that a big American university will organize them. Given my interest in machine learning I've been to a couple of statistics classes in the part III maths program which I found to be comparable to the lower level graduate classes in the US. So my advice for anyone who hasn't taken any graduate level machine learning classes is to first go to the US for a one or two years Master program or check out the programs at UCL or Edinburgh and then apply to Cambridge.
London Machine Learning Meetup... and joining a startup.
Three months ago I decided to end my employment at Microsoft. I had a great time, met loads of interesting people and will definitely miss being in the Cambridge research lab (as well as the occasional trip to Seattle).
Next up I am joining a startup company called Rangespan. I met the founders of Rangespan at Silicon Milkroundabout in April this year. When I got talking to them it was immediately obvious there was a hell of a lot of machine learning to be done there.
Rangespan is essentially an e-commerce platform which is organizing a marketplace for suppliers to sell products to big retailers. On one hand Rangespan is trying to build the biggest catalog in the world. We're doing some by fusing many different information sources (supplier feeds, manufacturer feeds, crowdsourced data, crawled data). The three big learning problems in this space are:
1. Matching: out of millions of product descriptions (semi-structured & textual), how do you recognize which descriptions talk about the same product? Record linkage is the keyword here, but making it work at large scale and under very small precision and recall margins is a challenge.
2. Reconcilliation: when you have 100 different descriptions of a product, and you want to construct the best possible description of the product, how do you do that? I think the speech recognition and machine translation community has a few neat tricks up their sleeve to be applied here: in the abstract if we think of each of the 100 descriptions as the output of a noisy channel with the unknown ground truth description as the input, the challenge is to invent a prior and model for the noisy channel. Machine learning baby!
3. Information extraction: the data we are gathering is only semi-structured and most often marketing text. The question now becomes: how do we figure out form a USB stick description that the product has a "4GB" capacity. How do we figure out that "4GB" is the same as "four gigabytes" etc etc. There's loads of great work on information extraction which we can build on; the good news is that our domain is restricted (physical goods) the bad news is that the domain is quite large (1000's of categories, 1000's of attributes).
Besides the catalog, Rangespan is also building an order pipeline component. One crucial component of that pipeline is the system that assigns retailer's orders to suppliers. The idea here is that for each product we have a pretty good view of the supply curve (what suppliers sells how many items at what price). We also have a pretty good view of the demand curve: by aggregating many different retailers we can predict what the demand for a product will be. This allows us to make very quantitative decisions about how to choose the price for that product to optimize metrics like user experience etc. In this area a mix of statistics (time series analysis), decision theory, econometrics and optimization are crucial ingredients... together with a ton of data ofcourse.
If this all sounds like music to your ears, then let us know. We're hiring!
Joining a startup also meant I am now working in London. One of the coolest things I discovered about being in London is the great number of meetups people are organizing. From MongoDB office hours (thank you for all the help 10gen!) to the Hadoop and Big Data meetups. Although big data is crucial at Rangespan, we think clever algorithms (as in machine learning) are as crucial to building successful insights from data. As such we are kicking off the Machine Learning Meetup as of January 2012. We'll put some fun talks together with the London machine learning community and probably do some study groups on the Stanford online learning courses.
If you live closeby and haven't yet signed up, you can find all meetup details here.
Machine Learning lecture series on videolecture.net
Our machine learning group organizes a series called the Advanced Tutorial Lecture Series on Machine Learning where we get to see some amazing researchers talk about recent developments in Machine Learning. Karsten, a postdoc in our group has been so kind to film all lectures and slowly but surely these videos are finding their way onto videolectures.net.
Currently on videolectures.net are videos on:
- Applications of Group Theory to Machine Learning by Risi Kondor; this math heavy lecture was cool in that Risi explained how tools from abstract algebra and harmonic analysis can be used in Machine Learning too,
- Spectral Clustering by Arik Azran; our local spectral methods expert shared his insight into how spectral methods work.
Coming up soon is a lecture by David MacKay on error correcting codes and how they can be decoded using local message passing algorithms. David is a very good speaker and his lecture is certainly in my top 5 of excellent machine learning talks.
UCI Website Revamped

First of all, happy new 2008 to all readers!
UCI hosts a famous collection of datasets at the UCI Machine Learning Repository. Recently, they have completely updated their webpage and are starting to offer new datasets. This is a great service to the machine learning community, but I would like to see us take this one step further: we should match this repository of datasets with a repository of algorithms. This would not only allow us to compare algorithms, but also get a lot of intuition about the nature of the datasets: a well-understood algorithm that does great on - say - digit classification but performs really poorly on the Wisconsin breast cancer dataset teaches us something about the nature of the data. A recent JMLR paper* calling for more open source machine learning software, mentions a project at the university of Toronto called Delve that meant to do exactly this. Unfortunately, the project seems to be dead as of 2003 or so.
* The Need for Open Source Software in Machine Learning - Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson; Journal of Machine Learning Research; 8(Oct):2443--2466, 2007.
Silicon Minds
The guys at Microsoft Research announced a very exciting competition: the silicon minds challenge. The goal of the competition is to foster novel ideas in the area of game AI.
Many years ago I wrote a computer game called De Profundis where I was in charge (among other things) of the game AI. Moving on to become an AI researcher it is interesting to reminisce and draw some connections.
On one hand, the game AI field is the perfect arena to try out new ideas for AI researchers. For AI researchers working on agents, planning and human interaction (speech, NLP) I could imagine it would be extremely valueable to interact with MMORPG's (Massive Multiplayer Online Role Playing Games). I don't know whether anyone in the research community has ever done this before, but having an unlimited source of humans to interact with seems like quite the experimental setup. This also applies to virtual worlds like second life ofcourse. AI Research has contributed to the game AI field ofcourse, so let me highlight two recent projects:
1. The university of Alberta games group: these guys do some amazing work on several kinds of games. As far as I understand it, most of their efforst are focussed on games where the mathematics are in some sense understood: chess, poker,... What I mean by the mathematics are understood is that with infinite computational capabilities, we would be able to solve these games. The U of A group also do some work on AI for real time strategy games (e.g. Age of Empires). A mathematical analysis of these games is much harder (if possible at all). The AI necessary for these games is much closer to what I would think of as strong AI.
2. The Applied Games Group at Microsoft research: the organizers of the silicon minds challenge have developed a few innovations for game AI themselves. Their machine learning approach to inferring gamer skills (know as TrueSkill) is used by the XBox Live service. They have also enhanced Forza Motorsport with a reinforcement learning agent that learns to drive from observing human drivers.
Unfortunately, the game AI field has very special requirements that prohibit the use of many innovations from the research community. First and foremost, game AI are supposed to make games more fun. More sophisticated agents do not necessarily mean more fun: one can spend a large amount of time making opponents (in first person shooters or racing games) smarter, but if that means the player always looses, he or she might not enjoy the game that much. Also, games are big business, and game engineers want to understand the behavior of their agents. It is unacceptable to release an agent out in the open which in the middle of a battle starts to act weird. Hence, game engineers often limit the intelligence of agents to (pre-historic ?!?) methods such as rule based systems and (heuristic) search because they can understand the behavior and debug it more easily. (It would be unfair to forget to give credit to the people that have applied reinforcement learning and neural networks to games; afaik mostly in the areas of racing games.) To get a rough idea about what is hot-or-not in the game AI field, take a look at AI Wisdom.
One could say, who cares what technique to use: rules and search work incredibly well! Very true. In my humble opinion, the AI/machine learning community has sometimes over-focussed on new algorithms and models and too little on building intelligent solutions. Although in fields like robotics, biology and vision, our machine learning tools have had a huge impact, I think there are many fields where the AI community does not have a good understanding on how to integrate all our tools to make a large working system. Hence, silicon minds looks like a promising challenge and I am very excited to see what people come up with.
To R or not to R
At the moment 90% of the research in our lab is done using Matlab. Some of us have tried Python but were dissapointed by how hard it was to get decent (read: native) numerical routines on par with Matlab. I've been developing on the.NET platform with a little bit of Java development (for Amazon's Elastic Map-Reduce) on the side. We've had a few discussions in our lab recently about switching to R for our machine learning research. A decision hasn't been made and we're wondering what the larger machine learning community thinks about this issue. Here's a list of pros and cons that I've come up with
Pros:
- great support for statistical functions
- great support from the statistical community
- good (if not better) plotting that Matlab
- reasonable fast (check out Dave's benchmarking)
- free (very important!) if we want to run a job on 100 machines (e.g. in the cloud), I believe currently you need a matlab licence for each one
Cons:
- pre-historic interface, doesn't compare to modern IDE's
- debugging doesn't seem up to Matlab standards
- not much support in machine learning community (?)
- learning curve
The jury is still out on this one... I really wonder how many machine learners out there already use R?
Machine Learning Summer School 2008
Perfect: the website for the next machine learning summer school is up and running. It will be organized on Porquerolles Island, right of the coast of France. The confirmed speaker list is still a work in progress but looks promising already:
- Shai Ben-David (University of Waterloo) "Theoretical Foundations of Clustering"
- Stephane Canu (INSA de Rouen) "Introduction to Kernel Methods"
- Manuel Davy (INRIA) "Parametric and Non-parametric Bayesian Learning"
- Pierre Del Moral (Universite de Bordeaux) "On the Foundations and the Applications of i-MCMC Methods"
- Isabelle Guyon (ClopiNet)
- Yann LeCun (New York university) "Supervised and Unsupervised Learning with Energy-Based Models"
- Rich Sutton (University of Alberta) "Reinforcement Learning and Knowledge Representation"
- Patrick Wolfe (Harvard University) "Overcomplete Representations with Incomplete Data: Learning Bayesian Models for Sparsity"
I'll definitely try to apply and hope for the best!
What is... Exchangeability?
$What is... Exchangeability?$

Talking about exchangeability, a friend once commented that exchangeability is "too simple too understand". On one hand, it is true that the statement of exchangeability (see below) sounds somewhat trivial, I found that I had absolutely no intuition as to why it is important for machine learning. So after some reading, I present my take on the concept of exchangeability.
What is Exchangeability?
Scenario 1. Imagine we have an urn with r red balls and b blue balls. We draw 3 balls from the urn as follows: we pick a random ball, write down its color and put it back in the urn before drawing a new ball. We introduce 3 random variables: A, B, C which denote the color of the first, second and third ball. It is not hard to see that p(A=r, B=b, C=b) = p(A=b, B=r, C=b); in other words, we can exchange the values of the random variables without changing the joint probability. Intuitively, the reason we can exchange the observations is that our random variables are IID (independent and identically distributed).
Scenario 2. We again pick 3 balls from an urn with r red and b blue balls. We still pick a random ball and note its color, but we put two balls of that color back in the urn. It may not be obvious that the sequence A=r, B=b, C = b has the same probability as the sequence A=b, B=b, C=r since the individual probabilities of picking the red ball first or last are completely different: r/[r+b] when it is the first ball versus r/[r+b+2] when it is the last ball (since two blue balls were added in the mean time). Writing down the equations makes it clear that the two sequence are equi-probable
$p(A=r, B=b, C=b) = \frac{r}{r+b}\frac{b}{r+b+1}\frac{b+1}{r+b+2} = \frac{b}{r+b}\frac{b+1}{r+b+1}\frac{r}{r+b+2} = p(A=b, B=b, C=r)$
It is trivial to generalize this expression to longer sequences. Again, it doesn't matter in what order we pick the balls, the only thing that matter is how many red and how many blue balls we pick. This is reflected in the formula in the sense that denominator of the probability of a sequence only depends on how long the sequence is. The nominator part only needs to know how many balls of each color there are. In our example: it only needs to know that there is a first and second blue ball (contributing b * (b+1) to the nominator) and a first red ball (contributing a).
Scenario 3. Both examples above were exchangeable since reordering the values of the random variables didn't change the probability. Let us consider a similar setup where exchangeability does not apply anymore. We again use the urn scheme with r red balls and b blue balls. However, now when we pick a red ball we note its color and simply put it back, but when we pick a blue ball we note its color and put two back. It is easy to see that we cannot exchange the value of the random variables anymore since
$p(A=r, B=b, C=b) = \frac{r}{r+b}\frac{b}{r+b}\frac{b+1}{r+b+1}$
while
$p(A=b, B=b, C=r) = \frac{b}{r+b}\frac{b+1}{r+b+1}\frac{r}{r+b+2}$
I think the following definition of exchangeability now becomes much more intuitive; we say a set of n random variables Y is exchangeable under a distribution p iff for any permutation pi of the integers 1..n
$p(Y_1=y_1, \cdots, Y_n=y_n) = p(Y_1=y_{\pi(1)}, \cdots, Y_n=y_{\pi(n)})$
Properties of Exchangeability
Let us now briefly discuss some consequences of exchangeability as it will allow us to see why it is such an important concept. First, we compute the marginal probability of the second draw p(B = r) under the different scenarios. Under scenario 1 this is trivial, just before the second draw the content of our urn is exactly as it was when we started: hence p(B=r) = r/(r+b). Under scenario 2, after some simple algebra we find that p(B=r) = p(A=b, B=r) + p(A=r, B=r) = r/(r+b). Now here is the exciting part: we shouldn't have done all the algebra; if we are convinced that the random variables are exchangeable under the distribution of scenario 2, we could have acted as if we were computing the marginal probability for the first draw. Formally, since p(A=b,B=r) = p(A=r, B=b) and substituting this in the expression for p(B=r), we could have marginalized out the B=b part. This property - changing the order around - is incredibly useful when computing probabilities.
More abstractly, here is one way to think of exchangeable sequences. In scenario 2, if a friend just drew a ball from the urn, didn't show it to us and put one extra ball back in the urn, this is not going to make a difference as to the probability of our next draw. However, in scenario 3 above, whether someone drew a ball before us is very important: it drastically changes the probabilities for our next draw. I think this is a very important distinction that sets exchangeable and non-exchangeable distributions appart.
Although exchangeability and IID variables look very similar they are not exactly the same. From scenario one above, it is easy to see that IID random variables are exchangeable. The converse is not true: in scenario 2, p(A=b, B=r) is not equal to p(A=b) p(B=r) and thus the random variables are not independently distributed.
Exchangeability and Machine Learning
Exchangeable distributions are very common in machine learning. The most famous modelling assumption for text processing is just exchangeability: the bag of words model. This modelling assumption states that the probability of a text document depends only on word counts and not on word order. This is exactly the same model as scenario 1 above except that instead of red and blue balls, we now have words from a fixed vocabulary. Is this a realistic assumption one may ask? It certainly is not! We don't expect natural language to be exchangeable: the probability of using the word "States" should certainly be dependent on the word in front of it (a.k.a. higher if that word is "United"). But who cares, the bag of words assumption works incredibly well...
There are many other exchangeable distributions in common use for machine learning: the Dirichlet Process and its Chinese Restaurant Process equivalent are exchangeable distributions, the Indian Buffet Process is an exchangeable distribution on binary matrices. Non-exchangeable distribution are also common: many Markov models (e.g. Hidden Markov Models) aren't exchangeable.
I hope this little overview of exchangeability was useful. I left one improtant concept out of our discussion so far: De Finetti's theorem. This is a very important theorem that applies to exchangeable sequences and I will discuss the theorem in a future post.
Machine Learning News

From Machine Learning (Theory); a new machine learning mailing list has been created here.
Hello, World!

Welcome to my blog; I am planning to post discussions on machine learning, functional programming (F#) and how it can be used in machine learning, fields related to machine learning such as numerical optimization, statistics,...
Stay tuned.
Popularizing Machine Learning

Three weeks ago I gave a presentation on probabilistic modelling to an audience of data mining practitioners. These people know about machine learning, they use it every day, but their main concern is to analyze data: real data!
The experience taught me a valuable lesson which I hadn’t come across by interacting with the academic community: Probabilistic models are hard. Even for people that are very close to the machine learning community (data mining), probabilistic models are a very different (new?) way of thinking.
The whole idea of building a generative story for your data and then using Bayes rule to “invert” the model given some dataset has become second nature to me. Nonetheless, I (we?) shouldn’t forget that it took statistics a long time to think about modelling in this sense. Hence I now understand realize that for outsiders the framework of probabilistic modelling is a highly non-trivial concept to grasp.
In this context I am obliged to share a blog post by Jeff Moser. I have never seen such a great explanation of a non-trivial probabilistic model that is deployed in very large scale on XBox Live: “Computing Your Skill”, a description of the TrueSkill ranking model. Very very well done Jeff!
ECML–PKDD 2010 Highlights
The city of Barcelona just hosted ECML/PKDD this year and I had the opportunity to go down and check out the latest and greatest mostly from the European machine learning community. My personal highlights
- The conference for me started with a good tutorial by Francis Bach and Guillaume Obozinsky. They gave an overview of sparsity: in particular various different methods using l1 regularization to induce sparsity.
The first invited speaker was Hod Lipson from Cornell (and as far as I know one of the few people in our field who has given a TED talk). The main portion of Hod’s talk was about his work on symbolic regression. The idea is the following: consider the following dataset
We can apply our favourite regression method, say a spline, to these points and perform accurate interpolation, perhaps even some extrapolation if we chose the right model. The regression function would not give us much insight into why data could be looking like this? In symbolic regression, the idea is that we try to come up with a symbolic formula which interpolates the data. In the picture above, the formula that generated the data was EXP(-x)*SIN(PI()*x)+RANDBETWEEN(-0.001,0.001) (in Excel). Hod and his graduate students have built a very cool (and free!) app called Eureqa which uses a genetic programming methodology to find a good symbolic expression for a specific dataset. Hod showed us how his software can recover the Hamiltonian and Lagrangian from the measurements of a double pendulum. Absolutely amazing!

Another noteworthy invited speaker was Jurgen Schmidhuber. He tried to convince us that we need to extend the reinforcement learning paradigm. The idea would be that instead of an agent trying to optimize the amount of long term reward he gets from a teacher, he would also try to collect “internal reward”. The internal reward is defined as follows: as the agent learns, he is building a better model of the world. Another way to look at this learning is that the agent just knows how to “compress” better. The reduction in representation he gets from learning from a particular impression is what Jurgen calls the “internal reward”. In other words, it is the difference between the number of bits needed to represent your internal model before and after an impression.

E.g. you listen to a new catchy son: Jurgen says that you think it’s catchy because you’ve never heard anything like this before, it is surprising and hence helps you learn a great deal. This in turn means you’ve just upgraded the “compression algorithm” in your brain and the amount of improvement is now reflected in you experiencing “internal reward”. Listening to a song you’ve heard a million times before doesn’t help compressing at all; hence, no internal reward.

I really like the idea about this internal reward a lot, and as far as I understand it would be very easy to test. Unfortunately, I did not see any convincing experiments so allow me to be sceptical …

The main conference was cool and I’ve met some interesting people working on things like probabilistic logic, a topic I desperately need to learn more about. Gjergji gave a talk about our work on crowdsourcing (more details in a separate post). Some things I marked for looking into are:
- Sebastian Riedel, Limin Yao and Andrew McCallum – “Modeling Relations and Their Mentions Without Labeled Text”: this paper is about how to improve information extraction methods which bootstrap from existing knowledge bases using constraint driven learning techniques.
- Wanner Meert, Nima Taghipour & Hendrik Blockeel – “First Order Bayes Ball”: a paper on how to use the Bayes ball algorithm to figure out which nodes not to ground before running lifted inference.
- Daniel Hernández-Lobato, José Miguel Hernández Lobato, Thibault Helleputte, Pierre Dupont – “Expectation Propagation for Bayesian Multi-task Feature Selection”: a paper on how to run EP for spike and slab models.
- Edith Law, Burr Settles, and Tom Mitchell – “Learning to Tag from Open Vocabulary Labels”: a nice paper on how to deal with tags: they use topic models to do dimensionality reduction on free text tags and then use that in a maximum entropy predictor to tag new music.
I enjoyed most of the industry day as well: I found it quite amusing that the Microsoft folks essentially gave all Bing secrets away in one afternoon: Rakesh Agarwal mentioned the secret sauce behind Bing’s ranker (neural net) whereas Thore Graepel explained the magic behind the advertisements selection mechanism (probit regression). Videos of these talks should be on videolectures.net soon.

One last rant: the proceedings are published by Springer who ask me to pay for the proceedings?!?! I’m still trying to figure out what value they’ve added to our camera-ready we sent them a few months ago …
The Singularity

Although I consider myself a machine learning researcher, I have to say I got all excited about this field because of its relation with Artificial Intelligence. (I consider machine learning to be more about solving "small" practical problems like information retrieval, machine translation, biological data analysis... whilst AI is the grand search for making machines actually understand the world, I consider AI more about giving machines a deeper semantic understanding of the world.
This post is about a group of enthusiasts called singularitarians whom on one hand I admire but on the other hand approach with skepticism. Singularitarians are people who believe that (soon?) there will be technological breakthrough's enabling the creation of smarter-than-human intelligence. The point in time at which this will occur is called the singularity. What I think is great about this movement is that there are singularitarians who try design architectures that make strong AI possible. This is absolutely fantastic and it is sad that there are so little opportunities in the academic world for this kind of blue sky research. The reason I am skeptical at points is that a lot of singularity talk ignores the software side of things: there is much ado about peripheral issues and little research that gives us insight into how a conscious machine should work. E.g., Ray Kurzweil (a brilliant thinker by the way!) wrote a book about how available computing power will soon match the brain's computational power but hardly gives any insight on what to do with so much hardware.
IEEE Spectrum has a special issue on the singularity with articles by both singularitarians and skeptics. A highly enjoyable read!
Machine Learning for Bug Discovery

One of the reasons I have this blog is to talk about some of my own research. I intend to give some more background on things that haven't made it into a paper, put some more experiments or references online, answer questions people have about my work,... I hope that by putting these discussions online, I can catch the attention of Google's indexer and this information is available for anyone who might ever work on similar things.
Today we (Finale Doshi and I) want to talk about a small side project of ours. Early this year we ran into a pretty cool project at the University of Wisconsin - Madison called the Cooperative Bug Isolation (CBI) project. In a nutshell, this project aims to find software bugs by first instrumenting the source code with little probes. These probes count when certain events - such as the execution of a particular line of code - occur. The CBI software collects these probe counts, and remembers whether the program exited successfully or whether the program crashed. For example: imagine we wrote the following program:
function foo()
a = read_line()
if a < 10 then
print "smaller than 10"
a = 10 / a - a
else
print "at least 10"
return a
Although this program is completely trivial, it has a potential division by zero. The CBI software will take a piece of source code as above and add some probes: say the bold italic ones below:
function foo()
a = read_line()
log( a == 0, probe_1)
if a < 10 then
print "smaller than 10"
a = 10 / a - a
log( a == 0, probe_2)
else
print "at least 10"
log( a == 0, probe_3)
return a
Note that there is a whole science to what probes are useful and where to put them, but the main point is that each probe is a counter, which gets augmented every time its first argument is true. As we mentioned above, when a program executes,various probes get augmented and at the end of the execution, the counts are stored in a file in addition to a flag that says whether the program exited gracefully or whether it crashed. Let's suppose that foo() is called several times in a larger program. After running that program a couple of times, we might get data that looks like this:
Run Probe 1 Probe 2 Probe 3 Exit Flag 1 0 1 2 Good 2 1 2 2 Fail 3 0 5 1 Good
The CBI project has collected thousands of traces for different open source projects. The goal now is to mine these huge datasets to discover bugs in the source code. Although there are many ways of going about it, one way is to make the following assumption: if we run a program, we are executing a sequence of complex but fairly atomic "usage patterns". For example, imagine we are executing a web browser: when we press the homepage button a series of functions gets executed (and the corresponding counters are augmented). However, when we press the back button a different (but maybe overlapping) sequence of functions will be executed. We call these sequences "usage patterns".
In this paper, some Programming Language and Machine Learning people in Madison (David Andrzejewski, Anne Mulhern, Ben Liblit, and Jerry Zhu) developed a statistical model (called DeltaLDA) based on Latent Dirichlet Allocation to learn two sets of usage patterns at the same time: one set of "good" usage patterns which both successful and failed runs use, and an extra set of "bad" usage patterns which only occur in failed runs. After learning the DeltaLDA model, one can go back and look at the bad usage patterns and see which counters are expressed. Since the counters in bad usage patters correlate with programs crashing, these could then be indicative of actual bugs in the source code. It turns out that it is a reasonable idea; read the DeltaLDA paper for details on how it performs in practice.
When we decided to work with this dataset, one obvious extension of the DeltaLDA model was to try and make the whole thing non-parametric. In the original DeltaLDA paper, one has to specify the number of usage patterns beforehand. We think this is unrealistic and allowing the number of usage patterns to adapt according to the data seems to be the right way to go. So we developed a model called the DeltaHDP which is exactly this: a non-parametric extension of DeltaLDA using the Hierarchical Dirichlet Process framework. On some of the smaller datasets which we tried, we found that the DeltaHDP behaved exactly as one would expect: it finds similar usage patterns as DeltaLDA but you don't have to specify the number of usage patterns beforehand. One other nice thing about the DeltaHDP is that the Gibbs sampler, which is fairly straightforward to implement, performs well.
Another unrealistic artifact of both the LDA/HDP type models is that the corresponding generative model does not correspond to the usage pattern story above. LDA/HDP define each run of a program to be a mixture of usage patterns while we really want each run to be an integer combination of usage patterns. The reason is that we believe that each time a usage pattern is executed, it augments the counter more or less deterministically - counters do not randomly fail to increment. In the end, we came up with a model based on the Indian Buffet Process. Check out our poster for more information on the model. It turned out the inference was a little nightmare and we never found a good sampling mechanism which could scale to very large datasets. However, the results on smaller datasets suggested that the model was OK and we got similar results as the DeltaHDP. In addition, the IBP returns more information: it can tell you exactly how many times specific usage patterns were executed whereas the DeltaHDP only returns the ratio that each usage pattern was executed.
In the end it was a fun project. We enjoyed working on this data, had a lot of fun coming up with the models and we learned a lot about non-parametric methods. Unfortunately, both Finale and I have different priorities and we think the cooperative bug isolation project deserves 100% of one's attention to make progress on using statistical models for bug isolation. However, we think CBI is a great project, and we will be more than happy to give some more background on our models and answer questions about our experiences.
From Undirected Grad to Informaniac

What drives a man to change the name of his blog? A trademark dispute? A mid-life crisis? None of those actually!
After three amazing years, I am exchanging the amazing machine learning group for an exciting new adventure with Microsoft FUSE Labs. As I am writing this, my laptop is crunching away at the last few experiments to be included in my thesis while a printed version of my thesis text is awaiting final editing.
It has been an interesting few months as I was debating moving into a post doc or evaluating all the amazing opportunities that were available in industry: in the end I decided to join FUSE Labs as an applied researcher. FUSE Labs is an incubation lab at Microsoft that sits in between Microsoft Research and the product groups. My new job allows me to collaborate with people (in or outside of Microsoft) on research while giving me the freedom to also spend time building working prototypes and demo’s. As far as I am concerned: the perfect mix of science and engineering.
Although I will soon be a graduate student no more, I don’t want this blog to die but rather take it to the next level. The foreseeable future will involve more experiments on machine learning, data mining and information processing and I should be able to report much of this to everyone who cares to listen!
welcome to www.informaniac.net and stay tuned …
Machine Learning Summer School Video’s Online

For the people who missed the Machine Learning Summer School here in Cambridge this year, the video’s are now available at videolectures.net! My personal favorites: David MacKay’s Information Theory, David Blei’s Topic Models, Iain Murray’s Markov Chain Monte Carlo and Tom Minka’s Approximate Inference. Enjoy …
NIPS 2008 Accepted Papers Out!
Hot from the presses, the full list of NIPS 2008 accepted papers are out. I quickly browsed through the papers and can say I am looking forward to these:
Cognitive Science
- Analyzing human feature learning as nonparametric Bayesian inference, J. Austerweil, T. Griffiths
- Depression: an RL formulation and a behavioural test, Q. Huys, j. vogelstein, P. Dayan
- Modeling human function learning with Gaussian processes, T. Griffiths, C. Lucas, J. Williams, M. Kalish
- Modeling the effects of memory on human online sentence processing with particle filters, R. Levy, F. Reali, T. Griffiths
Neuroscience
- Characterizing neural dependencies with Poisson copula models, P. Berkes, F. Wood, J. Pillow
- Dependent Dirichlet Process Spike Sorting, J. Gasthaus, F. Wood, D. Gorur, Y. Teh
Machine Learning
- Gates, T. Minka, J. Winn: the next big thing in graphical models?
- Non-stationary dynamic Bayesian networks, J. Robinson, A. Hartemink
- Nonparametric Bayesian Learning of Switching Linear Dynamical Systems, E. Fox, E. Sudderth, M. Jordan, A. Willsky: some of which we've seen at the NPBayes 2008 workshop
- The Mondrian Process, D. Roy, Y. Teh: some of which we've seen at the NPBayes 2008 workshop
Machine Learning Summer School in Cambridge!

The machine learning summer school 2009 will be organized in Cambridge (UK) from August 29 to September 10, 2009. On this website we will soon be posting the list of speakers, the application procedure and the other practical information.

See you in Cambridge!
Stats Clinic

We are starting a new machine learning related initiative in Cambridge: the walk-in Stats Clinic. The goal of our consulting service is to help applied researchers with their questions in statistics or machine learning. Starting October 21 from 16:30-18:00, we will be in meeting room 5 at the Centre for Mathematical Sciences every other week.
PQL–A Probabilistic Query Language
At MSR and Bing, when we do machine learning on smaller datasets (say anything below 100GB) we often use relational databases and SQL. Throw in a little bit of Excel and R and you’ve got yourself a very powerful platform for exploratory data analysis.

After the exploratory phase, we often build statistical models (adPredictor, TrueSkill, Matchbox, …) to discover more complex structures in the data. Infer.Net helps us prototype these graphical models, but unfortunately it forces you to work in a mode where you first create a binary that performs inference, suck out all data to your machine, run inference locally and then write all inference results back to the DB. My local machine is way slower than the machines which run our DB or our local compute cluster so ideally I’d like to have a platform which computes “close” to the data.

The Probabilistic Query Language (or PQL) is a language/tool which I designed two years ago, during an internship with Ralf Herbrich and Thore Graepel, where we had the following goals in mind:
- Allow for rapid prototyping of graphical models in a relational environment
- The focus should be on specifying models, not algorithms
- It should enable large scale execution and bring the computation to the data, rather than the data to the computation
Using SQL Server, DryadLinq (Map-Reduce for.NET) and Infer.Net I built a prototype of PQL and tested it on some frequently used models at Microsoft. In this post I want to introduce the PQL language and give a few examples of graphical models in PQL.

Let’s start with a very simple example where we have a DB with a table containing people’s info and a table with records describing doctor visits for those people. Assume the following relational schema

We assume that people have an unknown weight and when they go to the doctor, she measures this weight. Depending on the time of day (after a heavy lunch), this estimate could be off a bit. A statistical model to capture these assumption is to introduce a random variable for the weight for each person in the People table, put a prior on this variable and connect it with the observations in the DrVisits table. So how do we write such a model in PQL?

PQL is very much like SQL but with two extra keywords: AUGMENT and FACTOR. AUGMENT allows us to add random variables to the DB schema. In the example above we would write

People = AUGMENT DB.People ADD weight FLOAT
This essentially defines a “plate” in graphical model speak: for each row in the People table, a random variable over the real numbers called weight is defined.
The FACTOR keyword in PQL allows to introduce factors between random variables as well as any other variables in the DB schema. FACTOR follows the relational SQL syntax to specify exactly how to connect variables. To specify a normal prior on the weight variable we could write
FACTOR Normal(p.weight | 75.0,25.0) FROM People p
This introduces a normal factor for each row in the People table (the FROM People p part). The final component of our program connects the random variable with observations. In this case, we use the familiar SQL JOIN syntax to specify how to connect rows from the People table to the rows in the DrVisits table. In PQL we write
FACTOR Normal(v.weight | p.weight, 1.0)
FROM People p
JOIN DrVisit v ON p.id = v.personid

Except for the first line this is exactly SQL; instead of doing a query, the FACTOR statement describes the “probabilistic augmentation” of the DB schema”.

For the example above, this is it, the PQL program contains five lines of code and can be sent to the DB. It will run inference by performing EP or variational Bayesian inference. The inference itself can be run either within the database (this was implemented by Tina Palla who was an intern with us) or on the DryadLinq cluster.

Another example of PQL is the program to describe the TrueSkill ranking system. In this example we assume two-player games stored using a table of players (called Players) and a table of game outcomes (called PlayerGames). Each game played generates two rows in the PlayerGames table: one for the winner and the loser (with a score) column specifying who is the winner and who is the loser. The PQL program for TrueSkill is written below
Players = AUGMENT DB.Players ADD skill FLOAT;
PlayerGames = AUGMENT DB.PlayerGames ADD performance FLOAT;
FACTOR Normal(p.skill | 25.0, 20.0) FROM Players p;
FACTOR Normal(pg.performance | p.skill, 0.1)
FROM PlayerGames pg
JOIN Players p ON pg.player_id = p.player_id;
FACTOR IsGreater(pgb.performance, pga.performance)
FROM PlayerGames pga
JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
WHERE pga.player_id < pgb.player_id AND pga.score = 0;
FACTOR IsGreater(pga.performance, pgb.performance)
FROM PlayerGames pga
JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
WHERE pga.player_id < pgb.player_id AND pga.score = 2;

There are a lot of features in PQL I haven’t covered in this blog post (like using random variables in a WHERE clause to create mixture models) but I wanted to give you a flavour of what we’ve been working on so far.

While working on PQL I learned a lot about the state of the art in probabilistic databases and statistical relational learning. I think compared to this academic work, PQL does not add many theoretical contributions; our goal is to design a tool which takes statistical relational learning out of the laboratory into the hands of data mining practicioners.

Random for art:

Maggie Gyllenhaal in Prada for River of Fundament Premier

By Hook or By Cooke: Episode 1.2

Kristen Wiig in Michael Kors Big Bow

Lena Dunham in Valentino — and The Other Girls from Girls

Anne Hathaway in Chloe for Conversation with Ang Lee

Jennifer Lopez's Parisian Style. The Dress Size Issue.

Blake Lively in Paris Style Statement After L'Oreal Announcement

Kerry Washington Style

Trend Alert: Trousers. All of Them.

Cannes 2012 Red Carpet: Kristen Stewart

Reese Witherspoon Tour of Parking Lot Style Continues