Hello, Science!:
Machine Learning

  • PQL–A Probabilistic Query Language

    PQL–A Probabilistic Query Language

    At MSR and Bing, when we do machine learning on smaller datasets (say anything below 100GB) we often use relational databases and SQL. Throw in a little bit of Excel and R and you’ve got yourself a very powerful platform for exploratory data analysis.

    After the exploratory phase, we often build statistical models (adPredictor, TrueSkill, Matchbox, …) to discover more complex structures in the data. Infer.Net helps us prototype these graphical models, but unfortunately it forces you to work in a mode where you first create a binary that performs inference, suck out all data to your machine, run inference locally and then write all inference results back to the DB. My local machine is way slower than the machines which run our DB or our local compute cluster so ideally I’d like to have a platform which computes “close” to the data.

    The Probabilistic Query Language (or PQL) is a language/tool which I designed two years ago, during an internship with Ralf Herbrich and Thore Graepel, where we had the following goals in mind:

    • Allow for rapid prototyping of graphical models in a relational environment
    • The focus should be on specifying models, not algorithms
    • It should enable large scale execution and bring the computation to the data, rather than the data to the computation

    Using SQL Server, DryadLinq (Map-Reduce for.NET) and Infer.Net I built a prototype of PQL and tested it on some frequently used models at Microsoft. In this post I want to introduce the PQL language and give a few examples of graphical models in PQL.


    Let’s start with a very simple example where we have a DB with a table containing people’s info and a table with records describing doctor visits for those people. Assume the following relational schema

    image

    We assume that people have an unknown weight and when they go to the doctor, she measures this weight. Depending on the time of day (after a heavy lunch), this estimate could be off a bit. A statistical model to capture these assumption is to introduce a random variable for the weight for each person in the People table, put a prior on this variable and connect it with the observations in the DrVisits table. So how do we write such a model in PQL?

    PQL is very much like SQL but with two extra keywords: AUGMENT and FACTOR. AUGMENT allows us to add random variables to the DB schema. In the example above we would write

    People = AUGMENT DB.People ADD weight FLOAT

    This essentially defines a “plate” in graphical model speak: for each row in the People table, a random variable over the real numbers called weight is defined.

    The FACTOR keyword in PQL allows to introduce factors between random variables as well as any other variables in the DB schema. FACTOR follows the relational SQL syntax to specify exactly how to connect variables. To specify a normal prior on the weight variable we could write

    FACTOR Normal(p.weight | 75.0,25.0) FROM People p

    This introduces a normal factor for each row in the People table (the FROM People p part). The final component of our program connects the random variable with observations. In this case, we use the familiar SQL JOIN syntax to specify how to connect rows from the People table to the rows in the DrVisits table. In PQL we write

    FACTOR Normal(v.weight | p.weight, 1.0)
    FROM People p
    JOIN DrVisit v ON p.id = v.personid

    Except for the first line this is exactly SQL; instead of doing a query, the FACTOR statement describes the “probabilistic augmentation” of the DB schema”.

    For the example above, this is it, the PQL program contains five lines of code and can be sent to the DB. It will run inference by performing EP or variational Bayesian inference. The inference itself can be run either within the database (this was implemented by Tina Palla who was an intern with us) or on the DryadLinq cluster.


    Another example of PQL is the program to describe the TrueSkill ranking system. In this example we assume two-player games stored using a table of players (called Players) and a table of game outcomes (called PlayerGames). Each game played generates two rows in the PlayerGames table: one for the winner and the loser (with a score) column specifying who is the winner and who is the loser. The PQL program for TrueSkill is written below

    Players = AUGMENT DB.Players ADD skill FLOAT;
    PlayerGames = AUGMENT DB.PlayerGames ADD performance FLOAT;

    FACTOR Normal(p.skill | 25.0, 20.0) FROM Players p;

    FACTOR Normal(pg.performance | p.skill, 0.1)
    FROM PlayerGames pg
    JOIN Players p ON pg.player_id = p.player_id;

    FACTOR IsGreater(pgb.performance, pga.performance)
    FROM PlayerGames pga
    JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
    WHERE pga.player_id < pgb.player_id AND pga.score = 0;

    FACTOR IsGreater(pga.performance, pgb.performance)
    FROM PlayerGames pga
    JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
    WHERE pga.player_id < pgb.player_id AND pga.score = 2;

    There are a lot of features in PQL I haven’t covered in this blog post (like using random variables in a WHERE clause to create mixture models) but I wanted to give you a flavour of what we’ve been working on so far.

    While working on PQL I learned a lot about the state of the art in probabilistic databases and statistical relational learning. I think compared to this academic work, PQL does not add many theoretical contributions; our goal is to design a tool which takes statistical relational learning out of the laboratory into the hands of data mining practicioners.

  • Machine Learning Summer School Application Process Opens

    Just to let all of you know that the application process for the summer school has opened. More info here. The deadline for applications is June 1. The list of confirmed speakers is pretty a-ma-zing if you ask me.

    Hopefully I get to meet some of you in September!

  • ICML & UAI 2008 Accepted Papers

    The accepted papers for ICML 2008 and UAI 2008 are online!

    As one of my own papers got accepted, I am preparing a post with some more background information and maybe some extra plots and pictures.

  • Popularizing Machine Learning

    Three weeks ago I gave a presentation on probabilistic modelling to an audience of data mining practitioners. These people know about machine learning, they use it every day, but their main concern is to analyze data: real data!

    The experience taught me a valuable lesson which I hadn’t come across by interacting with the academic community: Probabilistic models are hard. Even for people that are very close to the machine learning community (data mining), probabilistic models are a very different (new?) way of thinking.

    The whole idea of building a generative story for your data and then using Bayes rule to “invert” the model given some dataset has become second nature to me. Nonetheless, I (we?) shouldn’t forget that it took statistics a long time to think about modelling in this sense. Hence I now understand realize that for outsiders the framework of probabilistic modelling is a highly non-trivial concept to grasp.

    In this context I am obliged to share a blog post by Jeff Moser. I have never seen such a great explanation of a non-trivial probabilistic model that is deployed in very large scale on XBox Live: “Computing Your Skill”, a description of the TrueSkill ranking model. Very very well done Jeff!

  • Machine Learning lecture series on videolecture.net

    Our machine learning group organizes a series called the Advanced Tutorial Lecture Series on Machine Learning where we get to see some amazing researchers talk about recent developments in Machine Learning. Karsten, a postdoc in our group has been so kind to film all lectures and slowly but surely these videos are finding their way onto videolectures.net.

    Currently on videolectures.net are videos on:

    • Applications of Group Theory to Machine Learning by Risi Kondor; this math heavy lecture was cool in that Risi explained how tools from abstract algebra and harmonic analysis can be used in Machine Learning too,
    • Spectral Clustering by Arik Azran; our local spectral methods expert shared his insight into how spectral methods work.
    Coming up soon is a lecture by David MacKay on error correcting codes and how they can be decoded using local message passing algorithms. David is a very good speaker and his lecture is certainly in my top 5 of excellent machine learning talks.

  • Machine Learning Summer School 2008

    Perfect: the website for the next machine learning summer school is up and running. It will be organized on Porquerolles Island, right of the coast of France. The confirmed speaker list is still a work in progress but looks promising already:

    • Shai Ben-David (University of Waterloo) "Theoretical Foundations of Clustering"
    • Stephane Canu (INSA de Rouen) "Introduction to Kernel Methods"
    • Manuel Davy (INRIA) "Parametric and Non-parametric Bayesian Learning"
    • Pierre Del Moral (Universite de Bordeaux) "On the Foundations and the Applications of i-MCMC Methods"
    • Isabelle Guyon (ClopiNet)
    • Yann LeCun (New York university) "Supervised and Unsupervised Learning with Energy-Based Models"
    • Rich Sutton (University of Alberta) "Reinforcement Learning and Knowledge Representation"
    • Patrick Wolfe (Harvard University) "Overcomplete Representations with Incomplete Data: Learning Bayesian Models for Sparsity"
    I'll definitely try to apply and hope for the best!

  • Silicon Minds

    The guys at Microsoft Research announced a very exciting competition: the silicon minds challenge. The goal of the competition is to foster novel ideas in the area of game AI.

    Many years ago I wrote a computer game called De Profundis where I was in charge (among other things) of the game AI. Moving on to become an AI researcher it is interesting to reminisce and draw some connections.

    On one hand, the game AI field is the perfect arena to try out new ideas for AI researchers. For AI researchers working on agents, planning and human interaction (speech, NLP) I could imagine it would be extremely valueable to interact with MMORPG's (Massive Multiplayer Online Role Playing Games). I don't know whether anyone in the research community has ever done this before, but having an unlimited source of humans to interact with seems like quite the experimental setup. This also applies to virtual worlds like second life ofcourse. AI Research has contributed to the game AI field ofcourse, so let me highlight two recent projects:

    1. The university of Alberta games group: these guys do some amazing work on several kinds of games. As far as I understand it, most of their efforst are focussed on games where the mathematics are in some sense understood: chess, poker,... What I mean by the mathematics are understood is that with infinite computational capabilities, we would be able to solve these games. The U of A group also do some work on AI for real time strategy games (e.g. Age of Empires). A mathematical analysis of these games is much harder (if possible at all). The AI necessary for these games is much closer to what I would think of as strong AI.
    2. The Applied Games Group at Microsoft research: the organizers of the silicon minds challenge have developed a few innovations for game AI themselves. Their machine learning approach to inferring gamer skills (know as TrueSkill) is used by the XBox Live service. They have also enhanced Forza Motorsport with a reinforcement learning agent that learns to drive from observing human drivers.

    Unfortunately, the game AI field has very special requirements that prohibit the use of many innovations from the research community. First and foremost, game AI are supposed to make games more fun. More sophisticated agents do not necessarily mean more fun: one can spend a large amount of time making opponents (in first person shooters or racing games) smarter, but if that means the player always looses, he or she might not enjoy the game that much. Also, games are big business, and game engineers want to understand the behavior of their agents. It is unacceptable to release an agent out in the open which in the middle of a battle starts to act weird. Hence, game engineers often limit the intelligence of agents to (pre-historic ?!?) methods such as rule based systems and (heuristic) search because they can understand the behavior and debug it more easily. (It would be unfair to forget to give credit to the people that have applied reinforcement learning and neural networks to games; afaik mostly in the areas of racing games.) To get a rough idea about what is hot-or-not in the game AI field, take a look at AI Wisdom.

    One could say, who cares what technique to use: rules and search work incredibly well! Very true. In my humble opinion, the AI/machine learning community has sometimes over-focussed on new algorithms and models and too little on building intelligent solutions. Although in fields like robotics, biology and vision, our machine learning tools have had a huge impact, I think there are many fields where the AI community does not have a good understanding on how to integrate all our tools to make a large working system. Hence, silicon minds looks like a promising challenge and I am very excited to see what people come up with.

  • UCI Website Revamped

    First of all, happy new 2008 to all readers!

    UCI hosts a famous collection of datasets at the UCI Machine Learning Repository. Recently, they have completely updated their webpage and are starting to offer new datasets. This is a great service to the machine learning community, but I would like to see us take this one step further: we should match this repository of datasets with a repository of algorithms. This would not only allow us to compare algorithms, but also get a lot of intuition about the nature of the datasets: a well-understood algorithm that does great on - say - digit classification but performs really poorly on the Wisconsin breast cancer dataset teaches us something about the nature of the data. A recent JMLR paper* calling for more open source machine learning software, mentions a project at the university of Toronto called Delve that meant to do exactly this. Unfortunately, the project seems to be dead as of 2003 or so.

    * The Need for Open Source Software in Machine Learning - Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson; Journal of Machine Learning Research; 8(Oct):2443--2466, 2007.

  • Machine Learning News

    From Machine Learning (Theory); a new machine learning mailing list has been created here.

  • Wanted: PostDocs

    ... in our very own Cambridge group. More info here.

  • Machine Learning Summer School in Cambridge!

    The machine learning summer school 2009 will be organized in Cambridge (UK) from August 29 to September 10, 2009. On this website we will soon be posting the list of speakers, the application procedure and the other practical information.

    See you in Cambridge!

  • London Machine Learning Meetup... and joining a startup.

    Three months ago I decided to end my employment at Microsoft. I had a great time, met loads of interesting people and will definitely miss being in the Cambridge research lab (as well as the occasional trip to Seattle).
    Next up I am joining a startup company called Rangespan. I met the founders of Rangespan at Silicon Milkroundabout in April this year. When I got talking to them it was immediately obvious there was a hell of a lot of machine learning to be done there.
    Rangespan is essentially an e-commerce platform which is organizing a marketplace for suppliers to sell products to big retailers. On one hand Rangespan is trying to build the biggest catalog in the world. We're doing some by fusing many different information sources (supplier feeds, manufacturer feeds, crowdsourced data, crawled data). The three big learning problems in this space are:

    1. Matching: out of millions of product descriptions (semi-structured & textual), how do you recognize which descriptions talk about the same product? Record linkage is the keyword here, but making it work at large scale and under very small precision and recall margins is a challenge.
    2. Reconcilliation: when you have 100 different descriptions of a product, and you want to construct the best possible description of the product, how do you do that? I think the speech recognition and machine translation community has a few neat tricks up their sleeve to be applied here: in the abstract if we think of each of the 100 descriptions as the output of a noisy channel with the unknown ground truth description as the input, the challenge is to invent a prior and model for the noisy channel. Machine learning baby!
    3. Information extraction: the data we are gathering is only semi-structured and most often marketing text. The question now becomes: how do we figure out form a USB stick description that the product has a "4GB" capacity. How do we figure out that "4GB" is the same as "four gigabytes" etc etc. There's loads of great work on information extraction which we can build on; the good news is that our domain is restricted (physical goods) the bad news is that the domain is quite large (1000's of categories, 1000's of attributes).
    Besides the catalog, Rangespan is also building an order pipeline component. One crucial component of that pipeline is the system that assigns retailer's orders to suppliers. The idea here is that for each product we have a pretty good view of the supply curve (what suppliers sells how many items at what price). We also have a pretty good view of the demand curve: by aggregating many different retailers we can predict what the demand for a product will be. This allows us to make very quantitative decisions about how to choose the price for that product to optimize metrics like user experience etc. In this area a mix of statistics (time series analysis), decision theory, econometrics and optimization are crucial ingredients... together with a ton of data ofcourse.
    If this all sounds like music to your ears, then let us know. We're hiring!
    Joining a startup also meant I am now working in London. One of the coolest things I discovered about being in London is the great number of meetups people are organizing. From MongoDB office hours (thank you for all the help 10gen!) to the Hadoop and Big Data meetups. Although big data is crucial at Rangespan, we think clever algorithms (as in machine learning) are as crucial to building successful insights from data. As such we are kicking off the Machine Learning Meetup as of January 2012. We'll put some fun talks together with the London machine learning community and probably do some study groups on the Stanford online learning courses.
    If you live closeby and haven't yet signed up, you can find all meetup details here.

  • Machine Learning Summer School Video’s Online

    For the people who missed the Machine Learning Summer School here in Cambridge this year, the video’s are now available at videolectures.net! My personal favorites: David MacKay’s Information Theory, David Blei’s Topic Models, Iain Murray’s Markov Chain Monte Carlo and Tom Minka’s Approximate Inference. Enjoy …

  • Dirichlet Distributions and Entropy

    Dirichlet Distributions and Entropy

    In between all the Netflix excitement I managed to read a paper that was mentioned by David Blei during his machine learning summer school talk: Entropy and Inference, Revisited by Ilya Nemenman, Fariel Shafee and William Bialek from NIPS 2002. This paper discusses some issues that arise when learning Dirichlet distributions. Since Dirichlet distributions are so common, I think these results should be more widely known.

    [Given that I’ve been reading a lot of natural language processing papers where Dirichlet distributions are quite common, I’m surprised I haven’t run into this work before.]

    First a little bit of introduction on Dirichlet distributions. The Dirichlet distribution is a “distribution over distribution”; in other words: a draw from a Dirichlet distribution is a vector of positive real numbers that sum up to one. The Dirichlet distribution is parameterized by a vector of positive real numbers which captures the mean and variance of the Dirichlet distribution. It is often very natural to work with a slightly constrained Dirichlet distribution called the symmetric Dirichlet distribution: in this case the vector of parameters to the Dirichlet are all the same number. This implies that the mean of the Dirichlet is a uniform distribution and the variance is captured by the magnitude for the vector of parameters. Let us denote with beta the parameter of the symmetric Dirichlet. Then, when \beta is small, samples from the Dirichlet will have high variance while with beta large, samples from the Dirichlet will have small variance. The plot below illustrates this idea for a Dirichlet with 1000 dimensions: the top plot has very small beta and hence a draw from this Dirichlet has only a few nonzero entries (hence high variance) while the Dirichlet with beta = 1, all entries of the sample have roughly the same magnitude (about 0.001).

    image

    Another way to approach the effect of beta is to look at the entropy of a sample from the Dirichlet distribution, denoted by S in the images below. The entropy of a Dirichlet draw is high when beta is large. More in particular, it is upper bounded by ln(D) where D is the dimensionality of the Dirichlet when beta approaches infinity and the Dirichlet distribution will approach a singular distribution at completely uniform discrete distribution. When beta approaches 0, a draw from a Dirichlet distribution approaches a delta peak on a random entry which is a distribution with entropy 1. The key problem the authors want to address is that when learning an unknown distribution, if we use a Dirichlet prior, beta pretty much fixes the allowed shapes while we might not have a good reason a priori to believe that what we want to learn is going to look like either one of these distributions.

    The way the authors try to give insight into this problem is by computing the entropy of a random draw of a Dirichlet distribution. In equations, if we denote with S the entropy, it will be a random variable with distribution

    image

    Computing the full distribution is hard but the authors give a method to compute its mean and variance. The following picture shows the mean and variance of the entropy for draws of a Dirichlet distributions. A bit of notation: K is the dimensionality of the Dirichlet distribution, Xi is the mean entropy (as a function of beta) and sigma is the variance of the entropy as a function of beta.

    image

    As you can see from this plot, the entropy of Dirichlet draws is extremely peaked for even moderately large K. The authors give a detailed analysis of what this implies but the main take-away message is this: as you change beta, the entropy of the implied Dirichlet draws varies smoothly, however, because the variance of the entropy is very peaked, the a priori choice of beta almost completely fixes the entropy.

    This is problematic as it means that unless our distribution is sampled almost completely, the estimate of the entropy is dominated by the choice of our prior beta. So how can we fix this? The authors suggest a scheme which I don’t completely understand but boils down to a mixture of Dirichlet distributions by specifying a prior distribution on beta as well.

    This mixture idea ties in with something we did in our EMNLP 09 paper: when we were training our part-of-speech tagger we had to choose a prior for the distribution which specifies what words are generated for a particular part-of-speech tag. We know that we have part-of-speech tag classes that generate very few words (e.g. determiners, attributes, …) and a few classes that generate a lot of words (e.g. nouns, adjectives, …). At first, we chose a simple Dirichlet distribution (with fixed beta) as our prior and although the results were reasonable, we did run into the effect explained above: if we set beta to be large, we got very few states in the iHMM where each state outputs a lot of words. This is good to capture nouns and verbs but not good for other classes. Conversely, when we chose a small beta we got a lot of states in the iHMM each generating only a few words. Our next idea was to put a Gamma prior on beta; this helped a bit, but still assumed that there is only one value of beta (or one type of entropy distribution) which we want to learn. This is again unreasonable in the context of natural language. Finally, we chose to put a Dirichlet process prior on beta (with a Gamma base measure). This essentially allows different states in the iHMM to have different beta’s (but we only expect to see a few discrete beta’s).

    “Entropy and Inference, Revisited” is one of those papers with a lot of intuitive explanations; hopefully it helps you make the right choice for priors in your next paper as well.

  • May All Your Wishes Come True

    I am trying to catch up with a stack of 109 papers which are on my “To Read” list. Today I read a very enjoyable paper called May all your wishes come true: A study of wishes and how to recognize them by some of my former UW-Madison friends and colleagues Andrew B. Goldberg, Nathanael Fillmore, David Andrzejewski, Zhiting Xu, Bryan Gibson and Xiaojin Zhu.

    The story is as follows: in 2007, the Times Square Alliance asked people to send their wished to a Virtual Wishing Wall website with the promise to print these wishes on confetti and drop it from the sky at midnight December 31, 2007. The UW-Madison people got a hold of this corpus of wishes and did some statistical analysis.

    A few highlights:

    • wishes follow Zipf’s law
    • topic wise, US people wish more for religion
    • topic wise, non-US people wish more for love, peace and travel
    • scope wise, US people wish to their country and family more frequently
    • scope wise, non-US people wish to the world more frequently

    After the statistical analysis, the authors wanted to come up with a classifier that could detect wishes. Their approach consists of extracting wish templates in an unsupervised fashion. The idea is quite clever (and I believe more widely applicable): people often wished for “world peace” and “I wish for world peace”, or “health and happiness” and “I wish for health and happiness”. In other words, there is quite a bit of redundancy in how the wish is expressed. By matching similar wishes, the authors extract the “redundant” part and assume it is a template.

    Matching wishes can be done as follows: you create a bipartite graph and loop over all wishes, if a wish contains another wish (“I wish for world peace” contains “world peace”), you create a content node on the left with the content part (e.g. “world peace”), a template part on the right with the redundancy (e.g. “I wish for”) and you connect these with a directed edge from the content to template node. When you’re done you loop over all template nodes and see whether any of them contain content words. If so, you introduce an edge from the template node to the content node. Finally, you score template nodes by their indegree – outdegree. I will refer you to table 4 to see that the results by this relatively simple algorithm are quite good.

    The reason I believe this is more widely applicable is that we can imagine using this technique for extracting templates from web queries. People often query for facts: e.g. kids of Michael Jackson. Using the algorithm from this paper, we might be able to extract the most common templates which in turn would give us a good indication of what kind of answers search engines should be trying to answer. Just an idea …

  • To R or not to R

    At the moment 90% of the research in our lab is done using Matlab. Some of us have tried Python but were dissapointed by how hard it was to get decent (read: native) numerical routines on par with Matlab. I've been developing on the.NET platform with a little bit of Java development (for Amazon's Elastic Map-Reduce) on the side. We've had a few discussions in our lab recently about switching to R for our machine learning research. A decision hasn't been made and we're wondering what the larger machine learning community thinks about this issue. Here's a list of pros and cons that I've come up with

    Pros:

    • great support for statistical functions
    • great support from the statistical community
    • good (if not better) plotting that Matlab
    • reasonable fast (check out Dave's benchmarking)
    • free (very important!) if we want to run a job on 100 machines (e.g. in the cloud), I believe currently you need a matlab licence for each one
    Cons:
    • pre-historic interface, doesn't compare to modern IDE's
    • debugging doesn't seem up to Matlab standards
    • not much support in machine learning community (?)
    • learning curve
    The jury is still out on this one... I really wonder how many machine learners out there already use R?

  • AIStats is coming to Europe

    That’s right, the conference that only happened every two years in the US is now going to be switching between the US and Europe on an annual basis. The conference website is up and running here.

  • Testing Markov Chain Monte Carlo Methods

    The projects I was involved in this year all used Markov Chain Monte Carlo (MCMC) methods for inference. MCMC methods are nice because they are so widely applicable. However, I often find them much harder to debug than variational methods. When coding up a new MCMC algorithm I use the following "procedure" to debug my code:

    1. Fix parameters in the model and generate a lot of data. Initialise the sampler with all hidden variables set to the values used to generate the data. Run the sampler and check that it doesn't move too much. It is crucial to generate a lot of data so there is little uncertainty in the parameters and there is little reason for the sampler to move around. This test is really to check that if the inference algorithm is working, it should recognize that it has found a solution.
    2. Fix parameters in the model and generate a lot of data. Initialise the sampler with random hidden variables. Run the sampler and check that it discovers the parameters you used to generate the data. Again, this would work if there is enough data so there is little uncertainty in the posterior distribution of the parameters. This test is to check the search abilities of the inference algorithm.
    3. Finally, I apply the model and algorithm to real data. At this point, I will probably dig up Bayesian Data Analysis and use more principled ways to assess convergence.

    In variational methods one can check that a lower bound increases between iterations; if it doesn't, something is wrong. Are there similar kind of checks for sampling algorithms? Do people have some good suggestions for debugging MCMC methods?

  • Machine Learning for Bug Discovery

    One of the reasons I have this blog is to talk about some of my own research. I intend to give some more background on things that haven't made it into a paper, put some more experiments or references online, answer questions people have about my work,... I hope that by putting these discussions online, I can catch the attention of Google's indexer and this information is available for anyone who might ever work on similar things.

    Today we (Finale Doshi and I) want to talk about a small side project of ours. Early this year we ran into a pretty cool project at the University of Wisconsin - Madison called the Cooperative Bug Isolation (CBI) project. In a nutshell, this project aims to find software bugs by first instrumenting the source code with little probes. These probes count when certain events - such as the execution of a particular line of code - occur. The CBI software collects these probe counts, and remembers whether the program exited successfully or whether the program crashed. For example: imagine we wrote the following program:

    function foo()
    a = read_line()
    if a < 10 then
    print "smaller than 10"
    a = 10 / a - a
    else
    print "at least 10"
    return a

    Although this program is completely trivial, it has a potential division by zero. The CBI software will take a piece of source code as above and add some probes: say the bold italic ones below:

    function foo()
    a = read_line()
    log( a == 0, probe_1)
    if a < 10 then
    print "smaller than 10"
    a = 10 / a - a
    log( a == 0, probe_2)
    else
    print "at least 10"
    log( a == 0, probe_3)
    return a

    Note that there is a whole science to what probes are useful and where to put them, but the main point is that each probe is a counter, which gets augmented every time its first argument is true. As we mentioned above, when a program executes,various probes get augmented and at the end of the execution, the counts are stored in a file in addition to a flag that says whether the program exited gracefully or whether it crashed. Let's suppose that foo() is called several times in a larger program. After running that program a couple of times, we might get data that looks like this:

    Run Probe 1 Probe 2 Probe 3 Exit Flag 1 0 1 2 Good 2 1 2 2 Fail 3 0 5 1 Good

    The CBI project has collected thousands of traces for different open source projects. The goal now is to mine these huge datasets to discover bugs in the source code. Although there are many ways of going about it, one way is to make the following assumption: if we run a program, we are executing a sequence of complex but fairly atomic "usage patterns". For example, imagine we are executing a web browser: when we press the homepage button a series of functions gets executed (and the corresponding counters are augmented). However, when we press the back button a different (but maybe overlapping) sequence of functions will be executed. We call these sequences "usage patterns".

    In this paper, some Programming Language and Machine Learning people in Madison (David Andrzejewski, Anne Mulhern, Ben Liblit, and Jerry Zhu) developed a statistical model (called DeltaLDA) based on Latent Dirichlet Allocation to learn two sets of usage patterns at the same time: one set of "good" usage patterns which both successful and failed runs use, and an extra set of "bad" usage patterns which only occur in failed runs. After learning the DeltaLDA model, one can go back and look at the bad usage patterns and see which counters are expressed. Since the counters in bad usage patters correlate with programs crashing, these could then be indicative of actual bugs in the source code. It turns out that it is a reasonable idea; read the DeltaLDA paper for details on how it performs in practice.

    When we decided to work with this dataset, one obvious extension of the DeltaLDA model was to try and make the whole thing non-parametric. In the original DeltaLDA paper, one has to specify the number of usage patterns beforehand. We think this is unrealistic and allowing the number of usage patterns to adapt according to the data seems to be the right way to go. So we developed a model called the DeltaHDP which is exactly this: a non-parametric extension of DeltaLDA using the Hierarchical Dirichlet Process framework. On some of the smaller datasets which we tried, we found that the DeltaHDP behaved exactly as one would expect: it finds similar usage patterns as DeltaLDA but you don't have to specify the number of usage patterns beforehand. One other nice thing about the DeltaHDP is that the Gibbs sampler, which is fairly straightforward to implement, performs well.

    Another unrealistic artifact of both the LDA/HDP type models is that the corresponding generative model does not correspond to the usage pattern story above. LDA/HDP define each run of a program to be a mixture of usage patterns while we really want each run to be an integer combination of usage patterns. The reason is that we believe that each time a usage pattern is executed, it augments the counter more or less deterministically - counters do not randomly fail to increment. In the end, we came up with a model based on the Indian Buffet Process. Check out our poster for more information on the model. It turned out the inference was a little nightmare and we never found a good sampling mechanism which could scale to very large datasets. However, the results on smaller datasets suggested that the model was OK and we got similar results as the DeltaHDP. In addition, the IBP returns more information: it can tell you exactly how many times specific usage patterns were executed whereas the DeltaHDP only returns the ratio that each usage pattern was executed.

    In the end it was a fun project. We enjoyed working on this data, had a lot of fun coming up with the models and we learned a lot about non-parametric methods. Unfortunately, both Finale and I have different priorities and we think the cooperative bug isolation project deserves 100% of one's attention to make progress on using statistical models for bug isolation. However, we think CBI is a great project, and we will be more than happy to give some more background on our models and answer questions about our experiences.

  • Parallel Random Number Generators

    We recently got a paper accepted in EMNLP where we elaborate on a point that was touched upon in Mark Johnson’s “Why doesn’t EM find good HMM POS-taggers?”: if we are going to do completely unsupervised POS-tagging, how can non-parametric Bayesian methods help us? There were two parts to our paper: extending the iHMM to be able to handle a medium sized text corpus and an evaluation of the results to see how the iHMM states correspond to POS tags. I’ll blog about our findings later but one thing that came up during this research was the following: because of the size (1,000,000 datapoints) of the corpus (Wall Street Journal of Penn Treebank) we used, we wanted to parallelize the forward-filtering backward-sampling (FFBS) part of the inference. Because all the sentences are independent given the transition matrix and output parameters, we can easily run the FFBS in parallel.

    One thing I started thinking about is the following: imagine I run on a machine with 4 cores and I start a thread for each core to process a chunk of the data. Since the FFBS is a sampling algorithm, each of these 4 threads needs to access a random number generator (RNG). I know of two ways to go about this:

    1. We use one RNG and access it from the 4 threads. Since the RNG’s I use are stateful, we need to lock them when we query them from a number so it’s internal state doesn’t get messed up. Locking an RNG can incur a large overhead and easily becomes the bottleneck of the parallelization.
    2. Use one RNG for each thread. How do we initialize them though? I use the.NET System.Random RNG which uses a time dependent random seed. I don’t think this is a good idea because of the resolution of this time dependent seed is to coarse, and my 4 threads start almost at the same time, they might all end up using the same seed. An alternative is to initialize a global RNG to produce the seeds for the thread local RNG’s.

    The second alternative in (2) is what we used but it does worry me a little bit: how can I know that the streams of random numbers in different threads are going to be ‘good enough’ (whatever that means). As a non-expert on RNG’s I’d like to know whether I’m going to get into trouble with these kinds of approaches. Matlab has a way to generate 3 statistically independent streams using the same RNG with the same seed. This seems like a step in the right direction, but it doesn’t solve my quad core problem, let alone when we run our algorithm on many more cores in Amazon Elastic Map Reduce …

    Any creative solutions out there?