Hello, Science!:
Machine Learning

  • PQL–A Probabilistic Query Language

    PQL–A Probabilistic Query Language

    At MSR and Bing, when we do machine learning on smaller datasets (say anything below 100GB) we often use relational databases and SQL. Throw in a little bit of Excel and R and you’ve got yourself a very powerful platform for exploratory data analysis.

    After the exploratory phase, we often build statistical models (adPredictor, TrueSkill, Matchbox, …) to discover more complex structures in the data. Infer.Net helps us prototype these graphical models, but unfortunately it forces you to work in a mode where you first create a binary that performs inference, suck out all data to your machine, run inference locally and then write all inference results back to the DB. My local machine is way slower than the machines which run our DB or our local compute cluster so ideally I’d like to have a platform which computes “close” to the data.

    The Probabilistic Query Language (or PQL) is a language/tool which I designed two years ago, during an internship with Ralf Herbrich and Thore Graepel, where we had the following goals in mind:

    • Allow for rapid prototyping of graphical models in a relational environment
    • The focus should be on specifying models, not algorithms
    • It should enable large scale execution and bring the computation to the data, rather than the data to the computation

    Using SQL Server, DryadLinq (Map-Reduce for.NET) and Infer.Net I built a prototype of PQL and tested it on some frequently used models at Microsoft. In this post I want to introduce the PQL language and give a few examples of graphical models in PQL.


    Let’s start with a very simple example where we have a DB with a table containing people’s info and a table with records describing doctor visits for those people. Assume the following relational schema

    image

    We assume that people have an unknown weight and when they go to the doctor, she measures this weight. Depending on the time of day (after a heavy lunch), this estimate could be off a bit. A statistical model to capture these assumption is to introduce a random variable for the weight for each person in the People table, put a prior on this variable and connect it with the observations in the DrVisits table. So how do we write such a model in PQL?

    PQL is very much like SQL but with two extra keywords: AUGMENT and FACTOR. AUGMENT allows us to add random variables to the DB schema. In the example above we would write

    People = AUGMENT DB.People ADD weight FLOAT

    This essentially defines a “plate” in graphical model speak: for each row in the People table, a random variable over the real numbers called weight is defined.

    The FACTOR keyword in PQL allows to introduce factors between random variables as well as any other variables in the DB schema. FACTOR follows the relational SQL syntax to specify exactly how to connect variables. To specify a normal prior on the weight variable we could write

    FACTOR Normal(p.weight | 75.0,25.0) FROM People p

    This introduces a normal factor for each row in the People table (the FROM People p part). The final component of our program connects the random variable with observations. In this case, we use the familiar SQL JOIN syntax to specify how to connect rows from the People table to the rows in the DrVisits table. In PQL we write

    FACTOR Normal(v.weight | p.weight, 1.0)
    FROM People p
    JOIN DrVisit v ON p.id = v.personid

    Except for the first line this is exactly SQL; instead of doing a query, the FACTOR statement describes the “probabilistic augmentation” of the DB schema”.

    For the example above, this is it, the PQL program contains five lines of code and can be sent to the DB. It will run inference by performing EP or variational Bayesian inference. The inference itself can be run either within the database (this was implemented by Tina Palla who was an intern with us) or on the DryadLinq cluster.


    Another example of PQL is the program to describe the TrueSkill ranking system. In this example we assume two-player games stored using a table of players (called Players) and a table of game outcomes (called PlayerGames). Each game played generates two rows in the PlayerGames table: one for the winner and the loser (with a score) column specifying who is the winner and who is the loser. The PQL program for TrueSkill is written below

    Players = AUGMENT DB.Players ADD skill FLOAT;
    PlayerGames = AUGMENT DB.PlayerGames ADD performance FLOAT;

    FACTOR Normal(p.skill | 25.0, 20.0) FROM Players p;

    FACTOR Normal(pg.performance | p.skill, 0.1)
    FROM PlayerGames pg
    JOIN Players p ON pg.player_id = p.player_id;

    FACTOR IsGreater(pgb.performance, pga.performance)
    FROM PlayerGames pga
    JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
    WHERE pga.player_id < pgb.player_id AND pga.score = 0;

    FACTOR IsGreater(pga.performance, pgb.performance)
    FROM PlayerGames pga
    JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
    WHERE pga.player_id < pgb.player_id AND pga.score = 2;

    There are a lot of features in PQL I haven’t covered in this blog post (like using random variables in a WHERE clause to create mixture models) but I wanted to give you a flavour of what we’ve been working on so far.

    While working on PQL I learned a lot about the state of the art in probabilistic databases and statistical relational learning. I think compared to this academic work, PQL does not add many theoretical contributions; our goal is to design a tool which takes statistical relational learning out of the laboratory into the hands of data mining practicioners.

  • Machine Learning Summer School Application Process Opens

    Just to let all of you know that the application process for the summer school has opened. More info here. The deadline for applications is June 1. The list of confirmed speakers is pretty a-ma-zing if you ask me.

    Hopefully I get to meet some of you in September!

  • ICML & UAI 2008 Accepted Papers

    The accepted papers for ICML 2008 and UAI 2008 are online!

    As one of my own papers got accepted, I am preparing a post with some more background information and maybe some extra plots and pictures.

  • Popularizing Machine Learning

    Three weeks ago I gave a presentation on probabilistic modelling to an audience of data mining practitioners. These people know about machine learning, they use it every day, but their main concern is to analyze data: real data!

    The experience taught me a valuable lesson which I hadn’t come across by interacting with the academic community: Probabilistic models are hard. Even for people that are very close to the machine learning community (data mining), probabilistic models are a very different (new?) way of thinking.

    The whole idea of building a generative story for your data and then using Bayes rule to “invert” the model given some dataset has become second nature to me. Nonetheless, I (we?) shouldn’t forget that it took statistics a long time to think about modelling in this sense. Hence I now understand realize that for outsiders the framework of probabilistic modelling is a highly non-trivial concept to grasp.

    In this context I am obliged to share a blog post by Jeff Moser. I have never seen such a great explanation of a non-trivial probabilistic model that is deployed in very large scale on XBox Live: “Computing Your Skill”, a description of the TrueSkill ranking model. Very very well done Jeff!

  • Machine Learning lecture series on videolecture.net

    Our machine learning group organizes a series called the Advanced Tutorial Lecture Series on Machine Learning where we get to see some amazing researchers talk about recent developments in Machine Learning. Karsten, a postdoc in our group has been so kind to film all lectures and slowly but surely these videos are finding their way onto videolectures.net.

    Currently on videolectures.net are videos on:

    • Applications of Group Theory to Machine Learning by Risi Kondor; this math heavy lecture was cool in that Risi explained how tools from abstract algebra and harmonic analysis can be used in Machine Learning too,
    • Spectral Clustering by Arik Azran; our local spectral methods expert shared his insight into how spectral methods work.
    Coming up soon is a lecture by David MacKay on error correcting codes and how they can be decoded using local message passing algorithms. David is a very good speaker and his lecture is certainly in my top 5 of excellent machine learning talks.

  • Machine Learning Summer School 2008

    Perfect: the website for the next machine learning summer school is up and running. It will be organized on Porquerolles Island, right of the coast of France. The confirmed speaker list is still a work in progress but looks promising already:

    • Shai Ben-David (University of Waterloo) "Theoretical Foundations of Clustering"
    • Stephane Canu (INSA de Rouen) "Introduction to Kernel Methods"
    • Manuel Davy (INRIA) "Parametric and Non-parametric Bayesian Learning"
    • Pierre Del Moral (Universite de Bordeaux) "On the Foundations and the Applications of i-MCMC Methods"
    • Isabelle Guyon (ClopiNet)
    • Yann LeCun (New York university) "Supervised and Unsupervised Learning with Energy-Based Models"
    • Rich Sutton (University of Alberta) "Reinforcement Learning and Knowledge Representation"
    • Patrick Wolfe (Harvard University) "Overcomplete Representations with Incomplete Data: Learning Bayesian Models for Sparsity"
    I'll definitely try to apply and hope for the best!

  • Silicon Minds

    The guys at Microsoft Research announced a very exciting competition: the silicon minds challenge. The goal of the competition is to foster novel ideas in the area of game AI.

    Many years ago I wrote a computer game called De Profundis where I was in charge (among other things) of the game AI. Moving on to become an AI researcher it is interesting to reminisce and draw some connections.

    On one hand, the game AI field is the perfect arena to try out new ideas for AI researchers. For AI researchers working on agents, planning and human interaction (speech, NLP) I could imagine it would be extremely valueable to interact with MMORPG's (Massive Multiplayer Online Role Playing Games). I don't know whether anyone in the research community has ever done this before, but having an unlimited source of humans to interact with seems like quite the experimental setup. This also applies to virtual worlds like second life ofcourse. AI Research has contributed to the game AI field ofcourse, so let me highlight two recent projects:

    1. The university of Alberta games group: these guys do some amazing work on several kinds of games. As far as I understand it, most of their efforst are focussed on games where the mathematics are in some sense understood: chess, poker,... What I mean by the mathematics are understood is that with infinite computational capabilities, we would be able to solve these games. The U of A group also do some work on AI for real time strategy games (e.g. Age of Empires). A mathematical analysis of these games is much harder (if possible at all). The AI necessary for these games is much closer to what I would think of as strong AI.
    2. The Applied Games Group at Microsoft research: the organizers of the silicon minds challenge have developed a few innovations for game AI themselves. Their machine learning approach to inferring gamer skills (know as TrueSkill) is used by the XBox Live service. They have also enhanced Forza Motorsport with a reinforcement learning agent that learns to drive from observing human drivers.

    Unfortunately, the game AI field has very special requirements that prohibit the use of many innovations from the research community. First and foremost, game AI are supposed to make games more fun. More sophisticated agents do not necessarily mean more fun: one can spend a large amount of time making opponents (in first person shooters or racing games) smarter, but if that means the player always looses, he or she might not enjoy the game that much. Also, games are big business, and game engineers want to understand the behavior of their agents. It is unacceptable to release an agent out in the open which in the middle of a battle starts to act weird. Hence, game engineers often limit the intelligence of agents to (pre-historic ?!?) methods such as rule based systems and (heuristic) search because they can understand the behavior and debug it more easily. (It would be unfair to forget to give credit to the people that have applied reinforcement learning and neural networks to games; afaik mostly in the areas of racing games.) To get a rough idea about what is hot-or-not in the game AI field, take a look at AI Wisdom.

    One could say, who cares what technique to use: rules and search work incredibly well! Very true. In my humble opinion, the AI/machine learning community has sometimes over-focussed on new algorithms and models and too little on building intelligent solutions. Although in fields like robotics, biology and vision, our machine learning tools have had a huge impact, I think there are many fields where the AI community does not have a good understanding on how to integrate all our tools to make a large working system. Hence, silicon minds looks like a promising challenge and I am very excited to see what people come up with.

  • UCI Website Revamped

    First of all, happy new 2008 to all readers!

    UCI hosts a famous collection of datasets at the UCI Machine Learning Repository. Recently, they have completely updated their webpage and are starting to offer new datasets. This is a great service to the machine learning community, but I would like to see us take this one step further: we should match this repository of datasets with a repository of algorithms. This would not only allow us to compare algorithms, but also get a lot of intuition about the nature of the datasets: a well-understood algorithm that does great on - say - digit classification but performs really poorly on the Wisconsin breast cancer dataset teaches us something about the nature of the data. A recent JMLR paper* calling for more open source machine learning software, mentions a project at the university of Toronto called Delve that meant to do exactly this. Unfortunately, the project seems to be dead as of 2003 or so.

    * The Need for Open Source Software in Machine Learning - Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson; Journal of Machine Learning Research; 8(Oct):2443--2466, 2007.

  • Machine Learning News

    From Machine Learning (Theory); a new machine learning mailing list has been created here.

  • Wanted: PostDocs

    ... in our very own Cambridge group. More info here.

  • Machine Learning Summer School in Cambridge!

    The machine learning summer school 2009 will be organized in Cambridge (UK) from August 29 to September 10, 2009. On this website we will soon be posting the list of speakers, the application procedure and the other practical information.

    See you in Cambridge!