Hello, Science!:
Info

The Elements of Statistical Learning

The Elements of Statistical Learning is an absolute classic for anyone wanting to do statistics/machine learning/data mining. I read that the second edition was out and debating whether I should spend the money on this new edition. Via John Cook I learned that the book is out on pdf (from their website). DOUBLE WIN: a) I’ve already paid once and get the upgrade for free, b) I know have a way to electronically search the book.
I also found out today that Koller and Friedman have just released their much anticipated book Probabilistic Graphical Models from MIT press. At a lengthy 1208 pages, this should provide enough reading for a few nights!
Gold's Theorem

After seeing this amazing talk by Josh Tenenbaum on videolectures.net, I started reading up on some very cool stuff at the intersection of machine learning and cognitive science. This brought me to read on Gold's theorem and the poverty of the stimulus. Very roughly, Gold's theorem says that any learner (be it a child or a computer) cannot "learn" a language by only acquiring sentences from the language she has to learn. Some people use this theorem to make the following argument: a toddler will only hear sentences from the language she is learning, she never gets to hear "wrong" (as in not in the language) sentences. Hence, since by Gold's theorem this toddler cannot learn the language, it must be innate: language abilities must be wired into our brains in some way. Gold's Theorem and Cognitive Science, by Kent Johnson is a very enjoyable read for more background on Gold's theorem and how it applies to the question of language acquisition.
Johnson's paper mentions something that I had never thought about: according to Morgan, a child acquires language after hearing about 4 million sentences. Now think about how many sentences we have access to to train our NLP algorithms on. This is orders of magnitude more than a person ever gets to hear and yet I would say we are far from building a computer system that can manipulate language as accurate as humans. From a Bayesian perspective, this could translate into assuming children having a really good prior which they start from when learning language. If the Bayesian way is the right way to look at this question, I really wonder how humans acquire this prior: how much is wired up in our brains, how much is it influenced by our sensory system,... ?
PQL–A Probabilistic Query Language
At MSR and Bing, when we do machine learning on smaller datasets (say anything below 100GB) we often use relational databases and SQL. Throw in a little bit of Excel and R and you’ve got yourself a very powerful platform for exploratory data analysis.

After the exploratory phase, we often build statistical models (adPredictor, TrueSkill, Matchbox, …) to discover more complex structures in the data. Infer.Net helps us prototype these graphical models, but unfortunately it forces you to work in a mode where you first create a binary that performs inference, suck out all data to your machine, run inference locally and then write all inference results back to the DB. My local machine is way slower than the machines which run our DB or our local compute cluster so ideally I’d like to have a platform which computes “close” to the data.

The Probabilistic Query Language (or PQL) is a language/tool which I designed two years ago, during an internship with Ralf Herbrich and Thore Graepel, where we had the following goals in mind:
- Allow for rapid prototyping of graphical models in a relational environment
- The focus should be on specifying models, not algorithms
- It should enable large scale execution and bring the computation to the data, rather than the data to the computation
Using SQL Server, DryadLinq (Map-Reduce for.NET) and Infer.Net I built a prototype of PQL and tested it on some frequently used models at Microsoft. In this post I want to introduce the PQL language and give a few examples of graphical models in PQL.

Let’s start with a very simple example where we have a DB with a table containing people’s info and a table with records describing doctor visits for those people. Assume the following relational schema

We assume that people have an unknown weight and when they go to the doctor, she measures this weight. Depending on the time of day (after a heavy lunch), this estimate could be off a bit. A statistical model to capture these assumption is to introduce a random variable for the weight for each person in the People table, put a prior on this variable and connect it with the observations in the DrVisits table. So how do we write such a model in PQL?

PQL is very much like SQL but with two extra keywords: AUGMENT and FACTOR. AUGMENT allows us to add random variables to the DB schema. In the example above we would write

People = AUGMENT DB.People ADD weight FLOAT
This essentially defines a “plate” in graphical model speak: for each row in the People table, a random variable over the real numbers called weight is defined.
The FACTOR keyword in PQL allows to introduce factors between random variables as well as any other variables in the DB schema. FACTOR follows the relational SQL syntax to specify exactly how to connect variables. To specify a normal prior on the weight variable we could write
FACTOR Normal(p.weight | 75.0,25.0) FROM People p
This introduces a normal factor for each row in the People table (the FROM People p part). The final component of our program connects the random variable with observations. In this case, we use the familiar SQL JOIN syntax to specify how to connect rows from the People table to the rows in the DrVisits table. In PQL we write
FACTOR Normal(v.weight | p.weight, 1.0)
FROM People p
JOIN DrVisit v ON p.id = v.personid

Except for the first line this is exactly SQL; instead of doing a query, the FACTOR statement describes the “probabilistic augmentation” of the DB schema”.

For the example above, this is it, the PQL program contains five lines of code and can be sent to the DB. It will run inference by performing EP or variational Bayesian inference. The inference itself can be run either within the database (this was implemented by Tina Palla who was an intern with us) or on the DryadLinq cluster.

Another example of PQL is the program to describe the TrueSkill ranking system. In this example we assume two-player games stored using a table of players (called Players) and a table of game outcomes (called PlayerGames). Each game played generates two rows in the PlayerGames table: one for the winner and the loser (with a score) column specifying who is the winner and who is the loser. The PQL program for TrueSkill is written below
Players = AUGMENT DB.Players ADD skill FLOAT;
PlayerGames = AUGMENT DB.PlayerGames ADD performance FLOAT;
FACTOR Normal(p.skill | 25.0, 20.0) FROM Players p;
FACTOR Normal(pg.performance | p.skill, 0.1)
FROM PlayerGames pg
JOIN Players p ON pg.player_id = p.player_id;
FACTOR IsGreater(pgb.performance, pga.performance)
FROM PlayerGames pga
JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
WHERE pga.player_id < pgb.player_id AND pga.score = 0;
FACTOR IsGreater(pga.performance, pgb.performance)
FROM PlayerGames pga
JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
WHERE pga.player_id < pgb.player_id AND pga.score = 2;

There are a lot of features in PQL I haven’t covered in this blog post (like using random variables in a WHERE clause to create mixture models) but I wanted to give you a flavour of what we’ve been working on so far.

While working on PQL I learned a lot about the state of the art in probabilistic databases and statistical relational learning. I think compared to this academic work, PQL does not add many theoretical contributions; our goal is to design a tool which takes statistical relational learning out of the laboratory into the hands of data mining practicioners.
Machine Learning Summer School Application Process Opens

Just to let all of you know that the application process for the summer school has opened. More info here. The deadline for applications is June 1. The list of confirmed speakers is pretty a-ma-zing if you ask me.
Hopefully I get to meet some of you in September!
ICML & UAI 2008 Accepted Papers

The accepted papers for ICML 2008 and UAI 2008 are online!
As one of my own papers got accepted, I am preparing a post with some more background information and maybe some extra plots and pictures.
Math.Net Numerics

I’ve been a fan of doing numerical computation on the.NET platform for a very long time. This interest landed me an internship at Microsoft Research with Don Syme’s team in 2007 where we investigated F# suitability for scientific computing. After the internship, I joined the open source community helping out with writing a kick-ass numerical library for the.NET platform.
Today, I am quite proud to announce that we are releasing the final beta of our open source project: Math.Net Numerics. Moreover, with this announcement, we are also kicking off a competition to find the fastest implementation of matrix multiplication in purely managed code. The winner of this competition will receive 1500$ and we will integrate his code into our open source codebase. I’m excited to see some creative coding in the next few weeks!
BBC 4 podcasts
I just discovered some really good podcasts from BBC 4 which I think some people here might enjoy.
- A Brief History of Mathematics: a discussion about some great mathematicians …
- In Our Time: very general, one-hour discussions with a few experts. I really enjoyed the episode on “Random & Pseudorandom” from January 13.
Microsoft Research PhD Scholarships

A new round of Microsoft Research PhD scholarships is being organized. I’ve enjoyed being on the scholarship for the past two years: it’s been a great opportunity to meet new researchers at Microsoft and other students at the PhD event organized by MSR Cambridge.
For all you upcoming machine learning rock-stars out there: talk to your advisors, they will have to apply for you but you can probably help them a bit!
ggplot2 and Subway
The following article caught my eye a few weeks ago: Subway Set to Overtake McD's in Omnipresence. As I am trying to learn a little bit of ggplot2 (and loving it so far!) I thought it would be fun to try and create some visuals to go with this claim.
I used one of Microsoft’s restaurant datasets and do a simple substring match on “subway”, returning the latitude and longitude. Using the following lines of ggplot and R code
```
states <- data.frame(map("state", plot=FALSE)[c("x","y")])
colnames(states) <- c("Lon","Lat")
ggplot(states, aes(x=Lon, y=Lat)) + geom_path() 
 + geom_point(alpha=0.6,size=0.3,data=subway)
```
we get a cool picture showing all of the metropolitan areas of the United States.

If you click on the image to zoom in you will be able to discern major highways as well. Subway is literally everywhere.
Popularizing Machine Learning

Three weeks ago I gave a presentation on probabilistic modelling to an audience of data mining practitioners. These people know about machine learning, they use it every day, but their main concern is to analyze data: real data!
The experience taught me a valuable lesson which I hadn’t come across by interacting with the academic community: Probabilistic models are hard. Even for people that are very close to the machine learning community (data mining), probabilistic models are a very different (new?) way of thinking.
The whole idea of building a generative story for your data and then using Bayes rule to “invert” the model given some dataset has become second nature to me. Nonetheless, I (we?) shouldn’t forget that it took statistics a long time to think about modelling in this sense. Hence I now understand realize that for outsiders the framework of probabilistic modelling is a highly non-trivial concept to grasp.
In this context I am obliged to share a blog post by Jeff Moser. I have never seen such a great explanation of a non-trivial probabilistic model that is deployed in very large scale on XBox Live: “Computing Your Skill”, a description of the TrueSkill ranking model. Very very well done Jeff!
The Unreasonable Effectiveness of Data

Alon Halevy, Peter Norvig, and Fernando Pereira (from the Google) just wrote an intriguing article called the unreasonable effectiveness of data in IEEE Intelligent Systems. They argue a few points: 1) use more data to increase the performance of our learning algorithms, don’t make your models to complex; 2) the semantic web is not the right approach, it’s too expensive to “label” the web with its semantics, we need to learn it. However for an non-parametric Bayes person this is what got me very excited:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.

Random for time:

Trend Alert: White Shirt Dresses

Museum of Sex | Social Direction

Solange Knowles in Floral Dolce & Gabbana

Suicide by Poison is the Better Option

When Little Black Dresses are Enough: Non Stop Premier, Paris

Jennifer Lawrence on Promo Trail of Hunger Games: Catching Fire

Jaime King in Maison Martin Margiela

Anne Hathaway in Victoria Beckham

Pink Wears SA Designer Catherine Deane to 2012 AMAs

Jessica Alba in Black and White Blumarine on Jimmy FallonI

Michelle Obama in Burberry Outerwear