Hello, Science!:
Microsoft

  • PQL–A Probabilistic Query Language

    PQL–A Probabilistic Query Language

    At MSR and Bing, when we do machine learning on smaller datasets (say anything below 100GB) we often use relational databases and SQL. Throw in a little bit of Excel and R and you’ve got yourself a very powerful platform for exploratory data analysis.

    After the exploratory phase, we often build statistical models (adPredictor, TrueSkill, Matchbox, …) to discover more complex structures in the data. Infer.Net helps us prototype these graphical models, but unfortunately it forces you to work in a mode where you first create a binary that performs inference, suck out all data to your machine, run inference locally and then write all inference results back to the DB. My local machine is way slower than the machines which run our DB or our local compute cluster so ideally I’d like to have a platform which computes “close” to the data.

    The Probabilistic Query Language (or PQL) is a language/tool which I designed two years ago, during an internship with Ralf Herbrich and Thore Graepel, where we had the following goals in mind:

    • Allow for rapid prototyping of graphical models in a relational environment
    • The focus should be on specifying models, not algorithms
    • It should enable large scale execution and bring the computation to the data, rather than the data to the computation

    Using SQL Server, DryadLinq (Map-Reduce for.NET) and Infer.Net I built a prototype of PQL and tested it on some frequently used models at Microsoft. In this post I want to introduce the PQL language and give a few examples of graphical models in PQL.


    Let’s start with a very simple example where we have a DB with a table containing people’s info and a table with records describing doctor visits for those people. Assume the following relational schema

    image

    We assume that people have an unknown weight and when they go to the doctor, she measures this weight. Depending on the time of day (after a heavy lunch), this estimate could be off a bit. A statistical model to capture these assumption is to introduce a random variable for the weight for each person in the People table, put a prior on this variable and connect it with the observations in the DrVisits table. So how do we write such a model in PQL?

    PQL is very much like SQL but with two extra keywords: AUGMENT and FACTOR. AUGMENT allows us to add random variables to the DB schema. In the example above we would write

    People = AUGMENT DB.People ADD weight FLOAT

    This essentially defines a “plate” in graphical model speak: for each row in the People table, a random variable over the real numbers called weight is defined.

    The FACTOR keyword in PQL allows to introduce factors between random variables as well as any other variables in the DB schema. FACTOR follows the relational SQL syntax to specify exactly how to connect variables. To specify a normal prior on the weight variable we could write

    FACTOR Normal(p.weight | 75.0,25.0) FROM People p

    This introduces a normal factor for each row in the People table (the FROM People p part). The final component of our program connects the random variable with observations. In this case, we use the familiar SQL JOIN syntax to specify how to connect rows from the People table to the rows in the DrVisits table. In PQL we write

    FACTOR Normal(v.weight | p.weight, 1.0)
    FROM People p
    JOIN DrVisit v ON p.id = v.personid

    Except for the first line this is exactly SQL; instead of doing a query, the FACTOR statement describes the “probabilistic augmentation” of the DB schema”.

    For the example above, this is it, the PQL program contains five lines of code and can be sent to the DB. It will run inference by performing EP or variational Bayesian inference. The inference itself can be run either within the database (this was implemented by Tina Palla who was an intern with us) or on the DryadLinq cluster.


    Another example of PQL is the program to describe the TrueSkill ranking system. In this example we assume two-player games stored using a table of players (called Players) and a table of game outcomes (called PlayerGames). Each game played generates two rows in the PlayerGames table: one for the winner and the loser (with a score) column specifying who is the winner and who is the loser. The PQL program for TrueSkill is written below

    Players = AUGMENT DB.Players ADD skill FLOAT;
    PlayerGames = AUGMENT DB.PlayerGames ADD performance FLOAT;

    FACTOR Normal(p.skill | 25.0, 20.0) FROM Players p;

    FACTOR Normal(pg.performance | p.skill, 0.1)
    FROM PlayerGames pg
    JOIN Players p ON pg.player_id = p.player_id;

    FACTOR IsGreater(pgb.performance, pga.performance)
    FROM PlayerGames pga
    JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
    WHERE pga.player_id < pgb.player_id AND pga.score = 0;

    FACTOR IsGreater(pga.performance, pgb.performance)
    FROM PlayerGames pga
    JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
    WHERE pga.player_id < pgb.player_id AND pga.score = 2;

    There are a lot of features in PQL I haven’t covered in this blog post (like using random variables in a WHERE clause to create mixture models) but I wanted to give you a flavour of what we’ve been working on so far.

    While working on PQL I learned a lot about the state of the art in probabilistic databases and statistical relational learning. I think compared to this academic work, PQL does not add many theoretical contributions; our goal is to design a tool which takes statistical relational learning out of the laboratory into the hands of data mining practicioners.

  • London Machine Learning Meetup... and joining a startup.

    Three months ago I decided to end my employment at Microsoft. I had a great time, met loads of interesting people and will definitely miss being in the Cambridge research lab (as well as the occasional trip to Seattle).
    Next up I am joining a startup company called Rangespan. I met the founders of Rangespan at Silicon Milkroundabout in April this year. When I got talking to them it was immediately obvious there was a hell of a lot of machine learning to be done there.
    Rangespan is essentially an e-commerce platform which is organizing a marketplace for suppliers to sell products to big retailers. On one hand Rangespan is trying to build the biggest catalog in the world. We're doing some by fusing many different information sources (supplier feeds, manufacturer feeds, crowdsourced data, crawled data). The three big learning problems in this space are:

    1. Matching: out of millions of product descriptions (semi-structured & textual), how do you recognize which descriptions talk about the same product? Record linkage is the keyword here, but making it work at large scale and under very small precision and recall margins is a challenge.
    2. Reconcilliation: when you have 100 different descriptions of a product, and you want to construct the best possible description of the product, how do you do that? I think the speech recognition and machine translation community has a few neat tricks up their sleeve to be applied here: in the abstract if we think of each of the 100 descriptions as the output of a noisy channel with the unknown ground truth description as the input, the challenge is to invent a prior and model for the noisy channel. Machine learning baby!
    3. Information extraction: the data we are gathering is only semi-structured and most often marketing text. The question now becomes: how do we figure out form a USB stick description that the product has a "4GB" capacity. How do we figure out that "4GB" is the same as "four gigabytes" etc etc. There's loads of great work on information extraction which we can build on; the good news is that our domain is restricted (physical goods) the bad news is that the domain is quite large (1000's of categories, 1000's of attributes).
    Besides the catalog, Rangespan is also building an order pipeline component. One crucial component of that pipeline is the system that assigns retailer's orders to suppliers. The idea here is that for each product we have a pretty good view of the supply curve (what suppliers sells how many items at what price). We also have a pretty good view of the demand curve: by aggregating many different retailers we can predict what the demand for a product will be. This allows us to make very quantitative decisions about how to choose the price for that product to optimize metrics like user experience etc. In this area a mix of statistics (time series analysis), decision theory, econometrics and optimization are crucial ingredients... together with a ton of data ofcourse.
    If this all sounds like music to your ears, then let us know. We're hiring!
    Joining a startup also meant I am now working in London. One of the coolest things I discovered about being in London is the great number of meetups people are organizing. From MongoDB office hours (thank you for all the help 10gen!) to the Hadoop and Big Data meetups. Although big data is crucial at Rangespan, we think clever algorithms (as in machine learning) are as crucial to building successful insights from data. As such we are kicking off the Machine Learning Meetup as of January 2012. We'll put some fun talks together with the London machine learning community and probably do some study groups on the Stanford online learning courses.
    If you live closeby and haven't yet signed up, you can find all meetup details here.

  • From Undirected Grad to Informaniac

    From Undirected Grad to Informaniac

    What drives a man to change the name of his blog? A trademark dispute? A mid-life crisis? None of those actually!

    fuse-logoAfter three amazing years, I am exchanging the amazing machine learning group for an exciting new adventure with Microsoft FUSE Labs. As I am writing this, my laptop is crunching away at the last few experiments to be included in my thesis while a printed version of my thesis text is awaiting final editing.

    It has been an interesting few months as I was debating moving into a post doc or evaluating all the amazing opportunities that were available in industry: in the end I decided to join FUSE Labs as an applied researcher. FUSE Labs is an incubation lab at Microsoft that sits in between Microsoft Research and the product groups. My new job allows me to collaborate with people (in or outside of Microsoft) on research while giving me the freedom to also spend time building working prototypes and demo’s. As far as I am concerned: the perfect mix of science and engineering.

    Although I will soon be a graduate student no more, I don’t want this blog to die but rather take it to the next level. The foreseeable future will involve more experiments on machine learning, data mining and information processing and I should be able to report much of this to everyone who cares to listen!

    welcome to www.informaniac.net and stay tuned …