Hello, Science!:
Large Scale ML

  • PQL–A Probabilistic Query Language

    PQL–A Probabilistic Query Language

    At MSR and Bing, when we do machine learning on smaller datasets (say anything below 100GB) we often use relational databases and SQL. Throw in a little bit of Excel and R and you’ve got yourself a very powerful platform for exploratory data analysis.

    After the exploratory phase, we often build statistical models (adPredictor, TrueSkill, Matchbox, …) to discover more complex structures in the data. Infer.Net helps us prototype these graphical models, but unfortunately it forces you to work in a mode where you first create a binary that performs inference, suck out all data to your machine, run inference locally and then write all inference results back to the DB. My local machine is way slower than the machines which run our DB or our local compute cluster so ideally I’d like to have a platform which computes “close” to the data.

    The Probabilistic Query Language (or PQL) is a language/tool which I designed two years ago, during an internship with Ralf Herbrich and Thore Graepel, where we had the following goals in mind:

    • Allow for rapid prototyping of graphical models in a relational environment
    • The focus should be on specifying models, not algorithms
    • It should enable large scale execution and bring the computation to the data, rather than the data to the computation

    Using SQL Server, DryadLinq (Map-Reduce for.NET) and Infer.Net I built a prototype of PQL and tested it on some frequently used models at Microsoft. In this post I want to introduce the PQL language and give a few examples of graphical models in PQL.


    Let’s start with a very simple example where we have a DB with a table containing people’s info and a table with records describing doctor visits for those people. Assume the following relational schema

    image

    We assume that people have an unknown weight and when they go to the doctor, she measures this weight. Depending on the time of day (after a heavy lunch), this estimate could be off a bit. A statistical model to capture these assumption is to introduce a random variable for the weight for each person in the People table, put a prior on this variable and connect it with the observations in the DrVisits table. So how do we write such a model in PQL?

    PQL is very much like SQL but with two extra keywords: AUGMENT and FACTOR. AUGMENT allows us to add random variables to the DB schema. In the example above we would write

    People = AUGMENT DB.People ADD weight FLOAT

    This essentially defines a “plate” in graphical model speak: for each row in the People table, a random variable over the real numbers called weight is defined.

    The FACTOR keyword in PQL allows to introduce factors between random variables as well as any other variables in the DB schema. FACTOR follows the relational SQL syntax to specify exactly how to connect variables. To specify a normal prior on the weight variable we could write

    FACTOR Normal(p.weight | 75.0,25.0) FROM People p

    This introduces a normal factor for each row in the People table (the FROM People p part). The final component of our program connects the random variable with observations. In this case, we use the familiar SQL JOIN syntax to specify how to connect rows from the People table to the rows in the DrVisits table. In PQL we write

    FACTOR Normal(v.weight | p.weight, 1.0)
    FROM People p
    JOIN DrVisit v ON p.id = v.personid

    Except for the first line this is exactly SQL; instead of doing a query, the FACTOR statement describes the “probabilistic augmentation” of the DB schema”.

    For the example above, this is it, the PQL program contains five lines of code and can be sent to the DB. It will run inference by performing EP or variational Bayesian inference. The inference itself can be run either within the database (this was implemented by Tina Palla who was an intern with us) or on the DryadLinq cluster.


    Another example of PQL is the program to describe the TrueSkill ranking system. In this example we assume two-player games stored using a table of players (called Players) and a table of game outcomes (called PlayerGames). Each game played generates two rows in the PlayerGames table: one for the winner and the loser (with a score) column specifying who is the winner and who is the loser. The PQL program for TrueSkill is written below

    Players = AUGMENT DB.Players ADD skill FLOAT;
    PlayerGames = AUGMENT DB.PlayerGames ADD performance FLOAT;

    FACTOR Normal(p.skill | 25.0, 20.0) FROM Players p;

    FACTOR Normal(pg.performance | p.skill, 0.1)
    FROM PlayerGames pg
    JOIN Players p ON pg.player_id = p.player_id;

    FACTOR IsGreater(pgb.performance, pga.performance)
    FROM PlayerGames pga
    JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
    WHERE pga.player_id < pgb.player_id AND pga.score = 0;

    FACTOR IsGreater(pga.performance, pgb.performance)
    FROM PlayerGames pga
    JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
    WHERE pga.player_id < pgb.player_id AND pga.score = 2;

    There are a lot of features in PQL I haven’t covered in this blog post (like using random variables in a WHERE clause to create mixture models) but I wanted to give you a flavour of what we’ve been working on so far.

    While working on PQL I learned a lot about the state of the art in probabilistic databases and statistical relational learning. I think compared to this academic work, PQL does not add many theoretical contributions; our goal is to design a tool which takes statistical relational learning out of the laboratory into the hands of data mining practicioners.

  • The Unreasonable Effectiveness of Data

    Alon Halevy, Peter Norvig, and Fernando Pereira (from the Google) just wrote an intriguing article called the unreasonable effectiveness of data in IEEE Intelligent Systems. They argue a few points: 1) use more data to increase the performance of our learning algorithms, don’t make your models to complex; 2) the semantic web is not the right approach, it’s too expensive to “label” the web with its semantics, we need to learn it. However for an non-parametric Bayes person this is what got me very excited:

    So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.

  • The Elastic Compute Cloud (Amazon EC2)

    Today I ran into this interesting offering from Amazon called the elastic compute cloud. The idea is that you create an image that consists of your application, libraries and data and load it up to Amazon. They will run the application for you and charge you for the computing resources that you've consumed. I was a little suprised to see how cheap it actually is: 0.10$ for every hour that you run a process that consumes not more than 1.7 GB of memory, 160 GB storage; an extra 0.10$ per gigabyte you transfer into their service and an extra 0.18$ per gigabyte you transfer out of their service.

    I think this is going to be a very useful platform for machine learning research. EC2 means that large scale map-reduce isn't just for Google anymore. Although I haven't tried it myself, creating the EC2 images should be straightforward as it is based on Xen virtualization; the virtualization platform out of our very own Cambridge computer lab.
    I wonder why anyone would want to spend time and money on maintaining their own cluster when there is this cheap alternative available? As soon as I get the chance, I will run an experiment on Amazon EC2 and report on the experience.

    PS Data Wrangling has some good posts on how to use Amazon EC2.

  • Parallel Random Number Generators

    We recently got a paper accepted in EMNLP where we elaborate on a point that was touched upon in Mark Johnson’s “Why doesn’t EM find good HMM POS-taggers?”: if we are going to do completely unsupervised POS-tagging, how can non-parametric Bayesian methods help us? There were two parts to our paper: extending the iHMM to be able to handle a medium sized text corpus and an evaluation of the results to see how the iHMM states correspond to POS tags. I’ll blog about our findings later but one thing that came up during this research was the following: because of the size (1,000,000 datapoints) of the corpus (Wall Street Journal of Penn Treebank) we used, we wanted to parallelize the forward-filtering backward-sampling (FFBS) part of the inference. Because all the sentences are independent given the transition matrix and output parameters, we can easily run the FFBS in parallel.

    One thing I started thinking about is the following: imagine I run on a machine with 4 cores and I start a thread for each core to process a chunk of the data. Since the FFBS is a sampling algorithm, each of these 4 threads needs to access a random number generator (RNG). I know of two ways to go about this:

    1. We use one RNG and access it from the 4 threads. Since the RNG’s I use are stateful, we need to lock them when we query them from a number so it’s internal state doesn’t get messed up. Locking an RNG can incur a large overhead and easily becomes the bottleneck of the parallelization.
    2. Use one RNG for each thread. How do we initialize them though? I use the.NET System.Random RNG which uses a time dependent random seed. I don’t think this is a good idea because of the resolution of this time dependent seed is to coarse, and my 4 threads start almost at the same time, they might all end up using the same seed. An alternative is to initialize a global RNG to produce the seeds for the thread local RNG’s.

    The second alternative in (2) is what we used but it does worry me a little bit: how can I know that the streams of random numbers in different threads are going to be ‘good enough’ (whatever that means). As a non-expert on RNG’s I’d like to know whether I’m going to get into trouble with these kinds of approaches. Matlab has a way to generate 3 statistically independent streams using the same RNG with the same seed. This seems like a step in the right direction, but it doesn’t solve my quad core problem, let alone when we run our algorithm on many more cores in Amazon Elastic Map Reduce …

    Any creative solutions out there?

Random for time:

  1. Stickin' It to Ya...With Supah Style
  2. Tasty Tuesday-A Healthier Way to Do Pancakes
  3. Back in the Not Me Saddle
  4. I Feed My Children Nothing But the Finest...All of the Time
  5. This One Goes Out to the One I Love
  6. Do You Text and Drive?
  7. Tasty Tuesday- Beef Tenderloin in Red Wine Mushroom Sauce
  8. Take Notes, I Will Be Conducting a Quiz
  9. Maybe You Asked Me on the Wrong Day
  10. Fill in the Blank Friday-Movie Time