Hello, Science!:
Idea

The Elements of Statistical Learning

The Elements of Statistical Learning is an absolute classic for anyone wanting to do statistics/machine learning/data mining. I read that the second edition was out and debating whether I should spend the money on this new edition. Via John Cook I learned that the book is out on pdf (from their website). DOUBLE WIN: a) I’ve already paid once and get the upgrade for free, b) I know have a way to electronically search the book.
I also found out today that Koller and Friedman have just released their much anticipated book Probabilistic Graphical Models from MIT press. At a lengthy 1208 pages, this should provide enough reading for a few nights!
Gold's Theorem

After seeing this amazing talk by Josh Tenenbaum on videolectures.net, I started reading up on some very cool stuff at the intersection of machine learning and cognitive science. This brought me to read on Gold's theorem and the poverty of the stimulus. Very roughly, Gold's theorem says that any learner (be it a child or a computer) cannot "learn" a language by only acquiring sentences from the language she has to learn. Some people use this theorem to make the following argument: a toddler will only hear sentences from the language she is learning, she never gets to hear "wrong" (as in not in the language) sentences. Hence, since by Gold's theorem this toddler cannot learn the language, it must be innate: language abilities must be wired into our brains in some way. Gold's Theorem and Cognitive Science, by Kent Johnson is a very enjoyable read for more background on Gold's theorem and how it applies to the question of language acquisition.
Johnson's paper mentions something that I had never thought about: according to Morgan, a child acquires language after hearing about 4 million sentences. Now think about how many sentences we have access to to train our NLP algorithms on. This is orders of magnitude more than a person ever gets to hear and yet I would say we are far from building a computer system that can manipulate language as accurate as humans. From a Bayesian perspective, this could translate into assuming children having a really good prior which they start from when learning language. If the Bayesian way is the right way to look at this question, I really wonder how humans acquire this prior: how much is wired up in our brains, how much is it influenced by our sensory system,... ?
PQL–A Probabilistic Query Language
At MSR and Bing, when we do machine learning on smaller datasets (say anything below 100GB) we often use relational databases and SQL. Throw in a little bit of Excel and R and you’ve got yourself a very powerful platform for exploratory data analysis.

After the exploratory phase, we often build statistical models (adPredictor, TrueSkill, Matchbox, …) to discover more complex structures in the data. Infer.Net helps us prototype these graphical models, but unfortunately it forces you to work in a mode where you first create a binary that performs inference, suck out all data to your machine, run inference locally and then write all inference results back to the DB. My local machine is way slower than the machines which run our DB or our local compute cluster so ideally I’d like to have a platform which computes “close” to the data.

The Probabilistic Query Language (or PQL) is a language/tool which I designed two years ago, during an internship with Ralf Herbrich and Thore Graepel, where we had the following goals in mind:
- Allow for rapid prototyping of graphical models in a relational environment
- The focus should be on specifying models, not algorithms
- It should enable large scale execution and bring the computation to the data, rather than the data to the computation
Using SQL Server, DryadLinq (Map-Reduce for.NET) and Infer.Net I built a prototype of PQL and tested it on some frequently used models at Microsoft. In this post I want to introduce the PQL language and give a few examples of graphical models in PQL.

Let’s start with a very simple example where we have a DB with a table containing people’s info and a table with records describing doctor visits for those people. Assume the following relational schema

We assume that people have an unknown weight and when they go to the doctor, she measures this weight. Depending on the time of day (after a heavy lunch), this estimate could be off a bit. A statistical model to capture these assumption is to introduce a random variable for the weight for each person in the People table, put a prior on this variable and connect it with the observations in the DrVisits table. So how do we write such a model in PQL?

PQL is very much like SQL but with two extra keywords: AUGMENT and FACTOR. AUGMENT allows us to add random variables to the DB schema. In the example above we would write

People = AUGMENT DB.People ADD weight FLOAT
This essentially defines a “plate” in graphical model speak: for each row in the People table, a random variable over the real numbers called weight is defined.
The FACTOR keyword in PQL allows to introduce factors between random variables as well as any other variables in the DB schema. FACTOR follows the relational SQL syntax to specify exactly how to connect variables. To specify a normal prior on the weight variable we could write
FACTOR Normal(p.weight | 75.0,25.0) FROM People p
This introduces a normal factor for each row in the People table (the FROM People p part). The final component of our program connects the random variable with observations. In this case, we use the familiar SQL JOIN syntax to specify how to connect rows from the People table to the rows in the DrVisits table. In PQL we write
FACTOR Normal(v.weight | p.weight, 1.0)
FROM People p
JOIN DrVisit v ON p.id = v.personid

Except for the first line this is exactly SQL; instead of doing a query, the FACTOR statement describes the “probabilistic augmentation” of the DB schema”.

For the example above, this is it, the PQL program contains five lines of code and can be sent to the DB. It will run inference by performing EP or variational Bayesian inference. The inference itself can be run either within the database (this was implemented by Tina Palla who was an intern with us) or on the DryadLinq cluster.

Another example of PQL is the program to describe the TrueSkill ranking system. In this example we assume two-player games stored using a table of players (called Players) and a table of game outcomes (called PlayerGames). Each game played generates two rows in the PlayerGames table: one for the winner and the loser (with a score) column specifying who is the winner and who is the loser. The PQL program for TrueSkill is written below
Players = AUGMENT DB.Players ADD skill FLOAT;
PlayerGames = AUGMENT DB.PlayerGames ADD performance FLOAT;
FACTOR Normal(p.skill | 25.0, 20.0) FROM Players p;
FACTOR Normal(pg.performance | p.skill, 0.1)
FROM PlayerGames pg
JOIN Players p ON pg.player_id = p.player_id;
FACTOR IsGreater(pgb.performance, pga.performance)
FROM PlayerGames pga
JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
WHERE pga.player_id < pgb.player_id AND pga.score = 0;
FACTOR IsGreater(pga.performance, pgb.performance)
FROM PlayerGames pga
JOIN PlayerGames pgb ON pga.game_id = pgb.game_id
WHERE pga.player_id < pgb.player_id AND pga.score = 2;

There are a lot of features in PQL I haven’t covered in this blog post (like using random variables in a WHERE clause to create mixture models) but I wanted to give you a flavour of what we’ve been working on so far.

While working on PQL I learned a lot about the state of the art in probabilistic databases and statistical relational learning. I think compared to this academic work, PQL does not add many theoretical contributions; our goal is to design a tool which takes statistical relational learning out of the laboratory into the hands of data mining practicioners.
Machine Learning Summer School Application Process Opens

Just to let all of you know that the application process for the summer school has opened. More info here. The deadline for applications is June 1. The list of confirmed speakers is pretty a-ma-zing if you ask me.
Hopefully I get to meet some of you in September!
ICML & UAI 2008 Accepted Papers

The accepted papers for ICML 2008 and UAI 2008 are online!
As one of my own papers got accepted, I am preparing a post with some more background information and maybe some extra plots and pictures.
Math.Net Numerics

I’ve been a fan of doing numerical computation on the.NET platform for a very long time. This interest landed me an internship at Microsoft Research with Don Syme’s team in 2007 where we investigated F# suitability for scientific computing. After the internship, I joined the open source community helping out with writing a kick-ass numerical library for the.NET platform.
Today, I am quite proud to announce that we are releasing the final beta of our open source project: Math.Net Numerics. Moreover, with this announcement, we are also kicking off a competition to find the fastest implementation of matrix multiplication in purely managed code. The winner of this competition will receive 1500$ and we will integrate his code into our open source codebase. I’m excited to see some creative coding in the next few weeks!
BBC 4 podcasts
I just discovered some really good podcasts from BBC 4 which I think some people here might enjoy.
- A Brief History of Mathematics: a discussion about some great mathematicians …
- In Our Time: very general, one-hour discussions with a few experts. I really enjoyed the episode on “Random & Pseudorandom” from January 13.
Microsoft Research PhD Scholarships

A new round of Microsoft Research PhD scholarships is being organized. I’ve enjoyed being on the scholarship for the past two years: it’s been a great opportunity to meet new researchers at Microsoft and other students at the PhD event organized by MSR Cambridge.
For all you upcoming machine learning rock-stars out there: talk to your advisors, they will have to apply for you but you can probably help them a bit!
ggplot2 and Subway
The following article caught my eye a few weeks ago: Subway Set to Overtake McD's in Omnipresence. As I am trying to learn a little bit of ggplot2 (and loving it so far!) I thought it would be fun to try and create some visuals to go with this claim.
I used one of Microsoft’s restaurant datasets and do a simple substring match on “subway”, returning the latitude and longitude. Using the following lines of ggplot and R code
```
states <- data.frame(map("state", plot=FALSE)[c("x","y")])
colnames(states) <- c("Lon","Lat")
ggplot(states, aes(x=Lon, y=Lat)) + geom_path() 
 + geom_point(alpha=0.6,size=0.3,data=subway)
```
we get a cool picture showing all of the metropolitan areas of the United States.

If you click on the image to zoom in you will be able to discern major highways as well. Subway is literally everywhere.
Popularizing Machine Learning

Three weeks ago I gave a presentation on probabilistic modelling to an audience of data mining practitioners. These people know about machine learning, they use it every day, but their main concern is to analyze data: real data!
The experience taught me a valuable lesson which I hadn’t come across by interacting with the academic community: Probabilistic models are hard. Even for people that are very close to the machine learning community (data mining), probabilistic models are a very different (new?) way of thinking.
The whole idea of building a generative story for your data and then using Bayes rule to “invert” the model given some dataset has become second nature to me. Nonetheless, I (we?) shouldn’t forget that it took statistics a long time to think about modelling in this sense. Hence I now understand realize that for outsiders the framework of probabilistic modelling is a highly non-trivial concept to grasp.
In this context I am obliged to share a blog post by Jeff Moser. I have never seen such a great explanation of a non-trivial probabilistic model that is deployed in very large scale on XBox Live: “Computing Your Skill”, a description of the TrueSkill ranking model. Very very well done Jeff!
The Unreasonable Effectiveness of Data

Alon Halevy, Peter Norvig, and Fernando Pereira (from the Google) just wrote an intriguing article called the unreasonable effectiveness of data in IEEE Intelligent Systems. They argue a few points: 1) use more data to increase the performance of our learning algorithms, don’t make your models to complex; 2) the semantic web is not the right approach, it’s too expensive to “label” the web with its semantics, we need to learn it. However for an non-parametric Bayes person this is what got me very excited:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.
Machine Learning lecture series on videolecture.net
Our machine learning group organizes a series called the Advanced Tutorial Lecture Series on Machine Learning where we get to see some amazing researchers talk about recent developments in Machine Learning. Karsten, a postdoc in our group has been so kind to film all lectures and slowly but surely these videos are finding their way onto videolectures.net.
Currently on videolectures.net are videos on:
- Applications of Group Theory to Machine Learning by Risi Kondor; this math heavy lecture was cool in that Risi explained how tools from abstract algebra and harmonic analysis can be used in Machine Learning too,
- Spectral Clustering by Arik Azran; our local spectral methods expert shared his insight into how spectral methods work.
Coming up soon is a lecture by David MacKay on error correcting codes and how they can be decoded using local message passing algorithms. David is a very good speaker and his lecture is certainly in my top 5 of excellent machine learning talks.
If pigs can whistle, then horses can fly... or... De Finetti's Theorem
$If pigs can whistle, then horses can fly... or... De Finetti's Theorem$
We talked about the concept of exchangeability a few weeks ago. I presented three urn models, two of which defined exchangeable probability distributions. The main idea I wanted to convey using the first and second urn models was that exchangeability is not just independence: urn model 2 does not look exchangeable at first sight but some simple algebra convinced us that it is.
Today I want to continue this discussion with a theorem that is strongly related to exchangeability called De Finetti's theorem. The theorem goes as follows: say we have an infinite sequence of binary random variables (say the color of the balls in our urn model) and this sequence is exchangeably, then there exists a probability distribution for
$\theta$
such that
$p(x_1,\cdots,x_n) = \int_{\Theta} \prod_{i=1}^n p(x_i|\theta) p(\theta) d\theta$
In other words, if our sequence of random variables is exchangeable, than it really is a sample from a mixture model with mixture distribution
$p(\theta)$
This kind of theorem reminds me of a quote I once heard in a complexity theory talk: if pigs can whistle, then horses can fly. What do I mean by this? Assuming that a set of random variables is exchangeably doesn't sounds like a very big assumption: e.g. if you plan to cluster something like the MNIST digit recognition dataset, there is nothing a priori that distinguishes one image from another; aka. you could assume they are exchangeable. However, once you make the exchangeability assumption, De Finetti's theoremy suddenly says that really your datapoints are part of a hierarchicaly Bayesian model. In other words: by making the relatively innocent exchangeability assumption, De Finetti's theorem imposes the graphical model structure upon you. A weak assumption implies a rather strong consequence.
There is a catch however: the theorem does not specify what kind of random variable
$\theta$
is, nor what its distribution function is. As a matter of fact,
$\theta$
could be an infinitely large random variable (which would lead us into the realm of nonparametric Bayesian models such as the Dirichlet Process, but that's for another time). For simple models such as urn model 2 we can compute the distribution though:
$\theta$
turns out to be a Beta distributed random variable.
All in all, De Finetti's theorem is an interesting theorem but it is not clear to me of how much practical value it is. There are many extensions and modifications of the theorem; for more information check
- The Concept of Exchangeability - Jose Bernardo
- Exchangeability and Related Topics - David Aldous
Machine Learning Summer School 2008
Perfect: the website for the next machine learning summer school is up and running. It will be organized on Porquerolles Island, right of the coast of France. The confirmed speaker list is still a work in progress but looks promising already:
- Shai Ben-David (University of Waterloo) "Theoretical Foundations of Clustering"
- Stephane Canu (INSA de Rouen) "Introduction to Kernel Methods"
- Manuel Davy (INRIA) "Parametric and Non-parametric Bayesian Learning"
- Pierre Del Moral (Universite de Bordeaux) "On the Foundations and the Applications of i-MCMC Methods"
- Isabelle Guyon (ClopiNet)
- Yann LeCun (New York university) "Supervised and Unsupervised Learning with Energy-Based Models"
- Rich Sutton (University of Alberta) "Reinforcement Learning and Knowledge Representation"
- Patrick Wolfe (Harvard University) "Overcomplete Representations with Incomplete Data: Learning Bayesian Models for Sparsity"
I'll definitely try to apply and hope for the best!
The Elastic Compute Cloud (Amazon EC2)

Today I ran into this interesting offering from Amazon called the elastic compute cloud. The idea is that you create an image that consists of your application, libraries and data and load it up to Amazon. They will run the application for you and charge you for the computing resources that you've consumed. I was a little suprised to see how cheap it actually is: 0.10$ for every hour that you run a process that consumes not more than 1.7 GB of memory, 160 GB storage; an extra 0.10$ per gigabyte you transfer into their service and an extra 0.18$ per gigabyte you transfer out of their service.
I think this is going to be a very useful platform for machine learning research. EC2 means that large scale map-reduce isn't just for Google anymore. Although I haven't tried it myself, creating the EC2 images should be straightforward as it is based on Xen virtualization; the virtualization platform out of our very own Cambridge computer lab.
I wonder why anyone would want to spend time and money on maintaining their own cluster when there is this cheap alternative available? As soon as I get the chance, I will run an experiment on Amazon EC2 and report on the experience.
PS Data Wrangling has some good posts on how to use Amazon EC2.
Silicon Minds
The guys at Microsoft Research announced a very exciting competition: the silicon minds challenge. The goal of the competition is to foster novel ideas in the area of game AI.
Many years ago I wrote a computer game called De Profundis where I was in charge (among other things) of the game AI. Moving on to become an AI researcher it is interesting to reminisce and draw some connections.
On one hand, the game AI field is the perfect arena to try out new ideas for AI researchers. For AI researchers working on agents, planning and human interaction (speech, NLP) I could imagine it would be extremely valueable to interact with MMORPG's (Massive Multiplayer Online Role Playing Games). I don't know whether anyone in the research community has ever done this before, but having an unlimited source of humans to interact with seems like quite the experimental setup. This also applies to virtual worlds like second life ofcourse. AI Research has contributed to the game AI field ofcourse, so let me highlight two recent projects:
1. The university of Alberta games group: these guys do some amazing work on several kinds of games. As far as I understand it, most of their efforst are focussed on games where the mathematics are in some sense understood: chess, poker,... What I mean by the mathematics are understood is that with infinite computational capabilities, we would be able to solve these games. The U of A group also do some work on AI for real time strategy games (e.g. Age of Empires). A mathematical analysis of these games is much harder (if possible at all). The AI necessary for these games is much closer to what I would think of as strong AI.
2. The Applied Games Group at Microsoft research: the organizers of the silicon minds challenge have developed a few innovations for game AI themselves. Their machine learning approach to inferring gamer skills (know as TrueSkill) is used by the XBox Live service. They have also enhanced Forza Motorsport with a reinforcement learning agent that learns to drive from observing human drivers.
Unfortunately, the game AI field has very special requirements that prohibit the use of many innovations from the research community. First and foremost, game AI are supposed to make games more fun. More sophisticated agents do not necessarily mean more fun: one can spend a large amount of time making opponents (in first person shooters or racing games) smarter, but if that means the player always looses, he or she might not enjoy the game that much. Also, games are big business, and game engineers want to understand the behavior of their agents. It is unacceptable to release an agent out in the open which in the middle of a battle starts to act weird. Hence, game engineers often limit the intelligence of agents to (pre-historic ?!?) methods such as rule based systems and (heuristic) search because they can understand the behavior and debug it more easily. (It would be unfair to forget to give credit to the people that have applied reinforcement learning and neural networks to games; afaik mostly in the areas of racing games.) To get a rough idea about what is hot-or-not in the game AI field, take a look at AI Wisdom.
One could say, who cares what technique to use: rules and search work incredibly well! Very true. In my humble opinion, the AI/machine learning community has sometimes over-focussed on new algorithms and models and too little on building intelligent solutions. Although in fields like robotics, biology and vision, our machine learning tools have had a huge impact, I think there are many fields where the AI community does not have a good understanding on how to integrate all our tools to make a large working system. Hence, silicon minds looks like a promising challenge and I am very excited to see what people come up with.
UCI Website Revamped

First of all, happy new 2008 to all readers!
UCI hosts a famous collection of datasets at the UCI Machine Learning Repository. Recently, they have completely updated their webpage and are starting to offer new datasets. This is a great service to the machine learning community, but I would like to see us take this one step further: we should match this repository of datasets with a repository of algorithms. This would not only allow us to compare algorithms, but also get a lot of intuition about the nature of the datasets: a well-understood algorithm that does great on - say - digit classification but performs really poorly on the Wisconsin breast cancer dataset teaches us something about the nature of the data. A recent JMLR paper* calling for more open source machine learning software, mentions a project at the university of Toronto called Delve that meant to do exactly this. Unfortunately, the project seems to be dead as of 2003 or so.
* The Need for Open Source Software in Machine Learning - Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson; Journal of Machine Learning Research; 8(Oct):2443--2466, 2007.
What is... Exchangeability?
$What is... Exchangeability?$

Talking about exchangeability, a friend once commented that exchangeability is "too simple too understand". On one hand, it is true that the statement of exchangeability (see below) sounds somewhat trivial, I found that I had absolutely no intuition as to why it is important for machine learning. So after some reading, I present my take on the concept of exchangeability.
What is Exchangeability?
Scenario 1. Imagine we have an urn with r red balls and b blue balls. We draw 3 balls from the urn as follows: we pick a random ball, write down its color and put it back in the urn before drawing a new ball. We introduce 3 random variables: A, B, C which denote the color of the first, second and third ball. It is not hard to see that p(A=r, B=b, C=b) = p(A=b, B=r, C=b); in other words, we can exchange the values of the random variables without changing the joint probability. Intuitively, the reason we can exchange the observations is that our random variables are IID (independent and identically distributed).
Scenario 2. We again pick 3 balls from an urn with r red and b blue balls. We still pick a random ball and note its color, but we put two balls of that color back in the urn. It may not be obvious that the sequence A=r, B=b, C = b has the same probability as the sequence A=b, B=b, C=r since the individual probabilities of picking the red ball first or last are completely different: r/[r+b] when it is the first ball versus r/[r+b+2] when it is the last ball (since two blue balls were added in the mean time). Writing down the equations makes it clear that the two sequence are equi-probable
$p(A=r, B=b, C=b) = \frac{r}{r+b}\frac{b}{r+b+1}\frac{b+1}{r+b+2} = \frac{b}{r+b}\frac{b+1}{r+b+1}\frac{r}{r+b+2} = p(A=b, B=b, C=r)$
It is trivial to generalize this expression to longer sequences. Again, it doesn't matter in what order we pick the balls, the only thing that matter is how many red and how many blue balls we pick. This is reflected in the formula in the sense that denominator of the probability of a sequence only depends on how long the sequence is. The nominator part only needs to know how many balls of each color there are. In our example: it only needs to know that there is a first and second blue ball (contributing b * (b+1) to the nominator) and a first red ball (contributing a).
Scenario 3. Both examples above were exchangeable since reordering the values of the random variables didn't change the probability. Let us consider a similar setup where exchangeability does not apply anymore. We again use the urn scheme with r red balls and b blue balls. However, now when we pick a red ball we note its color and simply put it back, but when we pick a blue ball we note its color and put two back. It is easy to see that we cannot exchange the value of the random variables anymore since
$p(A=r, B=b, C=b) = \frac{r}{r+b}\frac{b}{r+b}\frac{b+1}{r+b+1}$
while
$p(A=b, B=b, C=r) = \frac{b}{r+b}\frac{b+1}{r+b+1}\frac{r}{r+b+2}$
I think the following definition of exchangeability now becomes much more intuitive; we say a set of n random variables Y is exchangeable under a distribution p iff for any permutation pi of the integers 1..n
$p(Y_1=y_1, \cdots, Y_n=y_n) = p(Y_1=y_{\pi(1)}, \cdots, Y_n=y_{\pi(n)})$
Properties of Exchangeability
Let us now briefly discuss some consequences of exchangeability as it will allow us to see why it is such an important concept. First, we compute the marginal probability of the second draw p(B = r) under the different scenarios. Under scenario 1 this is trivial, just before the second draw the content of our urn is exactly as it was when we started: hence p(B=r) = r/(r+b). Under scenario 2, after some simple algebra we find that p(B=r) = p(A=b, B=r) + p(A=r, B=r) = r/(r+b). Now here is the exciting part: we shouldn't have done all the algebra; if we are convinced that the random variables are exchangeable under the distribution of scenario 2, we could have acted as if we were computing the marginal probability for the first draw. Formally, since p(A=b,B=r) = p(A=r, B=b) and substituting this in the expression for p(B=r), we could have marginalized out the B=b part. This property - changing the order around - is incredibly useful when computing probabilities.
More abstractly, here is one way to think of exchangeable sequences. In scenario 2, if a friend just drew a ball from the urn, didn't show it to us and put one extra ball back in the urn, this is not going to make a difference as to the probability of our next draw. However, in scenario 3 above, whether someone drew a ball before us is very important: it drastically changes the probabilities for our next draw. I think this is a very important distinction that sets exchangeable and non-exchangeable distributions appart.
Although exchangeability and IID variables look very similar they are not exactly the same. From scenario one above, it is easy to see that IID random variables are exchangeable. The converse is not true: in scenario 2, p(A=b, B=r) is not equal to p(A=b) p(B=r) and thus the random variables are not independently distributed.
Exchangeability and Machine Learning
Exchangeable distributions are very common in machine learning. The most famous modelling assumption for text processing is just exchangeability: the bag of words model. This modelling assumption states that the probability of a text document depends only on word counts and not on word order. This is exactly the same model as scenario 1 above except that instead of red and blue balls, we now have words from a fixed vocabulary. Is this a realistic assumption one may ask? It certainly is not! We don't expect natural language to be exchangeable: the probability of using the word "States" should certainly be dependent on the word in front of it (a.k.a. higher if that word is "United"). But who cares, the bag of words assumption works incredibly well...
There are many other exchangeable distributions in common use for machine learning: the Dirichlet Process and its Chinese Restaurant Process equivalent are exchangeable distributions, the Indian Buffet Process is an exchangeable distribution on binary matrices. Non-exchangeable distribution are also common: many Markov models (e.g. Hidden Markov Models) aren't exchangeable.
I hope this little overview of exchangeability was useful. I left one improtant concept out of our discussion so far: De Finetti's theorem. This is a very important theorem that applies to exchangeable sequences and I will discuss the theorem in a future post.
Machine Learning News

From Machine Learning (Theory); a new machine learning mailing list has been created here.
Wanted: PostDocs

... in our very own Cambridge group. More info here.

Random for time:

Prabal Gurung for Target Launch: Best Dressed

Kate Middelton, Duchess of Cambridge, in Malene Birger On The Tube

Meryl Streep is the COOLEST in Michael Kors

Amy Adams in Elie Saab Couture

2014 Grammy Awards Red Carpet: Best Dressed in Colour

Miranda Kerr In Pretty Purple Florals

Emeli Sande Weekend of Festival Style

Marion Cotillard in Christian Dior for Blood Ties

Nina Dobrev Style at 2012 Toronto Film Festival

Amanda Seyfried Promotes Lovelace in Givenchy, Balmain and Tibi

Mindy Kaling in Kenzo for Girls Inc. Night Out