Hello, Science!:
Conference Reports

ECML–PKDD 2010 Highlights
The city of Barcelona just hosted ECML/PKDD this year and I had the opportunity to go down and check out the latest and greatest mostly from the European machine learning community. My personal highlights
- The conference for me started with a good tutorial by Francis Bach and Guillaume Obozinsky. They gave an overview of sparsity: in particular various different methods using l1 regularization to induce sparsity.
The first invited speaker was Hod Lipson from Cornell (and as far as I know one of the few people in our field who has given a TED talk). The main portion of Hod’s talk was about his work on symbolic regression. The idea is the following: consider the following dataset
We can apply our favourite regression method, say a spline, to these points and perform accurate interpolation, perhaps even some extrapolation if we chose the right model. The regression function would not give us much insight into why data could be looking like this? In symbolic regression, the idea is that we try to come up with a symbolic formula which interpolates the data. In the picture above, the formula that generated the data was EXP(-x)*SIN(PI()*x)+RANDBETWEEN(-0.001,0.001) (in Excel). Hod and his graduate students have built a very cool (and free!) app called Eureqa which uses a genetic programming methodology to find a good symbolic expression for a specific dataset. Hod showed us how his software can recover the Hamiltonian and Lagrangian from the measurements of a double pendulum. Absolutely amazing!

Another noteworthy invited speaker was Jurgen Schmidhuber. He tried to convince us that we need to extend the reinforcement learning paradigm. The idea would be that instead of an agent trying to optimize the amount of long term reward he gets from a teacher, he would also try to collect “internal reward”. The internal reward is defined as follows: as the agent learns, he is building a better model of the world. Another way to look at this learning is that the agent just knows how to “compress” better. The reduction in representation he gets from learning from a particular impression is what Jurgen calls the “internal reward”. In other words, it is the difference between the number of bits needed to represent your internal model before and after an impression.

E.g. you listen to a new catchy son: Jurgen says that you think it’s catchy because you’ve never heard anything like this before, it is surprising and hence helps you learn a great deal. This in turn means you’ve just upgraded the “compression algorithm” in your brain and the amount of improvement is now reflected in you experiencing “internal reward”. Listening to a song you’ve heard a million times before doesn’t help compressing at all; hence, no internal reward.

I really like the idea about this internal reward a lot, and as far as I understand it would be very easy to test. Unfortunately, I did not see any convincing experiments so allow me to be sceptical …

The main conference was cool and I’ve met some interesting people working on things like probabilistic logic, a topic I desperately need to learn more about. Gjergji gave a talk about our work on crowdsourcing (more details in a separate post). Some things I marked for looking into are:
- Sebastian Riedel, Limin Yao and Andrew McCallum – “Modeling Relations and Their Mentions Without Labeled Text”: this paper is about how to improve information extraction methods which bootstrap from existing knowledge bases using constraint driven learning techniques.
- Wanner Meert, Nima Taghipour & Hendrik Blockeel – “First Order Bayes Ball”: a paper on how to use the Bayes ball algorithm to figure out which nodes not to ground before running lifted inference.
- Daniel Hernández-Lobato, José Miguel Hernández Lobato, Thibault Helleputte, Pierre Dupont – “Expectation Propagation for Bayesian Multi-task Feature Selection”: a paper on how to run EP for spike and slab models.
- Edith Law, Burr Settles, and Tom Mitchell – “Learning to Tag from Open Vocabulary Labels”: a nice paper on how to deal with tags: they use topic models to do dimensionality reduction on free text tags and then use that in a maximum entropy predictor to tag new music.
I enjoyed most of the industry day as well: I found it quite amusing that the Microsoft folks essentially gave all Bing secrets away in one afternoon: Rakesh Agarwal mentioned the secret sauce behind Bing’s ranker (neural net) whereas Thore Graepel explained the magic behind the advertisements selection mechanism (probit regression). Videos of these talks should be on videolectures.net soon.

One last rant: the proceedings are published by Springer who ask me to pay for the proceedings?!?! I’m still trying to figure out what value they’ve added to our camera-ready we sent them a few months ago …
ACL 09 & EMNLP 09
ACL-IJCNLP 2009 and EMNLP 2009 have just finished here in Singapore. As an outsider to the field I had a hard time following many talks but nonetheless enjoyed the conference. The highlight for me was the talk by Richard Sproat who wondered whether there exists a statistical test to check if a series of symbol sequences is actually a language? If this test would exist, we could use it to decide whether the set of symbols known as the Indus Valey Script is actually a language. Very fascinating stuff: I immediately bought “Lost Languages” by Andrew Robinson to learn more about the history of deciphering dead languages.

The paper had some very cool papers; the first one I really liked was Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling by Daichi Mochihashi et al. They build on the work of Yee Whye Teh and Sharon Goldwater who showed that Kneser-Ney language modelling is really an approximate version of a hierarchical Pitman-Yor based language model (HPYLM). The HPYLM starts from a unigram model over a fixed dictionary and hence doesn’t accommodate for out of vocabulary words. Daichi et al extended the HPYLM so that the base distribution is now an character infinity-gram that is itself an HPYLM (over characters). They call this model the nested HPYLM or NPYLM. There is no need for a vocabulary of words in the NPYLM, rather, the HPYLM base distribution is a distribution over arbitrary long strings. In addition the model will perform automatic word segmentation. The results are really promising: from their paper, consider the following unsegmented English text

lastly,shepicturedtoherselfhowthissamelittlesisterofhersw
ould,intheafter-time,beherselfagrownwoman;andhowshe
wouldkeep,throughallherriperyears,thesimpleandlovingh
eartofherchildhood:andhowshewouldgatheraboutherothe
rlittlechildren,andmaketheireyesbrightandeagerwithmany
astrangetale,perhapsevenwiththedreamofwonderlandoo
ngago:andhowshewouldfeelwithalltheirsimplesorrows,an
dndapleasureinalltheirsimplejoys,rememberingherownc
hild-life,andthehappysummerdays. […]
When the NPYLM is trained on this data, the following is found

last ly, she pictured to herself how this same little sis-
ter of her s would, inthe after - time, be herself agrown woman ; and how she would keep, through allher ripery ears, the simple and loving heart of her child hood : and how she would gather about her other little children,and make theireyes bright and eager with many a strange tale, perhaps even with the dream of wonderland of longago : and how she would feel with all their simple sorrow s, and a pleasure in all their simple joys, remember ing her own child - life, and thehappy summerday s.
A note on the implementation of Hierarchical Dirichlet Processes by Phil Blunsom et al. In this paper the authors discuss how previous approximate inference schemes to the HDP collapsed Gibbs sampler can turn out to be quite bad. In this paper they propose a more efficient and exact algorithm for a collapsed Gibbs sampler for the HDP. A few other papers I really enjoyed:
- Minimized Models for Unsupervised Part-of-Speech Tagging by Sujith Ravi et al.
- Polylingual Topic Models by David Mimno et al.
- Graphical Models over Multiple Strings by Markus Dreyer and Jason Eisner
- Bayesian Learning of a Tree Substitution Grammar by Matt Post and Daniel Gildea
NPBayes workshop at ICML
Yesterday, between ICML and UAI there was the nonparametric Bayes (NPBayes) workshop organized by Yee Whye Teh, Romain Thibaux, Athanasios Kottas, Zoubin Ghahramani & Michael Jordan. The program was packed with a lot of very interesting talks which for your convenience were recorded and will be put online at videolectures.net soon.
Just after lunch, there was a panel discussion on software for NPBayes. I thought much of the discussion also applied to graphical models in general. The main contenders for general software are (with main pros and cons):
- the Hierarchical Bayes Compiler by fellow blogger Hal Daume III. Pros: software is freely available, continuing development, quite a large feature set (including NPBayes components) and a group of active users to show that it actually works. Cons: the language itself is more limited than some of its contenders.
- Church by Noah Goodman, Vikash Mansighka, Dan Roy, Keith Bonawitz & Josh Tenenbaum. Pros: very flexible language. Cons: no software available yet.
- A proposal by Max Welling: Max and one of his students are working on a library for fast inference and are planning to add a graphical UI to design graphical models visually.
- Infer.NET by Microsoft Research Cambridge. Pros: integrates with many programming languages through.NET, a great variety of inference algorithms. Cons: no free software available yet (a release is scheduled for later this year).
The workshop ended with a panel discussion on the future and prospects of NPBayes. Here are some of the questions and answers I can remember of the top of my head
- David Sontag: we've seen many Markov chain algorithms and some mean field for NPBayes. What are the prospects of using different inference algorithms? More specifically, the marginal polytope has taught us that belief propagation, Kikuchi approximations, etc... give us approximations that have a different flavour than mean field methods. Answers:
- Myself: how should we position infinite capacity models compared to finite capacity models: is the main motivation to be able to model uncertainty in the model capacity or are there other reasons to strongly favour NPBayes? Answers:
- Vikash Mansighka: asked about the role of consistency of infinite capacity models: aka. how much should we care about exchangeability and such. A long discussion followed from which I just recall that the statisticians in the panel all agreed that this is crucial and one should really spend time proving these properties.
Finally, the panel was asked how they see the future of NPBayes, some answers:
- Zoubin Ghahramani: believes we will see applications of NPBayes on very large problems,
- Yee Whye Teh: believes we will see more interesting use of optimization in NPBayes,
- Lawrence Carin: thinks we will have priors adapted to our custom applications, not just use the few building blocks (DP, IBP,... ) we have now,
- David Blei: believes we will see nonparametric distributions over more complicated combinatorial distributions.

Random for time:

Eva Mendes Street Style, July 2013

Rose Bryne in Target

Victoria Beckham and Harper Land in New York in Jil Sander

Trens Alert: Diane Kruger Proves Dungarees Are Back

Act On CO2 | Social Direction

2013 Golden Globe Awards Red Carpet: Worst Dressed

CFDA and Vogue Fashion Fund Awards Red Carpet Best Dressed

Yay or Nay? Kristen Wiig in Prada

Ashley Madekwe Red Carpet Style

Shark Project | Social Direction

Mila Kunis in Carolina Herrera for Gemfields