ACL-IJCNLP 2009 and EMNLP 2009 have just finished here in Singapore. As an outsider to the field I had a hard time following many talks but nonetheless enjoyed the conference. The highlight for me was the talk by Richard Sproat who wondered whether there exists a statistical test to check if a series of symbol sequences is actually a language? If this test would exist, we could use it to decide whether the set of symbols known as the Indus Valey Script is actually a language. Very fascinating stuff: I immediately bought “Lost Languages” by Andrew Robinson to learn more about the history of deciphering dead languages.
The paper had some very cool papers; the first one I really liked was Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling by Daichi Mochihashi et al. They build on the work of Yee Whye Teh and Sharon Goldwater who showed that Kneser-Ney language modelling is really an approximate version of a hierarchical Pitman-Yor based language model (HPYLM). The HPYLM starts from a unigram model over a fixed dictionary and hence doesn’t accommodate for out of vocabulary words. Daichi et al extended the HPYLM so that the base distribution is now an character infinity-gram that is itself an HPYLM (over characters). They call this model the nested HPYLM or NPYLM. There is no need for a vocabulary of words in the NPYLM, rather, the HPYLM base distribution is a distribution over arbitrary long strings. In addition the model will perform automatic word segmentation. The results are really promising: from their paper, consider the following unsegmented English text
lastly,shepicturedtoherselfhowthissamelittlesisterofhersw
ould,intheafter-time,beherselfagrownwoman;andhowshe
wouldkeep,throughallherriperyears,thesimpleandlovingh
eartofherchildhood:andhowshewouldgatheraboutherothe
rlittlechildren,andmaketheireyesbrightandeagerwithmany
astrangetale,perhapsevenwiththedreamofwonderlandoo
ngago:andhowshewouldfeelwithalltheirsimplesorrows,an
dndapleasureinalltheirsimplejoys,rememberingherownc
hild-life,andthehappysummerdays. […]
When the NPYLM is trained on this data, the following is found
last ly, she pictured to herself how this same little sis-
ter of her s would, inthe after - time, be herself agrown woman ; and how she would keep, through allher ripery ears, the simple and loving heart of her child hood : and how she would gather about her other little children,and make theireyes bright and eager with many a strange tale, perhaps even with the dream of wonderland of longago : and how she would feel with all their simple sorrow s, and a pleasure in all their simple joys, remember ing her own child - life, and thehappy summerday s.
- Minimized Models for Unsupervised Part-of-Speech Tagging by Sujith Ravi et al.
- Polylingual Topic Models by David Mimno et al.
- Graphical Models over Multiple Strings by Markus Dreyer and Jason Eisner
- Bayesian Learning of a Tree Substitution Grammar by Matt Post and Daniel Gildea