Paper at DIR 2011 — Semi-Supervised Priors for Microblog Language Identification

Simon Carter, Manos Tsagkias, and Wouter Weerkamp

20 January 2011
Keywords: paper, dir, language identification

Abstract

Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i) blogger-basedprior, using previous posts by the same blogger, and (ii) link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.

References

[1] Simon Carter, Manos Tsagkias, and Wouter Weerkamp. 2011. Semi-Supervised Priors for Microblog Language Identification. In DIR 2011: Dutch-Belgian Information Retrieval Workshop Amsterdam (pp. 12-15). University of Amsterdam, Information and Language Processing group. PDF