Hover over the dots to explore related posts. Closer dots are more semantically related, and the red dot marks the current page.
Hover over the dots to explore related posts. Closer dots are more semantically related, and the red dot marks the current page.
Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i) blogger-basedprior, using previous posts by the same blogger, and (ii) link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.
[1] Simon Carter, Manos Tsagkias, and Wouter Weerkamp. 2011. Semi-Supervised Priors for Microblog Language Identification. In DIR 2011: Dutch-Belgian Information Retrieval Workshop Amsterdam (pp. 12-15). University of Amsterdam, Information and Language Processing group. PDF