Paper at TREC 2009 — The University of Amsterdam at TREC 2009

Abstract¶

We describe the participation of the University of Amsterdam’s ILPS group in the web, blog, web, entity, and relevance feedback track at TREC 2009. Our main preliminary conclusions are as follows. For the Blog track we find that for top stories identification a blogs to news approach outperforms a simple news to blogs approach. This is interesting, as this approach starts with no input except for a date, whereas the news to blogs approach also has news headlines as input. For the web track, we find that spam is an important issue in the ad hoc task and that Wikipedia- based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce the spam in top ranked documents. As for the diversity task, we explored different methods. Initial results show that clustering and a topic model-based approach have similar performance, which are relatively better than a query log based approach. Our performance in the Entity track was downright disappointing; the use of co-occurrence models led to poor results; an initial analysis shows that while our approach is able to find correct entity names, we fail to find homepages for these entities. For the relevance feedback track we find that a topical diversity approach provides good feedback documents. Further, we find that our relevance feedback algorithm seems to help most when there are sufficient relevant documents available.

References¶

[1] Krisztian Balog, Marc Bron, Jiyin He, Katja Hofmann, Edgar Meij, Maarten de Rijke, Tsagkias, and Wouter Weerkamp. The University of Amsterdam at TREC 2009: Blog, Web, Entity and Relevance Feedback. In TREC 2009 Working Notes. NIST, November 2009. PDF

For our experimental evaluation, we use data from Twitter, Digg, Delicious, the New York Times Community, Wikipedia, and the blogosphere to generate query models. We show that different query models, based on different data sources, provide complementary information and manage to retrieve different social media utterances from our target index. As a consequence, data fusion methods manage to significantly boost retrieval performance over individual approaches. Our graph-based term selection method is shown to help improve both effectiveness and efficiency.

47.89% similar — Paper at WSDM 2011 — Linking online news and social media
Recent years have witnessed a persistent interest in generating pseudo test collections, both for training and evaluation purposes. We describe a method for generating queries and relevance judgments for microblog search in an unsupervised way. Our starting point is this intuition: tweets with a hashtag are relevant to the topic covered by the hashtag and hence to a suitable query derived from the hashtag. Our baseline method selects all commonly used hashtags, and all associated tweets as relevance judgments; we then generate a query from these tweets. Next, we generate a timestamp for each query, allowing us to use temporal information in the training process. We then enrich the generation process with knowledge derived from an editorial test collection for microblog search.

41.54% similar — Paper at SIGIR 2013 — Pseudo Test Collections for Training and Tuning Microblog Rankers
The advent of social media has established a symbiotic relationship between social media and online news. This relationship can be leveraged for tracking news content, and predicting behavior with tangible real-world applications, e.g., online reputation management, ad pricing, news ranking, and media analysis. In this thesis we focus on tracking news content in social media, and predicting user behavior.

In the first part, we develop methods for tracking content which build upon, and extend practices in Information Retrieval. We begin with discovering social media posts that discuss a news article yet they do not provide a hyperlink to it. Our methods model news articles using several channels of information, either endogenous or exogenous to the article. These models are then used to query an index of social media posts. During this process we found that the query models are close in size to the documents to be retrieved, violating a standard assumption of language

41.03% similar — Ph.D. Thesis — Mining Social Media: Tracking Content and Predicting Behavior
Abstract

Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i) blogger-basedprior, using previous posts by the same blogger, and (ii) link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.

References

39.02% similar — Paper at DIR 2011 — Semi-Supervised Priors for Microblog Language Identification
Abstract

On-line news agents provide commenting facilities for readers to express their views with regard to news stories. The number of user supplied comments on a news article may be indicative of its importance or impact. We report on exploratory work that predicts the comment volume of news articles prior to publication using five feature sets. We address the prediction task as a two stage classification task: a binary classification identifies articles with the potential to receive comments, and a second binary classification receives the output from the first step to label articles “low” or “high” comment volume. The results show solid performance for the former task, while performance degrades for the latter.

References

38.76% similar — Paper at CIKM 2009 — Predicting the Volume of Comments on Online News Stories
To make up for this discrepancy, we consider distributions that emerge from sampling without replacement: the central and non-central hypergeometric distributions. We present two retrieval models that build on top of these distributions: a log odds model and a bayesian model where document parameters are estimated using the Dirichlet compound multinomial distribution. We analyse the behavior of our new models using a corpus of news articles and blog posts and find that for the task of republished article finding, where we deal with queries whose length approaches the length of the documents to be retrieved, models based on distributions associated with sampling without replacement outperform traditional models based on multinomial distributions.

38.26% similar — Paper at SIGIR 2011 — Hypergeometric Language Models for Republished Article Finding
Abstract

We propose a retrieval model for searching microblog posts for a given topic of interest. We develop a language modeling approach tailored to microblogging characteristics, where redundancy-based IR methods cannot be used in a straightforward manner. We enhance this model with two groups of quality indicators: textual and microblog specific. Additionally, we propose a dynamic query expansion model for microblog post retrieval. Experimental results on Twitter data reveal the usefulness of boolean search, and demonstrate the utility of quality indicators and query expansion in microblog search.

References

36.17% similar — Paper at ECIR 2011 — Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts

Paper at TREC 2009 — The University of Amsterdam at TREC 2009

Blog, Web, Entity, and Relevance Feedback

Krisztian Balog, Marc Bron, Jiyin He, Katja Hofmann, Edgar Meij, Maarten de Rijke, Manos Tsagkias, and Wouter Weerkamp

University of Amsterdam

26 October 2009

Keywords: paper, trec

Abstract¶

References¶

Abstract¶

References¶

Related Posts

Abstract

References

Abstract

References

Abstract

References