Over the years I have worked on a wide range of projects. Here I list a bunch of recent ones that I’m particularly fond of:
- 904Labs self-learning search engine
- Predicting IMDB movie ratings using social media
904Labs self-learning search engine
904Labs self-learning search engine (904SLS) is the first commercially available self-learning search engine for e-commerce and content publishers. It optimizes search results from user behavior for increasing conversion.
904SLS offers all core aspects of search: from pre-search, e.g., query understanding, to automatic document augmentation and post-search, e.g., result re-ranking. The engine is able to update in near real-time so that it can quickly adapt to swift changes in user search behavior, without retraining. 904SLS is available as software-as-a-service or virtual machine, and it can re-use the existing search infrastructure if it is Lucene-based so that no re-indexing is required.
The amount of engineering that our team has put in, given the timeframe and human resources, is, in my opinion, mind blowing. We’ve had to solve core machine learning limitations (e.g., delayed feedback), scalability challenges (i.e., minimize latency whithout harming effectiveness), and domain-dependent challenges (i.e., how much search is eCommerce search?), but also to setup a full-blown infrastructure with experimental, staging, and production environments so that our customers have no down-time when we push new features to production. The results have been rewarding; 904SLS outperforms state-of-the-art learning-to-rank systems in academic datasets, and it has proven to uplift revenue by 30% on live, e-commerce shops.
Streamwatchr listened to the world and tried to understand what was happening. It monitored Twitter to find out to which music people were listening. Streamwatchr offered real-time insights into music listening behavior around the world. It ran from 2013 until 2016.
The millions of tweets and the hundreds of thousands of artists that Streamwatchr has listened to over the years have been distilled to a handful of noteworthy factoids:
tweets listened: 438,225,941
artists seen: 660,941
most popular song: Passenger – Let Her Go, 196,986 times
most sung along song: John Legend – All of Me, 541 times
The engineering behind Streamwatchr was impressive. Streamwatchr collected music-related tweets, extracted songs and artists, mapped these to a MusicBrainz, a music database, and to YouTube video clips, all in real-time. All information shown on Streamwatchr was refreshed with every single tweet, from popularity charts, music analytics, trending music, and song recommendations.
Streamwatchr used a stack of Python, and MongoDB for analytics and recommendations. Most of the magic behind real-time updates relied on clever data structures and algorithms that minimized the amount of updates needed.
Periscope was a project that highlighted what topics on Twitter will break viral before they did. It was meant for media analysts to keep ahead of the curve and for average users to keep the clutter away.
(Section is to be expanded.)
Predicting IMDB movie ratings
Part of the Artificial Intelligence Masters programme in which I used to teach was one month hands-on project. I have supervised Andrei Oghina and Mathias Breuss on whether can we predict a movie’s IMDB rating from social media before the movie is out.
Andrei and Mathias collected tweets about movies and their respective YouTube trailers, and they extracted several features which were used for training a rating classifier. We have found out that we are able predict a movie’s IMDB rating with ±0.25 accuracy. That is if a movie gets an average IMDB rating of 8, we would predict 7.75 or 8.25. This is quite impressive for predicting a movie’s success before it is even released.
Our work has been published at ECIR 2012, and it has been awarded the best paper award.