we write about the things we build and the things we consume
Leaf is our tool for tracking and visualising what’s cool on Twitter for any given topic and it’s been evolving fast recently.
Originally conceived as a way of tracking events on Twitter and finding out the most popular topics, Leaf’s been used for The Archers 60th anniversary, monitoring the Arab Spring in Egypt, the royal wedding and many other things since. It listens to streams of tweets that contain given keywords, then uses a variety of methods to track, rank and display the most intriguing. For example if "#London2012" was the tracked keyword then you can use Leaf to monitor that stream and compare the most popular sports or related phrases and it wouldn’t be affected by everyone tweeting about Justin Bieber going swimming.
what is interesting?
To help find the coolest phrases, Leaf also analyses all of the tweets in the stream and suggests the best sub phrases nestled within, so that a user can select ones that they wish to track and display. Unsurprisingly this is quite hard, especially as we can process dozens of tweets every second for a popular topic, so we have to do this quickly and not use much memory. Focusing on short, 2-4 word phrases seemed to be the way to go. We increased the scores for phrases this long and constructed a Bloom Filter to determine if the phrase was in Wikipedia, as well as a few other tricks. This worked remarkably well, but the problem was then an editorial, not a technical one. Seeing that "Boris Johnson" is being mentioned a lot while tracking "Olympics" seems peculiar but isn’t very meaningful without context. Seeing that "Boris Johnson got stuck on an Olympics zip line" tells the reader why and is more interesting, so we started favouring longer phrases instead.
Of course once the filtering for the type of phrases we want is done, what sets each phrase apart will be how popular it is, the number of times we’ve seen it. This works well enough when we’re only dealing with short timescales, but if an event is running for many hours or days then you end up with a high score table of what the best phrases have been since tracking began. Leaf wants to know what’s happening now though, so we have to somehow get rid of all the old phrases, without stopping any up-and-coming ones. A few approaches were tried, the ideal way would be to record the count at different times and calculate how it’s changing, but this was too expensive. We settled on reducing based on the time the phrase was last seen and an ever decreasing counter, that is brought up incrementally whenever we see the phrase.
We’re always trying to make our projects even better, so where’s Leaf heading next? We’ve been experimenting with making Leaf fully autonomous, tracking the phrases that it thinks are the most interesting and popular at the moment. It works great so far but still needs a bit of checking on to stop the less savoury things on Twitter slipping through. Oli has also been using atlas to make Leaf a bit smarter, perhaps you’ll hear about that sometime soon.