James Kember

time to write an algorithm…

obtaining more data from twitter

Earlier this week Max talked a little about Twitter favourites. We worked on it together so I’m going to give a little bit of a low-down on one particular aspect which required writing an algorithm to prioritise tweets.

We’ve created an algorithm that prioritises tweets so that we can focus more heavily on popular tweets so we can track changes over time for our applications.

This involved calculating a score for a tweet based on the aspects it has:

  • How recent it is
  • How many users are following the tweeter
  • How many retweets of the tweet had occurred so far

As there are quite big numbers when dealing with data from Twitter we had to consider the best ways of doing the priority sorting of tweets without blowing up the system.

Originally we loaded all tweets into memory and did the sorting from there but this was too memory intensive. Tom pointed me towards the MinMaxPriorityQueue which is a nice way of continuing to do this in the application and to only get the items that we wanted.

measuring the unmeasurable

One of the most important things that is needed when writing an algorithm like this is to be able to get feedback from it. If you have no idea what you want to achieve, it becomes very difficult to achieve it!

The problem in our case was that to obtain the success or failure of the items that we were looking up required a call to twitter which we could not accomplish due to request limits.

In the end we decided to use a measure which involved counting the number of tweets that we were continuing to track in a store above a certain threshold of followers and retweets – this gives a rough idea of how successful the algorithm is but as the store is also populated by other means it means the measure is not entirely accurate.

In hindsight we could have possibly taken a random sample to test against but this presents its own problems and could have wildly varying results when tested multiple times.

final thoughts

It’s always better to work on the measure of success before spending any real time or effort on an algorithm – especially when it is quite difficult to define what success is, and this will be a lesson for any future work we take on which involves lots of data.

If you have some experience with measuring stats for tasks involving lots of data it would be great to hear from you. Get in touch via Twitter or through the comment box below.

blog comments powered by Disqus