Thursday, July 25, 2013

How Well Can We Predict Guy Kawasaki’s Top 5 Tweets of the Day

My goal when I started this blog was to create a Twitter account that would share a small number of Guy Kawasaki’s best tweets instead of the huge volume of tweets I would receive if I followed him.  After 4 months of work, @TT_GuyKawasaki, a Twitter account that retweets Guy Kawasaki’s top 5 tweets for the day is finally live!  Because I just started running the account, I don’t know how well it will predict Kawasaki’s best tweets.  However, we can run the model on historical data and see how well it performs.  Even though I am initially using a simple linear regression, the model predicts whether a tweet is one of the top 5 of the day with 95% accuracy.  It looks like we’re off to a good start!

For the first version of the prediction model, I chose to use a simple linear regression model.  Although I could have used a few different models to predict Kawasaki’s tweets, I chose the simple linear model at first due to its good accuracy, and ease of implementation.  In the future I will use other models if I find them to increase prediction accuracy sufficiently.

A good way to measure the prediction model’s accuracy is to determine the percentage of tweets that we predicted correctly and incorrectly.  The tweets that were predicted incorrectly are then split into false positives, and false negatives.  False positives are tweets that were categorized as top tweets even though they weren’t top tweets, and false negatives are tweets that should have been categorized as top tweets but were not.    The chart below shows the percentage of tweets that were correctly categorized in green, false positives in yellow, and false negatives in red.  The bars show the model’s accuracy if it uses the number of retweets 5, 15, 25, and 35 minutes after the tweet is posted to predict whether the tweet is a top tweet.


Overall, the model does a very good job of predicting the top tweets correctly.  Using the number of retweets 5 minutes after the tweet was posted, the model predicted 91% of the tweets correctly.  This means that the model will incorrectly classify a tweet only about once every two days.  If the model uses the number of retweets 35 minutes after the tweet was posted, the accuracy increases to 95%, or only one mistake every four days.

The results of the model look so good, that it is worth increasing the bar for accuracy.  Instead of judging the model’s accuracy by its ability to classify all tweets, lets take a look at how good it is at classifying only the top tweets.  The chart below shows the number of tweets that were predicted correctly, and those that weren’t divided into false positives and false negatives if we use the number of retweets 5, 15, 25, and 35 minutes after the tweet is posted for prediction.


Once we zoom into only the top tweets, the chart doesn’t make the model look as good as the previous one.  If we use the number of retweets 5 minutes after the tweet was posted for prediction, the model will correctly predict 75% of the top tweets.  The model will also falsely identify 25% more tweets as top tweets even though they were not.  If we use the number of retweets 35 minutes after the tweet was posted, the percentage of correctly identified top tweets increases to 84%, and the percentage of false positives decreases to only 9% of all top tweets.  This more nuanced view of the performance of the model will be a better comparison of the performance of other models in the future.

This new view of the data exposes an interesting trend related to the change in false positives and false negatives the longer we wait to sample the number of retweets.  Although the number of false positives decreases the longer we wait to sample the number of retweets, the number of false negatives seems to plateau after 25 minutes.  This data appears to corroborate the conclusion we reached in a previous blog post that the optimum amount of time to wait to predict retweets is 25 minutes.

These results do come with a caveat.  First, we are using historical data in this analysis.  Although history can repeat itself, I am not sure if historical data will be representative of future behavior.  Second, we used the same data to create the model and for prediction.  Using the same dataset for both purposes gives us a little of an unfair advantage.  A better comparison would be to use one dataset to create the model, and apply to another dataset.  Because we didn’t have a lot of data yet, we couldn’t perform this type of comparison yet.

Even with these caveats, I think this analysis is a very good start to understand how well the @TT_GuyKawasaki account will predict Kawasaki’s top tweets.  Stay tuned; in a few weeks we will perform a similar analysis with the data we collected from running the account.

4 comments:

  1. This looks really interesting. I think I might change my following from the original Guy to your 'streamlined' version... Any way you could do the same on Google+ ... As much as I like some of what Guy says he takes up most of my G+ page every day :-(

    ReplyDelete
    Replies
    1. That's a very good idea! I'll look into the G+ API and see how easy it would be to port the code. In the meantime, are there any other Twitter accounts for which you would like a 'streamlined' version?

      Delete
  2. Really love how you've used mathematical modelling in your blog. I've just caught up on all your posts and think you've done something really special here using science. Thanks!

    ReplyDelete
  3. Thank you Zuleyka! It's very interesting to see what you can do with a little science and whole lot of data from Twitter. Thanks for sharing on Google+!

    ReplyDelete