Sunday, May 19, 2013

Predicting Guy Kawasaki’s Retweets is (not that) Hard


Prediction doesn’t have to be hard.  We do it every day.  We become amateur meteorologists when we look out the window to help us predict the day’s weather.  This seemingly simple  act is firmly based in reason.  A very good predictor of the weather in a few hours is the weather right now.

Similarly, a very good predictor of the number of retweets that a tweet will receive in a few minutes is the number of retweets that it has right now.  In this post, we’ll see how the number of retweets that tweet receives soon after Guy Kawasaki posts it are a very good predictor of the final number of retweets the tweet will receive.

First though, we will need to agree on how long we will wait to measure the final number of retweets that a tweet receives.  In a two week sample I used for this analysis, I found that 95% of tweets stop receiving retweets 12 hours after they were posted.  So, the number of retweets that a tweet receives 12 hours after the tweet is posted will be our benchmark for final number of retweets.

The second question we need to answer is how good are the number retweets that a tweet receives a few minutes after Kawasaki posts it at predicting the tweet's final number of retweets.  A good way to visualize this relationship is to plot the number of retweets a tweets receives soon after posting versus the final number of retweets.

The left graph below shows, for a two week sample, the number of retweets each tweet received 5 minutes after Kawasaki posted it compared to the number of retweets it received 12 hours after posting.  The right graph below shows, for the same sample, the number of retweets 35 minutes after posting versus 12 hours after posting.  Both graphs include a blue best-fit line drawn through the middle of the points.

Both graphs show that the number of retweets that a tweet receives just a few minutes after posting is a good predictor of the tweet's final number of retweets.  We know that one that one is a good predictor of the other because retweets early on increase as the total number of retweets increases as well.  However, the main difference between the graphs is that the points on the right graph are closer to the blue best-fit line.

The closer points are to a best-fit line, the better one variable is at predicting another.  If the closer we get to the best-fit line, the better we will be at predicting the final number of retweets, then why don’t we just wait longer than 35 minutes to predict final number of retweets?

A trade-off exists between the time we are willing to wait for a prediction of the total number of retweets, and the prediction's accuracy.  For example, it would not be very helpful for us to wait 11 hours to predict the total number of retweets.  In the next post, we will discuss how long we’re willing wait for a prediction, and how much prediction quality we really need in order to show that predicting Kawasaki's retweets isn't really that hard.

No comments:

Post a Comment