Sunday, June 30, 2013

There’s More Than One Way to Predict Guy Kawasaki’s Retweets

Like many things in life, such as tying a tie, cooking an egg, and skinning a cat, there are different ways that we could predict the final number of retweets for one of Guy Kawasaki’s tweets.  However, unlike the three previous tasks, predictions have objective ways to determine what method is best.  In this post I’ll compare three different statistical tools, simple regression, the lasso (or L1), and random forests to determine which is best at predicting retweets.  I found that the random forest predicted the data best and only showed an average error of 19%.  The lasso and simple regression did similarly well will errors of 46% and 36% respectively. 

The chart below shows how well each one of the tools performed by graphing the actual number of retweets of each tweet in the x-axis and the predicted number of retweets in the y-axis.  The error rate of each tweet is color coded in green, yellow, and red depending on the accuracy of each prediction.



The simple regression is our baseline tool and uses the number of retweets 25 minutes after Kawasaki posts a tweet to predict the tweet’s final number of retweets.  I picked the number of retweets 25 minutes after Kawasaki posted the tweet because in a previous post we found that 25 minutes was the optimum time to sample.  The regression does a pretty good job of predicting the data.  On average, each prediction has an error of only 36%1.  The graph above also shows how well the regression performed.  We see mostly green and yellow dots, and only a few red dots.

The second model that we tried was the lasso.  The lasso shares some of a regression’s characteristics such as that it uses assumes a linear relationship between variables.  However, the lasso usually does a better job at picking up the signal from the noise in the data by selecting variables that are truly significant.  I threw a lot more data at the lasso than I did to the simple regression.  The following table shows all the data I used with the lasso:

Data
Sample Frequency       
Description
Number of retweets
5, 15, 25 min. after posting
Retweets after a certain time
Difference in retweets
5, 15, 25 min. after posting
Increase in retweets during a certain time
Audience size
5, 15, 25 min. after posting
Number of users that follow those who retweeted the tweet
Difference in audience size
5, 15, 25 min. after posting
Increase in the number of users that follow those who retweeted the tweet
Number of followers
At time of posting
Number of people following @GuyKawasaky when the tweet was posted
Source
At time of posting
Tool used to post the tweet
Date, week day, hour, minute, second
At time of posting
Time information on when the tweet was posted

The Lasso picked four of these data points as being significant, the number of retweets 5, 15, and 25 minutes after the tweet was posted, and the difference between the number of retweets 25 and 15 minutes after the tweet was posted.  This result is surprising not because of what the lasso selected, but what it didn't select.  There doesn't seem to be a strong relationship between the time of day a tweet is posted or what tool is used to post it, and the number of retweets.  In addition, by looking at the chart above it seems that the lasso did similarly well at predicting the final number of retweets as the simple regression.  Actually, it did a little worse by predicting with an average error of 46%.

The third tool I used is called a random forest.  Random forests are different from the simple regression and the lasso because instead of relying on linear releationships to predict the final number of retweets, random forests use a combination of decision trees.  Decision trees use a series of conditional statement such as, is the number of retweets 25 minutes after the tweet was posted more or less than 10, to make a decision.  Random forests create many decision trees with random sections of the data and use the median value generated by all the trees as the prediction. This tool worked much better than the other two, predicting the data with an error rate of only 19%.

Although the random forest doesn't tell us exactly how it made every decision, it does tell us how influential each of the variables was in making the decision with the graph below.  The graph shows us that the four most important variables are the same ones the lasso selected.  The random forest also finds the size of the audience to which the tweet was exposed, and the number of followers at the time the tweet was posted to be slightly influential.


The three tools proved to be very helpful in predicting the number of retweets; none of them were completely off the mark.  However, which one you use for prediction might depend not only on its accuracy but also on how easy it is for you to run in your application.  A regression and the lasso give you a simple equation to run the prediction, while the random forest requires a computer to run the prediction model.


1 The error is calculated as the average of the ratio of the absolute value of the difference between the prediction and the actual number of retweets for all data points.

No comments:

Post a Comment