Like many things in life, such as tying a tie, cooking
an egg, and skinning
a cat, there are different ways that we could predict the final number
of retweets for one of Guy Kawasaki’s tweets.
However, unlike the three previous tasks, predictions have objective ways
to determine what method is best. In
this post I’ll compare three different statistical tools, simple regression, the lasso (or L1), and random
forests to determine which is best at predicting retweets. I found that the random forest predicted the
data best and only showed an average error of 19%. The lasso and simple regression did similarly
well will errors of 46% and 36% respectively.
The chart below shows how well each one of the tools
performed by graphing the actual number of retweets of each tweet in the xaxis
and the predicted number of retweets in the yaxis. The error rate of each tweet is color coded
in green, yellow, and red depending on the accuracy of each prediction.
The simple regression is our baseline tool and uses the
number of retweets 25 minutes after Kawasaki posts a tweet to predict the tweet’s
final number of retweets. I picked the number
of retweets 25 minutes after Kawasaki posted the tweet because in a previous
post we found that 25
minutes was the optimum time to sample.
The regression does a pretty good job of predicting the data. On average, each prediction has an error of only 36%1. The graph above
also shows how well the regression performed.
We see mostly green and yellow dots, and only a few red dots.
The second model that we tried was the lasso. The lasso shares some of a regression’s characteristics
such as that it uses assumes a linear
relationship between variables. However, the lasso usually
does a better job at picking up the signal from the noise in the data by selecting
variables that are truly significant. I
threw a lot more data at the lasso than I did to the simple regression. The following table shows all the data I used
with the lasso:
Data

Sample Frequency

Description

Number of retweets

5, 15, 25 min. after posting

Retweets after a certain time

Difference in retweets

5, 15, 25 min. after posting

Increase in retweets during a certain time

Audience size

5, 15, 25 min. after posting

Number of users that follow those who retweeted the tweet

Difference in audience size

5, 15, 25 min. after posting

Increase in the number of users that follow those who retweeted the
tweet

Number of followers

At time of posting

Number of people following @GuyKawasaky when the tweet was posted

Source

At time of posting

Tool used to post the tweet

Date, week day, hour, minute, second

At time of posting

Time information on when the tweet was posted

The Lasso picked four of these data points as being
significant, the number of retweets 5, 15, and 25 minutes after the tweet was
posted, and the difference between the number of retweets 25 and 15 minutes
after the tweet was posted. This result
is surprising not because of what the lasso selected, but what it didn't
select. There doesn't seem to be a
strong relationship between the time of day a tweet is posted or what tool is
used to post it, and the number of retweets.
In addition, by looking at the chart above it seems that the lasso did similarly
well at predicting the final number of retweets as the simple regression. Actually, it did a little worse by predicting
with an average error of 46%.
The third tool I used is called a random forest. Random forests are different from the simple
regression and the lasso because instead of relying on linear releationships to
predict the final number of retweets, random forests use a combination of decision trees. Decision trees use a series of conditional
statement such as, is the number of
retweets 25 minutes after the tweet was posted more or less than 10, to
make a decision. Random forests create
many decision trees with random sections of the data and use the median value generated by all the trees as the prediction. This tool worked much better than the other two, predicting
the data with an error rate of only 19%.
Although the random forest doesn't tell us exactly how it
made every decision, it does tell us how influential each of the variables was
in making the decision with the graph below.
The graph shows us that the four most important variables are the same
ones the lasso selected. The random
forest also finds the size of the audience to which the tweet was exposed, and the
number of followers at the time the tweet was posted to be slightly influential.
The three tools proved to be very helpful in predicting the number of retweets; none of them were completely off the mark. However, which one you use for prediction might depend not only on its accuracy but also on how easy it is for you to run in your application. A regression and the lasso give you a simple equation to run the prediction, while the random forest requires a computer to run the prediction model.
^{1} The error is calculated as the average of the ratio of the absolute value of the difference between the prediction and the actual number of retweets for all data points.
No comments:
Post a Comment