In a recent Google+ thread, Kawasaki was asked whether he had deleted some of his tweets during the tragic Boston Marathon Bombings. In no uncertain terms, Kawasaki responded that he not only “…did not delete any tweets…” on that day, but also that he couldn't “…remember the last time [he] deleted a tweet”. Although it is not clear who is responsible for Kawasaki’s missing tweets, the fact is that some of his tweets do go missing. I understand a missing tweet to be one that was posted by the @GuyKawasaki Twitter account but 12 hours later can no longer be accessed either through the Twitter website, or through the Twitter API. In the two week sample of 1,310 @GuyKawasaki tweets used in this analysis1, 66% of all tweets went missing.
This blog is about finding trends in data, so I thought that looking closely at what these missing tweets looked like could help us determine the cause of their disappearance. I was not alone. In my previous post, Max Christian Hansen, who originally asked Kawasaki whether he was deleting his tweets, asked me whether the missing tweets could be due to Kawasaki’s practice of reposting the same tweet multiple times. Only the most recent repost would be visible while the previous reposts would go missing. To my surprise, his hunch was correct. Whether or not a tweet had been reposted could explain whether a tweet would go missing.
The chart below shows the percentage of tweets that go missing by each tweet’s repost count. The repost count describes a tweet's order in a set of reposted tweets. For example, the tweets with a repost count of 0, were only posted once. Tweets with a repost count of 1, were the first of multiple tweets with the same tweet string. Tweets with a repost count of 2, were the second of multiple tweets with the same tweet string, and so on. Looking at the chart below, we can see that tweets with a repost count of 0, those that were only posted once, very rarely, if ever, go missing. The dark grey bar that describes the percentage of tweets that don't go missing covers most of the column.
In contrast, the tweets with a repost count of 1, those that were the first of a series of reposted tweets, will almost always go missing. The same applies to tweets that were the 2nd and 3rd in a string of reposted tweets. However, if we look at the 4th tweet in a string of reposted tweets we see that the trend reverses and barely any go missing.
The vast majority of tweets that are reposted are reposted 4 times as seen by the column widths in the chart above. The width of the columns of the chart is based on the number of tweets in each category. The tiny width of the bar for tweets that were reposted 5 times or more shows that very few tweets are reposted 5 or more times. Therefore, most reposted tweets are reposted 4 times, and the 4th tweets in a string of reposted tweets rarely goes missing. We can then conclude that whether or not a tweet goes missing can be explained to large degree by its order in a string of reposted tweets.
A second interesting trend related to the missing tweets has to do with the number of retweets a tweet receives. I found that the more retweets a tweet receives, the less likely it is to go missing. The chart below shows the percentage of tweets that go missing by the number of retweets a tweet received 15 minutes after being posted. Therefore, the leftmost point shows that of all the tweets that received 0 retweets 15 minutes after being posted, 60% went missing. The trend demonstrated by the blue line shows that if a tweet has 13 retweets or less 15 minutes after being posted, it has about a 70% chance of going missing. As soon as the tweet crosses the 13 retweet barrier, the odds drop very quickly. By the time a tweet reaches 19 retweets or more, the tweet has almost no probability of going missing. Therefore, tweets with more retweets have a higher chance of not going missing.
Finally, as described in a previous post but included here for completeness, there is also a relationship between the time of day a tweet is posted and whether or not it goes missing. The chart below shows the percentage of tweets that go missing by the hour when the tweet was posted. You can see that the hour when a tweet was posted tells us a lot about whether a tweet will go missing or not. Between midnight and 5 a.m. Pacific Time, all tweets will be erased. Then, we see a big decrease during the next two hours in the probability that tweet will be erased. Across the day, the probability of going missing increases slowly, with the exception of a few sharp decreases like the one at 6 p.m. Finally, between 10 p.m. and mid-night, a tweet’s probability of going missing, increases significantly to 92% on average.
So, does this new insight tell us whether Kawasaki is being dishonest when he says he doesn’t delete his post, or whether Twitter is to blame for the missing tweets? If this was the TV show Mythbusters, we would rule this one "Inconclusive". Although some of the trends above, especially the repost count, explain to a large degree whether a tweet will go missing or not, none are a smoking gun with someone’s fingerprints. In future posts we will explore other ways to analyze Kawasaki’s twitter data to see if we can find a culprit in the case of the missing tweets.
1 The data used in this analysis was collected between March 21, 2013 and April 6, 2013. I used the Twitter API to collect the data and sampled for a new tweet every 45 seconds.