Reconstructing Twitter's Firehose

Jay Taylor's notes

back to listing index

Reconstructing Twitter's Firehose

[web search]

Original source (docs.google.com)

Tags: id sharding hacking twitter snowflake firehose docs.google.com

Clipped on: 2019-03-13

Reconstructing Twitter's Firehose

How to reconstruct over 99% of Twitter's firehose for any time period

Author: Jason Michael Baumgartner (Owner of Pushshift.io)

E-mail: jason@pushshift.io

Twitter: @jasonbaumgartne

Since there has been a lot of interest in detailing the methodology on how to do this, I've

decided to make this a blog entry that will detail not only the technical aspects, but also

to give some insight into the history of Twitter's ID creation scheme and some

interesting tidbits on distributed computing in general. I hope that this article provides

you with valuable insight into the many nuances of data science and data collection

strategies. So without further ado, let's dive in!

On any random day, thousands of tweets are created every second on Twitter's social

media platform. Twitter offers developers many different API endpoints that can be used

to retrieve data, but there is no specific endpoint which allows reconstructing the

thousands of tweets per second that are made to Twitter. Twitter does sell premium data

services including their much coveted “firehose” stream. Twitter’s firehose is a complete

stream of all tweets made on their platform and is only available to a few businesses at

an extraordinary price. This paper will detail methods on how to reconstruct portions of

Twitter’s firehose for research purposes.

The ability to ingest tweets for a specific time period can be invaluable for research.

Whether your end goal is to gather a small sample of tweets between two time periods

or ingest as much data from Twitter as possible, the technical methods covered in this

paper will give you the ability to ingest various percentage levels of tweets (anywhere

between one and 99% of all tweets between two time periods). The techniques outlined

below will give you flexibility for ingesting a certain percentage of all publically available

tweets.

Before diving into the technical details on how to implement this process, let's start by

first rehashing some history. Twitter arrived on the social media scene in late March of

2006. During Twitter's infancy, there wasn’t much data volume -- perhaps a couple of

tweets every minute. At the beginning, Twitter started by using a sequential id scheme

for its tweets and user ids. Since all of their ids were sequential, it was easy to

transverse the space of possible tweets. Being sequential, all one had to do was simply

request ids in a sequential fashion (1, 2, 3, 4, etc.). This worked out well during that time

but also presented Twitter with one of its first problems -- the ability to create ids for

tweets and users in a distributed fashion without depending on one centralized service

to assign ids. To better understand the problem, we first have to understand one of the

main technical issues with a quickly growing platform such as Twitter.

Within a year of its launch, Twitter grew quickly and that rapid growth necessitated the

need to add additional servers to handle the increasing load. During the time period

when Twitter ids were sequential, there was one service responsible for generating and

assigning ids for new tweets and users. This presented a single point of failure in their

backend. Twitter increased their capacity to handle the growing volume of tweets and

new users by adding more servers to process data. As more servers were added, their

engineering team realized that they needed to decentralize id creation. If the service

that handled id creation went offline, every server responsible for handling incoming

tweets would be unable to assign a unique id to the tweet that server was processing.

Their engineering team needed to find a decentralized and scalable solution for

assigning ids to new incoming tweets.

There are a couple methods for assigning unique ids to new things in a decentralized

way. Twitter could have elected to simply use a UUID scheme

(https://en.wikipedia.org/wiki/Universally_unique_identifier) that would assign each new

tweet a random hashed id. By using a large amount of bits for hash creation, the chance

of possible id collisions would be virtually zero. However, Twitter wanted to find a way of

assigning numeric ids to new objects in a way that didn't involve using hashes and also

preserved some type of order thereby making ids sortable chronologically.

Enter the Snowflake

In 2010, Twitter engineers devised a method to decentralize id creation while also

preserving order (maintaining the ability to sort tweets chronologically). This

announcement was made on their blog detailing those efforts

(https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html).

Basically, the method they came up with works like this:

By using the new Snowflake algorithm, each server would be able to generate a unique

id on its own without the need to communicate with other servers. They did this by

breaking up a tweet id into several components -- epoch millisecond timestamp,

datacenter id, server id and sequence id. Each id created by the snowflake algorithm

uses epoch time in milliseconds in the high order bits while also using the datacenter id,

server id and sequence number in the low order bits.

Woah, wait a second -- high order, low order -- what? You've lost me.

(Further reading: https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit)

I’ll begin by giving a detailed explanation using examples to help clarify what all this

means. Let's first talk about numbers and how machines see them. Any integer can be

represented by bits in a computer (as long as you have enough bits to represent really

large numbers). Let's take the integer 183. When we talk about integers, we usually

always refer to them using base 10 math. However, computers do computation using

base 2 mathematics -- more popularly known as binary.

Getting back to the number 183 -- this number can be expressed in binary as 10110111.

In this example, it required 8 numbers in base 2 (binary) to represent a 3 digit number in

base 10. So how is the number like 183 transformed to binary? How does 10110111

represent 183? Starting on the right side (low order), the first number is 1. A binary

number is basically a bunch of switches where 1 represents "ON", 0 represents “OFF”

and each slot represents a power of 2. The first far right number is 2 raised to the 0

(which is equal to one), the second number moving to the left is 2 raised to the 1 (which

is 2), the next is 2 raised to the 2, which is four, etc. Basically, it looks like this (the top

being the far right number and each one below it being the number to the left):

Binary Addition

1 = 2^0 * 1 = 1

1 = 2^1 * 1 = 2

1 = 2^2 * 1 = 4

0 = 2^3 * 0 = 0

1 = 2^4 * 1 = 16

1 = 2^5 * 1 = 32

0 = 2^6 * 0 = 0

1 = 2^7 * 1 = 128

Adding up all the numbers on the far right (1,2,4,16,32,128) leads to 183. This is binary

arithmetic. They say there are 10 types of people that understand binary -- and you are

now in the group that does!

The digits to the right are considered low order bits and the ones towards the left are

considered high order bits. In computer jargon, high order bits are also referred to as

MSB (most significant bits) and these numbers represent the higher values. Great, so

how does this relate to Twitter ids?

Remember, Twitter had a couple major goals when creating their Snowflake algorithm --

every server could create ids without conflicting with another server and all ids created

would preserve some type of ordering (in this case, temporal). Let's look at an actual

Twitter ID and show which parts represent what. I'll use a tweet id from one of my

tweets: 1100125195476631553. I will use Python commands to help illustrate examples

and give you the opportunity to try them out for yourself. This id contains all the

information needed to tell you when it was created down to the millisecond and which

datacenter and server created it, and what sequence id value it represents. In order to

do this, we first have to convert it to its binary equivalent (brace yourself):

>>> bin(1100125195476631553)

111101000100011011100000100101011100110101111000000000000001

Twitter stores the following information in the binary representation of an id:

The first 12 digits (least significant or LSB bits -- the numbers furthest to the right)

represent the sequence id. In this example, the sequence id is ‘000000000001’ or

simply 1.

The next 5 binary numbers contain the server id for the tweet.

‘11000’ = 24

The next 5 binary numbers contain the data center id for the tweet:

‘01011’ = 11

The next 22 binary numbers contain the time in milliseconds of the tweet. However,

there is a small catch -- there is an offset that needs to be applied. The offset value

used is when Snowflake officially started. That offset is 1288834974657. So in order to

get the time in milliseconds of when the tweet was made, you would take the value of

the next 22 binary digits and add it to this offset.

‘11110100010001101110000010010101110011’ = 262290285939 + 1288834974657 =

1551125260596

The epoch time of creation for this tweet is 1551125260.596 or (UTC) Monday, February

25, 2019 8:07:40 PM (and .596 milliseconds)

Here is the tweet ID again with each part separated out and labeled:

| Time Component | DC | SID | Sequence ID|

|11110100010001101110000010010101110011|01011|11000|000000000001|

DC = Datacenter id

SID = Server id

Hopefully this removes some of the mystery behind Twitter tweet ids and gives you a

better understanding of how they are created. The reason the time component is

located in the higher order bits is to keep the ids sortable by time. By having the time

component in the most significant bits, temporal order is preserved and tweet ids can be

sorted chronologically. It’s important that you understand how Twitter creates their tweet

ids before moving on -- I encourage you to reread this section if this still feels a bit

unclear.

Using Snowflake to our advantage

Understanding how tweet ids are created is the first step in helping to reduce the id

space that needs to be scanned in order to find actual tweets for any given time range.

For each millisecond in the timeline, how many tweet ids are possible? We know that

the first 22 bits contain the sequence id, server id and data center id. If we knew nothing

about Twitter’s infrastructure, we’d have to check every combination. How many

combinations are there with 22 bits? There are a total of 4,194,304 possible values that

can be represented with 22 bits! Let’s do some quick math to see if it is feasible to

gather tweets by scanning such a large space.

Twitter’s statuses lookup API endpoint allows for a total of 1,200 API calls every 15

minutes. Each call allows the user to pass 100 ids for a total of 120,000 id requests

every 15 minutes using both APP auth (300 requests) and USER auth (900 requests).

This would require making API calls over 525 minutes to scan every possible value --

and that’s just to check one millisecond of actual timeline data! It is absolutely

unfeasible to do this given those API limits. But can we do better? If you’re reading this,

you know we can!

(Note: For the remainder of this article, I will treat the bit space representing the server id and data center

id as one entity by combining them. Realistically, we aren’t concerned with what data center was used, we

mainly care about combinations and permutations. This is best done by treating these two different ids

conceptually as one “machine id.” Recall that 5 bits are used for the data center id and 5 bits are used for

the server id -- so if we combine them both, we’ll treat those ten bits as one entity and refer to this as the

machine id.)

Taking advantage of Twitter’s Snowflake scheme

When I ran an analysis on millions of tweets, I made a surprising observation -- Twitter

doesn’t use that many machines to create their ids. In fact, the bulk of ids are created by

20 or fewer machines. This reduces the space tremendously. Now we only have 20 *

the number of possible sequence ids that can be represented by 12 bits (remember, the

sequence id is the first 12 bits). How many possible values are there in 12 bits? There

are 4,096 possible values to check. We need to check all 4,096 for each machine (20).

That means the total space is now just 81,920 instead of 4,194,304. We’ve reduced the

id space that we have to check by a factor of over 50!

Unfortunately, there is still a problem -- it will still require over 10 minutes to scan that

space for each millisecond of time. In order to scan an entire second of actual history, it

will require over 10,000 minutes. That means it will take a week of constant API calls

just to reconstruct one second of Twitter firehose data. We’ve gone from the realm of

impossible to absurd -- but can we do better? In fact, we can!

After analyzing millions of tweets to see what sequence ids are most used, I found that

most tweets only use a sequence ID of 0! The next largest batch of tweets uses a

sequence ID of 1. In fact, over 99% of all tweets use a sequence id of 10 or less! Here

is a breakdown of sequence id usage when I checked one million tweets:

Seq ID | % of Tweets | Cumulative %

The main sequence ids

0 0.49815 0.49815

1 0.26266 0.76080

2 0.12046 0.88126

6 0.04343 0.92469

5 0.02783 0.95252

3 0.02588 0.97839

7 0.00747 0.98586

4 0.00735 0.99321 ← over 99%!

8 0.00304 0.99625

10 0.00109 0.99734

What this shows is that almost half of all tweets have a sequence id of 0. Over 75% of

all tweets have a sequence id of 0 or 1. Over 90% of all tweets have a sequence id of

0,1,2 or 6. (These numbers may change with a larger sample -- I suspect that sequence

ids are handed out sequentially -- although I have seen very high sequence ids in the

past so there may be more fine-tuning possible).

This is amazing because this means that, theoretically, you could get almost half of all

available tweets by just checking tweets with a sequence id of 0! How does this reduce

the space of ids that we need to check?

20 machine ids * 1 (one sequence id) = 20 total id scans for each millisecond of time.

With 120,000 possible id requests every 15 minutes (using one dev account), we can

scan 6,000 milliseconds of actual timeline data (6 seconds). That doesn’t sound like a

lot, but that is equal to 576 seconds of timeline data each day -- almost ten minutes of

timeline data getting close to 50% of all tweets! We’ve gone from impossible to absurd

to manageable!

For many projects, you wouldn’t need to collect a 50% sample. What if we wanted a

10% sample for a specific range of time? Let’s take a look and see what percentage of

tweets are processed by each machine id:

Machine ID | % of Tweets | Cumulative %

The first 20 machine ids

332 0.09057 0.09057

335 0.05927 0.14984

361 0.05724 0.20707

363 0.05452 0.26159

381 0.05406 0.31565

364 0.05378 0.36944

382 0.05344 0.42288

372 0.05341 0.47629

362 0.05263 0.52892

376 0.05247 0.58139

375 0.05174 0.63312

350 0.05145 0.68458

365 0.04923 0.73380

325 0.04076 0.77456

347 0.03860 0.81317

326 0.03855 0.85172

336 0.03748 0.88920

333 0.03660 0.92581

327 0.03576 0.96157

342 0.03453 0.99610 ← over 99%!

For the time range I sampled, 10% of tweets were handled by machine id 332. 15%

were handled by 2 machine ids. 20% were handled by 3 machine ids. In fact, over 99%

of all tweets were handled by twenty machine ids in total. If we wanted a 10% sample,

let’s assume we would need to scan the first five machine ids. How does that reduce the

space of ids that we need to check?

120,000 id checks against 5 machine ids with sequence 0 should give us around a 15%

sample (remember, around 49% of all tweets have a sequence id of 0). That means we

can now sample four times faster if we want a ~ 15% sample. We’re now approaching

35 minutes of timeline data for each day of processing (using one dev account)!

If ~99% of all tweets were desired, it would require 200 id lookups per millisecond of

timeline data. Using the maximum number of requests for one developer account, it

would be possible to scan 6 seconds of timeline data every 150 minutes (a little less

than one minute of timeline data per day of requests). If multiple developer accounts

were used, or an application asked for user authorization to scan more ids (900

additional requests per 15 minute window for each user that authorized the app), it

could be feasible to reconstruct large portions of the timeline by making parallel

requests and cycling through user keys. An application that collected 100 user

authorizations could scan close to 100 minutes of timeline data per day. Taken to an

extreme, if an application had ~1,650 users authorizing the app to make requests, the

entire firehose itself (~99% sample rate) could be reconstructed in real-time. Such a

system could be designed to make requests from multiple servers and guided by a

master control program to reconstruct Twitter’s entire firehose!

(Author’s note: It is my belief that if enough university students and researchers worked together, we

could reproduce the firehose at a 99+% sample rate and make the data available to all researchers -- that

would be amazingly cool! Getting ~1,700 people (a few extra as a buffer for incomplete calls, etc.) to

participate wouldn’t be an insurmountable task and if there is interest in doing this project, I would be

happy to help write the code to manage everything including the distribution of the stream to various

universities, etc. We would only need to get around 1,700 users to authorize the app to make status

lookups on their behalf -- the application would not require any sensitive permissions. My e-mail is

jason@pushshift.io if you are interested in kick-starting such an adventure. Creating a “decahose” stream

(~10%) would only require around 50-75 users to participate! A 50% stream would require a few hundred

people.)

Keep in mind that while ingesting this data, you will get retweets that contain the original

tweet. That means the likelihood of getting tweets that were popular during the time

period approaches 100% the more often it was retweeted.

Doing an initial pre-scan to determine what machine ids were used

One important step in this process is doing an initial pre-scan for the time period you are

interested in ingesting. The purpose of the pre-scan is to determine which machine ids

were in use during a specific time period. Over time, Twitter has added and removed

machines, so the machine ids do change over time. The pre-scan should cycle over all

possible ten bit variables that make up the machine id (5 bits are reserved for the server

id and 5 bits are reserved for the data-center id). There are 1,024 possible machine ids,

so the pre-scan will need to iterate over all possible machine ids for the time range that

will be ingested. This prescan does not have to collect a large amount of tweets, but the

pre-scan should use at a minimum 1,200 API calls (15 minutes worth of API calls for the

statuses lookup endpoint) to get all the machine ids that were in use during the

timespan that will be ingested.

1,200 API calls will allow for 120,000 id checks. With this amount, 117 milliseconds of

time can be scanned checking all possible machine ids.

(Note: I will add this functionality to the firehose ingest script and will call the method “prescan.” The

prescan method will accept a start period, end period and number of checks. The function will then space

out the requests between the start period and end period to determine which machine ids were in play

during the time range. I will include this function in the eventual Github commit for the firehose

reconstruction code.)

Example of ingesting the timeline during a major event

New Year’s Day in Tokyo

To check how many tweets can be captured each second using this method, the

firehose reconstruction script was executed using ten sequence ids (0-9). The following

machine ids (20 total) were in play during New Year’s Day in Tokyo (2019):

335, 325, 342, 326, 347, 382, 361, 365, 363, 362, 336, 375, 364, 333, 381, 327, 350, 372, 332, 376

Tokyo is GMT +9, which translates to an epoch time of 1546268400. I started the script

to begin one second before the new year. Checking 10 sequence ids across 20 server

ids requires 200 id checks for each millisecond of timeline data. This requires two api

calls are necessary to scan each millisecond of data. With a limit of 1,200 API calls per

15 minutes, 600 milliseconds of timeline data can be captured every 15 minutes. Here is

the frequency of tweets captured each second using this technique:

(Author Note: I’m still collecting this data. Any seconds listed are complete.I will add more data as it

becomes available. Each sample below is one complete second. This should represent between

95%-99% of the full firehose. Times below are in UTC -- Tokyo is +9 GMT)

Tweets Time Period

9,568 Mon Dec 31 14:59:59 +0000 2018

19,339 Mon Dec 31 15:00:00 +0000 2018

30,478 Mon Dec 31 15:00:01 +0000 2018

33,919 Mon Dec 31 15:00:02 +0000 2018

33,422 Mon Dec 31 15:00:03 +0000 2018

26,623 Mon Dec 31 15:00:04 +0000 2018

22,286 Mon Dec 31 15:00:05 +0000 2018

17,009 Mon Dec 31 15:00:06 +0000 2018

18,347 Mon Dec 31 15:00:07 +0000 2018

17,079 Mon Dec 31 15:00:08 +0000 2018

15,988 Mon Dec 31 15:00:09 +0000 2018

15,714 Mon Dec 31 15:00:10 +0000 2018

15,961 Mon Dec 31 15:00:11 +0000 2018

15,639 Mon Dec 31 15:00:12 +0000 2018

Another example of ingesting timeline data during a major event

New Year’s Day in NYC

Using the same methodology that was used to capture tweets during new year’s day in

Tokyo, I changed the time period to capture tweets for new year’s day in New York City.

I scanned 10 sequence ids after running a prescan to determine which machine ids

were in play during this time period. These were the following machine ids that were

active:

332, 375, 325, 350, 347, 376, 365, 361, 336, 333, 372, 362, 327, 342, 382, 381, 326, 335, 364, 363

New York City is GTM -5 during this time. There is a large difference in the volume of

tweets between these two events. Below is the data for each second during the New

Year’s event in NYC.

Tweets Time Period

3,367 Tue Jan 01 04:59:59 +0000 2019

5,921 Tue Jan 01 05:00:00 +0000 2019

4,884 Tue Jan 01 05:00:01 +0000 2019

5,027 Tue Jan 01 05:00:02 +0000 2019

4,851 Tue Jan 01 05:00:03 +0000 2019

4,779 Tue Jan 01 05:00:04 +0000 2019

4,680 Tue Jan 01 05:00:05 +0000 2019

4,499 Tue Jan 01 05:00:06 +0000 2019

4,411 Tue Jan 01 05:00:07 +0000 2019

4,380 Tue Jan 01 05:00:08 +0000 2019

4,314 Tue Jan 01 05:00:09 +0000 2019

4,261 Tue Jan 01 05:00:10 +0000 2019

4,382 Tue Jan 01 05:00:11 +0000 2019

4,257 Tue Jan 01 05:00:12 +0000 2019

Statistical Probability of ingesting a tweet retweeted N times

(Edit note: Add some examples / chart of the probability of ingesting a tweet that was retweeted given a

specific sampling percentage vs. how many times a tweet was retweeted.)

For example, if one ingests a 10% random sample and a tweet is retweeted 10 times,

the probability of ingesting that tweet is (1 - (.90^10)) = ~ 65% chance of getting that

tweet.

Generalized where X is the sample rate percentage and N is the number of times

something was retweeted:

(1 - (1.00 - X^N))

Example

X = .25 (25% Random sample stream)

N = 5 (Retweeted at least 5 times)

(1 - (1.00 - .25) ^ 5) = 76.26% chance of ingesting the original tweet

Working backwards -- To have a 95% probability of ingesting something from a 25%

stream, you would need it to occur at least log(0.05) / log(1 - 0.25) = 10.4 times.

(Author note: A big thank you to Xanda Schofield Twitter:@XandaSchofield for help with the math.)

Use Cases:

Temporal Analysis of Significant Events

A significant event happens and a researcher wishes to find the first tweet mentioning

an earthquake, terrorist attack, etc. -- Using this method, one could use the Twitter

search API but the public search endpoint only has ~10 day history. The method

outlined in this paper would allow researchers to check historical events to find the first

tweets that mention something. (Could be awesome for Earthquakes and getting a 5

minute window of tweets to get every tweet that mentions feeling the ground shake,

etc.)

Complete sample to better understand SPAM / BOT Retweet activity

President Trump makes a tweet -- how was his tweet retweeted during the first few

minutes afterwards? This method would help to detect bot-like activity on Twitter by

scanning time blocks right after a specific account makes a tweet to see if the same

accounts are active each time.

Creating a real-time stream

One could create self-adapting code that utilizes all available authorizations to stream

the firehose at X% (depending on the number of auth accounts in the rotation pool).

With one dev account, it should be possible to get a real-time stream with around 1% of

all tweets that could easily scale up as more authorizations are added to the pool. If

approximately 50-75 users gave the app authorization to use the statuses/lookup

endpoint, a real-time decahose (10%) would be possible.

Actual Code Examples

How to break apart a Twitter ID and view its components using Python:

(Edit Note: Add this code to Github)

#!/usr/bin/env python3

Toggle screen reader support