Social media data is constantly changing. It doesn’t matter whether its Twitter, Facebook, Instagram, YouTube, forums, blogs, or news. People are sharing on social media and you need to listen.
This data is important. The demographic and engagement data you can compute for business or research purposes can be extremely valuable. Not to mention the inbound sales leads and the support you can provide your customers.
Problems arise when you need to access this data in near real time. We’re talking a lot of data. Twitter currently publishes about 500M tweets per day and Facebook is at more than 2B. On Youtube there are over 500 hours of video uploaded every minute. The Spinn3r firehose is nearly 300GB per day and rising.
This is where a firehose API comes in handy. Unlike search APIs, or static web data, a firehose will push content to the user 24/7 with very low latencies (usually less than a few seconds or minutes).
Twitter essentially coined the term ‘firehose’ when associated with streaming data APIs - though it was used colloquially in the tech industry for years.
However, multiple companies now have their own firehose products and they also license them to 3rd parties. This includes Facebook with their Topic Data API but also includes Google which provides some firehose APIs to users of YouTube and Google+.
Besides Twitter and Social Media data there are other sources of data that are available via firehose APIs including:
RSS and blog data feeds
Activity data (click streams)
Now traditionally a number of data providers have only had search APIs. While flexible, they don’t work very well when providing bulk access to data.
These follow a traditional query/response model with users executing search queries and then receiving responses from the API in the form of JSON, XML or other machine readable formats.
These usually support advanced boolean queries such as
microsoft OR google
but also fielded queries like
microsoft AND source_followers:>100000.
Additionally, they can support features like aggregations where you can compute analytics while executing your query. These usually show up as a set of buckets with stats per bucket.
For example, you could run an aggregation by time to count the number of posts per hour. However, you could also run another aggregation by gender to see the male/female breakdown.
Caption: Data volumes on social media after the 2016 US Election
This works well when you only want to receive a small amount of the original data but breaks down when you want to receive more than 30% of the original data stream. At this point you’re going to want to convert to a firehose.
Remember that we’re still talking about a massive amount of data. Anywhere from 100GB-500GB per day.
This is why filtering the firehose can come in handy. A decent firehose API will support filtering whereby you can limit the inbound data based on coarse grained filters.
For example, you might only want a specific language. If you only care about French it doesn’t make a ton of sense to index download (or pay for) other languages that you will never use.
When using a search API the data is simply stored in the cloud. Your service provider just stores the data for you and you only receive the content you’re ready to pay for and request via search.
However, for a firehose, this content needs to be sent to your datacenter and your application.
But what happens during failure scenarios? If you have a production application you will be processing data for years and intermittent failures are to be expected.
In this situation your firehose needs resume and recovery.
This needs to be proven, and well tested - ideally with a crash only philosophy so that when you go offline, recovering is easy.
With Spinn3r, we keep 5 days of data ready for recovery. It’s been VERY VERY rare for us to use this feature but we’ve had customers who have had serious datacenter failures and needed to replay their data.
We’ve been in business for ten years now and have plenty of stories under our belt.
One customer had a full HBase failure and lost about 48 hours of data which required a restore from backups.
There are a number of great technologies that can be used to process the firehose.
They generally fall into the following categories:
This model generally allows companies to compute statistics on the fly while streaming the data but not necessarily keeping the data for more than a few minutes.
This can yield massive infrastructure savings. Why spend $50k per month on hardware when you can just keep a small window of content?
Apache Spark Streaming, Storm, and Esper are all solid choices here. They
work by keeping a small buffer in memory, then build up stats within that interval.
When the interval is over the data is flushed, making way fro new data.
For example, you could keep data for the total number of hits for a specific hashtag, then write out stats to an analytics database like KairosDB.
This would dramatically compress the data allowing you to store less than 1% of the original content. The downside here is that you won’t be able to do any historical queries or any queries outside of your original pre-computed metrics.
There’s still a use case for using firehose APIs to replicate data to 3rd party full-text search clusters.
We still have customers running under this model and they usually host their own SOLR or Elasticsearch install.
The upside here is that you can combine the firehose with your own search data.
Of course the downside is you have to pay for a lot of hardware to host your own search index. And remember we’re talking a lot of data so even just 30 days of content can cost you tens of thousands of dollars per month.
Batch processing allows you to efficiently execute large data processing jobs over a massive amount of data. We actually provide customers with an Apache Spark/Hive install and host the data on our own Elasticsearch cluster. At the time of this writing this was more than 150TB of content.
Batch processing is usually much better than running queries one at a time and can handle more data than the streaming model.
The only problem is that real time querying isn’t super efficient. Usually combining all the approaches (streaming, search, and batch) is the best approach.
Whether your business needs a search or a firehose for your data really depends on the requirements of your application.
What’s clear though is that for production applications you should spend a significant amount of time making the right decision as data, in bulk, can be quite expensive.
At Spinn3r we’ve tried to make it possible for our customers to work with much more data at more reasonable pricing but at the end of the day working with terabytes of content isn’t always inexpensive.