Spinn3r

Trusted by Top Tier Startups

Spinn3r is powering startups that have raised in excess of $100M in VC funding. We've been put through the hurdle a number of times, and have had extensive external audits by companies performing migrations to our infrastructure. We've migrated three large crawlers over to Spinn3r, saving massive amounts of cash and infrastructure headaches for the companies involved.

Industry Standardization

It's interesting that most of the industry is standardizing around our infrastructure. Why wouldn't they? We've been in production for over three years and have been burned in by plenty of other startups.

Research Program

By far, our more successful program over the last year has been our research program.

We're pretty excited to make this announcement public now.

We have researchers at Harvard, Carnegie Melon, Stanford, Caltech, University of Maryland Baltimore County, University of Washington, University of Southern California, Nanyang Technological University, University Of Edinburgh, National Institute of Informatics in Japan, California Institute of Technology, University of Hannover, Stony Brook University, and on and on.

Basically, if you're a PhD researching blogosphere, you're probably using Spinn3r.

Published Research

Cornell recently launched a Memetracker powered by Spinn3r, and we're really excited about it!

TextMap is another search engine that uses Spinn3r. Their paper, Large-Scale Sentiment Analysis for News and Blogs from the 2007 International Conference on Weblogs and Social Media (ICWSM) does a good job of explaining their system.

A number of our customers are performing entity extraction and sentiment analysis, and this space is going to be rapidly maturing in the next few years.

Sponsoring ICWSM 2009

We're sponsoring ICWSM 2009 with a web crawling corpus representing a four month crawl powered by Spinn3r.

We provided them with four months of data which was nearly 400GB uncompressed. It's turning out to be a huge success, with more than 150 researchers requesting access. We've also extended them access to Spinn3r, and will continue to do so for the foreseeable future.

Papers

The following papers have been released using our architecture and crawl data. If your paper isn't listed here please contact us and we'll make sure it is added.

Specifications and Architectures of Federated Event-Driven Systems

Specifying the Personal Information Broker Data Acquisition: Data can be acquired from multiple sources – currently we use Spinn3r, later we will also acquire IEM, Twitter, Technorati, etc. Each of these acquisitions is specified differently. Acquisition of Spinn3r data, referenced in Fig-
ure 3 step 1, is achieved through changing URL arguments in a manner defined by Spinn3r. Thus, the specification is
unique to Spinn3r. While that particular specification cannot be reused, using the compositional approach, exchanging Spinn3r for Twitter, a news feed, or an instant messaging account while maintaining the integrity of the composition is trivial. The specifications for all of these information inter-
faces are very different; a notation that allows the description of composite applications must account for this.

Blogs as Predictors of Movie Success

In this work, we attempt to assess if blog data is useful for prediction of movie sales and user/critics ratings. Here are
our main contributions:

• We evaluate a comprehensive list of features that deal with movie references in blogs (a total of 120 features) using
the full spinn3r.com blog data set for 12 months.

• We find that aggregate counts of movie references in blogs are highly predictive of movie sales but not predictive of
user and critics ratings.

• We identify the most useful features for making movie sales predictions using correlation and KL divergence as metrics and use clustering to find similarity between the features.

• We show, using time series analysis as in (Gruhl, D. et. al. 2005), that blog references generally precede movie sales
by a week and thus weekly sales can be predicted from blog references in the preceding weeks.

• We confirm low correlation between blog references and first week movie sales reported by (Mishne, G. et. al. 2006) but we find that (a) budget is a better predictor for the first week; (b) subsequent weeks are much more pre-dictive from blogs (with up to 0.86 correlation).

Data and Features

The data set we used for this paper is the spinn3r.com blog data set from Nov. 2007 until Nov. 2008. This data set includes practically all the blog posts published on the webin this period (approximately 1.5 TB of compressed XML).

Blogvox2: A modular domain independent sentiment analysis system

Bloggers make a huge impact on society by representing and influencing the people. Blogging by nature is about expressing and listening to opinion. Good sentiment detection tools, for blogs and other social media, tailored to politics can be a useful tool for today’s society. With the elections around the corner, political blogs are vital to exerting and keeping political influence over society. Currently, no sentiment analysis framework that is tailored to Political Blogs exist. Hence, a modular framework built with replicable modules for the analysis of sentiment in blogs tailored to political blogs is thus justified.

Spinn3r (http://tailrank.com ) provided live spam-resistant and high performance spider dataset to us. We tested our framework on this dataset since it was live feeds and we wanted to test our performance of sentiment analysis on these dataset for performance analysis and testing. We periodically pinged the online api for the current dataset of all the rss feeds. Although we had different domains that were provided to us, we chose the political
domain for consistency with our other results.

Meme-tracking and the Dynamics of the News Cycle

Tracking new topics, ideas, and “memes” across the Web has been an issue of considerable interest. Recent work has developed methods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identification of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events.

Dataset description. Our dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites that we obtained through the Spinn3r API [27]. The total dataset size is 390GB and essentially includes complete online media coverage: we have all mainstream media sites that are
part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites. From the dataset we extracted the total 112 million quotes and discarded those with L < 4, M < 10, and those that fail our single-domain test with ε = .25. This left us with 47 million phrases out of which 22 million were distinct. Clustering the phrases took 9 hours and produced a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together included 94,700 nodes (phrases).

Flash Floods and Ripples: The Spread of Media Content through the Blogosphere

This paper is based on the Spinn3r data set (ICWSM 2009), which consists of web feeds collected during a two month period in 2008. The data set includes posts from blogs as well as other data sources like news feeds. We discuss our methodology for cleaning up the data and extracting posts of popular blog domains for the study. Because the Spinn3r data set spans multiple blog domains and language groups, this gives us a unique opportunity to study the link structure and the content sharing patterns across multiple blog domains. For a representative type of content that is shared in the blogosphere, we focus on videos of the popular web-based broadcast media site, YouTube.

Our analysis, based on 8.7 million blog posts by 1.1 million blogs across 15 major blog hosting sites, reveals a number of interesting findings. First, the network structure of blogs shows a heavy-tailed degree distribution, low reciprocity, and low density. Although the majority of the blogs connect only to a few others, certain blogs connect to thousands of other blogs. These high-degree blogs are often content aggregators, recommenders, and reputed content producers. In contrast to other online social networks, most links are unidirectional and the network is sparse in the blogosphere. This is because links in social networks represent friendship where reciprocity and mutual friends are expected, while blog links are used to reference information from other data sources.

Identifying Personal Stories in Millions of Weblog Entries

Stories of people’s everyday experiences have long been the focus of psychology and sociology research, and are increasingly being used in innovative knowledge-based technologies. However, continued research in this area is hindered by the lack of standard corpora of sufficient size and by the costs of creating one from scratch. In this paper, we describe our efforts to develop a standard corpus for researchers in this area by identifying personal stories in the tens of millions of blog posts in the ICWSM 2009 Spinn3r Dataset. Our approach was to employ statistical text classification technology on the content of blog entries, which required the creation of a sufficiently large set of annotated training examples. We describe the development and evaluation of this classification technology and how it was applied to the dataset in order to identify nearly a million personal stories.

In this paper, we describe our efforts to overcome the limitations of our previous story collection research using new technologies and by capitalizing on the availability of a new weblog dataset. In 2009, the 3rd International AAAI Conference on Weblogs and Social Media sponsored the ICWSM 2009 Data Challenge to spur new research in the area of weblog analysis. A large dataset was released as part of this challenge, the ICWSM 2009 Spinn3r Dataset (ICWSM, 2009), consisting of tens of millions of weblog entries collected and processed by Spinn3r.com, a company that indexes, interprets, filters, and cleanses weblog entries for use in downstream applications. Available to all researchers who agree to a dataset license, this corpus consists of a comprehensive snapshot of weblog activity between August 1, 2008 and October 1, 2008. Although this dataset was described as containing 44 million weblog entries when it was originally released, the final release of this dataset actually consists of 62 million entries in Spinn3r.com’s XML format.

SentiSearch: Exploring Mood on the Web

Given an accurate mood classification system, one might imagine it to be simple to configure the classifier as a search filter, thus creating a mood-based retrieval system. However, the challenge lies in the fact that in order to classify the mood for a potential result, the entire content of that page must be downloaded and analyzed. Much like a typical web-based retrieval system, to avoid this cost, pages could be crawled and their mood indexed along with the representation stored for search indexing. Alternatively, the presence of a massive dataset from www.spinn3r.com enabled the ESSE system to be built, performing mood classification and result filtering on the fly (Burton et al. 2009). Because the dataset (including textual content), search system, and mood classification system all exist on the same server, the filtering retrieval system was made possible. The dataset not only allows access to the content of a blog post (beyond the summary and title typically made available through search APIs) but the closed nature of the dataset allows for experimentation while still being vast enough to provide breadth and depth of topical coverage.

Event Intensity Tracking in Weblog Collections

The data provided for ICWSM 2009 came from a weblog indexing service Spinn3r (http://spinn3r.com). This included 60 million postings spanned over August and September 2008. Some meta-data is provided by Spinn3r.

Each post comes with Spinn3r’s pre-determined language tag. Around 24 million posts are in English, 20 million more are labeled as ‘U’, and the remaining 16 million are comprised of 27 other languages (Fig. 3). The languages are encoded in ISO 639 two-letter codes (ISO 639 Codes, 2009). Other popular languages include Japanese (2.9 million), Chinese/Japanese/Korean (2.7 million) and Russian (2.5 million). The second largest label is U unknown. This data could potentially hold posts in languages not yet seen or posts in several languages. Our present work, including additional dataset analysis presented next, is limited to the English posts unless otherwise specified. In future work we plan to also consider other languages represented in the dataset.

Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network

Our research is the first attempt to give an accurate measure for the level of information propagation. This paper presents ‘SugarCube’, a model designed to tackle part of this problem by offering a mathematically precise solution for the quantification of the level of topic propagation. The paper also covers the application of SugarCube in the analysis of the social network structure of the ICWSM/Spinn3r dataset (ICWSM 2009). It presents threshold values for the communities found within the collection, and paves the way for the measurement of topic propagation within those communities. Not only can SugarCube quantify the proliferation level of topics, but it also helps to identify ‘heavily-propagated’ or Global topics. This novel approach is inspired by Percolation Theory and its application in Physics (Efros 1986).