Spinn3r

Domo Arigato Mr Roboto!

Spinn3r is used by search engines with more than 500M combined page views per month, dozens of startups, and more than 100 PhDs performing research on the blogosphere and social networks.

If you're providing access to Google, Yahoo, Microsoft, etc to crawl your site, we could like access as well.

What is Spinn3r?

Spinn3r is a web service that crawls on behalf of dozens of companies, researchers, and web startups.

Basically if you're indexing the blogosphere then you should probably be using Spinn3r. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Why are you reading my site?

Spinn3r is indexing your site on behalf of our user base to provide your content so it can influence their applications. We're used by search engines, analytic services, competitive intelligence services, etc.

Are you wasting my bandwidth?

Spinn3r uses very little bandwidth to monitor your site. We only request pages once and cache them once we've fetched them.

How do I verify that the robot visiting my website is Spinn3r?

First, it will have a User-Agent of:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1; aggregator:Spinn3r (Spinn3r 3.1); http://spinn3r.com/robot) Gecko/20021130

Second, we support robot DNS verification.

When you have a HTTP log entry which has our user agent, just perform a reverse DNS on the raw IP address.

For example:

%shell% nslookup 64.34.195.138
Non-authoritative answer:
138.195.34.64.in-addr.arpa name = robot32.spinn3r.com.

%shell% nslookup robot32.spinn3r.com
Non-authoritative answer:
Name: robot32.spinn3r.com
Address: 64.34.195.138

Can I tell Spinn3r to stop reading my site?

We currently monitor the XML feeds syndicated from your weblog. If you want us to index your feeds (as well as the HTML), let us know. Most people want us to index their site, so this rarely happens.

Spinn3r also supports robots.txt, so you can block us this way as well.

Why is Spinn3r requesting XML files that don't exist on my server?

We attempt to use web standards as much as possible to find the feeds that exist on your site. Unfortunately, there are many websites that break web standards in ways which can confuse robots. We attempt to assert that your weblog software is configured correctly by requesting additional files. We try to avoid downloading the entire file, and only use conditional gets to avoid wasting bandwidth. The biggest problem with this approach though is that it generates 404 error messages, but we only do this once per week.

Does Spinn3r index my feed?

If your site offers an RSS feed, we try to find it and index it by our service. If not, we also try to analyze your HTML as well. If you want to influence Spinn3r, the best way possible is to use an RSS feed with a full content feed (including all HTML from your post).

How does Spinn3r attempt to minimize my bandwidth usage?

  1. Compression We use gzip compression to reduce the number of bytes between our servers and your servers. This can usually result in a significant savings in bandwidth.
  2. Only fetch when your weblog has changed. We use the If-Modified-Since and ETag HTTP headers to prevent duplicate downloads. Not every weblog system supports these standards in all scenarios.