With the release of Spinn3r 3.0, we have decided to share key statistics on our crawler with the public, including language breakdown data, content management distribution, and so on.
These stats are updated live 24/7 and recomputed on the fly. There are additional metrics available to our paying customers via our admin console.
Breakdown of language across all content in the blogosphere. This is measured by mathematically computing the language of each post based on content (rather than the configured language for the weblog, which might be incorrect).
Posts with an 'unknown' language are almost certainly short posts of less than about 200 characters.
Weblog Hosting Provider Breakdown
Breakdown of posts across major weblog hosting providers. We separate this from other hosting providers which, while they might have weblogs, they might not have users actively participating in the core of the blogosphere.
Right now this is implemented with link pattern matching. For example, WordPress blogs are identified if the URL contains 'wordpress.com'. Of course, this method is prone to error and doesn't correctly identify Moveable Type, WordPress, or any other stand alone blog on their own URL.
Further, TypePad.com is probably misrepresented as well, as most users there use domain masking.
We expect to have a patch for this soon to include more precise values, creating a higher rate of accuracy for these blog hosts.
ALL Hosting Provider Breakdown
All weblog hosts, including MSN Live Spaces, MySpace, and LiveJournal. They might not qualify as traditional blogs, so we decided to break them out into a dedicated metric.
Raw number of RSS and Atom feeds being indexed by Spinn3r. This is directly correlated to the number of posts Spinn3r sees in both the permalink API and the feed API. This number might fluctuate based upon the raw ping rate at any given time, as well as the weekly update cycle for pinged feeds.
Feed Content Performance
Number of feed items per hour indexed and available by the feed API. You might see less content from raw API, as we do not include all posts in the API if they have been registered by one of our customers, but not yet approved as non-spam content.