Spinn3r

Technical Specifications

Spinn3r has been deployed and in production for over three years. It has been extended to support all versions of RSS and all significant crawling and indexing standards as well as newer, niche, and experimental standards.

We also go above and beyond the call of duty and index RSS feeds and web content that does not yield to standards correctly. For example, our RFC 822 date parser is able to handle localized months (Ene for Enero which is Spanish for January) which are not part of the RFC.

XML, RSS, and Atom Specifications

  • RSS 1.0, 0.9 (all RSS specifications)
    • RSS 0.90
    • RSS 0.91 Netscape
    • RSS 0.91 Userland
    • RSS 0.92
    • RSS 0.93
    • RSS 0.94
    • RSS 1.0
    • RSS 2.0
  • Atom
    • Atom 0.3
    • Atom 1.0 (RFC 4287)
    • Atom (generic, for partially broken feeds)
  • RSS autodiscovery
    • Aggressive RSS discovery. Test all /xml, rss and feed links.
    • Vendor specific aggregator link detection
  • mod_xhtml, RSS and atom content, encoded, XML or HTML
  • Original item publication times in either RSS or Atom including atom:published and dc:created
  • Atom and RSS author metadata
  • wfw:comments
  • XHTML 1.0 output for content extract encoding
  • UTF-8 encoding for all content
  • Atom threading (RFC 4685)
  • Feedburner RSS extensions
  • Both RFC 822 and ISO 8601 time formats. This includes support for ISO 8601 timezones, and additional localized month names not present in the original specification but used in practice.

Protocol Robustness

  • All HTML entities (to prevent feeds from breaking)
  • Whitespace correction and content filtering for partially broken RSS and Atom feed support
  • Monitored accuracy detection. Feeds in spinn3r are constantly monitored for missing, dropped, or feeds that fall behind.

Microformats and Additional Specs

  • hAtom
  • rel-tag
  • Tag support including technorati tags, dc:subject, mod_taxonomy (RSS 1.0)
  • robots-nocontent
  • Google ad section targeting

Computed Metadata

  • mathematically computed language classification
  • codepage detection for all multibyte languages
  • n-gram language detection for european languages
  • internal spam probability detection
  • content extraction or chrome/template removal
  • inbound link count

Robots

  • robot DNS verification
  • robust aggregator identification (UserAgent + reverse/forward DNS)
  • cyclical (30 minute) indexing
  • robots.txt
  • Weblog ping support (pingomatic, weblogs.com, blo.gs, etc)
  • Integrated duplicate detection (verify that we don't index the same content multiple times)
  • Full HTML indexing/fetching via permalink API

HTTP Protocol Support

  • If-Modified-Since
  • If-None-Match (ETag support)
  • gzip content encoding
  • connection and read timeout detection for slow server mitigation
  • DNS caching