Indexing the full blogosphere in real time is hard enough without dealing with spammers.
Ideally you could focus on building your application without constantly fighting a war of attrition preventing spam from corruptiong your work.
Spinn3r has integrated complex spam blocking technology which enables you to develop more naive algorithms without having to worry about the complexity of fighting spammers.
How does it work?
We combine both link analysis and text analysis to prevent spam being indexed. If your content seems too much like spam and you have insufficient link connectivity to the A-list portion of the blogosphere, your content is simply not indexed.
It's not for lack of trying. We receive more than 100k pings per second from spammers trying to be indexed within Spinn3r.
The level of complexity in spam is rising. In late 2007 we posted an extensive analysis of a spam campaign that was rounding the blogoshere taking advantage of Wordpress vulnerabilities.
This attack was very advanced both in terms of the scope and technical nature. It consisted of a large number of doorway pages that redirected to another site which tried to install malware on the victims computers. ... We're still analyzing the source of inbound links to the doorway pages in this attack. The attacker used vulnerable blog sites to link to .edu domains which hosted the actual content. This analysis is made more difficult due to that fact that the source of the links often clean up the offending data before we can perform analysis.
The level and complexity of the attacker will always increase. The trick is to build on a platform which has spam integrated from the beginning.