Content Sourcing
Ever since this post, the idea of content sourcing has been bugging me. The “silver bullet” for splogs is content theft. For example, in Doc Searls’ original post on that topic, he was reading quickly on some unknown blog, and failed to realize immediately that the words he read there were in fact Dave Winer’s. Now, there’s two problems with this phenomenon – what Doc charitably calls blogs “of unclear provenance.
First, it’s a simple violation of copyright. Now complaints about copyright tend to cause snickers in some quarters, but this is exactly the kind of abuse that needs to be protected. Dave Winer’s commentary isn’t just being copied without his permission – it’s being monetized, and without even acknowledging that Dave Winer was the author.
Second, and more important, I think, is the realization that content can be lifted arbitrarily from anywhere, and used to “cloak” a splog (“Spam Blog”). As any SEO black-hat worth their salt will tell you, doorway pages full of links are increasingly hard to get noticed and indexed by the search engines. Splogs need to look like real blogs, or at least like real content, in order to have any hope of attracting visitors through searching.
The problem is exacerbated by the casual conventions of the blogosphere. I know for several of my past posts, the entire post was quoted in someone else’s blog. That doesn’t bother me, and if fact it’s probably a good thing as the posts in question were points I wanted distributed as broadly as possible. I’m not blogging here for the ad revenue though, so I’m different than others who do, and can see that this practice may be a problem for some. As it is, though, part of the culture of blogs is linking, and attribution, but also “fair use” republishing of important or salient parts of articles from elsewhere. That being the case, then as long as content can be trivially copied from legitimate sites, it hardly matters how sophisticated the content analyzer crawling a splog is, as the content is a perfect copy of legitimate content from somewhere else. Keyword mapping, Bayesian filtering -- all that is futile if you’re just scanning Dave Winer’s commentary taken from scriptingnews.com
That’s a feature of the system that can be and will be exploited heavily. Say I’m a splogger. I’ve got a tool that takes keywords I enter and finds blog posts (via one or more search engines) that have content related to the keywords I enter. The tool picks several dozen recent posts from the thousands available, combines them into a blog, and adorns the aggregated content with my ads, and links to my offer pages. It costs me virtually nothing – the domain name is less than $10, the blog hosting is free – and it is so automated that it’s nearly as simple as entering the keywords to start with and pressing the “Submit” button. Sure, it’s wholesale copyright violation, but when’s the last time you heard of someone catching trouble for that on a blog? The original authors can complain, and I’ll drop their content from my splog – there’s a million others to copy from just as easily.
As Doc Searls’ experience points out, it can be hard for a human to detect when this happens. For a crawler that’s analyzing the content to identify splogs, this makes the situation nearly hopeless, at least in terms of content analysis. The identifying characteristics of a splog in this case no longer can be found in the content – it’s full of rich, stolen text – but rather in the sites and URLs it points to. Content theft, then, is the Vulcan Nerve Pinch the black-hats can use whenever needed against the system. Just scrounge up some fresh content from somebody else, and the whole splog detection framework collapses, like the hapless victim in Mr. Spock’s grip.
Should we, then, throw our hands up, and declare defeat? No, but there are two things I think need to be pursued in addition to (or maybe instead of) enhanced content analysis for blogs. First, we should look at semantics that can be embedded in (X)HTML that identify (tag) parts of the content that are “original” – copyright by the author of the post, and also those parts that are quoted or cited from elsewhere. I’ve done a quick tour with the search engine about this kind of markup, and besides some interesting ideas that use META tags – which are inadequate as they apply to the whole page – I haven’t found a framework for doing this. It’s not a complex concept, however. (If anyone reading has experience with an existing framework that does this, please email me to let me know, thanks.)
The second idea that should be pursued in tandem is the enhancement of our publishing tools that make the “ownership/quotation” markup simple and quick for users to work into the process of creating content (writing a post). Something like a style tag that doesn’t just visually identify quoted material, but semantically identifies it as external content, along with proper source attribution. Properly marked up posts, then, can be quickly sorted through to determine what parts of the content are original and which are imported. That’s not a panacea, but it would be a good step forward. Clear assertions about what content is being produced vs. what is not will be a powerful asset in filtering splogs from the content stream.
As an aside, those who monitor the tech side of the blogosphere will remember the minor uproar started by Mark Cuban’s protest against splogs a couple weeks ago. There was a distinct surge in the splog traffic, but looking back, it was nothing that unusual, given what I can see looking back aways at the logs for weblogs.com. What was different this time, though, was that the bad guys tried something beyond the usual keywords (“loans”, “hair loss”, etc.), and tried using some new terms, like “Jeff Jarvis” and “Scoble”. Some splogs simply used the names of bloggers as so much flypaper to catch users who were looking for the bloggers, in hopes that the odd one might click through an ad. Other’s appropriated who posts, even large collections of posts by popular bloggers, creating a sort of counterfeit blog. These tactics have been used before, but it was interesting to note that the blogosphere really got upset this time because they started seeing splogs regularly in their normal searches.
I can’t think that this tactic was particularly effective from a commercial standpoint – it seems unlikely that people searching for “Scoble” are good clickthrough candidates. It seems more like the black-hats having some fun at the expense of popular bloggers. If so, it may have backfired, as discussions and efforts around the splog issue have definitely picked up momentum in the last couple weeks.



Either that or the hax0r set has started wearing Armani and Rolex. It was bad enough that I was hanging out there without a conference badge, but carrying around a naked ThinkPad…