While some things may change, others always remain the same. The recurring debate of whether you should use partial or full RSS feeds has been revisited time and time again, because if you have a blog (and it’s successful, usually), you’re also probably a victim of scraping.
Because it’s so easy to grab data from an RSS feed, blogs will arise and people will always try to make a quick buck off of your hard work and effort. This is why many bloggers provide partial feeds, though people who subscribe to hundreds of feeds (like myself) find this very inconvenient.
A lot of bloggers don’t realize that there are options to prevent scrapers from stealing the content off your site even when you have full-feeds enabled.
Here are a few things you can do that are relatively easy:
Report it to the search engines. Stealing content without authorization is a violation of the Digital Millennium Copyright Act. You can file DMCA complaints with any search engine as long as you provide sufficient information behind the suspicion of theft. Furthermore, anyone using Adsense on their sites can have their accounts terminated if Google finds them responsible for content theft. You can report such violations through Google’s DMCA Adsense page.
Prevent hotlinking of your images. There are plenty of sites that discuss the methods for doing this, but a simple .htaccess file in your images directory can be sufficient, especially if you serve the scrapers different image content.
Here’s some code to add to your .htaccess file (via Jackol’s htaccess cheatsheet):
RewriteEngine on RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www.)?mydomain.com/.*$ [NC] RewriteRule .(gif|jpg)$ - [F]
Be sure to replace mydomain.com with your own. This will then create a failed request when hotlinking of the specified file types occurs. In the case of images, a broken image is shown instead, or you can create your own image to be shown (get as creative as you want) by adding this line:
RewriteRule .(gif|jpg)$ http://www.mydomain.com/dontsteal.gif [R,L]
Another good in-depth tutorial on setting up this .htaccess file, or alternatively, a PHP file, can be found on A List Apart.
Contact the scraper directly. In some cases, they will tell you what you already know: they stole the content from your RSS feed. Some even have the audacity to ask you to link to them (in a mutual relationship that will benefit all parties). Your goal is to have others who actually comply with the removal of using your content altogether. Nobody wants to be slammed with a duplicate content penalty. Surprisingly, people do comply with this request, so it doesn’t hurt to try.
Does anyone have any other suggestions for how to prevent your content from being plagiarized?
[Thanks, Steve!]

Hey Tamar, I forgot to mention a blog post I did on this a while back, “Content Thieves“.
This is a touchy issue.
I publish my blog entries under a Creative Commons license that is visible on every page of my blog. I use partial feeds, not because I am against full feeds but due to the way the templates work on my blog it is easier to break a post into pieces to fit in the advertising blocks.
I was recently the victim of blog scraping and it took me a few days to find out and clean up the mess. In my case I would not have minded if the content used under the terms of my license, but it was not.
If you want to use someone’s content just ask! You would be amazed at how many folks will say, “yes, just give me proper credit and link backs.”
I wrote a short article on the release of the Wii News channel but I could not get screen caps at the time. I went to flikr, found some good photos and contacted the owner of those photos. In exchange for proper credit he allowed me to use the images.
Just a heads up, you can also use DMCA notices against the hosts of the scraper provided that they are hosted within the U.S., which most are. I usually consider that or Adsense to be my first stop, whichever is more relevant and would do the most damage.
Great article though, I’m glad to see that you and others like you are spreading awareness of this issue!
Let me know if I can be of any help.
Coprighted works (both All Rights Reserved and Creative Commons) can be registered with Numly. Numly Numbers are used for verification purposes and can be embedded in your content.
Scrapers are not removing these numbers allowing readers to track down the real owner of the content.
Hi Jonathan: I took the Adsense approach. The particular site in question that Steve brought to my attention is trying to offer a “free bundle” of tools that can scrape content off a site in 2 minutes. And my requests for removing the content was ignored; in fact, later that day, they published THIS article to their blog (ironic, isn’t it?)
Steve: We aren’t published under CC at all. This is an “All rights reserved” company blog. I certainly don’t mind my blog posts being quoted (with proper credit), but publishing content verbatim is definitely not something most bloggers will approve of. I certainly don’t.
Tamar,
This is why technorati is your friend. If you embed an invisible link back to your blog inside your article it is very likely that you will see your scraped content providing links back to you when doing a technorati search.
I’m often reminded of a certain quote in American history,”John Marshall has made his decision; now let him enforce it!”
It is great to have a starting point to combat scraping.
Does any of this apply to websites (in addition to blogs)?
Include lots of links back to your other content, tag pages etc
Have a clear license and other legal information in the footer of each post
Use a GPL license and thet them use the content. It is all links, and if you can’t beat them, you might as well benefit from it.
The problem is that most scrapers are grabbing content from Technorati or news feeds which is a pain, because those services strip out the links.
But only do this if every single bit of your content is 100% your own, and you have rights to distrubute it in this fashion.
If you are using anyone elses content, such as photos, video etc, possibly the biggest problem is Google Reader, which is effectively designed to create splogs, or aggregated shared content, depending on your point of view.
My favorite trick is to set up a Google Alert for the name of my blog, I’ve caught ALOT of stuff that way!
I can’t believe how little shame people have. I always like the image changing trick, that way you can spoil there scraping page with relative ease.
Nice tips, thanks guys.
I did the image changing trick… all images on the splog were plastered with 10e20 copyright images. :)
The others are quite useful as well.
I actually like Google’s shared reader — I don’t know if I would categorize that as ‘scraping’ — I found a lot of great blogs that way. The links do point back to the original blog and I end up subscribing to the ones I like!
I guess I am scraping your titles in my personal aggregator thingie. Would you like me to remove your records?
infectious: The concern isn’t about aggregating content and linking to the site. Your aggregator does just that. What people are doing is taking images and text verbatim and putting it on their own blogs while claiming it as their own. This is the distinction I had hoped to make. There is nothing wrong with your content and you have built a rather cool application there. :)
Looks like that particular “auto-scraping” blog no longer scrapes your content!
Yup — I noticed that this morning. They didn’t answer the “please remove” email, so I suppose the hotlinking image threw them off. In any event, all’s well that ends well.
OK, good. Thanks, I enjoy reading here at 10e20.com. :)