The duplication of already-published content is running rampant online. Automatic scrapers will take unique published web content and copy it to publish on another website. This is to hopefully get the content indexed by the search engines so traffic will begin to flow to their website, eventually clicking on ads and other links published throughout the site or blog.
When you discover scraped content from your blog on another website, it’s common to feel powerless and angry. However, there are many steps and actions that can be taken proactively and retroactively to decrease content scraping from occurring.
Pre-emptive Action – Pinging
It’s a good idea to let search engines find and index your content first before distributing it out on an RSS feed. This sends a signal to search engines that you are the original source before they find it on other sites.
To speed up this process you can use a tool like Ping-O-Matic to update search engines and feed sites that your content has been updated and it’s ready. This can be done automatically in WordPress with a ping list as well.
Monitor Incoming Links
This is a good way to determine whether or not scrapers have begun to visit your site. Scrapers will also take up a lot of bandwidth and may cause your site to slow down for actual users, which is also a noticeable sign of constant scraping.
Visit blogs that have linked to your posts to determine if they are blogs made out of content scraped from other websites. There are several WordPress plugins that let you monitor links and you can also view them in the WordPress Dashboard. If you don’t use the WordPress platform, you can also use Yahoo Site Explorer. Simply go to siteexplorer.search.yahoo.com and type your blog’s URL. Click on ‘Explore URL’.
Next, select ‘Inlinks’ and then ‘Entire Site’ from the second drop down menu. This will display all incoming links from other websites to all the pages on your site. Regularly checking these not only helps you monitor spammers, but is also a great way to keep track of inbound links to determine traffic increases.
Keep in mind that Yahoo only shows a sampling of links that link to your site. While they reveal more than Google, it’s still not near 100%.
Use CopyScape or a Similar Plagiarism Engine
CopyScape will search for copies of your web pages online. You can use their free search tool (and then display a ‘Protected by CopyScape’ banner on your website) or their Premium Service, which is 5 cents per page and has more detailed analytics and records of plagiarized content from your site.
Besides CopyScape, Plagiarism Checker is another free content checking service (you can input blocks of content or a URL). Additionally, if you are fairly sure your content is getting scraped, simply typing in the title of the blog post in question can bring up scraped copies in the search engine results.
Modify the .htaccess file
Besides blocking websites, you can also block IP addresses in the .htaccess file. The command will look something like this:
deny from 192.168.44.201
deny from 184.108.40.206
deny from 172.16.7.92
allow from all
If the content scrapers are grabbing your entire source code when they scrape your content they may scrape your meta data too. Using the rel=”canonical” attribute will send a proper signal of attribution back to the originating source to the eyes of Google. If you’re using WordPress you can have this meta data automatically inserted into each post on your blog.
Use a CAPTCHA
CAPTCHAs can help prevent scraping and spam comments on your blog posts. According to the Sitescraper.net blog, “A typical way to prevent automated scrapers is by forcing users to pass a CAPTCHA. For example Google does this when it gets too many search requests from the same IP within a timeframe. To avoid this, the scraper could proceed slowly, but they probably can’t afford to wait. To speed up this rate they may purchase multiple anonymous proxies to mask their IP, but that is expensive – 10 anonymous proxies will sell for ~$30 / month.”
Richard also goes on to mention that some scrapers may outsource CAPTCHA solving to workers like the ones who work at Amazon Mechanical Turk. However, this is more costly and simply using CAPTCHAs like the Askimet plugin or Recaptcha can help cut down on the majority of spam and content scraping.
Contact Scraper to Take Down Stolen Content
If these proactive actions fail to prevent spammers from scraping your content, you will have to contact the owner of the website posting your content directly to ask them to remove it. Find the owner of the website using the WhoIs.com Database. Contact them by email or postal mail that references the exact content being scraped and your request to remove it:
“Dear Sir or Madam,
I have come upon your website/blog, www.domain.com, and have noticed that you have duplicated original content that was located on my website, www.mydomain.com. This duplicated content is located at: www.domain.com/blog/post1. Please remove this content immediately as it was created by myself.
Going into more detail may be necessary, depending on the level of scraping. Stay professional and do not resort to threats or name-calling. If no response is received, there are several actions that can be taken: legal, commenting on the stolen blog to tell other readers it is stolen, and also posting a blog post on your own blog to announced the scraped content and confront it head on. Filing a Digital Millennium Copyright Act (DMCA) complaint
Are There Benefits to Content Scraping?
Yes, scraping can come with benefits—some scrapers give links back to the original content (thus possibly increasing inbound link value to your website) and if the site gets steady traffic, this may also drive page views and return users to your site. In addition to SEO value, a link back to your site is a great indicator to a search engine that you were the original content creator.
Overall, content scraping may not be that big of issue in the grand scheme of things. According to Matt Cutts, Head of Search Quality at Google, “”There are some people who really hate scrapers and try to crack down on them and try to get every single one deleted or kicked off their web host,” says Cutts. “I tend to be the sort of person who doesn’t really worry about it, because the vast, vast, vast majority of the time, it’s going to be you that come up tops, not the scraper. If the guy is scraping and scrapes the content that has a link to you, he’s linking to you, so worst case, it won’t hurt, but in some weird cases, it might actually help a little bit.”
The debate will always be out over allowing or blocking content to be scraped on your site is the most beneficial but now you know the information to decide for yourself.