e

Like what you see? Let's talk about how we can help your business. Contact Us -->

Content Scraping: Prevention, Repercussions, and…Benefits?

Content Scraping: Prevention, Repercussions, and…Benefits?

The duplication of already-published content is running rampant online. Automatic scrapers will take unique published web content and copy it to publish on another website. This is to hopefully get the content indexed by the search engines so traffic will begin to flow to their website, eventually clicking on ads and other links published throughout the site or blog.

When you discover scraped content from your blog on another website, it’s common to feel powerless and angry. However, there are many steps and actions that can be taken proactively and retroactively to decrease content scraping from occurring.

Pre-emptive Action – Pinging

It’s a good idea to let search engines find and index your content first before distributing it out on an RSS feed.  This sends a signal to search engines that you are the original source before they find it on other sites.

To speed up this process you can use a tool like Ping-O-Matic to update search engines and feed sites that your content has been updated and it’s ready. This can be done automatically in WordPress with a ping list as well.

Monitor Incoming Links

Yahoo Site Explorer - Link checking

This is a good way to determine whether or not scrapers have begun to visit your site. Scrapers will also take up a lot of bandwidth and may cause your site to slow down for actual users, which is also a noticeable sign of constant scraping.

Visit blogs that have linked to your posts to determine if they are blogs made out of content scraped from other websites. There are several WordPress plugins that let you monitor links and you can also view them in the WordPress Dashboard. If you don’t use the WordPress platform, you can also use Yahoo Site Explorer. Simply go to siteexplorer.search.yahoo.com and type your blog’s URL. Click on ‘Explore URL’.

Next, select ‘Inlinks’ and then ‘Entire Site’ from the second drop down menu. This will display all incoming links from other websites to all the pages on your site. Regularly checking these not only helps you monitor spammers, but is also a great way to keep track of inbound links to determine traffic increases.

Keep in mind that Yahoo only shows a sampling of links that link to your site. While they reveal more than Google, it’s still not near 100%.

Use CopyScape or a Similar Plagiarism Engine

CopyScape content theft preventionCopyScape will search for copies of your web pages online. You can use their free search tool (and then display a Protected by CopyScape’ banner on your website) or their Premium Service, which is 5 cents per page and has more detailed analytics and records of plagiarized content from your site.

Besides CopyScape, Plagiarism Checker is another free content checking service (you can input blocks of content or a URL). Additionally, if you are fairly sure your content is getting scraped, simply typing in the title of the blog post in question can bring up scraped copies in the search engine results.

Modify the .htaccess file

Using a command in your .htaccess file to block the well-known scraping and spam websites and engines out there is a fairly easy and proactive way to prevent the scrapers from coming to your site all together. Some of the common scraping sites include MailWolf and GetWeb, and can be blocked along with several other sites using this code from Javascript Kit. Using the RewriteEngine On command can cut off the majority of these sites. The list continues to grow and more are added to JaveScript Toolkit, so bookmark it and check back regularly.

Besides blocking websites, you can also block IP addresses in the .htaccess file. The command will look something like this:

order allow,deny

deny from 192.168.44.201

deny from 224.39.163.12

deny from 172.16.7.92

allow from all

Use rel=”canonical”

If the content scrapers are grabbing your entire source code when they scrape your content they may scrape your meta data too. Using the rel=”canonical” attribute will send a proper signal of attribution back to the originating source to the eyes of Google. If you’re using WordPress you can have this meta data automatically inserted into each post on your blog.

Use a CAPTCHA

CAPTCHAs can help prevent scraping and spam comments on your blog posts. According to the Sitescraper.net blog, “A typical way to prevent automated scrapers is by forcing users to pass a CAPTCHA. For example Google does this when it gets too many search requests from the same IP within a timeframe. To avoid this, the scraper could proceed slowly, but they probably can’t afford to wait. To speed up this rate they may purchase multiple anonymous proxies to mask their IP, but that is expensive – 10 anonymous proxies will sell for ~$30 / month.”

Richard also goes on to mention that some scrapers may outsource CAPTCHA solving to workers like the ones who work at Amazon Mechanical Turk. However, this is more costly and simply using CAPTCHAs like the Askimet plugin or Recaptcha can help cut down on the majority of spam and content scraping.

Contact Scraper to Take Down Stolen Content

WhoIsIf these proactive actions fail to prevent spammers from scraping your content, you will have to contact the owner of the website posting your content directly to ask them to remove it. Find the owner of the website using the WhoIs.com Database. Contact them by email or postal mail that references the exact content being scraped and your request to remove it:

“Dear Sir or Madam,

I have come upon your website/blog, www.domain.com, and have noticed that you have duplicated original content that was located on my website, www.mydomain.com. This duplicated content is located at: www.domain.com/blog/post1. Please remove this content immediately as it was created by myself.

Sincerely,

Name”

Going into more detail may be necessary, depending on the level of scraping. Stay professional and do not resort to threats or name-calling. If no response is received, there are several actions that can be taken: legal, commenting on the stolen blog to tell other readers it is stolen, and also posting a blog post on your own blog to announced the scraped content and confront it head on. Filing a Digital Millennium Copyright Act (DMCA) complaint

Are There Benefits to Content Scraping?

Yes, scraping can come with benefits—some scrapers give links back to the original content (thus possibly increasing inbound link value to your website) and if the site gets steady traffic, this may also drive page views and return users to your site. In addition to SEO value, a link back to your site is a great indicator to a search engine that you were the original content creator.

Overall, content scraping may not be that big of issue in the grand scheme of things. According to Matt Cutts, Head of Search Quality at Google, “”There are some people who really hate scrapers and try to crack down on them and try to get every single one deleted or kicked off their web host,” says Cutts. “I tend to be the sort of person who doesn’t really worry about it, because the vast, vast, vast majority of the time, it’s going to be you that come up tops, not the scraper. If the guy is scraping and scrapes the content that has a link to you, he’s linking to you, so worst case, it won’t hurt, but in some weird cases, it might actually help a little bit.”

Conclusion:

The debate will always be out over allowing or blocking content to be scraped on your site is the most beneficial but now you know the information to decide for yourself.

If you’re interested in more related content follow us on Twitter, Fan us on Facebook, or subscribe to our RSS feed.

Want to Get Inside?

Become a BlueGlass Insider Today!

  • Be the first to know about BlueGlass events, meetups, and surprise releases. Before they’re made public…
  • Exclusive access to the latest tools, tips and must-read posts.From people who have been doing this for years…
  • Insider perspective on the latest trends in digital marketing. Info that you won’t get anywhere else…

Enter your email below to join for free!




Comments

  1. Andrei B says:

    “commenting on the stolen blog to tell other readers it is stolen” – do You really think your comment would be approved ???

    • Jordan Kasteler says:

      If it’s on auto-approval, yes. People who automatically scrape sites probably automate most other things.

  2. IncrediBILL says:

    That quote from Matt Cutts is idiotic. Of course Matt doesn’t worry about scrapers, he doesn’t earn a living from his online content like webmasters do, he earns his living working for Google.

    • Jordan Kasteler says:

      I’m convinced Matt “doesn’t worry” about anything. Seems to be his templated answer to all SEO questions.

  3. Avinash Kaushik had a great post about tracking content scraped from his website. I’ll try and find it..I can not find it for the life of me.. but it is a tool which embeds a piece of tracking code into the content someone copies from your website and you can tell how many times this content was viewed, and the resulting actions.. clicks on links etc.

    Its a pretty cool tool to measure virality of content. I’ll keep looking for the link!

    • Jordan Kasteler says:

      Awesome! Let me know if you find it.

  4. Robert says:
  5. Gareth says:

    @Robert – thanks for tha,t interesting

    You could also just add a long number to the bottom of the post like “j213jjsdjbc” – this would show all the scraped pages when you searched on it.

    Also be sure to use Yoast’s plugin (if your using WP) to add your links to the bottom of each footer in your RSS feed. Let those bots build links for you!

  6. Check out these guys http://www.scrapestopper.com They are able to stop any type of scrape attack I do not know how they do it but they stop all forms of scraping…They have a trial period Awesome and there system is so easy worth a look at..

  7. Michael says:

    Hello! Thanks for the great posting on scraping. I hadn’t thought about the positive ramifications of those scraping my articles, and though I have never taken action, I guess I should see it as more of a positive then negative.

    Thanks for sharing your guidance!

    Michael
    CCO OutMaturity

  8. I see more negative than positive

    Lets see what scraping really is….
    Scraping is a process where a computer programmed by a human mindlessly extracts data for personal gain.

    The personal gain is to profit from your content…

    I can easily take any ecommerce website I can create a website in real time take data from that ecommerce site and put this up on my site and take 5 cents of each product the other sites sells..

    So a user does a search and really in the back end I am searching a competitors site taking the products and reducing the price and then putting the content up on my site, the images magically also are saved locally to my site.

    I do not need a database or even a warehouse as I can get the manufacturer to ship directly to the customer and with Paypals mulit vendor transaction system I can pay the manufacturers/vedors in real time so I do not need any accounting system, I just get the commission..

    My poor competitor has done all the work for me and they lose as I am always cheaper no matter what they do…

    This is a basic example of scraping for personal gain.

    Check out these guys http://www.scrapestopper.com they seem to be able to stop all types of scraping.

    Brett Wraight
    Senior Software Engineer
    Golabs.com

  9. Kenny says:

    What about people who copy you’re entire article word for word and create a page / group whatever on Facebook – no link back to your site. Who does that benefit? Certainly not me, so I don’t see why I should “not worry” about it. As you can imagine Facebook ranks rather well and in some cases the page can outperform you’re own. Another problem I’ve found is that Facebook are very slow to move in taking down stolen content.

  10. Yes I agree Kenny Scraping is a real issue and it is hard to stop however scrapestopper.com seem be able to stop a scrape attack on a site….Check them out Brett W

  11. Kenny says:

    Looking at that, I don’t think it would prevent human “scrapers” from using copy and paste. I’m a tight Scotsman and therefore unwilling to part with cash, so I will be coding my own solution to keep on top of my copyrighted material :-)

  12. Hi Kenny…Human scrapers can be restricted by limiting page views, I think that large sites with tens of thousands of pages would be challenging for a human to copy and paste. It really is the automated scrapes that are the issue….Also formatted data with pics can also be a challenge to get it in a format that can be formatted to database fields…Contact Scrapestopper and let “me” know if you would like a free version for a extended period since you are a tight Scotsman and I am a Aussie with a Scottish heritage.