Prevent scraping to protect your data
Blocking scraping: Different takes on protecting your data
By: Martin Zetterlund, Partner at Sentor Managed Security Services
Web scraping (or screen scraping, data scraping or just scraping) is a constantly growing phenomenon on the Internet. Data published on the net is by a majority of people regarded as free to use under whatever circumstances. This will become a problem if you regard your data as property and still want to publish it for the benefit of your users. When you read this article you will learn about the different kinds of data scrapers and different takes on protecting your business from malicious data scraping.
Bots are divided into benevolent bots and malign bots. Benevolent bots are the ones you want on your site - like search engine bots - which index your site, giving you search engine ranking. They adhere to robots.txt, giving you the opportunity to decide if you want their attention or not. Malign bots does not adhere to robots.txt. They decrease your sites performance and cause you to loose control over your data since you have no way of knowing how the scraper behind the bot will use it.
Web scrapers can be divided into three categories:
- Manual scrapers
- Scripted scrapers
People who download data manually and/or use it in direct breach with the terms and conditions of your site. This can be either single individuals or groups of people such as for example a call centre using the site commercially.
In order to get large amounts of data quickly or perform transactions automatically it is most convenient for scrapers to use a script or a program to perform the scraping rather than doing it manually. Scripted web scrapers can use single or multiple IP:s making it seem that they are in fact a group of legitimate users.
Bots are divided into benevolent bots and malign bots. Benevolent bots are the ones you want on your site - like search engine bots - which index your site, giving you search engine ranking. They adhere to robots.txt, giving you the opportunity to decide if you want their attention or not.
Malign bots does not adhere to robots.txt. They decrease your sites performance and cause you to loose control over your data since you have no way of knowing how the scraper behind the bot will use it.
Blocking scraping - traditional methodsTo block scraping is in no a way trivial task even though it may seem so at a glance. Here are four traditional anti scraping methods which all have their drawbacks when used to prevent scraping.
- Rate limiting
- Captcha tests
- Obfuscating source code
- Blacklists Blacklists consisting of IP:s known to scrape the site is not really a method in itself since you still need to detect a scraper first in order to blacklist him. Even so it is still a blunt weapon since IP:s tend to change over time. In the end you will end up blocking legitimate users with this method. If you still decide to implement black lists you should have a procedure to review them on at least a monthly basis.
To rate limit an IP means that you only allow the IP a certain amount of searches in a fixed timeframe before blocking it. This may seem sure way prevent the worst offenders but in reality it's not. The problem is that a large proportion of your users are likely to come through proxy servers or large corporate gateways which they often share with thousands of other users. If you rate limit a proxy's IP that limit will easily trigger when different users from the proxy uses your site. Benevolent bots may also run at higher rates than normal, triggering your limits.
One solution is of course to use white list but the problem with that is that you continually need to manually compile and maintain these lists since IP-addresses change over time. Needless to say the data scrapers will only lower their rates or distribute the searches over more IP:s once they realise that you are rate limiting certain addresses.
In order for rate limiting to be effective and not prohibitive for big users of the site we usually recommend to investigate everyone exceeding the rate limit before blocking them.
Captcha tests are a common way of trying to block scraping at web sites. The idea is to have a picture displaying some text and numbers on that a machine can't read but humans can (see picture). This method has two obvious drawbacks. Firstly the captcha tests may be annoying for the users if they have to fill out more than one. Secondly, web scrapers can easily manually do the test and then let their script run. Apart from this a couple of big users of captcha tests have had their implementations compromised.
Some solutions try to obfuscate the http source code to make it harder for machines to read it. The problem here with this method is that if a web browser can understand the obfuscated code, so can any other program. Obfuscating source code may also interfere with how search engines see and treat your website. If you decide to implement this you should do it with great care.
Blocking scraping - efficient method
As you can see the above methods all prove to be unsatisfactory in some way. The truth is that the only way to efficiently block scraping is to have a managed security system watching over your web site 24/7. To properly handle screen scraping at a larger site you need a sophisticated system that performs a number of tests on all requests and analyses usage patterns in real time. Additional to this you need a 24/7 team to monitor the alerts generated by the system and apply blocks on scrapers.
Learn more about Sentor's automated anti scraping surveillance network - ASSASSIN that prevents scraping.