Detect scraping with hidden content
Data seeding - detecting copyright infringement or plagiarism
An old and to some extent a reliable way of detecting and tracking scraping is to seed the data either with easily identifiable items in order to trace where the data ends up or with bogus records in order to destroy the value of the data.
Map makers know their maps from others
This technique has a long history and has for example been used by map makers since very early days. The problem of the map makers was essentially the same as the problem of today's data owners, they spent large amounts of time and money in gathering the data to compile an accurate map. However, it was easy enough for their competing map makers to buy one copy of the map and make alterations to it and claim it as their own.
In order to at least find out who was stealing from them, the map makers started to add bogus information to their maps. It could be items such as villages or lakes that did not exist for example. When a new map came out they could then check for these specific places and determine if the map was built on original data or if it simply was a copy of their work.
Seeding online data
Today when large proportions of valuable data is available online for anyone to download, this is still a viable concept. Enforcing terms and conditions of use for a public database is hard but seeding may be a way of at least find some of the people stealing the data.
Tagging data with honeytokens
There are two main strategies in terms of seeding the data. The first is to insert easily identifiable records in the dataset, things that are unique such as in a database of phone numbers a very odd name. The same odd name can then be used later to identify your data in a suspected dataset. Even more effective is of course to add some information that refers back to you such as for example combine the odd name with a phone number over which you have control, that way you'll get a receipt when someone uses the data.
Providing fake data to scrapers
The other way of seeding is to insert bogus data in order to destroy the value of a dataset. In the phone book case this could be for example exchanging digits in the phone numbers or randomize names and phone numbers. The problem with this is that you will need to have sophisticated means of detecting someone scraping your site since you wouldn't want to give out the bogus data to any of your real customers.
As a conclusion, seeding may be effective to stop certain types of data abuse but it will not do it by itself, you will still need proper detection and probably legal assistance in order to catch the offenders.