Sentor
Blockscraping.com What is scraping? Prevent scraping News about scraping Data seeding Scraping FAQ Risk assessment Managed anti-scraping service About Sentor Contact us

Detect scraping with hidden content

Data seeding - detecting copyright infringement or plagiarism

An old and to some extent a reliable way of detecting and tracking scraping is to seed the data either with easily identifiable items in order to trace where the data ends up or with bogus records in order to destroy the value of the data.

Map makers know their maps from others

This technique has a long history and has for example been used by map makers since very early days. The problem of the map makers was essentially the same as the problem of today's data owners, they spent large amounts of time and money in gathering the data to compile an accurate map. However, it was easy enough for their competing map makers to buy one copy of the map and make alterations to it and claim it as their own.

In order to at least find out who was stealing from them, the map makers started to add bogus information to their maps. It could be items such as villages or lakes that did not exist for example. When a new map came out they could then check for these specific places and determine if the map was built on original data or if it simply was a copy of their work.

Seeding online data

Today when large proportions of valuable data is available online for anyone to download, this is still a viable concept. Enforcing terms and conditions of use for a public database is hard but seeding may be a way of at least find some of the people stealing the data.

Tagging data with honeytokens

There are two main strategies in terms of seeding the data. The first is to insert easily identifiable records in the dataset, things that are unique such as in a database of phone numbers a very odd name. The same odd name can then be used later to identify your data in a suspected dataset. Even more effective is of course to add some information that refers back to you such as for example combine the odd name with a phone number over which you have control, that way you'll get a receipt when someone uses the data.

Providing fake data to scrapers

The other way of seeding is to insert bogus data in order to destroy the value of a dataset. In the phone book case this could be for example exchanging digits in the phone numbers or randomize names and phone numbers. The problem with this is that you will need to have sophisticated means of detecting someone scraping your site since you wouldn't want to give out the bogus data to any of your real customers.

More information

As a conclusion, seeding may be effective to stop certain types of data abuse but it will not do it by itself, you will still need proper detection and probably legal assistance in order to catch the offenders.
News

Is Screen Scraping Legal? Read news about web scraping.

Facts about web scraping

Like the evil one, data scraping has many names. Below is a list of expressions which all are similar to "data scraping".

  • Web scraping
  • Screen scraping
  • Page scraping
  • Html scraping
  • Scrapping

Learn more about scraping »

Wikipedia on fictitious entry

"Copyright traps are deliberately erroneous entries in a work inserted to facilitate detection of copyright infringement or plagiarism."

See also the the following terms

  • Canary trap
  • Honeytoken
  • Trap street

Learn more about fictitious entry »

© Sentor 2008.