http : // ex a mple . com / page ? s e ssio n id =3
B9 593 022 97 0 93 4 1E 9 D8 D7C 245 10E 383
http : // ex a mple . com / page ? s e ssio n id =
D2 7E5 22 C BF B FE 7 24 57F 547 911 7E3 B7 D 0
There exists countermeasures to bypass such pro-
tection for bots, starting by registering hashes of al-
ready visited webpages to a smarter implementation
of URL management.
Random Texts. A different solution, which — in
addition to its deployment —is more effective than
others, stands in random generation of text in a web-
page. With simple script from webserver side, one
page can be randomly generated with random text and
random links referencing the same page. Once this
page is hit, the bot will treat what looks to it as differ-
ent webpage, since it comes from different links with
a different page’s name, over and over.
But whatever is the level of sophistication of the
crawler trap, it can only generate pages that respect
the strategies for which it has been designed. It means
that each crawler trap has specific characteristics. Ei-
ther the web page is created automatically on the fly
and the content is fully random, or it starts from an
existing webpage whose content is randomly derived.
In other words, the crawler can be working on to-
tally fake platforms, where the pages are similar to
the original site in terms of shape, HTML architec-
ture, even some key words, part of content...
This is this last category of crawler trap which
matters for us. Indeed, the involuntary one can be
avoided by a bot aware of such objects. And about
intentional ones, where they are based on URL man-
agement, it is possible to bypass the security with a
specific management of already visited pages. But the
last technique with random text generation cannot be
managed in the same way and it requires a specific
approach to handle it. This is the purpose of our pa-
per.
From an operational point of view, we are visiting
websites, page after page. The main difficulty lies in
the fact that the trap generates random pages after the
bot has visited a certain number of pages (based on
the speed of the visit, the number of visits, the logic
of the order of the links visited, etc). In a way, we can
only base our decisions on the already visited pages
and the ones we are going to visit next. This is the
operational context in which we operate. Our whole
approach is not to avoid being detected by the website
(crawling performance would be too downgraded) but
to detect whenever the website starts setting a trap.
For this reason, we need to measure the distance be-
tween regular and irregular webpages.
3 USE OF DISTANCES
The objective is to measure the distance between sev-
eral data sets to perform a distinction between several
families. This can be done in section 3.1 in order to
define precisely what we are really trying to measure
and the approach which drives our forthcoming anal-
yses. Subsequently, we will be able to present some
of the existing distances in the section 3.2 and then
we will discuss the contributions proposed by our dis-
tance to answer our problematic.
3.1 Approach to Resolve the Problem
From a generic point of view, we have two ethnic
groups: pages extracted from regular website and
pages generated by crawler traps. In each of the ethnic
group, it is possible to find different families. In the
case of regular websites, it means that all the pages of
a family come from the same website. The same ap-
plies for pages generated from different crawler traps.
In our case, we are trying to detect whenever there
is a webpage generated from a crawler trap in a set of
regular webpages. According to our operational con-
text, we are aggregating webpages on the fly, which
means we cannot know the full dataset in advance. It
means we have to check the difference of a new web-
page from a set of already visited webpages. This
is doable by computing a distance D between the
new webpage and the current set. In that sense, we
can assume that a family F composed of n samples,
∀i, j ∈N
+∗
such as 1 ≤i ≤ j ≤n we have the random
value X which models the distance between two sam-
ples s
i
and s
j
such as X = D (s
i
, s
j
). Since all samples
belong to the same family, there is no reason that the
distance between two samples is different to a third
one, that is to say ∀i, j, k ∈ N
+∗
where i 6= j 6= k we
have D (s
i
, s
j
) ≈ D (s
i
, s
k
).
For optimization purposes, we cannot compute a
distance between the new one and all pages of the cur-
rent set. Since the distance between webpages of a
single family is supposed to be the same, we can con-
sider the mean (expected value) as a good estimator.
This mean is computed on the fly by updating it with
each new value which belong to the family. It means
that our detector system is based on the fact that the
distance between a new sample s and the mean of a
family m is based on the fact that D (s
i
, s
j
) ≤ ε where
ε is distinctive for each each family.
The challenge is therefore to define a distance D
that respects the property defined bellow. It means
that our distance must be discriminating in the sense
that each family must have its own mean of distance.
And it should be accurate which means that the stan-
Detection of Crawler Traps: Formalization and Implementation Defeating Protection on Internet and on the TOR Network
777