follow a Zipf distribution. Figure 2 only shows the
distributions of external scripts and all the scripts,
since the distribution of embedded scripts is very
similar to the latter. Only a small number of sites use
more than 300 scripts, so we can omit a big portion
of the tail of the distribution.
We have also found both sites that invoke the
same scripts many times and sets of sites that invoke
the same scripts. In the latter case, most of the times
the pages of a set of sites have the same content (so
they represent replicate content).
Furthermore, we have studied which script files
are the most invoked. We have concluded that there
is a group of 63 JavaScript files whose names appear
more than 1000 times. Table 3 shows the top ten
positions of that group with the number of
occurrences and the number of domains:
Table 3: Script files that are invoked more times.
File name Calls Domains
show_ads.js 36,248 18,443
urchin.js 33,898 32,873
AC_RunActiveContent.js 29,746 28,526
swfobject.js 19,918 18,887
prototype.js 9,029 8,837
mootools.js 8,483 8,032
jquery.js 8,207 7,851
caption.js 5,947 5,916
scriptaculous.js 5,172 5,028
funciones.js 4,578 4,419
We have also tried to group those libraries by
functionality and to count the number of domains
that use libraries of each functionality. Table 4
shows the results:
Table 4: Functionality of the most common scripts.
Functionality Domains Calls
Management of Flash and
active content
51,895 59,713
Visit count and generation of
statistics
39,354 41,777
Content dynamization with
AJAX
28,819 41,767
Content rendering and image
treatment
22,185 24,598
Menu generation 4,714 5,198
Data treatment and validation 4,376 6,586
We have concluded that, although there are a lot
of libraries on the Web, we can group them in a
small number of functionalities (generation of
statistics, AJAX, Flash, image treatment, etc.).
An interesting feature of the use of scripts is how
they are used to create URLs dynamically. One of
the cases that crawlers hardly ever manage well is
when URLs have parameters whose values are
injected with JavaScript, like in the following code:
location.href =
“http://www.tienda.es/prod?id=”+id;
In many cases, crawlers will believe that the
inner expression showed below is an URL:
“http://www.tienda.es/prod?id=” + id
Or if they perform a tokenization, it could be:
http://www.tienda.es/prod?id=
Actually, the result of the tokenization is an
URL, but as the product identifier is missing, it
probably will not lead the crawler to the expected
resource. The problem here is that we cannot guess
the best values for the parameters unless we interpret
JavaScript, but other techniques could be researched.
Table 5 shows the number of URLs we found in
typical redirection sentences, as well as the number
of simple “potential” URLs that have parameters
injected by means of JavaScript as in the previous
example. Moreover, it also shows how many of
them are considered “well formed” by the algorithm
of the OpenSource crawler Nutch, which uses, first,
regular expressions to detect potential URLs and,
then, a filter to discard not valid ones.
Table 5: Finding complete and potential URLs.
Name Number Pass Nutch filter %
Complete URLs 41,716 41,590 99.7
Potential URLs 8,709 8,581 98.5
As it is shown in Table 5, we have found 8,709
potential URLs that could be completed with some
extra processing. However, conventional crawlers
would treat them as valid URLs although they
actually point to error pages or uninteresting pages.
4.2 Web Forms
Forms are the main entry point to the server-side
Hidden Web, so we need to study them thoroughly.
We have found 188,712 forms in 124,865 domains
(21.6%) following a power law distribution. 122,417
of them (64.9%) make their request by POST and
48,443 (25.7%) use GET for that purpose. Also,
17,779 forms (14.2%) do not specify a method, so
the default value for them is GET too.
Table 6 shows the use of password fields in
forms. The percentages are relative to the number of
pages with forms. These fields are often associated
to authentication, register or password change tasks.
THE SPANISH WEB IN NUMBERS - Main Features of the Spanish Hidden Web
373