Figure 17: Downloading speed for the test on 05/01/2016.
Figure 18: Downloading avg speed for daily periodic snap-
shots during August 2016.
submitted a set of periodic snapshots. We executed 31
daily web crawling sessions, each of them kept alive
for one hour (from 21:00 to 22:00) and by using the
best set up of the software, the same of Test 1 and
2, i.e. 8 nodes and 2 agents per nodes. As total, we
downloaded about 15TB of data (on average, 484GB
per snapshot) and saved on the storage around 3,3TB
(on average, 111GB per snapshot). In Figure 18 is
reported the downloaded data speeds for each snap-
shot. We observed that they oscillated between 0,96
and 1,07Gbps with an average value around 1,00Gbps
(the bottleneck due the firewall) and variance around
0,0005Gbps. The successfully done tests and the high
performance demonstrated the good reliability of the
software.
5 CONCLUSIONS AND FUTURE
WORKS
To summarize, with this work we obtained a good
product for the web crawling activity. This tool is
completely integrated in ENEAGRID infrastructure
by the Web Crawling Project. It allows collaborative
work, thanks to its Virtual Lab, which provides a web
graphical application in order to use all instruments
integrated in the infrastructure, also remotely. In ad-
dition to the web crawling software, the Virtual Lab
offers some post-crawling solutions, i.e. for indexing,
querying, displaying and clustering the web data, in
such a way as to have a complete product. The exper-
imental results confirm the high quality of the config-
uration for the web crawling software from a point of
the efficiency, the robustness and the reliability.
For the future we hope to improve the indexing pro-
cess. Now, this task is sequentially performed by a
single machine (the front end). In the first instance,
statistics relative to this operation, show an average
of 4 hours of indexing (on single node) for each one
of crawling (on 8 nodes). We expect, by paralleling
the process (by means the other 8 nodes and/or by
a large memory machine), reaching the goal of “few
minutes” of indexing per one hour of crawling.
ACKNOWLEDGEMENTS
The computing resources and the related technical
support used for this work have been provided by
CRESCO/ENEAGRID High Performance Comput-
ing infrastructure and its staff (Ponti et al., 2014).
CRESCO/ENEAGRID High Performance Comput-
ing infrastructure is funded by ENEA, the Ital-
ian National Agency for New Technologies, En-
ergy and Sustainable Economic Development and
by Italian and European research programmes, see
http://www.cresco.enea.it/english for informa-
tion.
REFERENCES
Boldi, P., Marino, A., Santini, M., and Vigna, S. (2016).
BUbiNG: Massive Crawling for the Masses. CoRR,
abs/1601.06919.
Mariano, A., D’Amato, G., Ambrosino, F., Aprea, G.,
Colavincenzo, A., Fina, M., Funel, A., Guarnieri, G.,
Palombi, F., Pierattini, S., Ponti, G., Santomauro, G.,
Bracco, G., and Migliori, S. (2016). Fast Access
to Remote Objects 2.0 a renewed gateway to enea-
grid distributed computing resources. PeerJ Preprints,
4:e2537v1.
Ponti, G., Palombi, F., Abate, D., Ambrosino, F., Aprea,
G., Bastianelli, T., Beone, F., Bertini, R., Bracco, G.,
Caporicci, M., Calosso, B., Chinnici, M., Colavin-
cenzo, A., Cucurullo, A., Dangelo, P., De Rosa, M.,
De Michele, P., Funel, A., Furini, G., Giammattei,
D., Giusepponi, S., Guadagni, R., Guarnieri, G., Ital-
iano, A., Magagnino, S., Mariano, A., Mencuccini,
G., Mercuri, C., Migliori, S., Ornelli, P., Pecoraro,
S., Perozziello, A., Pierattini, S., Podda, S., Poggi, F.,
Quintiliani, A., Rocchi, A., Scio, C., Simoni, F., and
Vita, A. (2014). The role of medium size facilities in
the HPC ecosystem: The case of the new CRESCO4
cluster integrated in the ENEAGRID infrastructure.
pages 1030–1033.
A Collaborative Environment for Web Crawling and Web Data Analysis in ENEAGRID
295