A Collaborative Environment for Web Crawling and Web Data Analysis

in ENEAGRID

Giuseppe Santomauro

, Giovanni Ponti

, Fiorenzo Ambrosino

, Giovanni Bracco

Antonio Colavincenzo

, Matteo De Rosa

, Agostino Funel

, Dante Giammattei

Guido Guarnieri

and Silvio Migliori

ENEA - C.R. Portici, DTE-ICT-HPC, P.le E. Fermi, 1 - 80055 Portici, NA, Italy

ENEA - C.R. Frascati, DTE-ICT-HPC, Via E. Fermi, 45 - 00040 Frascati, Roma, Italy

ENEA - C.R. Portici, DTE-ICT-RETE, P.le E. Fermi, 1 - 80055 Portici, NA, Italy

ENEA - Sede Legale, DTE-ICT, Lungotevere Thaon di Revel, 76 - 00196 Roma, Italy

Keywords:

Web Crawling, Big Data, Text Data, Data Mining, Networking.

Abstract:

In this document we provide an overview on the development and the integration in ENEAGRID of some tools

able to download data from Web (Web Crawling), manage and display a large amount of data (Big Data), and

extract from data relevant hidden information (Data Mining). We collected all these instruments inside the so

called Web Crawling Project. Further, the corresponding environment, called Virtual Laboratory, is able to

offer the possibility to use all these integrated tools remotely and by a simple graphical interface. A detailed

description of the developed web application will be illustrated. Finally, some experimental results on the

behaviour of the Web Crawling tool will be reported.

1 INTRODUCTION

Over the last 20 years, the Internet has shown an expo-

nential growth. It is certainly the World’s largest data

source and this is conﬁrmed by the increasing number

of new documents/contents uploaded every day.

Web data have characteristics that involve a consid-

erable effort of analysis and organization. Indeed,

three are their main features. The ﬁrst one, due to

the fast development of the Web, is the great amount

of data. The second one is their distributed storing

in different sites, sometimes geographically very far

apart. Finally, the last feature regards their form. In

many cases, because of their textual nature, data has

an unstructured (e.g. docs, pdf, texts, etc.) or “semi”

structured (e.g. HTML, XML, JSON, etc.) form.

It is in this scenario that all the most popular search

engines, such as Google, Bing and Yahoo, everyday

work and deal with this huge amount of data. In fact,

they need to periodically download the content of a

wide web space in order to update their indexes and

then provide the most recent results for the querying

tasks. There are a lot of other contexts in which the

information retrieved from Web data is very impor-

tant, such as: in Cyber Security to be protected against

the criminal or unauthorized use of electronic data;

in Business Intelligence to obtain a successful market

strategy before competitors; or in Academic Research

to analyse the historical evolution of the Web over the

years, from origins to present.

It is trivial to realize that the activity of download-

ing a large amount of data from the Internet, within a

tolerable elapsed time, becomes a critical point in all

the above application ﬁelds. This task is commonly

known as Web Crawling.

At the same time, in order to tackle the problem of

storing, copying, compressing and displaying a huge

amount of data, who decides to crawl the contents

from the Web, needs the use of suitable strategies of

data managing. It is for this reason that all these oper-

ations have to be dealt with only inside the Big Data

context.

A last consideration regards the knowledge that it can

be extracted from the Web data. As, in many cases,

data sources are in textual (unstructured or semi struc-

tured) form, it is normal to apply Data Mining tech-

niques for text data for advanced analysis. Speciﬁ-

cally, in order to extract relevant and accurate infor-

Santomauro, G., Ponti, G., Ambrosino, F., Bracco, G., Colavincenzo, A., Rosa, M., Funel, A., Giammattei, D., Guarnieri, G. and Migliori, S.

A Collaborative Environment for Web Crawling and Web Data Analysis in ENEAGRID.

DOI: 10.5220/0006483602870295

In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 287-295

ISBN: 978-989-758-255-4

287

mation and thus to obtain quality results and high-

value content, such techniques have to be able to take

into account both syntactic and semantic aspects of

text data. It is certainly important to consider clus-

tering techniques in order to identify homogeneous

groups in the data, information classes and patterns.

The remarkable amount of data combined with the al-

gorithmic aspect is a truly costly task from a com-

putational point of view. For this reason it was im-

portant to conduct research at an advanced comput-

ing centre such as the Portici ENEA Research Centre,

which, hosting the CRESCO/ENEAGRID infrastruc-

ture (Ponti et al., 2014), offers the possibility to com-

pute and test the algorithms.

The paper is organized as follows: in Section 2 we

provide a general survey of the Web Crawling tools;

in Section 3 we describe the integration in ENEA-

GRID of the Web Crawling Project; in Section 4 we

report the results of any Web Crawling sessions per-

formed on ENEAGRID; and ﬁnally in Section 5 we

resume the main results reached and some possible

future tasks to improve them.

2 WEB CRAWLING TOOLS

Generally, a crawling technique analyses systemati-

cally and automatically the content of a network in

order to search for documents to download. The Web

crawlers, namely the programs designed to crawl, are

based on a list of URLs. They make a text copy of

all visited pages and save them in an index. When a

crawler parses an URL, all hyper links in the docu-

ment are identiﬁed and then added to the list of URLs

to visit. This process can be manually stopped or

when a speciﬁed number of connections is reached.

With the aim to integrate in ENEAGRID some web

crawling tools, several products have been investi-

gated. At the state-of-the-art, web crawler tools can

be divided into two main categories. In the ﬁrst one

there are all tools allowing only to download raw web

pages (“mere crawling”), whereas in the second class

there are software programs which provide data struc-

turing or data pre-analysis in addition to mere crawl-

ing facilities. We will refer to this second family of

tools as “new generation crawling”, which embed

also post-crawling instruments, e.g. for data parsing.

Among the “mere crawling” tools, the following ones

have been taken into account:

Heritrix: It is written in Java; It respects the exclu-

sion guidelines of the META robots tags; It col-

lects material at a measured and adaptive pace

with low probability of inﬂuencing the normal ac-

tivity of a website; The interface is accessible by

using a web browser; It is scalable but not dynam-

ically scalable.

Nutch: It is based on Lucene and Java; It is encoded

entirely in Java, but data is written in language-

independent formats; It has a highly modular ar-

chitecture that allows developers to create plug-

ins for media-type parsing, data retrieval, query-

ing, and clustering. It is accessible from terminal.

BUbiNG: It is an open-source Java fully distributed

software; It is a product born in academic envi-

ronment; It is built upon the experience with Ub-

iCrawler on the last ten years of research on the

web crawling topic.

Among the “new generation crawling” instruments,

we investigated the following ones:

Scrapy: It is written in Python; It can be used to

extract data using API; It returns semi-structured

outputs (JSON, XML, CSV, etc.); It is accessible

from terminal.

OpenWebSpyder: It support the multi-threading; It

has some search engine functions; It is accessible

from web browsers;

OpenSearchServer: It performs various functions,

such as Crawling, Parsing, Analysis, Classiﬁca-

tion, Learning, on different data sources (Web,

File system, Database and XML ﬁles).

It was also important that the selected solutions sat-

isfy the requirements of the state-of-the-art in the

Web Crawling theme both from the point of view of

methodologies and from the rules which are the best

practices in this ﬁeld and without causing problems to

the visited sites. Indeed, three are the main issues to

be taken into account: the ﬁrst one is technical and

concerns the so-called Denial of Service; the other

two questions are of a legislative nature and concern

the safeguarding of the conﬁdentiality of personal in-

formation (Privacy) and the protection of intellectual

property of the data (Copyright).

It is also for these reasons that, after some tests and

experimentations, in order to establish which web

crawler was the most conform to be integrated in

ENEAGRID, we decided to adopt BUbiNG software.

In the following, we describe in details the main fea-

tures of this web crawling software, explaining the

reason of our choice.

2.1 Software Solution: BUbiNG

Among the above considered tools, the software BUb-

iNG (Boldi et al., 2016), at 0.9.11 version, is chosen

as main Web crawler to use and it is properly installed,

conﬁgured and tested on ENEAGRID. This software

KomIS 2017 - Special Session on Knowledge Discovery meets Information Systems: Applications of Big Data Analytics and BI -

methodologies, techniques and tools

288

Figure 1: Portici ENEA Research Centre Network Schema.

allows the simultaneous execution of multiple agents

(e.g., Java instances that play the role of a crawler).

Each agent can communicate with each other one by

ensuring not repeated visits of the same pages and bal-

ancing the computational load. BUbiNG also allows

setting up all conﬁguration options in a single ﬁle and

many of its parameters can be modiﬁed at runtime.

Among them, the main parameters that we can set up,

we ﬁnd the ﬁlters. For example, a ﬁlter can be applied

to:

• all ready URLs to decide whether to fetch them;

• all fetched resources to decide whether to parse

them.

• all fetched resources to decide whether to store

them.

Another important setting is about the number of

threads for each inner task, i.e for the DNS resolu-

tion, the fetching and the parsing of the web pages.

Further, we can also set up the size of inner cache

memories, more precisely:

• the Workbench which contains approximately all

URLs that will be retrieved in a short time;

• the Sieve which keeps track of URLs not yet vis-

ited;

• the Bloom Filter to detect duplicated downloaded

pages;

• the URL cache which saves 128bit MurmurHash3

ﬁngerprints of newly visited URLs.

Also, the initial seed from which a session of web

crawling starts is passed as parameter to the software,

more precisely as path of the text ﬁle where the URLs

are saved. Finally, BUbiNG saves the content down-

loaded from the Web in compressed .warc.gz ﬁles.

This data compression is very important because it let

us save space up to around 80% of the real size of the

amount of data.

3 INTEGRATION IN ENEAGRID

In order to integrate in ENEAGRID tools able to

download data from the Web, manage/display a large

amount of data and extract signiﬁcant information

from them, we created the Web Crawling Project.

One of the aims of the project was also the usage

of ENEAGRID for some experimental tests instead

of high performance computation tasks. A secondary

goal was to set up a project that involves all the main

hardware and software devices of the facility. In fact,

with the Web Crawling Project we had to consider

and harmonize with each other some hpc nodes, the

networking (inner and outer) system, the storage ap-

paratus and the job scheduler.

A Collaborative Environment for Web Crawling and Web Data Analysis in ENEAGRID

289

3.1 Hardware Solutions

First of all, as hardware resources, we considered a

set of 9 machines dedicated only for the Web Crawl-

ing activity, 8 of them are used for computations, 1

as a front end. The servers are nodes of the clus-

ter CRESCO2 belonging to the Portici ENEA Re-

search Centre and integrated in ENEAGRID. Each of

them mount 2 Xeon Quad-Core Clovertown E5345

processors and 16 GByte of RAM. Suitable conﬁg-

urations, also from networking side, are carried out

to appropriately conﬁne the network trafﬁc. In Fig-

ure 1 we report the Network Schema of the Portici

ENEA Research Centre. The nodes are linked by 5

InﬁniBand (5 × 1Gbps) cables to a Switch Enterasys

3100. From here, the network trafﬁc passes (with

4 × 10Gbps bandwidth) across the Campus Router,

an Enterasys Matrix X4. After, the trafﬁc is routed

(1Gbps in I/O) on the Centre Firewall and then on

the External Router, a Cisco 7606. Finally, with a

1 × 10Gbps connection, trafﬁc reaches the Point of

Presence GARR NA6, then the Internet.

According to the ENEAGRID guidelines, we inte-

grated the web crawling software on the infrastruc-

ture. More speciﬁcally, the binary source of BUb-

iNG tool and all the relative dependencies are then

properly installed, conﬁgured in the project area of

the geographically distributed ﬁlesystem AFS (the in-

stalled version is 1.6.7-1). For data storing, as the web

crawler quickly downloads a large amount of data, we

decided to allocate a wide space on the high perfor-

mance distributed parallel clustered ﬁlesystem GPFS

(4.2.2 version). Finally, a set of bash scripts are coded

to simplify the commands submitting from the soft-

ware to the resource scheduler LSF (7.0.3 version).

3.2 Virtual Laboratory

All the Web Crawling instruments are collected

in a ENEAGRID Virtual Laboratory, named Web

Crawling. The goal of the laboratory is to make

available to a wide scientiﬁc community the crawl-

ing tools. In particular, like all the virtual labs

developed on the infrastructure, we created a public

Website where users can authenticate and then

access to an ad hoc graphical interface to intuitively

use the installed software. The scientiﬁc commu-

nity of ENEAGRID users, interested in using the

Web Crawling tools, can browse at the address

http://www.afs.enea.it/project/webcrawl/

index.php.

The site is composed of four main pages: a welcome

page (Fig. 2), a software area access page (Fig. 3),

a contacts page and a reserved documents area page.

Once an user enters its ENEAGRID credentials at the

software page of site (Fig. 3), a graphical mask (i.e,

a web application) starts on the front end machine,

and he can begin to use the Web Crawling tools.

The authentication procedure is managed by a NX

client installed locally on the user machine, while the

interface is based on FARO 2.0 technology (Mariano

et al., 2016).

3.3 Web Application

The web application is a customized version of FARO

2.0, developed over the JavaFX framework, that ex-

ploits WebViews for its interface implementation and

its operational logics. A second way to access to the

web application is by means FARO 2.0 which has a

suitable button on its Virtual Labs page (Fig. 4) to

directly start the web crawling interface.

Figure 5 shows the interface welcome page. By click-

ing on the Application button on the top bar, different

tabs for different operations are displayed:

• On the ﬁrst tab (Snapshot) Web Crawling sessions

with the opportunity to perform single or periodic

snapshots can be submitted (Fig. 6). The user can

specify a session name, the number of snapshots,

the running time for each of them, the starting

date, the execution frequency for periodic snap-

shots, select the setting of the parameters to be

used and the URLs initial seed;

• In the second tab (New Conﬁguration) the user

can create a new setting for BUbiNG software

(Fig. 7). By default, each software parameter is

Figure 2: Homepage of Web Crawling Virtual Lab site.

KomIS 2017 - Special Session on Knowledge Discovery meets Information Systems: Applications of Big Data Analytics and BI -

methodologies, techniques and tools

290

set up with the corresponding value of the best

conﬁguration (16 agents). The user can also load

an old setting, modify it and save it as new conﬁg-

uration or overwrite over the old one.

• In the third tab (New Initial Seed) a new list of

URLs to start the crawling can be created (Fig.

8). Like the conﬁguration, user can load an old

list, modify it and save it as new list or overwrite

over the old one;

• On the fourth tab (Monitoring), the generated traf-

ﬁc on the network by a running snapshot, se-

lectable from the list of those that are in run, can

be monitored (Fig. 9). This tab is useful to know

in real time the amount of downloaded data, for

single agent or total, and detect eventual perfor-

mance anomalies;

• On the ﬁfth tab (Statistics), user can view some

statistics on the performed snapshots, such as: the

amount of data downloaded, the number of the

visited pages and the downloading speed (Fig.

10).

• On the sixth and last tab (Analysis & Display) are

available some tools for the post-crawling process

(Fig. 11).

For each terminated snapshot, three main instruments

are used for the analysis, the displaying and the clus-

tering of the web data. From here, user can start/stop:

• a session, as web server, of Solr software for the

indexing, the analysis and searching of the web

data;

Figure 3: Software page of Web Crawling Virtual Lab site.

Figure 4: FARO2 Virtual Labs page.

Figure 5: GUI welcome page.

• a session, as web server, of OpenWayback product

for the browsing of the HTML pages;

• a session, as application, of Carrot

software for

the clustering of data.

Solr is based on Lucene and its graphical interface is

viewable via browser. It extracts and indexes a lot

of metadata from each web document (URL, crawl-

ing date, text content, position inside the source ﬁle,

etc.). In Figure 12 is reported a screenshot of the soft-

ware after a text search of the Italian word terremoto

(earthquake) on a snapshot executed only on the Ital-

ian domain .it.

OpenWayback tool, whose graphical interface is, once

again, viewable via browser, indexes the html pages

for URL and crawling date. In Figure 13 is reported a

screenshot of the software while it displays one of the

documents just searched by Sorl.

Finally, Carrot

, which has an independent interface,

is considered an Open Source Search Results Cluster-

ing Engine and it is well-integrated with Solr. Among

the main features of Carrot

it should be underlined

A Collaborative Environment for Web Crawling and Web Data Analysis in ENEAGRID

291

Figure 6: GUI snapshot tab.

Figure 7: GUI new conﬁguration tab.

that it is able to create clusters on different types of

documents (also the web pages), using various clus-

tering techniques (K-means, Lingo, STC, etc.) and re-

turns the results in different forms, also graphically.

Figure 14 shows a screenshot of the software at the

end of a clustering execution. In the top left it can

be seen that the choice of data source is a collection

of web documents indexed by Solr and the algorithm

used is the K-means. The software applied the algo-

rithm on the collection by running at the ﬁrst a ﬁlter

on the documents, starting from a keyword provided

by the user (in this example it has been chosen the

Italian word repubblica, i.e. republic), and it returned

a partition on the subset of ﬁles by tagging each of

these partition with a label (or also called cluster). Fi-

nally, on the right side are displayed some algorithm

parameters that can be modiﬁed to obtain a different

result.

Figure 8: GUI new initial seed tab.

Figure 9: GUI monitoring tab.

4 TESTS AND EXPERIMENTAL

RESULTS

Once BUbiNG is installed, a set of snapshots hav-

ing the same running time (15min) but with various

choices of the software parameters has been submit-

ted. The aim of this parameter tuning was to ﬁnd out

the best conﬁguration for the crawler. At the end of

this operation, after more than 40 performed sessions,

we observed that the best set up for the software, com-

puted as the amount of data downloaded per time unit

(∼1Gbps), has been obtained by using 8 nodes and

16 agents (2 agents per node). The best measured

speed had the same order of bandwidth of the net-

work system used for the Web crawling activity. In-

deed, the bottleneck for the Portici ENEA Research

Centre network was the ﬁrewall that limits the access

to 1Gbps. In order to stress BUbiNG crawler we de-

cided to perform some large Web crawling sessions.

We have taken into account the following aspects:

KomIS 2017 - Special Session on Knowledge Discovery meets Information Systems: Applications of Big Data Analytics and BI -

methodologies, techniques and tools

292

Figure 10: GUI statistics tab.

Figure 11: GUI analysis, display and clustering tab.

• we selected some periods with low network trafﬁc

in Portici ENEA Research Centre;

• we informed the network managers to prevent

conﬂicts with other activities and/or fully saturate

the available bandwidth;

• we used all the available hardware resources (8

nodes) and the optimal setting of the software in

order to maximize the downloading speed;

• we monitored the snapshots during their running

time.

Two kinds of snapshot sessions have been con-

sidered: in the ﬁrst one the execution time was very

long in order to evaluate a very slow performance de-

cay and the robustness of the software; in the second

one we tested the crawler by submitting some sched-

uled periodic snapshots for ensuring the uniform be-

haviour along the time, i.e. the software reliability.

The harvested material, downloaded from the Internet

by tests, was also useful to create a ﬁrst database on

which apply the data mining tools. For all the tests

Figure 12: Solr interface.

Figure 13: OpenWayback interface.

we noted that no interference with the normal net-

work activity is reported. Further, all the snapshot

sessions are performed by using as initial seed a set

of only Italian URLs and by applying a ﬁlter to down-

load only Italian sites, e.g. only web pages of the .it

domain.

4.1 Tests for Long-time Snapshots

For the ﬁrst kind of tests, we considered some days

of the year with low activity in the Portici ENEA Re-

search Centre, i.e. during the 2015 Christmas holi-

days. More speciﬁcally we performed the three fol-

lowing long-time snapshots:

Test 1 on 10/12/2015, from 09:30 to 17:30 (8h);

Test 2 on 28/12/2015, from 12:00 to 22:00 (10h);

Test 3 on 10/01/2016, from 01:00 to 11:00 (10h).

In Table 1 we report the statistics extracted from the

log ﬁles about the three sessions. By observing the

table we can remark that:

A Collaborative Environment for Web Crawling and Web Data Analysis in ENEAGRID

293

Table 1: Long time sessions performance for Test 1, 2 and 3.

Starting Running Saved compressed Downloaded Downloaded Data Resources

Date Running time Agents data data resources speed speed

[dd/mm/yy] [hh:mm] [hour] [#] [GB] [TB] [#Pages] [Mb/s] [#Pages/s]

10/12/2015 09:30 8 16 522,38 2,94 66.806.790 850 2305

28/12/2015 12:00 10 16 580,28 3,23 71.667.304 740 1959

05/01/2016 01:00 10 8 581,89 3,27 75.587.287 756 2048

Figure 14: Carrot

interface.

a) as absolute value, the Test 1 (by exploiting 16

agents) reached the best performance (850Mbps);

b) if the running time was equal (10h), the Test 3

(756Mbps), which used 8 agents, was slightly bet-

ter than the Test 2 (740Mbps), which exploited 16

agents;

c) the ratio between the compressed data saved on

the storage (column 5) and the real amount of

downloaded data (column 6) is ever around the

17%.

In Figures 15, 16 and 17 we report the plot of

the downloading data speed versus the time, respec-

tively for Test 1, 2 and 3. We should point out that the

real speed value was a little greater than the reported

one on the graphics because the web data are com-

Figure 15: Downloading speed for the test on 10/12/2015.

pressed before downloading. As we expected, during

the crawling, the downloading speed showed an initial

peak (up to the limit of 1Gbps) and then a damping

around a slight lower constant value (700-800Mbps).

This phenomenon of performance decreasing is com-

mon for all web crawlers. The best initial behaviour is

explained by the inner managing of the cache system.

At the beginning all the crawler cache memories (e.g.

for URLs to visit, the URLs just downloaded, etc.)

are empty. When they begin to allocate space then the

downloading speed begin to decrease. We observed

that the performance decay is only for 20% − 30%

with respect to the best initial behaviour. This is a

good result because the decrement is regular, constant

and limited over a long time. This last remark con-

ﬁrmed the high performance of the software. Further,

in Figure 15, we observed around 13:30, for few min-

utes, a quickly decay of the downloading speed and

then a fast return to the previous state. We related

this event to a momentary reset of a networking de-

vice and this unexpected phenomenon, happened dur-

ing Test 1, let us to remark a good robustness of the

software BUbiNG.

4.2 Tests for Periodic Snapshots

By the massive crawling of the ﬁrst kind of tests we

veriﬁed some web crawler quality characteristics, i.e.

the efﬁciency and the robustness. Instead, by the sec-

ond type of tests, we proved the software reliability.

Firstly, we considered a long year period when the

network conditions were constant and with low activ-

ity, such as August 2016. Then, during this month, we

Figure 16: Downloading speed for the test on 28/12/2015.

KomIS 2017 - Special Session on Knowledge Discovery meets Information Systems: Applications of Big Data Analytics and BI -

methodologies, techniques and tools

294

Figure 17: Downloading speed for the test on 05/01/2016.

Figure 18: Downloading avg speed for daily periodic snap-

shots during August 2016.

submitted a set of periodic snapshots. We executed 31

daily web crawling sessions, each of them kept alive

for one hour (from 21:00 to 22:00) and by using the

best set up of the software, the same of Test 1 and

2, i.e. 8 nodes and 2 agents per nodes. As total, we

downloaded about 15TB of data (on average, 484GB

per snapshot) and saved on the storage around 3,3TB

(on average, 111GB per snapshot). In Figure 18 is

reported the downloaded data speeds for each snap-

shot. We observed that they oscillated between 0,96

and 1,07Gbps with an average value around 1,00Gbps

(the bottleneck due the ﬁrewall) and variance around

0,0005Gbps. The successfully done tests and the high

performance demonstrated the good reliability of the

software.

5 CONCLUSIONS AND FUTURE

WORKS

To summarize, with this work we obtained a good

product for the web crawling activity. This tool is

completely integrated in ENEAGRID infrastructure

by the Web Crawling Project. It allows collaborative

work, thanks to its Virtual Lab, which provides a web

graphical application in order to use all instruments

integrated in the infrastructure, also remotely. In ad-

dition to the web crawling software, the Virtual Lab

offers some post-crawling solutions, i.e. for indexing,

querying, displaying and clustering the web data, in

such a way as to have a complete product. The exper-

imental results conﬁrm the high quality of the conﬁg-

uration for the web crawling software from a point of

the efﬁciency, the robustness and the reliability.

For the future we hope to improve the indexing pro-

cess. Now, this task is sequentially performed by a

single machine (the front end). In the ﬁrst instance,

statistics relative to this operation, show an average

of 4 hours of indexing (on single node) for each one

of crawling (on 8 nodes). We expect, by paralleling

the process (by means the other 8 nodes and/or by

a large memory machine), reaching the goal of “few

minutes” of indexing per one hour of crawling.

ACKNOWLEDGEMENTS

The computing resources and the related technical

support used for this work have been provided by

CRESCO/ENEAGRID High Performance Comput-

ing infrastructure and its staff (Ponti et al., 2014).

CRESCO/ENEAGRID High Performance Comput-

ing infrastructure is funded by ENEA, the Ital-

ian National Agency for New Technologies, En-

ergy and Sustainable Economic Development and

by Italian and European research programmes, see

http://www.cresco.enea.it/english for informa-

tion.

REFERENCES

Boldi, P., Marino, A., Santini, M., and Vigna, S. (2016).

BUbiNG: Massive Crawling for the Masses. CoRR,

abs/1601.06919.

Mariano, A., D’Amato, G., Ambrosino, F., Aprea, G.,

Colavincenzo, A., Fina, M., Funel, A., Guarnieri, G.,

Palombi, F., Pierattini, S., Ponti, G., Santomauro, G.,

Bracco, G., and Migliori, S. (2016). Fast Access

to Remote Objects 2.0 a renewed gateway to enea-

grid distributed computing resources. PeerJ Preprints,

4:e2537v1.

Ponti, G., Palombi, F., Abate, D., Ambrosino, F., Aprea,

G., Bastianelli, T., Beone, F., Bertini, R., Bracco, G.,

Caporicci, M., Calosso, B., Chinnici, M., Colavin-

cenzo, A., Cucurullo, A., Dangelo, P., De Rosa, M.,

De Michele, P., Funel, A., Furini, G., Giammattei,

D., Giusepponi, S., Guadagni, R., Guarnieri, G., Ital-

iano, A., Magagnino, S., Mariano, A., Mencuccini,

G., Mercuri, C., Migliori, S., Ornelli, P., Pecoraro,

S., Perozziello, A., Pierattini, S., Podda, S., Poggi, F.,

Quintiliani, A., Rocchi, A., Scio, C., Simoni, F., and

Vita, A. (2014). The role of medium size facilities in

the HPC ecosystem: The case of the new CRESCO4

cluster integrated in the ENEAGRID infrastructure.

pages 1030–1033.

A Collaborative Environment for Web Crawling and Web Data Analysis in ENEAGRID

295