also allow us to map agrifood business to regions, as
FaceBook APIs offer location-based search.
Finally another future activity, is to apply the
methods we developed for classifying agrifood sites
to all sectors. This to generalise the tools we
developed and also fully classify the .it web.
6 CONCLUSIONS
This paper has covered the design and
implementation of a web classification system
focusing on .it web sites. The whole idea has been to
create a classification system able to permanently
classify a large number of continuously changing web
sites. The outcome is that we can correctly assign a
category to domain names with a overall F1 score of
over 80% that is great step ahead with respect to
commercial classification services that produce poor
results as reported in Table 1; this using broader
categories, and thus easing the classification task,
with respect to this work where we have used very
specific categories. This work has been used in the
context of the Universal Expo Expo2015 to classify
the agribusiness sites active on .it, and divide them
into sub-categories. While the system is operational
since some months, we are extending it to user it for
categorising non-agrifood domains.
In terms of original contributions, our system is a
step forward with respect to commercial
classification systems that fall short when classifying
non-English or not-so-popular web sites. All the
software is based on freely available tools and
libraries, and its internals have been explained in this
paper making the system open and extensible,
contrary to commercial systems that do not explain
how/how often they classify sites.
ACKNOWLEDGEMENTS
Our thanks to Prof. Giuseppe Attardi
<attardi@di.unipi.it> for his help and suggestions
thought this work. In addition we would like to thank
our colleagues that have assisted us during the manual
domain classification.
REFERENCES
BrightCloud Inc., 2014. BrightCloud Web Classification
Service, http://www.brightcloud.com/pdf/BCSS-WCS-
DS-us-021814-F.pdf
SimilarWeb Inc, 2014. Our Data & Methodology,
http://www.similarweb.com/downloads/our-data-
methodology.pdf.
AOL Inc, 2015. Open Directory Project (ODP),
http://dmoz.org.
Blocksi SAS, 2014. Blocksi Manager for Cloud Filtering,
http://www.blocksi.net.
zvelo Inc., 2014. Website Classification,
https://zvelo.com/website-classification/.
Sun, A., Lim, E., 2002. Web classification using support
vector machine, Proc. of the 4th international workshop
on Web information and data management (WIDM
’02).
Dumais, S., Chen, H., 2000. Hierarchical classification of
Web content, Proc. of the 23rd ACM SIGIR conference
on Research and development in information retrieval
(SIGIR ’00).
Zhang Zhang, Y., Zincir-Heywood, N., and Milios, E.,
2003. Summarizing web sites automatically, Proc. of
AI’03.
Soumen, C., Van den Berg, M., and Dom, B., 1999.
Focused crawling: a new approach to topic-specific
Web resource discovery, Computer Networks 31.11
(1999): 1623-1640.
Chandra, C., et al., 1997. Web search using automatic
classification, Proc. of the Sixth International
Conference on the World Wide Web.
Jung-Jin, L., et al., 2009. Novel web page classification
techniques in contextual advertising, Proc. of the
eleventh international workshop on Web information
and data management. ACM.
Xiaoguang, Q., and Davison, B. D., 2009. Web page
classification: Features and algorithms, ACM
Computing Surveys (CSUR) 41.2 (2009):12.
Dou, S., et al., 2004. Web-page classification through
summarization, Proc. of the 27th annual international
ACM SIGIR conference on Research and development
in information retrieval. ACM.
Hwanjo, Y., Han, J., and Chen-Chuan Chang, K., 2002.
PEBL: positive example based learning for web page
classification using SVM, Proc. of the eighth ACM
SIGKDD international conference on Knowledge
discovery and data mining, ACM.
Ji-bin, Z., et al., 2010. A Web Site Classification Approach
Based On Its Topological Structure, Int. J. of Asian
Lang. Proc. 20.2 (2010):75-86.
Soumen, C., Dom, B., and Indyk, P., 1998. Enhanced
hypertext categorization using hyperlinks, ACM
SIGMOD Record. Vol. 27. No. 2. ACM.
Attardi, G., Gulli, A., and Sebastiani, F., 1999. Automatic
Web page categorization by link and context analysis,
Proc. of THAI. Vol. 99. No. 99.
Vapnik, V., 1998. Probabilistic learning theory. Adaptive
and learning systems for signal processing,
communications, and control, John Wiley & Sons.
Joachims, T., 1998. Text categorization with support vector
machines: Learning with many relevant features.
Springer Berlin Heidelberg.
Sun, A., Ee-Peng, L., and Wee-Keong, N., 2002. Web
classification using support vector machine, Proc. of the