5 CONCLUSION AND FUTURE
WORK
This work is part of the effort done in the Kl
´
a research
project, which is under development in the Comput-
ing Research Center at the Costa Rica Institute of
Technology. Its website is www.kla.ic-itcr.ac.cr.
Kl
´
a is focused on characterize the Costa Rican web,
based on its structure, composition and clustering. All
this knowledge can be applied to develop several tools
for improving the web experience. A new search en-
gine has been developed under the MS .net tech-
nologies. It will be available soon at www.kla.ac.cr.
In this paper it has been showed that Costa Rican
web has around seven hundred thousand documents.
Most of them are static and relatively new. Almost
82% of documents are html text. Distributions of in-
coming links, outgoing links, pagerank, hubrank and
authorityrank follow expected power laws.
It was also presented a clustering of 16 sites of
Costa Rican web according to their popularity. This
technique builds a classic vector for each site and then
applies a HAC or k-means for clustering. Currently,
we are working on a symbolic representation for web
sites, using histograms (Rodr
´
ıguez-Rojas, 2000) for
measuring the frequency of terms in different axis of
a web document: text, title, links and bold. Symbolic
representations of web objects are not new, Schenker
et al have been applied a graph representation for clus-
tering web documents (Bunke, 2003).
ACKNOWLEDGEMENTS
The author would like to thank several people that
help in the construction of this paper: Dr. Oldemar
Rodr
´
ıguez-Rojas for his guidance in the very first ex-
periments on symbolic web clustering, Dr. Carlos
Castillo for his help using WIRE system and Dr. Fran-
cisco J. Torres-Rojas for his support in the establish-
ment of the Kl
´
a research project.
REFERENCES
Baeza-Yates R., Poblete B. and Saint-Jean, F. (2003). Evo-
lution of the Chilean Web: 2001-2002 (In Spanish).
In Proceedings of the Jornadas Chilenas de Com-
putaci
´
on. Chill
´
an, Chile, November 2003.
Boley D., Gini M., Gross R., Han S., Hastings K., Karypis
G., Kumar V., Moore J. and Mobasher B. (1999).
Partitioning-Based Clustering for Web Document Cat-
egorization. In Decision Support Systems. 1999.
Bunke H., Last M., Schenker A. and Kandel A.(2003).
A comparison of two novel algorithms for clustering
web documents. In Proceedings of the 2nd Interna-
tional Workshop on Web Document Analysis (WDA).
2003.
Buntine W., Perttu, S. and Tuulos V. (2004). Using Dis-
crete PCA on Web Pages. In ECML/PKDD 2004, Pro-
ceedings of the Workshop on Statistical Approaches
for Web Mining. Pisa, Italy, September, 2004.
Castillo C. (2004). Effective Web Crawling. PhD Thesis,
University of Chile, 2004.
Chakrabarti, S. (2003). Mining the Web. Morgan Kaufmann
Publishers, 2003.
COSTA RICA NIC: Domain name services for Costa Rica.
http://www.nic.cr.
He X., Zha H., Ding C. and Simon H. (2001). Web Doc-
ument Clustering Using Hyperlink Structures. Tech-
nical Report CSE-01-006, Dept. of Computer Science
and Engineering, Pennsylvania State University, 2001.
Jain A.K., Murty M.N. and Flynn P.J. (1999). Data Cluster-
ing: A Review In ACM Computing Surveys. Septem-
ber, 1999.
Kleinberg J. (1999). Authoritative sources in a hyperlinked
environment. In Journal of the ACM. 46(5):604632,
1999.
LGL: Large Graph Layout
http://apropos.icmb.utexas.edu/lgl/
NETCRAFT. http://news.netcraft.com/.
ODP: Open Directory Project. http://dmoz.org.
Page L., Brin S., Motwani R. and Winograd T. (1998). The
PageRank citation ranking: bringing order to the Web.
Technical report, Stanford Digital Library Technolo-
gies Project, 1998.
Predisoft: Scientific Prediction. http://www.predisoft.com.
Rodr
´
ıguez-Rojas O. (2000). Classification and Linear Mod-
els in Symbolic Data Analysis. PhD Thesis, Univer-
sity of Paris IX-Dauphine, 2000.
Shen D., Chen Z., Yang Q., Zeng H., Zhang B., Lu Y.
and Ma W. (2004). Web-page Classification through
Summarization. In SIGIR 2004, Proceedings of the
ACM Conference on Research & Development on In-
formation Retrieval. South Yorkshire, United King-
dom, July, 2004.
WIRE: Web Information Retrieval Environment.
http://www.cwr.cl/projects/WIRE/.
MINING THE COSTA RICAN WEB
421