are executed with and without integration of tags.
2 SOCIAL CLASSIFICATION
The evolution of the World Wide Web led to the rise
and growth of new concepts like Web 2.0 and the
Social Web, in which users have access to a set of
applications that allow them to interact with each
other by easily publishing, editing and sharing
content (for example, blogs, wikis, video sharing
systems, photo sharing systems, etc.).
However, the massive user participation creates a
growing flow of information which again requires
new ways to recover information (Lee et al., 2009).
The dynamics which occur among Web 2.0 users
are naturally providing interesting ways to help
organize information by creating folksonomies
. The
term folksonomy (Wal, 2007) was created by
Thomas Vander Wal and derives from the
agglutination of the terms folk and taxonomy.
Folksonomies naturally arise when a set of users,
interested in some information, decide to describe it
through comments, or by attributing tags (Snuderl,
2008), providing important elements to categorize
that information. The power that resides in creating a
folksonomy is visible in initiatives like the one
carried out by the Library of Congress or at
steve.museum research project (Trant, 2008).
The Library of Congress launched a pilot project
on Flickr, a popular photo sharing website, which
consisted of an open invitation to the general public
to tag and describe two sets of approximately 3000
historical photographs (Springer et al., 2008). The
initiative was a success, generating a massive
growing movement, typical to the Web 2.0
communities.
steve.museum research project is another
example which relies on cooperation between
museum professionals and other entities who believe
social tagging may provide new ways to describe
and access cultural object collections, besides
promoting visitor interaction.
According to Trant (Trant, 2008), when
implementing the steve.museum project prototype,
the analyses of the tags attributed by common
museum users showed they did not match the terms
used by museum professionals. To minimize the gap
between professional language and common
language, social tagging was used as a promising
addition to museum records as its terminology is
usable in some king of searches (although this
possibility stills has to be verified by a large scale
study) (Trant, 2008).
In fact, “[it] is still uncertain that [a] new
folksonomy will replace traditional hierarchy but
now that all users have the power to classify
according to their own language, research will never
be the same” (Dye, 2006).
Still, in Trant (Trant, 2008) it is said that the
museum professionals general opinion is that the
tags attributed by users may be interesting even
though its pertinence may require validation.
However, self-normalization theories state that
folksonomic tags will self regulate, the collective
vocabulary will become more consistent in time and
all without need for an external imposed control
(Trant, 2009).
The initiatives conducted in these two projects
demonstrate an awareness of the potentialities
emerging from using the collective intelligence
generated from a folksonomy.
3 k-means ALGORITHM
The k-means algorithm was the starting point for
this investigation specially because of its simplicity
and efficiency (Feldman and Sanger, 2007,
Theodoridis and Koutroumbas, 2009). Its time
complexity by iteration is, in the worst case, O(kn)
but the number of iterations is generally quite small.
The k-means algorithm (MacQueen, 1967)
allows the partition of an initial set of documents
(each document is represented as a vector) in k
clusters. The algorithm starts by selecting k random
seeds and then calculates the distance from each
document to every seed, grouping each document to
its nearest seed. When all clusters are formed, the
new centroids become the mean of the document
vectors on each cluster. Each document is then
associated to the nearest centroid. The process ends
when convergence is achieved, or in other words,
when there are no more changes.
Despite the efficiency, the random choice of
seeds may lead to bad clustering examples. In this
sense, Arthur e Sergei Vasilvitskii (Arthur and
Vassilvitskii, 2007) proposed the k-means++
algorithm to overcome that fault, which chooses the
seeds according to specific probabilities. Its
complexity is O(log k) and the experimental results
show a shrinkage on the number of iterations until
convergence is achieved. However, the number of
clusters is still unknown, a parameter which greatly
influences the quality of the formed clusters.
ClusteringandClassifyingTextDocuments-ARevisittoTaggingIntegrationMethods
161