which applies Dice’s coefficient to Rand’s method
in order to overcome some defects of Rand’s method.
For the three types of web pages (Border, 2002),
we decided to choose queries from informational
classes which are estimated as the most proper ones
for blog searches (Gliad and Maarten, 2006). We
chose one query each from movie, music, and book
categories. First, ‘X-Japan,’ the name of one of the
most famous rock bands in Japan, is the query from
music. Second, we chose ‘Cha T.H,’ who is one of
the most famous actors in Korea, as the query word
from movies. Lastly, ‘Ekuni Gaori,’ who is one of
the most famous Japanese novelists, is the query
from books.
We tested 50 sets of blog data from the Naver
search engine, without performing any editing.
Although our prototype system could test its
algorithm with all blog pages on the web, we
decided to test just 50 pages this time. We will test
more pages for more accuracy later. Before our
prototype system perofrmed its task, we made an
ideal clustered set by hand. After the prototype
system created its result set, we compared this with
the ideal set by CSIM.
Table 1: The evaluation result.
Query word X-Japan Cha T.H. Ekuni Gaori
CSLM value 0.857 0.711 0.805
Since the CSIM value is quite close to 1, we can
conclude that the prototype system is successful in
clustering blog information. Although the test was
performed on only 50 sets of blog data, it certainly
clustered data which should be clustered, so we
think that the larger the example set is, the more
exact the results will be. We expect that this system
will be able to offer useful and special information
to users and companies that want to know the
public’s response to their products or image.
5 CONCLUSIONS
In this paper, we discuss a blog search algorithm that
considers the characteristics of blog content based
on the assumption that the resultant blog
classification can provide more valuable information
to users. We also made a simple prototype to
evaluate our algorithm. In order to test this system,
we tried to find features of a blog and the problems
of general search engines, and then find a solution
which could solve those problems to an extent. We
decided to use the concept of K-means as the
classification method. We developed our own
algorithm to adjust K-means to blog information. As
shown in section 4, our algorithm and system
provides certain benefits to users with clustered
groups. It may not satisfy all the users, but it can
give additional useful data to users and suggest a
new approach to the blog search engine field.
For future research, there is something else to
consider. There were three important issues in
making an algorithm with K-means, as you can see
in section 2.2, and we do not think that our solution
suggested in this paper is the only possible one. So
we will try to find the best solution which can
extract a better weight from the blog and choose a
better K and critical point. In addition, we will study
more classification methods which can be matched
more closely with blog searches. Finally, nowadays
a variety of search algorithms and methods used in
search engines exist. Since our final goal is to
present the best blog algorithm, we will study other
search mechanisms, including classification.
ACKNOWLEDGEMENTS
This research was financially supported by the
Ministry of Knowledge Economy(MKE) and Korea
Industrial Technology (KOTEF) through the Human
Resource Training Project for Strategic Technology.
REFERENCES
Aixin, S., Maggy, S., Ying, L. 2007. Blog Classification
Using Tags: An Empirical Study. In ICADL 2007.
Bloglines: http://www.bloglines.com/.
Blogpulse: http://www.blogpulse.com/.
BLOGRANGER: http://ranger.labs.goo.ne.jp/.
BlogWatcher: http://blogwatcher.pi.titech.ac.jp/.
Broder A. 2002. A Taxonomy of Web Search. In SIGIR
Forum.
Chung, Y.M., Lee, J.Y. 2001. A corpus-based approach to
comparative evaluation of statistical term association
measures. In J. of the American Society for
Information Science and Technology.
Fujiki, T., Nanno, T., Suzuki, Y., Okumura, M. 2004.
Identification of Bursts in a Document Stream. In First
International Workshop on Knowledge Discovery
2004.
Fujimura, K.,Toda, H., Inoue, T., Hiroshima, N., Kataoka,
R., Sugizaki M. 2006. BLOGRANGER – A multi-
faceted Blog Search Engine. In WWW 2006.
Gilad, M., Maarten, R. 2006. A Study of Blog Search. In
ECIR 2006. LNCS 3936.
Google, http://www.google.com/.
Kumar, R., Novak, J., Raghavan, P., Tomkins, A. 2003.
On the bursty evolution of blogspace. In WWW’03:
ICEIS 2009 - International Conference on Enterprise Information Systems
66