VISUALIZATION OF AND RETRIEVAL OF BACKGROUND
INFORMATION RELATING TO WORDS IN WEB DOCUMENTS
A Visualization Interface based on Vector Representation
Kouji Shimatsuka and Tatsuhiro Yonekura
Graduate School of Science and Engeneering, Ibaraki University, Ibaraki, Japan
Keywords: Visualization, Multi Media, User Interface, Text Mining, Vector Space Model.
Abstract: When people encounter unfamiliar words, they often use tools such as search engines to obtain background
information on these words. However, the semantic content of words can be complex, and it is not always
possible to understand the meaning of words from textual information alone. In this paper we quantify the
semantic content of words by means of a simple and convenient text-based method whereby the semantic
content is constructed from linguistic, visual and auditory characteristic values. Using characteristic vectors
generated in this way, users are able to visually check and search for background information on unfamiliar
terms in a web document.
1 INTRODUCTION
The Web was originally used for the exchange of
text-based information, but with the growth of the
Internet, a vast and rich collection of multimedia
information such as images, music and video content
is now available online. As a result, the Web has
become an extremely useful tool for looking up the
meaning of words. When people encounter
unfamiliar words, they often use tools such as search
engines to obtain background information on these
words. However, the semantic content of words can
be complex, and it is not always possible to
understand the meaning of words from textual
information alone. In this paper we quantify the
semantic content of words by means of a simple and
convenient text-based method whereby the semantic
content is constructed from linguistic, visual and
auditory characteristic values. Using characteristic
vectors generated in this way, users are able to
visually check and search for background
information on unfamiliar terms in a web document.
2 RELATED RESEARCHES
The field of semantic visualization methods is
currently being actively researched.
Words can be related in many different ways
parent-child relationships and sibling relationships
can be determined in some cases, and counterfactual
relationships are exhibited in other cases. However,
most of these techniques only concentrate on
linguistic characteristics. This is because it is
generally not possible to systematically handle the
weighting of words and the weighting of images and
audio in a simple manner.
3 VECTOR SPACE MODEL
(Gerard Salton, Michael J. MeGill, 1983)A vector
space model is a search model where a document is
represented with vectors whose elements are the
weightings applied to the search terms.
3.1 Weighting of Indexing Terms
3.1.1 TF-IDF
In this paper, weightings are determined by using the
TF-IDF method, which is widely used for the
weighting of indexing terms. The frequency at
which an indexing term t
i
appears in a set of
documents d
j
is called the term frequency (TF) and
is expressed as tf
ij
. The inverse document frequency
(IDF) is used to express the specificity of these
terms. The inverse document frequency idf is
defined as follows:
419