UdM, and 10,000 Thai entries from Kasetsart Univer-
sity; some were bilingual: 70,000 Japanese-English
entries, plus 10,000 Japanese-French entries in J.
Breen’sJDICT XML format, 8,000 Japanese-Thai en-
tries in SAIKAM XML format, and 120,000 English-
Japanese entries in KDD-KATE LISP format. Finally,
there were 50,000 French-English-Malay entries in
FeM XML format. The authors defined a macrostruc-
ture, with a set of monolingual dictionaries of word
senses, called “lexies” linked together through a set
of interlingual links, called “axies”. In the next step
the “raw dictionaries” were transformed into a “lex-
ical soup”, in the (Mangeot-Lerebours, 2001) inter-
mediate DML format, which comprised of the XML
schema and the namespace. The star-like structure
thus created made it easier to add new languages.
Breen (2004) built a multilingual dictionary, JM-
dict, with Japanese as the pivot language and trans-
lations in several other languages. The project was
an extension of an earlier Japanese-English dictio-
nary project (EDICT: Electronic Dictionary) (Breen,
1995), that began in the early 1990s to create a
Japanese-English dictionary, which grew to 50,000
entries by the late ’90s. Yet its structure was found
to be inadequate to represent the orthographical com-
plexities of the language, as many Japanese words can
be written with alternative kanji and kana and may
have alternative pronounciations. The kanji came
from the ancient Chinese and kana was a derivative
of it. In modern use they are used to describe dif-
ferent parts of speech. For French translations they
used two projects: 17,500 entries from Dictionnaire
franc¸ais-japonais (Desperrier, 2002) and 40,500 en-
tries from French-Japanese Complementation Project
at http://francais.sourceforge.jp/. For German trans-
lations they used WaDokuJT Project (Apel, 2002).
XML (Extensible Markup Language) was used to
format the file on account of the flexibility it pro-
vides. The JMdict XML structure contained an ele-
ment type: <entry>, which in turn contained: the se-
quence number, the kanji word, the kana word, infor-
mation, and translation information. The translation
part consisted of one or more sense elements. The
combining rules were used to weed out unnecessary
entries. The rule stated in short: treat each entry as
a triplet of kanji, kana and senses; if for any two or
more entries, two or more members of the triplet are
the same, combine them into one entry. Thus if the
kanji and kana in different entries are included as al-
ternative forms, and if they differ in sense, they are
included as polysemous words. The entry also stored
information regarding the meanings of the word in
different languages. The JMDict file contained over
99,300 entries in both English and Japanese, while
83,500 keywords/phrases had German translations,
58,000 had French translations, 4,800 had Russian
translations, and 530 had Dutch translations. A set
of 4,500 Spanish translations was being prepared.
3 MULTILINGUAL LEXICON
GENERATION
Since its humble beginnings in 2001, Wikipedia has
emerged as a huge online resource attracting over
684 million visitors yearly by 2008. There are more
than 75,000 active contributors working on more
than 10,000,000 articles in more than 250 languages
(Wikipedia, August 3, 2008). Each Wikipedia page
has links to pages on the same topic in other lan-
guages.and combined in the form of 7-tuples, which
are entries in the lexicon, detailing a word in English
and its translations in the six other languages. The
aim was to extract as many such 7-tuples as possible.
3.1 Web Crawler
A web crawler is a computer program that follows
links on web pages to automatically collect data (hy-
pertext) off the internet. We use it here to move from
one Wikipedia article to another, collecting the above
mentioned tuples of word/phrase translations in the
process.
Our version of the web crawler takes the starting
page as an input from the user. It visits the given
page, and extracts all the links on that page and ap-
pends them to a list. Then it repeats the process for
each link collected earlier, and visits them one by one,
extracting the links and once again appending them
to the list. Putting them at the end (e.g., making the
list a queue) ensures that the search method adopted
is Breadth First Search (BFS). In our context follow-
ing BFS will explore a number of related concepts
consecutively while Depth First Search (DFS) would
drift-off any given topic. There may be technical as-
pects related to the use of memory by each approach
but we will not discuss them here.
The BFS approach was used in the following ex-
periments. With the BFS capability thus incorpo-
rated, other lists were defined that would keep track of
all the web pages that have already been visited thus
keeping the code from revisiting them and extracting
repeatedly the same 7-tuples into the lexicon. Apart
from ensuring that there was no redundancy within
the lexicon, either purely numeric or had a null entry
for any of the seven languages.
The program picks up a URL from the top of the
queue, expands in terms of URLs by exploring new
ICAART 2009 - International Conference on Agents and Artificial Intelligence
358