AUTOMATIC MULTILINGUAL LEXICON GENERATION USING

WIKIPEDIA AS A RESOURCE

Ahmad R. Shahid and Dimitar Kazakov

Department of Computer Science, The University of York, YO10 5DD, York, U.K.

Keywords:

Multilingual Lexicons, Web Crawler, Wikipedia, Natural Language Processing, Web mining, Data mining.

Abstract:

This paper proposes a method for creating a multilingual dictionary by taking the titles of Wikipedia pages

in English and then ﬁnding the titles of the corresponding articles in other languages. The creation of such

multilingual dictionaries has become possible as a result of exponential increase in the size of multilingual

information on the web. Wikipedia is a prime example of such multilingual source of information on any

conceivable topic in the world, which is edited by the readers. Here, a web crawler has been used to traverse

Wikipedia following the links on a given page. The crawler takes out the title along with the titles of the

corresponding pages in other targeted languages. The result is a set of words and phrases that are translations

of each other. For efﬁciency, the URLs are organized using hash tables. A lexicon has been constructed which

contains 7-tuples corresponding to 7 different languages, namely: English, German, French, Polish, Bulgarian,

Greek and Chinese.

1 INTRODUCTION

The main goal of this project is to attract the attention

to and demonstrate the feasibility of creating a mul-

tilingual dictionary using Wikipedia. Here, English,

German, French and Polish were chosen for their

wealth of information and another three languages to

demonstrate that our program could also handle dif-

ferent writing systems and alphabets: Greek, Bulgar-

ian and Chinese in this case. The technique can be

applied to a number of other online resources where

versions of the same article appear in different lan-

guages; one such example is the Southeast European

Times news site (http://www.setimes.com/). In this

case, and many others, the use of crawlers is unavoid-

able as an off-line version of the resource is not at

hand.

2 LITERATURE REVIEW

There have been efforts in the past to build multi-

lingual dictionaries with varying degrees of success,

and the ones we know of are only extensions of bilin-

gual dictionaries already available. Yet none of them

had tried to use Wikipedia as the potential source of

lexical information. Only very recently, there have

been attempts to tap into multilingual dimension of

Wikipedia, which has been used to identify named en-

tities Richman et al. (2008) .

Lafourcade (1997) carried out two multilingual

construction projects: French-English-Malay (FeM),

and French-English-Thai (FeT). The FeM data was

created by crossing of French-English and English-

Malay lexical resources. the source language. For

each word in French, one or several meanings in the

target language were grouped together in so called

blocks. Thus there were equivalent blocks for En-

glish, Malay and Thai. In the case that more than

one translation existed, one entry was restricted to just

one meaning, with extra entries for extra meanings.

Sometimes only one entry was used even in the case

of several alternative meanings, leaving it to the dis-

cretion of the lexicographers to decide which alterna-

tive meaning to use. Two kinds of dictionaries were

targeted: the general dictionary with about 20,000 en-

tries and a Computer Science domain speciﬁc dictio-

nary with 5,000 entries.

Boitet et al. (2002) worked on the PAPIL-

LON project. The project covered seven different

languages: English, French, Japanese, Thai, Lao,

Vietnamese, and Malay. They started with open

source data, known as “raw dictionaries”. Some of

them were monolingual: 4,000 French entries from

357

R. Shahid A. and Kazakov D. (2009).

AUTOMATIC MULTILINGUAL LEXICON GENERATION USING WIKIPEDIA AS A RESOURCE.

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence, pages 357-360

DOI: 10.5220/0001783003570360

 SciTePress

UdM, and 10,000 Thai entries from Kasetsart Univer-

sity; some were bilingual: 70,000 Japanese-English

entries, plus 10,000 Japanese-French entries in J.

Breen’sJDICT XML format, 8,000 Japanese-Thai en-

tries in SAIKAM XML format, and 120,000 English-

Japanese entries in KDD-KATE LISP format. Finally,

there were 50,000 French-English-Malay entries in

FeM XML format. The authors deﬁned a macrostruc-

ture, with a set of monolingual dictionaries of word

senses, called “lexies” linked together through a set

of interlingual links, called “axies”. In the next step

the “raw dictionaries” were transformed into a “lex-

ical soup”, in the (Mangeot-Lerebours, 2001) inter-

mediate DML format, which comprised of the XML

schema and the namespace. The star-like structure

thus created made it easier to add new languages.

Breen (2004) built a multilingual dictionary, JM-

dict, with Japanese as the pivot language and trans-

lations in several other languages. The project was

an extension of an earlier Japanese-English dictio-

nary project (EDICT: Electronic Dictionary) (Breen,

1995), that began in the early 1990s to create a

Japanese-English dictionary, which grew to 50,000

entries by the late ’90s. Yet its structure was found

to be inadequate to represent the orthographical com-

plexities of the language, as many Japanese words can

be written with alternative kanji and kana and may

have alternative pronounciations. The kanji came

from the ancient Chinese and kana was a derivative

of it. In modern use they are used to describe dif-

ferent parts of speech. For French translations they

used two projects: 17,500 entries from Dictionnaire

franc¸ais-japonais (Desperrier, 2002) and 40,500 en-

tries from French-Japanese Complementation Project

at http://francais.sourceforge.jp/. For German trans-

lations they used WaDokuJT Project (Apel, 2002).

XML (Extensible Markup Language) was used to

format the ﬁle on account of the ﬂexibility it pro-

vides. The JMdict XML structure contained an ele-

ment type: <entry>, which in turn contained: the se-

quence number, the kanji word, the kana word, infor-

mation, and translation information. The translation

part consisted of one or more sense elements. The

combining rules were used to weed out unnecessary

entries. The rule stated in short: treat each entry as

a triplet of kanji, kana and senses; if for any two or

more entries, two or more members of the triplet are

the same, combine them into one entry. Thus if the

kanji and kana in different entries are included as al-

ternative forms, and if they differ in sense, they are

included as polysemous words. The entry also stored

information regarding the meanings of the word in

different languages. The JMDict ﬁle contained over

99,300 entries in both English and Japanese, while

83,500 keywords/phrases had German translations,

58,000 had French translations, 4,800 had Russian

translations, and 530 had Dutch translations. A set

of 4,500 Spanish translations was being prepared.

3 MULTILINGUAL LEXICON

GENERATION

Since its humble beginnings in 2001, Wikipedia has

emerged as a huge online resource attracting over

684 million visitors yearly by 2008. There are more

than 75,000 active contributors working on more

than 10,000,000 articles in more than 250 languages

(Wikipedia, August 3, 2008). Each Wikipedia page

has links to pages on the same topic in other lan-

guages.and combined in the form of 7-tuples, which

are entries in the lexicon, detailing a word in English

and its translations in the six other languages. The

aim was to extract as many such 7-tuples as possible.

3.1 Web Crawler

A web crawler is a computer program that follows

links on web pages to automatically collect data (hy-

pertext) off the internet. We use it here to move from

one Wikipedia article to another, collecting the above

mentioned tuples of word/phrase translations in the

process.

Our version of the web crawler takes the starting

page as an input from the user. It visits the given

page, and extracts all the links on that page and ap-

pends them to a list. Then it repeats the process for

each link collected earlier, and visits them one by one,

extracting the links and once again appending them

to the list. Putting them at the end (e.g., making the

list a queue) ensures that the search method adopted

is Breadth First Search (BFS). In our context follow-

ing BFS will explore a number of related concepts

consecutively while Depth First Search (DFS) would

drift-off any given topic. There may be technical as-

pects related to the use of memory by each approach

but we will not discuss them here.

The BFS approach was used in the following ex-

periments. With the BFS capability thus incorpo-

rated, other lists were deﬁned that would keep track of

all the web pages that have already been visited thus

keeping the code from revisiting them and extracting

repeatedly the same 7-tuples into the lexicon. Apart

from ensuring that there was no redundancy within

the lexicon, either purely numeric or had a null entry

for any of the seven languages.

The program picks up a URL from the top of the

queue, expands in terms of URLs by exploring new

ICAART 2009 - International Conference on Agents and Artificial Intelligence

358

Figure 1: A Snapshot of the Lexicon.

links on the given URL, and then extracts titles in the

given languages. An essential ﬁrst step was to sep-

arate the tracking of the URLs from the code itself.

Thus a database in Access was created, comprising of

28 different tables. One table stored all the URLs to

be visited, and the rest implemented a hash table, to

store the already visited URLs, indexed by the ﬁrst

character of the page title. 26 different tables were

created for the 26 different letters in the English al-

phabet. Another table was created, URLExtra, which

would store all the URLs with Wikipedia page titles

starting with anything but English letters, including

numbers.

The program starts with a user-provided URL,

then looks for potential URLs to be used for extrac-

tion of titles. These URLs are added to the queue. For

each page that is visited and removed from the queue,

the URLs of several pages (contained in that page)

are typically added. Thus the number of URLs to be

searched rises quickly and might yield a huge list. In

order to avoid such a scenario and keep the size of

the URL list within reasonable limits, an upper limit

was set to the number of URLs at a time. Similarly a

lower limit was set so that the program could be pre-

scient and started looking for more URLs before it ran

out of them. The lower and upper limits were set to 50

and 1,000 respectively. Thus barring an exhaustion of

all potential URLs, the program would never run out

of URLs to be searched for.

3.2 Results

Figure 1 shows a snap shot of the lexicon in a table.

UTF-8 was used as the coding scheme, which makes

possible writing characters in other non-English lan-

guages.

Figure 2: The English Histogram.

A total of 5 runs were carried out to get a total of

8,748 entries, out of which 5,006 were unique (57%

of the total). In order to get that many entries the pro-

gram visited 726,715 English language articles, not

everyone of which was unique. Despite the checks

on revisiting pages some of them were re-visited due

to a bug in the code. Only a little more than 1% of

the total had corresponding pages in all the other six

languages. The crawler still had to visit more than a

quarter of all Wikipedia articles in English.

Using the least-proliﬁc language as a pivotal lan-

guage would have saved us a lot of time in search-

ing for lexical entries since most pages in the least-

frequently occurring language would probably have

corresponding pages in other languages, yet English

was chosen as the pivotal language. The reason being

that English is easy to play with, with its very famil-

iar alphabet-set and thus building hashing tables was

easily done. Also we were more interested in English

AUTOMATIC MULTILINGUAL LEXICON GENERATION USING WIKIPEDIA AS A RESOURCE

359

at the ﬁrst instance and different languages were in-

corporated at a later stage.

Looking at Figure 2 one can see that unigrams

make the bulk of entries (2,485 - almost 50% of the

total), followed by bigrams (1,659 - 33% of the total).

In terms of semantics, the resulting lexicon is a

mix of: toponyms, names of famous people, names of

languages, and general concepts, such as “rock mu-

sic” and “ﬁre ﬁghter”, among others.

4 USES OF A MULTILINGUAL

DICTIONARY

It can help the lexicographers build traditional dictio-

naries, as a starting point for their work. Another im-

portant use is for Cross-Lingual Information Retrieval

(CLIR). Prikola (1998), compared the performance of

translated Finnish queries against English documents

to the performance of original English queries against

English documents, by using a general dictionary and

a domain speciﬁc dictionary (a medical dictionary in

this case). It was found that a cross-lingual IR sys-

tem based on Machine Readable Dictionary (MRD)

translation was able to achieve the performance level

of monolingual IR.

The lexicon containing more than

5,000 entries is available at http://www-

users.cs.york.ac.uk/∼ahmad/index.htm for free

use under GNU Free Documentation License.

5 FUTURE WORK

A useful thing to do would be to create domain

speciﬁc dictionaries based on the categories deﬁned

within Wikipedia, according to which each article be-

longs to one or more categories. A domain could be

deﬁned using a set of categories and only those arti-

cles could be used for building the lexicons belonging

to that particular domain.

ACKNOWLEDGEMENTS

This research was carried in the period Nov 2007 -

July 2008, and was partly sponsored by the Higher

Education Commission in Pakistan.

REFERENCES

Apel, U. (2002). WaDokuJT - A Japanese-German Dictio-

nary Database. In Papillon 2002 Seminar, Tokyo.

Boitet, C., Mangeot-Lerebours, M., and Serasset, G. (2002).

The PAPILLON Project: Cooperatively Building

a Multilingual Lexical Data-base to Derive Open

Source Dictionaries & Lexicons. In Proceedings of

the 2nd Workshop NLPXML 2002, Post COLING 2002

Workshop, Taipei.

Breen, J. (1995). Building an Electronic Japanese-English

Dictionary. In Japanese Studies Association of Aus-

tralia Conference.

Breen, J. (2004). JMdict: a Japanese-Multilingual Dictio-

nary. In Coling 2004 Workshop on Multilingual Lin-

guistic Resources, pages 71–78, Geneva.

Desperrier, J.-M. (2002). Analysis of the Results of a

Collaborative Project for the Creation of a Japanese-

French Dictionary. In Papillon 2002 Seminar, Tokyo.

Lafourcade, M. (1997). Multilingual Dictionary Construc-

tion and Services Case Study with the Fe* Projects. In

Proc. PACLING’97, pages 173–181.

Mangeot-Lerebours, M. (2001). Environnements Centraliss

et Distribus pour Lexicographes et Lexicologues en

Contexte Multilingue. PhD thesis, Universite Joseph

Fourier.

Pirkola, A. (1998). The Effects of Query Structure and Dic-

tionary Setups in Dictonary-Based Cross-Language

Information Retrieval. In Proceedings of the 21st

annual international ACM SIGIR conference on Re-

search and development in information retrieval,

pages 55–63, Melbourne.

Richman, A. and Schone, P. (2008). Mining Wiki

Resrouces for Multilingual Named Entity Recogni-

tion. In Proceedings of ACL-08: HLT, pages 1–9,

Columbus, Ohio, USA.

ICAART 2009 - International Conference on Agents and Artificial Intelligence

360