Effects of Comparable Corpora on Cross-language
Information Retrieval
Fatiha Sadat
University of Quebec in Montreal, Computer Science Department
201 President Kennedy avenue, Montreal, QC, Canada
Abstract. This paper seeks to present an approach to learning bilingual termi-
nology from scarce resources in order to translate and expand terms from
source language to target language and possibly retrieve documents across lan-
guages. An extracted bilingual lexicon from comparable corpora will provide a
valuable resource to enrich existing bilingual dictionaries and thesauri. A linear
combination involving the extracted bilingual terminology from comparable
corpora, readily available bilingual dictionaries and transliteration is proposed
to Cross-Language Information Retrieval. An application on Japanese-English
language pair of languages shows that the proposed combination yields better
translations and an effectiveness of information retrieval could be achieved
across languages.
1 Introduction
Large text corpora represent a crucial resource for bilingual terminology acquisition
and multilingual lexical resources enrichment. Moreover, in recent years non-aligned
comparable corpora have been an object of studies and research related to natural
language processing and information retrieval (Dagan and Itai 1994; Dejean et al.
2002; Diab and Finch 2000; Fung 2000; Koehn and Knight 2002; Nakagawa 2000;
Peters and Picchi 1995; Rapp 1999; Shahzad and al. 2001; Tanaka and Iwasaki
1996), because of their availability and easy accessibility through the World Wide
Web.
In the present paper, our goal consists on learning translation lexicons using scarce
resources, i.e. readily available resources and possibly through the Internet. We are
concerned by exploiting news articles as comparable corpora in order to translate
terms in a source language to any specified target language. Our preliminary study is
conducted on Japanese-English language pair using general-domain comparable cor-
pora and could be extended to other languages and domains. Evaluations were con-
ducted on Cross-Language Information Retrieval (CLIR) using a large-scale test
collection NTCIR
1
for (Japanese, English) language pair. CLIR consists of retrieving
documents written in one language using query terms in another language.
1
http://research.nii.ac.jp/ntcir/
sadat F.
Effects of Comparable Corpora on Cross-Language Information Retrieval.
DOI: 10.5220/0003029200530059
In Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science (ICEIS 2010), page
ISBN: 978-989-8425-13-3
Copyright
c
2010 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
The remainder of the present paper is organized as follows: Section 2 presents an
overview of the proposed approach for bilingual terminology acquisition from com-
parable corpora. Linear combination to dictionary-based translation and translitera-
tion is presented in Section 3. Experiments and evaluations in CLIR are discussed in
Sections 4, 5 and 6. Section 7 concludes the present paper.
2 An Overview of the Proposed Approach on Comparable
Corpora
Unlike parallel texts, which are clearly defined as translated texts, there is a wide
variation of non-parallel-ness in monolingual data. It can be manifested in the topic,
the domain, the authors, the time period, etc. Comparable corpora are collections of
texts from pairs or multiples of languages, which can be contrasted because of their
common features. We rely on such comparable corpora for the extraction of bilingual
terminology in order to enrich existing bilingual dictionaries, thesauri and retrieve
documents across different languages.
In the present study, we follow the proposed model by (Dejean et al. 2002; Fung
2000; Rapp 19992). First, word frequencies, context word frequencies in surrounding
positions (here three-words window) are estimated following statistics-based metrics.
Context vectors for each term in the source language and the target language are
constructed. We use the log-likelihood ratio
(Dunning 1993) as expressed in equation
(1):
LLR(w
i
, w
j
) = K11log
K
11
N
C
1
R
1
+ K12log
K
12
N
C
1
R
2
+ K 21 log
K
21
N
C
2
R
1
+ K 22log
K
22
N
C
2
R
2
(1)
where,
C
1
= K
11
+ K
12
, C
2
= K
21
+ K
22
,
R
1
= K
11
+ K
21
, R
2
= K
12
+ K
22
,
N = K
11
+ K
12
+ K
21
+ K
22
,
K
11
= frequency of common occurrences of word w
i
and word w
j
,
K
12
= corpus frequency of word w
i,
- K
11
,
K
21
= corpus frequency of word w
j
- K
11,
K
22
= N - K
12
- K
22
.
Next, context vectors of the target words are translated using a preliminary seed lex-
icon. We consider all translation candidates, keeping the same context frequency
value as the source term. This step requires a seed lexicon that will be enriched using
the proposed bootstrapping approach of this paper.
Similarity vectors are constructed for each pair of source term and target term using
the cosine metrics (Salton and McGill, 1983), as expressed in equation (2):
Similarity (w
i
, w
j
) =
v
ik
v
jk
k
v
ik
2
v
jk
2
k
k
(2)
54
where,
v
ik
represents co-occurrence frequencies in context vectors of the source term w
i
with
term w
k
. and v
jk
represents co-occurrence frequencies in context vectors of the target
term w
j
with term w
k
..
Therefore, similarity vectors are constructed to yield a probabilistic translation model
P
comp
(t|s) for bilingual terminology extraction from comparable corpora.
3 Linear Combination of Different Translation Models
Combining different models has showed success in previous research (Dejean et al.
2002). We propose a combined probabilistic translation model involving comparable
corpora, readily available bilingual dictionaries as well as transliteration for the spe-
cial phonetic or spelling representation of Japanese language, represented by the
Katakana alphabet.
Fig. 1 presents an overview of the proposed approach in CLIR combining different
translation models such as the comparable corpora, bilingual dictionaries and transli-
teration.
General-purpose dictionaries are basic source of translations and could be exploited
for bilingual terminology extraction. The proposed dictionary-based translation model
is derived directly from readily available bilingual dictionaries, by considering all
translation candidates and their associated phrases, for each source entry.
Transliteration is the phonetic or spelling representation of one language using the
alphabet of another language. The special phonetic alphabet (here Japanese katakana)
to foreign words and loanwords requires romanization or transliteration (Knight and
Graehl 1998). Japanese vocabulary is frequently imported from other languages,
primarily (but not exclusively) from English. Katakana, the special phonetic alphabet
is used to write down foreign words and loanwords, example names of persons and
other terms.
Finally, translation alternatives are ranked according to the combined probability. A
fixed number of top-ranked translation candidates are selected for each source term
and misleading candidates are discarded.
The English word ‘computer’ is transliterated in Japanese katakana as
コンピューター’, as well ‘engineer is transliterated as ‘エンジニアー’, and
space shuttle’ is transliterated as ‘スペースシャトル’. Named entities such as
proper names of foreign (else than Japanese) persons, locations and organizations, are
transliterated in Japanese. An example is ‘Bill Clinton’ as named entities and transli-
terated in Japanese as ‘ビルクリントン’.
Therefore, the combined probabilistic model will involve distribution probabilities
derived from the comparable corpora P
comp
(t|s), readily available bilingual dictiona-
ries P
dict
(t|s) and the transliteration model P
translit
(t|s) as expressed in equation (3):
)s|t(P)s|t(P)s|t(P)s|t(P
translit3dict2comp1
α
α
α
+
+
=
(3)
55
Parameters
α
1
to
α
3
are models dependant and represent the importance of each trans-
lation strategy, with
1,..3
1
i
i
α
=
=
.
Fig. 1. Overview of the proposed approach combining different translation models in CLIR.
4 Experiments and Evaluation in CLIR
Experiments have been carried out to measure the improvement of our proposal on
bilingual Japanese-English tasks in CLIR, i.e. Japanese queries to retrieve English
documents.
4.1 Linguistic Resources
A collection of news articles from Mainichi Newspapers (1998-1999) for Japa-
nese and Mainichi Daily News (1998-1999) for English are considered as compa-
rable corpora, because of their common feature of the time period. Moreover,
documents of NTCIR-2 test collection were considered as comparable corpora in
order to cope with special features of the test collection during evaluations.
56
Morphological analyzers, ChaSen
2
version 2.2.9 (Matsumoto et al. 1997) for
texts in Japanese and OAK
3
(Sekine 2001) for English texts were used in linguis-
tic pre-processing.
EDR (EDR 1996) and EDICT
4
bilingual Japanese-English dictionaries were used
in translation.
KAKASI
5
, a language processing inverter and free software, available on the
Internet was used in the transliteration process of Japanese terms written in kata-
kana to English. Corrections on transliteration were completed manually by a na-
tive Japanese language speaker.
NTCIR-2 (Kando 2001), a large-scale test collection was used to evaluate the
proposed strategies in CLIR.
SMART information retrieval system (Salton 1971), which is based on vector
model, was used to retrieve English documents.
4.2 Results and Discussion
Content words (nouns, verbs, adjectives, adverbs) were extracted from English and
Japanese corpora. In addition, foreign words (mostly represented in katakana) were
extracted from Japanese texts. Thus, context vectors were constructed for Japanese
and English terms. Similarity vectors were constructed for Japanese-English pairs of
terms.
We conducted experiments and evaluations on the monolingual and bilingual tasks of
NTCIR test collection.
Table 1. Results and Evaluations on different translation models and their combination.
Translation Model Avg. Precision % Monolingual % Difference (Improvement)
ME
(Monolingual English)
0.2683 100 - - -
DT
(Dictionary and Transliteration)
0.2279 84.94
-
15.05
- -
SCC
(Comparable Corpora)
0.1417 52.81
-
47.18
-
37.82
-
DT&SCC
(Linear Combination)
0.2366 88.18
-
11.81
+3.82 +66.97
2
http://chasen.aist-nara.ac.jp/
3
http://nlp.cs.nyu.edu/oak/
4
http://www.csse.monash.edu.au/~jwb/wwwjdic.htm
5
http://kakasi.namazu.org/
57
Topics 0101 to 0149 were considered and key terms contained in fields, title
<TITLE>, description <DESCRIPTION> and concept <CONCEPT> were used to
generate 49 queries in Japanese and English.
Results and performances of different translation models and their combination are
described in Table 1. Evaluations were based on the average precision, differences in
term of average precision of the monolingual counterpart and the improvement over
the monolingual counterpart.
The combined dictionary-based and transliteration model ‘DT’ showed 84.94% im-
provement of the monolingual retrieval, while the comparable corpora-based model
SCC’ showed a lower improvement in average precision compared to the monolin-
gual retrieval and the combined dictionary-based and transliteration model ‘DT’ with
52.81% of the monolingual retrieval. The proposed combination of comparable cor-
pora, bilingual dictionaries and transliteration ‘DT&SCC’ showed the best perfor-
mance in terms of average precision with 88.18% of the monolingual counterpart,
+3.82% compared to the dictionary-based method and +66.97 compared to the com-
parable corpora model taken alone.
5 Conclusions
We investigated the approach of extracting bilingual terminology from comparable
corpora with an application on Japanese-English language pair. A combined model
involving comparable corpora, readily available bilingual dictionaries and translitera-
tion was found very efficient and could be used to enrich bilingual lexicons and the-
sauri. Most of the selected terms were considered as translation candidates or expan-
sion terms in CLIR. Exploiting different translation models revealed to be effective.
Ongoing research is focused on transliteration of the special phonetic alphabet, kata-
kana in the case of Japanese language and phrasal translation in CLIR.
References
1. Dagan, I., Itai, I. Word Sense Disambiguation using a Second Language Monolingual
Corpus. Computational Linguistics 20(4): 563-596. (1994).
2. Dejean, H., Gaussier, E., Sadat, F. An Approach based on Multilingual Thesauri and Model
Combination for Bilingual Lexicon Extraction. In Proceedings of COLING’02, Taiwan, pp
218-224. (2002)
3. Diab, M., Finch, S. A Statistical Word-Level Translation Model for Comparable Corpora.
In Proceedings of the Conference on Content-based Multimedia Information Access RIAO.
(2000
4. Dunning, T. Accurate Methods for the Statistics of Surprise and Coincidence. Computa-
tional linguistics 19(1): 61-74. (1993)
5. EDR. Japan Electronic Dictionary Research Institute, Ltd. EDR electronic dictionary ver-
sion 1.5 technical guide. Technical report TR2-007, Japan Electronic Dictionary research
Institute, Ltd. (1996)
6. Fung, P. A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to
Non-Parallel Corpora. In Jean Véronis, Ed. Parallel Text Processing. (2000)
58
7. Kando, N. Overview of the Second NTCIR Workshop. In Proceedings of the Second
NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and text Summari-
zation, Tokyo. (2001)
8. Knight, K., Graehl, J. Machine Transliteration. Computational Linguistics 24 (4). (1998)
9. Koehn, P., Knight, K. Learning a Translation Lexicon from Monolingual Corpora. In Pro-
ceedings of ACL-02 Workshop on Unsupervised Lexical Acquisition. (2002)
10. Matsumoto, Y., Kitauchi, A., Yamashita, T., Imaichi, O., and Imamura, T. Japanese mor-
phological analysis system ChaSen manual. Technical report NAIST-IS-TR97007, NAIST.
(1997)
11. Nakagawa, H. Disambiguation of Lexical Translations Based on Bilingual Comparable
Corpora. In Proceedings of LREC2000, Workshop of Terminology Resources and Compu-
tation WTRC2000, pp 33-38. (2000)
12. Peters, C., Picchi, E. Capturing the Comparable: A System for Querying Comparable Text
Corpora. In Proceedings of the Third International Conference on Statistical Analysis of
Textual Data, pp 255-262. (1995)
13. Rapp, R. Automatic Identification of Word Translations from Unrelated English and Ger-
man Corpora. In Proceedings of EACL’99. (1999)
14. Salton, G. The SMART Retrieval System, Experiments in Automatic Documents
Processing. Prentice-Hall, Inc., Englewood Cliffs, NJ. (1971)
15. Salton, G., McGill, J. Introduction to Modern Information Retrieval. New York, Mc Graw-
Hill. (1983)
16. Sekine, S. OAK System– Manual. New York University. (2001)
17. Shahzad, I., Ohtake, K., Masuyama, S., Yamamoto, K. (1999) Identifying Translations of
Compound Using Non-aligned Corpora. In Proceedings of Workshop MAL, pp 108-113.
18. Tanaka, K., Iwasaki, H. Extraction of Lexical Translations from Non-Aligned Corpora. In
Proceedings of COLING’96. (1996)
59