Topics 0101 to 0149 were considered and key terms contained in fields, title
<TITLE>, description <DESCRIPTION> and concept <CONCEPT> were used to
generate 49 queries in Japanese and English.
Results and performances of different translation models and their combination are
described in Table 1. Evaluations were based on the average precision, differences in
term of average precision of the monolingual counterpart and the improvement over
the monolingual counterpart.
The combined dictionary-based and transliteration model ‘DT’ showed 84.94% im-
provement of the monolingual retrieval, while the comparable corpora-based model
‘SCC’ showed a lower improvement in average precision compared to the monolin-
gual retrieval and the combined dictionary-based and transliteration model ‘DT’ with
52.81% of the monolingual retrieval. The proposed combination of comparable cor-
pora, bilingual dictionaries and transliteration ‘DT&SCC’ showed the best perfor-
mance in terms of average precision with 88.18% of the monolingual counterpart,
+3.82% compared to the dictionary-based method and +66.97 compared to the com-
parable corpora model taken alone.
5 Conclusions
We investigated the approach of extracting bilingual terminology from comparable
corpora with an application on Japanese-English language pair. A combined model
involving comparable corpora, readily available bilingual dictionaries and translitera-
tion was found very efficient and could be used to enrich bilingual lexicons and the-
sauri. Most of the selected terms were considered as translation candidates or expan-
sion terms in CLIR. Exploiting different translation models revealed to be effective.
Ongoing research is focused on transliteration of the special phonetic alphabet, kata-
kana in the case of Japanese language and phrasal translation in CLIR.
References
1. Dagan, I., Itai, I. Word Sense Disambiguation using a Second Language Monolingual
Corpus. Computational Linguistics 20(4): 563-596. (1994).
2. Dejean, H., Gaussier, E., Sadat, F. An Approach based on Multilingual Thesauri and Model
Combination for Bilingual Lexicon Extraction. In Proceedings of COLING’02, Taiwan, pp
218-224. (2002)
3. Diab, M., Finch, S. A Statistical Word-Level Translation Model for Comparable Corpora.
In Proceedings of the Conference on Content-based Multimedia Information Access RIAO.
(2000
4. Dunning, T. Accurate Methods for the Statistics of Surprise and Coincidence. Computa-
tional linguistics 19(1): 61-74. (1993)
5. EDR. Japan Electronic Dictionary Research Institute, Ltd. EDR electronic dictionary ver-
sion 1.5 technical guide. Technical report TR2-007, Japan Electronic Dictionary research
Institute, Ltd. (1996)
6. Fung, P. A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to
Non-Parallel Corpora. In Jean Véronis, Ed. Parallel Text Processing. (2000)
58