words in transliterated texts that are similar enough to
allow for a fuzzy string matching approach to align-
ing texts. We used edit distance as a way to measure
how similar two strings were and calculated a score
that would also take word length into consideration.
Since the Amharic and English news texts in this
study are comparable rather than parallel (direct trans-
lations of one another), the algorithm did not use doc-
ument length as a feature in the alignment process.
We believe that under these circumstances, the doc-
ument length could be more confusing than helpful
for the alignment process. When aligning potentially
parallel data however, the length could be an impor-
tant feature.
The content of the body of the text (the news ar-
ticle) could of course also be used in the alignment
process instead of using the title only. The body of the
text has often many occurences of names and numbers
that would help the alignment. When analyzing the
text body, resources such as lexica (e.g. for word by
word translation), stop word list (to remove non con-
tent bearing words that appear in many of the articles),
morphological analysis or stemming (to consider the
root word only) etc. may be used.
Machine learning could also be used to improve
the fuzzy matching by finding more likely character
substitutions from known matching word pairs, and
assigning them a different weight when calculating
the word matching score. It would also be possible
to incorporate an improved number and date conver-
sion that would allow several different formats (digit
or text representation) for these items.
Under all circumstances, when used as a semi auto-
matic tool, the existing algorithm gives an acceptable
performance in relation to the amount of work that
would otherwise be required to align the news items
manually. As an example, the parameter settings 0.15
and 0.5 finds 225 suggested matches out of which 127
are correctly aligned news items (see Table 1 above).
The 127 correctly aligned texts are resonably close to
the estimated total number of possible matches (10%
of the 1219 Amharic articles in the test set), and at
the same time the amount of manual work required to
identify the 98 incorrectly aligned news pairs in this
group is substancially less than what would have been
required to do it completely from scratch without the
help of the system.
It is our hope that by demonstrating the feasibil-
ity of our approach, we will inspire additional work
on creating parallel corpora and other linguistic re-
sources for Amharic as well as for many more of the
worlds "low density languages".
REFERENCES
Alemu, A., Asker, L., and Eriksson, G. (2003). An em-
pirical approach to building an amharic treebank. In
Proceedings of TLT-2003.
Alemu, A., Asker, L., and Eriksson, G. (2004). Building an
amharic lexicon from parallel texts. In Proceedings of
First Steps for Language Documentation of Minority
Languages: Computational Linguistic Tools for Mor-
phology, Lexicon and Corpus Compilation, a Work-
shop at LREC-2004.
Bendersky, E. (2004). Levenshtein dis-
tance algorithm: Perl implementation.
http://www.merriampark.com/ldperl.htm, accessed
Jan 31, 2004.
Chen, J. and Jian-Jun, N. (2000). Automatic construction
of parallel english-chinese corpus for cross-language
information retrieval. In Proceedings of the Sixth Con-
ference on Applied Natural Language Processing.
GlobalReach (2004). Global internet statictics (by lan-
guage). http://global-reach.biz/globstats/index.php3.
Hulth, A. (2004). Combining Machine Learning and Nat-
ural Language Processing for Automatic Keyword Ex-
traction. Doctoral Dissertation, Department of Com-
puter and Systems Sciences, Stockholm University.
Hwa, R., Resnik, P., Weinberg, A., and Kolak, O. (2002).
Evaluating translational correspondence using anno-
tation projection. In Proceedings of ACL-02.
JuneCalends (2004). 7000 years calendar v1.4.1.
http://www.junecalends.com/7000.html, accessed Jan
31, 2004.
Ma, X. and Liberman., M. (1999). Bits: A method for bilin-
gual text search over the web. In Proceedings of Ma-
chine Translation Summit VII.
Resnik, P. (1998). Parallel strands: A preliminary investi-
gation into mining the web for bilingual text. In Pro-
ceedings of the Third Conference of the Association
for Machine Translation in the Americas, AMTA-98.
Resnik, P. (1999). Mining the web for bilingual text. In 37th
Annual Meeting of the Association for Computational
Linguistics (ACL’99).
Resnik, P. and Smith, N. A. (2003). The web as a parallel
corpus. Computational Linguistics, 29(3).
Riloff, E., Schafer, C., and Yarowsky, D. (2002). In-
ducing information extraction systems for new lan-
guages via cross-language projection. In Proceedings
of COLING-02.
Yacob, D. (1996). System for ethiopic
representation in ascii (sera).
http://www.abyssiniacybergateway.net/fidel/.
Yang, C. C. and Li, K. W. (2002). Mining english/chinese
parallel documents from the world wide web. In Pro-
ceedings of the 11th International World Wide Web
Conference.
Yarowsky, D., Ngai, G., and Wicentowski., R. (2001). In-
ducing multilingual text analysis tools via robust pro-
jection across aligned corpora. In Proceedings of HLT-
01.
WEBIST 2005 - WEB INTERFACES AND APPLICATIONS
246