why all the pairs which matched only on tool words disappeared from the alignment,
289 subtitle pairs are concerned by this cut off.
Table 3. Impact of reducing the tool words’ weight.
γ #C #I #Tot Rec. Prec. Fm. γ #C #I #Tot. Rec. Prec. Fm.
1.0 1119 94 1213 0.820 0.923 0.868 0.4 1056 171 1227 0.774 0.861 0.815
0.9 1097 134 1231 0.804 0.891 0.845 0.3 1044 189 1233 0.765 0.847 0.804
0.8 1097 134 1231 0.804 0.891 0.845 0.2 1040 192 1232 0.762 0.844 0.801
0.7 1097 134 1231 0.804 0.891 0.845 0.1 1039 194 1233 0.762 0.843 0.800
0.6 1097 133 1230 0.804 0.892 0.846 0.0 869 55 951 0.657 0.942 0.774
0.5 1097 133 1230 0.804 0.892 0.846
By launching the developed alignment method on the total corpus (40 movies:
43013 English subtitles and 42306 French subtitles) we achieve 37625 aligned pairs.
6 Conclusion and Perspectives
Working on parallel movie corpora constitutes a good challenge to go towards realistic
translation machine applications. Indeed, movies corpora include so many common
expressions, hesitations, coarse words,...Training decoding translation system on these
corpora will lead to spontaneous speech translation machine systems. First results are
very confident and can be used in order to constitute automatic aligned corpora. Tests
have been conducted on a corpus of 40 movies, which correspond to 43013 English
subtitles and 42306 French subtitles. By setting γ to 1 and α
F
M
to 9, we obtained 37625
aligned pairs with a precision of 92, 3%. This result is competitive in accordance to
the state of art of noisy corpus alignment [8]. However, we have to pursue our efforts
in order to increase the precision which makes the parallel corpora noiseless. Several
movies are available on the Internet, the result of the automatic alignment encourage
us to boost our parallel corpus which is crucial for the decoding translation process.
This work could be considered as a first stage towards a real time subtitling machine
translation.
References
1. Koehn, P.: Europarl: A multilingual corpus for evaluation of machine translation. In: MT
SUMMIT, Thailand (2005)
2. Vandeghinste, V., Sang, E.K.: Using a parallel transcript/subtitle corpus for sentence com-
pression. In: LREC, Lisbon, Portugal (2004)
3. Mangeot, M., Giguet, E.: Multilingual aligned corpora from movie subtitles. Technical report,
LISTIC (2005)
4. Melamed, I.D.: A geometric approach to mapping bitext correspondence. In Brill, E., Church,
K., eds.: Proceedings of the Conference on Empirical Methods in Natural Language Process-
ing. Association for Computational Linguistics, Somerset, New Jersey (1996) 1–12
5. Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Proceedings of
the Association for Machine Translation in the Americas Conference. (2002) 135–144
209