loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: May Myo Zin ; Teeradaj Racharak and Nguyen Minh Le

Affiliation: School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan

Keyword(s): Neural Machine Translation, Myanmar Word Segmentation, Parallel Corpus Creation, Back-translation, Siamese-BERT Network.

Abstract: When dealing with low resource languages such as Myanmar, using additional pseudo parallel data for training machine translation systems is often an effective approach. As a pseudo parallel corpus is generated by back-translating target monolingual texts into the source language, it potentially contains a lot of noise including translation errors and weakly paired sentences and is thus required cleaning. In this paper, we propose a noisy parallel-sentences filtering system called Construct-Extract based on cosine similarity and Siamese BERT-Networks based cross-lingual sentence embeddings. The proposed system filters out noisy sentences by extracting high score sentence pairs from the constructed pseudo parallel data to finally obtain better synthetic parallel data. As part of the proposed system, we also introduce an unsupervised Myanmar sub-word segmenter to improve the quality of current English-Myanmar translation models that are potential to be used as backward systems for back- translation and often suffer from Myanmar word segmentation errors. Experiments show that the proposed Myanmar word segmentation could help the backward system to construct more accurate back-translated pseudo parallel data and using our extracted pseudo parallel corpus led to improve the performance of English-Myanmar translation systems in the two directions. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.140.188.16

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Zin, M.; Racharak, T. and Le, N. (2021). Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART; ISBN 978-989-758-484-8; ISSN 2184-433X, SciTePress, pages 333-342. DOI: 10.5220/0010318903330342

@conference{icaart21,
author={May Myo Zin. and Teeradaj Racharak. and Nguyen Minh Le.},
title={Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation},
booktitle={Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART},
year={2021},
pages={333-342},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010318903330342},
isbn={978-989-758-484-8},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART
TI - Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation
SN - 978-989-758-484-8
IS - 2184-433X
AU - Zin, M.
AU - Racharak, T.
AU - Le, N.
PY - 2021
SP - 333
EP - 342
DO - 10.5220/0010318903330342
PB - SciTePress