Constructing High Quality Bilingual Corpus using Parallel Data from the Web
Sai Cheok, Sai Cheok, Lap Hoi, Lap Hoi, Su-Kit Tang, Su-Kit Tang, Rita Tse, Rita Tse
2022
Abstract
Natural language machine translation system requires a high-quality bilingual corpus to support its efficient translation operation at high accuracy rate. In this paper, we propose a bilingual corpus construction method using parallel data from the Web. It acts as a stimulus to significantly speed up the construction. In our proposal, there are 4 phases. Parallel data is first pre-processed and refined into three sets of data for training the CNN model. Using the well-trained model, future parallel data can be selected, classified and added to the bilingual corpus. The training result showed that the test accuracy reached 98.46%. Furthermore, the result on precision, recall and f1-score is greater than 0.9, which outperforms RNN and LSTM models.
DownloadPaper Citation
in Harvard Style
Cheok S., Hoi L., Tang S. and Tse R. (2022). Constructing High Quality Bilingual Corpus using Parallel Data from the Web. In Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, ISBN 978-989-758-564-7, pages 127-132. DOI: 10.5220/0010997000003194
in Bibtex Style
@conference{iotbds22,
author={Sai Cheok and Lap Hoi and Su-Kit Tang and Rita Tse},
title={Constructing High Quality Bilingual Corpus using Parallel Data from the Web},
booktitle={Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,},
year={2022},
pages={127-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010997000003194},
isbn={978-989-758-564-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,
TI - Constructing High Quality Bilingual Corpus using Parallel Data from the Web
SN - 978-989-758-564-7
AU - Cheok S.
AU - Hoi L.
AU - Tang S.
AU - Tse R.
PY - 2022
SP - 127
EP - 132
DO - 10.5220/0010997000003194