Constructing High Quality Bilingual Corpus using Parallel Data from the Web

Sai Cheok, Sai Cheok, Lap Hoi, Lap Hoi, Su-Kit Tang, Su-Kit Tang, Rita Tse, Rita Tse

2022

Abstract

Natural language machine translation system requires a high-quality bilingual corpus to support its efficient translation operation at high accuracy rate. In this paper, we propose a bilingual corpus construction method using parallel data from the Web. It acts as a stimulus to significantly speed up the construction. In our proposal, there are 4 phases. Parallel data is first pre-processed and refined into three sets of data for training the CNN model. Using the well-trained model, future parallel data can be selected, classified and added to the bilingual corpus. The training result showed that the test accuracy reached 98.46%. Furthermore, the result on precision, recall and f1-score is greater than 0.9, which outperforms RNN and LSTM models.

Download


Paper Citation


in Harvard Style

Cheok S., Hoi L., Tang S. and Tse R. (2022). Constructing High Quality Bilingual Corpus using Parallel Data from the Web. In Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, ISBN 978-989-758-564-7, pages 127-132. DOI: 10.5220/0010997000003194


in Bibtex Style

@conference{iotbds22,
author={Sai Cheok and Lap Hoi and Su-Kit Tang and Rita Tse},
title={Constructing High Quality Bilingual Corpus using Parallel Data from the Web},
booktitle={Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,},
year={2022},
pages={127-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010997000003194},
isbn={978-989-758-564-7},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 7th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,
TI - Constructing High Quality Bilingual Corpus using Parallel Data from the Web
SN - 978-989-758-564-7
AU - Cheok S.
AU - Hoi L.
AU - Tang S.
AU - Tse R.
PY - 2022
SP - 127
EP - 132
DO - 10.5220/0010997000003194