This is the code used for our LoResMT 2021 Paper Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages. This repository contains the implementations of the subword segmentation algorithms we used, the cleaned Quechua dataset, and an Indonesian dataset, along with the pipeline of how our model is run.
pip install requirements.txt
- Originally from Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation
- The base code for PRPE was taken from https://github.com/zuters/prpe.
- Samples of the heuristics, separated out from the main algorithm for convenience, can be accessed below:
- Quechua Heuristic
- Indonesian Heuristic
- The generic heuristic (the general parameters of PRPE) can be found here
- The data we cleaned is found in
data/cleaned_source.test.es.txtandtest.qz.txtwere created by random shuffling of all of the parallel lines. The source data from Annette Rios can be found here. - The Religious, News, and General Indonesian-English datasets from Benchmarking Multidomain English-Indonesian Machine Translation can be found at their repository here.
- The Religious and Magazine data from Neural machine translation with a polysynthetic low resource language can be found here.
Our entire pipeline can be run with:
python pipeline.py
The pipeline can take in several flags:
--src_segment_typeand--tgt_segment_typecan benone,bpe,unigram,prpe,prpe_bpe,prpe_multiN(where N is number of iterations).--model_typecan bernn(aka LSTM) ortransformer. Defaults to LSTM.--in_langspecifies the input language to be translated. We usedqzfor Quechua andidfor Indonesian. Defaults to Quechua.--out_langspecifies the output language to be translated to. We usedesfor Spanish andenfor English. Defaults to Spanish.--domainspecifies the name of dataset to be used, which should be located indata/under the same name. Defaults to religious.- A dataset folder should include:
train.{in_lang}.txtvalidate.{in_lang}.txttest.{in_lang}.txttrain.{out_lang}.txt,validate.{out_lang}.txttest.{out_lang}.txt
- Example:
train.qz.txt,train.es.txtfor Quechua-Spanish translation.
- A dataset folder should include:
--train_stepsspecifies how many steps the model should be trained. Default value is 100,000.--save_stepsspecifies how often the trained model is saved. Default is every 10,000 steps.--validate_stepsspecifies how often the model should be evaluated against the validation set. Default is every 2000 steps.--batch_sizeis the batch size for training. Default is 64.--filter_too_longspecifies the max token length of a line in the training set. Any line that passes this value is filtered out. Default is no filtering.--src_token_langand--tgt_token_langspecifies the tokenization language Moses uses. We useesfor both languages in QZ-ES, andenfor ID-EN.
The pipeline will automatically test the model after training is finished and output a BLEU and CHRF score.