Author: @ dataturks
Crazy technology house
A complete guide to learning to translate between any two languages
This article will help you learn how to translate a given language into any target language through a hands-on tutorial. Our current task is fully inspired by an open source library, whose pytorch implementation can be obtained through the python language (known as open source neural machine translation)). It is designed to facilitate the research of deep learning enthusiasts, so that they can realize their ideas in machine translation, abstract, image to text conversion, lexicology and other fields.
Although there are many efficient translation systems provided by Google translate, Microsoft, etc., they are not open source or unavailable under restrictive license. There are also other libraries, such astensorflow-seq2seq Model, but only as research code.
Open-NMTIt is not only open source, but also provides a large number of documented, modular and easy to read code, which can quickly train and effectively run the model.
I’ll explain in more detail how to set up the library and how to use the toolkit to train your own translation system. This paper introduces how to generate Hindi translation from a given English text.
Open NMT Architecture Overview
Open NMT is based on the research of Guillaume Klein et al. Relevant information can be found here: http://aclweb.org/anthology/p.
The white paper reveals more about its architecture:
Opennmt is a complete library for training and deploying neural machine translation models. The system is a follow-up product of seq2seq Attn developed by Harvard, and has been completely rewritten, which improves efficiency, readability and versatility. It includes native NMT models as well as support for attention, gating, stacking, input feeding, regularization, cluster search, and others.
The main system is implemented in Lua / torch mathematical framework, and can be easily extended using torch’s internal standard neural network components. Facebook research’s Adam LeRer also extended it to support the python / pytorch framework with the same API.
Setting of required modules
To train your own custom translation system, the main package required is essentially pytorch, in which the open NMT model has been implemented.
First, of course, cloningOpenNMT-pyRepository:
git clone https://github.com/OpenNMT/OpenNMT-py cd OpenNMT-py
This is a require.txt file to collect all necessary packages:
six tqdm torch>=0.4.0 git+https://github.com/pytorch/text future
As pytorch has been evolving, we recommend fork pytorch v0.4 to ensure stable performance of the code base.
Run the following command to automatically collect the required dependencies:
pip install -r requirements.txt
Collecting data sets
The dataset consists of a parallel corpus of source and target language files, each line containing a sentence, each marker separated by a space.
For this tutorial, we use parallel corpora of English and Hindi sentences stored in separate files. Data is collected and consolidated from a variety of sources. Then reorganize the data and create the following filesets:
- src-train.txt: training files containing 10000 English (source language)
- sentencestgt-train.txt: training document containing 10000 Hindi (target language)
- sentencessrc-val.txt: contains 1000 validation data in English (source language)
- sentencestgt-val.txt: validation data with 1000 Hindi (target language)
- sentencessrc-test.txt: test evaluation data, consisting of 1000 English sentences (source language)
- sentencestgt-test.txt: test evaluation data, including 1000 Hindi (target language) sentences
All of the above files are in the / data directory.
Be careful:In this tutorial, we use only a small amount of data for interpretation and experimentation. However, it is recommended to use a large corpus with millions of sentences to ensure that a large number of words can better learn the sum of unique wordsClose to human translation。
Validation data is used to evaluate the model in each step to identify convergence points. It usually contains up to 5000 sentences.
This is an example of how text data is arranged in the corresponding file:
Source Files : They also bring out a number of Tamil weekly newspapers. They are a hard — working people and most of them work as labourers. Tamil films are also shown in the local cinema halls. There are quite a large number of Malayalees living here. Target Files :तमिल भाषा में वे अनेक समाचार पत्र व पत्रिकाएं भी निकालते हैं . ये लोग काफी परिश्रमी हैं , अधिकांश लोग मजदूरी करते हैं . स्थानीय सिनेमा हालों में तमिल चलचित्रों का प्रदर्शन अक्सर किया जाता है . मलयालम लोगों की बहुत बडी संख्या है .
Preprocessing text data
By executing the following commands to preprocess the training and validation data, extract the training features, and generate vocabulary files for the model.
python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo
Training translator model
The main command for training is very easy to use. In fact, it takes data files and save files as input.
A summary of the default models used is as follows:
NMTModel( (encoder): RNNEncoder( (embeddings): Embeddings( (make_embedding): Sequential( (emb_luts): Elementwise( (0): Embedding(20351, 500, padding_idx=1) ) ) ) (rnn): LSTM(500, 500, num_layers=2, dropout=0.3) ) (decoder): InputFeedRNNDecoder( (embeddings): Embeddings( (make_embedding): Sequential( (emb_luts): Elementwise( (0): Embedding(20570, 500, padding_idx=1) ) ) ) (dropout): Dropout(p=0.3) (rnn): StackedLSTM( (dropout): Dropout(p=0.3) (layers): ModuleList( (0): LSTMCell(1000, 500) (1): LSTMCell(500, 500) ) ) (attn): GlobalAttention( (linear_in): Linear(in_features=500, out_features=500, bias=False) (linear_out): Linear(in_features=1000, out_features=500, bias=False) (softmax): Softmax() (tanh): Tanh() ) ) (generator): Sequential( (0): Linear(in_features=500, out_features=20570, bias=True) (1): LogSoftmax() ) ) python train.py -data data/demo -save_model demo-model
The above command will run a default model, including a two-tier LSTM with 500 hidden units for encoder and decoder. To specify a GPU to improve training efficiency, specify
-gpuidParameters (for example,
-gpuid 1Used to specify GPU 1).
The default model typically lasts 100000 iterations, so a checkpoint is saved every 5000 iterations. Therefore, if the model converges and the verification accuracy reaches the stable point earlier, further training can be stopped and the previously saved checkpoints can be used.
Translate your own data
You can perform reasoning steps on text that is not visible in the source language (English) by executing the following command, and generate a corresponding translation of the prediction:
python translate.py -model demo-model_XYZ.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose
The command will generate the converted output and save the prediction results to a file named pred.txt.
The model trains 10000 iterations on NVIDIA geforce 2GB GPU. Training on CPU will require high computing cost. It is recommended to use high-end GPU to train models with large amount of data at a faster speed.
Sample prediction of the model
The following shows several examples of Hindi translation generated for the corresponding English sentence after the training model.
Trees are capable of absorbing more of carbon dioxide, thus maintaining equilibrium in the air composition PREDICTED : पेडों में कार्बन डाईआक्साइड के बुरे लोग इस प्रकार पेड - पौधे का प्रयोग करने के लिए मौजूद हैं. He has hope that the gods will throw good things from the heavens, upon them PREDICTED :वे उमीद है कि वे घर से कुछ नहीं बची हैं The Buddhist temple , the Dalai Lama Palace and dispensary of Tibet are tourist attractions here PREDICTED :यहां का बौद्ध मंदिर दलाई लामा का आवास तथा तिब्बती औषधालय स्थानिय लोगो में मिलता है . He lets the hair grow long. PREDICTED : वह अपने बढा लेता है .
As shown above, the prediction is not good enough, because there is less training data. In order to perform translation close to the real world, it is necessary to train the model on the basis of a large number of words and about one million sentences, but this will lead to a significant increase in the cost of hardware requirements and training time.
Evaluate your training model
Bleu assessment（Bilingual Evaluation Understudy Score）It is an evaluation index of machine translation system by comparing the generated sentences with the reference sentences.
In the assessment, Bleu score is 1.0 if it can be fully matched, while Bleu score is 0.0 if it can not be completely matched.
Bleu assessment is a common metric for assessing translation models because it is language independent, easy to interpret, and highly relevant to manual assessment.
Bleu scores were presented by Kishore papineni et al in a study conducted. Bleu: a method for automatically evaluating machine translation.
The Bleu score is generated after matching the n-gram in the candidate translation with the n-gram in the reference text. Word order is not considered in this comparison.
So how do we define an n-gram? Suppose that 1-gram or uni gram represents each individual tag, while Bi gram represents each pair of words.
Links are provide