Machine translation: training translation model with open NMT

Posted on

Author: @ dataturks

Crazy technology house

Original: https://hackernoon.com/natural

A complete guide to learning to translate between any two languages

This article will help you learn how to translate a given language into any target language through a hands-on tutorial. Our current task is fully inspired by an open source library, whose pytorch implementation can be obtained through the python language (known as open source neural machine translation)). It is designed to facilitate the research of deep learning enthusiasts, so that they can realize their ideas in machine translation, abstract, image to text conversion, lexicology and other fields.

Although there are many efficient translation systems provided by Google translate, Microsoft, etc., they are not open source or unavailable under restrictive license. There are also other libraries, such astensorflow-seq2seq Model, but only as research code.

Open-NMTIt is not only open source, but also provides a large number of documented, modular and easy to read code, which can quickly train and effectively run the model.

Machine translation: training translation model with open NMT

I’ll explain in more detail how to set up the library and how to use the toolkit to train your own translation system. This paper introduces how to generate Hindi translation from a given English text.

Open NMT Architecture Overview

Open NMT is based on the research of Guillaume Klein et al. Relevant information can be found here: http://aclweb.org/anthology/p.

Machine translation: training translation model with open NMT

The white paper reveals more about its architecture:

Opennmt is a complete library for training and deploying neural machine translation models. The system is a follow-up product of seq2seq Attn developed by Harvard, and has been completely rewritten, which improves efficiency, readability and versatility. It includes native NMT models as well as support for attention, gating, stacking, input feeding, regularization, cluster search, and others.

The main system is implemented in Lua / torch mathematical framework, and can be easily extended using torch’s internal standard neural network components. Facebook research’s Adam LeRer also extended it to support the python / pytorch framework with the same API.

Setting of required modules

To train your own custom translation system, the main package required is essentially pytorch, in which the open NMT model has been implemented.

First, of course, cloningOpenNMT-pyRepository:

git clone https://github.com/OpenNMT/OpenNMT-py
cd OpenNMT-py

This is a require.txt file to collect all necessary packages:

six
tqdm
torch>=0.4.0
git+https://github.com/pytorch/text
future

As pytorch has been evolving, we recommend fork pytorch v0.4 to ensure stable performance of the code base.

Run the following command to automatically collect the required dependencies:

pip install -r requirements.txt

Collecting data sets

The dataset consists of a parallel corpus of source and target language files, each line containing a sentence, each marker separated by a space.

For this tutorial, we use parallel corpora of English and Hindi sentences stored in separate files. Data is collected and consolidated from a variety of sources. Then reorganize the data and create the following filesets:

  • src-train.txt: training files containing 10000 English (source language)
  • sentencestgt-train.txt: training document containing 10000 Hindi (target language)
  • sentencessrc-val.txt: contains 1000 validation data in English (source language)
  • sentencestgt-val.txt: validation data with 1000 Hindi (target language)
  • sentencessrc-test.txt: test evaluation data, consisting of 1000 English sentences (source language)
  • sentencestgt-test.txt: test evaluation data, including 1000 Hindi (target language) sentences

All of the above files are in the / data directory.

Be careful:In this tutorial, we use only a small amount of data for interpretation and experimentation. However, it is recommended to use a large corpus with millions of sentences to ensure that a large number of words can better learn the sum of unique wordsClose to human translation

Validation data is used to evaluate the model in each step to identify convergence points. It usually contains up to 5000 sentences.

This is an example of how text data is arranged in the corresponding file:

Source Files :
They also bring out a number of Tamil weekly newspapers.

They are a hard  working people and most of them work as labourers.
Tamil films are also shown in the local cinema halls.
There are quite a large number of Malayalees living here.

Target Files :तमिल भाषा में वे अनेक समाचार पत्र  पत्रिकाएं भी निकालते हैं .
ये लोग काफी परिश्रमी हैं , अधिकांश लोग मजदूरी करते हैं .
स्थानीय सिनेमा हालों में तमिल चलचित्रों का प्रदर्शन अक्सर किया जाता है .
मलयालम लोगों की बहुत बडी संख्या है .

Preprocessing text data

By executing the following commands to preprocess the training and validation data, extract the training features, and generate vocabulary files for the model.

python preprocess.py 
-train_src data/src-train.txt 
-train_tgt data/tgt-train.txt 
-valid_src data/src-val.txt 
-valid_tgt data/tgt-val.txt 
-save_data data/demo
python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

Training translator model

The main command for training is very easy to use. In fact, it takes data files and save files as input.

A summary of the default models used is as follows:

NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(20351, 500, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(20570, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3)
      (layers): ModuleList(
        (0): LSTMCell(1000, 500)
        (1): LSTMCell(500, 500)
      )
    )
    (attn): GlobalAttention(
      (linear_in): Linear(in_features=500, out_features=500, bias=False)
      (linear_out): Linear(in_features=1000, out_features=500, bias=False)
      (softmax): Softmax()
      (tanh): Tanh()
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=500, out_features=20570, bias=True)
    (1): LogSoftmax()
  )
)
python train.py -data data/demo -save_model demo-model

The above command will run a default model, including a two-tier LSTM with 500 hidden units for encoder and decoder. To specify a GPU to improve training efficiency, specify-gpuidParameters (for example,-gpuid 1Used to specify GPU 1).

The default model typically lasts 100000 iterations, so a checkpoint is saved every 5000 iterations. Therefore, if the model converges and the verification accuracy reaches the stable point earlier, further training can be stopped and the previously saved checkpoints can be used.

Translate your own data

Machine translation: training translation model with open NMT

You can perform reasoning steps on text that is not visible in the source language (English) by executing the following command, and generate a corresponding translation of the prediction:

python translate.py -model demo-model_XYZ.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose

The command will generate the converted output and save the prediction results to a file named pred.txt.

The model trains 10000 iterations on NVIDIA geforce 2GB GPU. Training on CPU will require high computing cost. It is recommended to use high-end GPU to train models with large amount of data at a faster speed.

Sample prediction of the model

The following shows several examples of Hindi translation generated for the corresponding English sentence after the training model.

Trees are capable of absorbing more of carbon dioxide, 
thus maintaining equilibrium in the air composition
PREDICTED : पेडों में कार्बन डाईआक्साइड के बुरे लोग 
इस प्रकार पेड - पौधे का प्रयोग करने के लिए मौजूद हैं.
He has hope that the gods will throw good things from the heavens, 
upon them 
PREDICTED :वे उमीद है कि वे घर से कुछ नहीं बची हैं
The Buddhist temple , 
the Dalai Lama Palace and dispensary of Tibet are tourist attractions here 
PREDICTED :यहां का बौद्ध मंदिर दलाई लामा का आवास तथा तिब्बती औषधालय स्थानिय लोगो में मिलता है .
He lets the hair grow long.
PREDICTED : वह अपने बढा लेता है .

As shown above, the prediction is not good enough, because there is less training data. In order to perform translation close to the real world, it is necessary to train the model on the basis of a large number of words and about one million sentences, but this will lead to a significant increase in the cost of hardware requirements and training time.

Evaluate your training model

Bleu assessment(Bilingual Evaluation Understudy Score)It is an evaluation index of machine translation system by comparing the generated sentences with the reference sentences.

In the assessment, Bleu score is 1.0 if it can be fully matched, while Bleu score is 0.0 if it can not be completely matched.

Bleu assessment is a common metric for assessing translation models because it is language independent, easy to interpret, and highly relevant to manual assessment.

Bleu scores were presented by Kishore papineni et al in a study conducted. Bleu: a method for automatically evaluating machine translation.

The Bleu score is generated after matching the n-gram in the candidate translation with the n-gram in the reference text. Word order is not considered in this comparison.

So how do we define an n-gram? Suppose that 1-gram or uni gram represents each individual tag, while Bi gram represents each pair of words.

Links are provide

Leave a Reply

Your email address will not be published.