Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

This is google’s newest machine translation system based on deep learining(NMT) disclosure in detail. It’s a weights more on the techniques/tricks used to build the production enviroment system.

Three inherent weaknesses of Neural Machine Translation GNMT tends to solve:

  1. its slower training and inference speed
  2. ineffectiveness in dealing with rare words
  3. sometimes failure to translate all words in the source sentence



  1. 8 hidden layer LSTM RNNs (in encoder network and decoder network)
  2. residual connections
  3. attention (attention network)

deep stacked LSTM

Since each additional layer reduced perplexity by nearly 10% showed by paper[1], this model stacked 8 layers to encoder and decoder,

with first encoder layer a bidirectional layer to have the best possible context at each point in the encoder network,which is also used in [2]. To allow for maximum possible parallelization during computation, bi-directional connections are only used for the bottom encoder layer – all other encoder layers are uni-directional.

attenion network

Attention function is learned through a 1 hidden layer neural network.

residual connections

Simple stacked LSTM layers work well up to 4 layers, barely with 6 layers, and very poorly beyond 8 layers. Thanks to [3], deeper stack lstm is working.

Other techniques

  1. sub-word units(wordpieces)
  2. beam search


This technique is introduced when developed to solve a Japanese/Korean segmentation problem for the Google speech recognition system in [4]. Wordpieces model is a data-driven model and guaranteed to generate a deterministic segmentation for any possible sequence of characters. It is similar to the method used in [5] to deal with rare words in Neural Machine Translation.

an example of wordpiece: • Word: Jet makers feud over seat width with big orders at stake • wordpieces: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

This is just introduce higher dimension to learn the subtlety of morphology.

Typically keep 8-12 hypotheses but we find that using fewer (4 or 2) has only slight negative effects on BLEU scores.

Refinements to the pure max-probability based beam search algorithm:

  1. a coverage penalty [6]
  2. length normalization

Without some form of length-normalization regular beam search will favor shorter results over longer ones on average since a negative log-probability is added at each step, yielding lower (more negative) scores for longer sentences.


  1. low precision at inference constrains during training:

  2. train first with adam then standard sgd


