Title: Neural Machine Translation by Jointly Learning to Align and Translate
Authors: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
Published: 1st September 2014 (Monday) @ 16:33:02
Link: http://arxiv.org/abs/1409.0473v7

Abstract

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.


Neural Machine Translation by Jointly Learning to Align and Translate - Notes

An explanation of (soft) attention as devised in Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio’s 2015 ICLR paper Neural Machine Translation by Jointly Learning to Align and Translate.

  • NMT uses encoder-decoder end-to-end approach
  • fixed-length vector is bottleneck in enc-dec paradigm: propose (soft) attention
  • MT performance on par with SotA phrase-based English-French system
  • soft-alignments agree with intuition

Introduction

  • encoder-decoder jointly trained to maximise probability of correct translation given source sentence
  • input sentence length (esp. longer than training corpus) degrades translation quality
  • propose: aign and translate jointly - soft search source sentence for relevant information
    • predict based on context vectors associated with sources positions and all previous generated targets
    • encodes input to sequence of vectors (matrix) and selects subset of these adaptively when decoding
    • improvements more apparent with longer sentences

Background: NMT

  • translation: find with source and target sentences and from parallel corpus.
  • SotA improved by adding neural components to phrase-based method e.g. score phrase pairs in phrase table (Cho et al. 2014a) or re-rank candidate translations (Sutskever et al. 2014)
  • [explains Sutskever et al. 2014 seq2seq modelling with RNNs in terms of predicting a sequence of output symbols (words) over time ; and a given prediction is conditional on previous predictions up to teacher forcing, the hidden state and the context vector ]

Learning to Align and Translate

  • encoder: bidirectional RNN
  • decoder emulates searching through source sentence during translation
  • each conditional probability defined as with so the probability is conditioned on a distinct context vector for each target word
    • context vector is a weighted sum of annotations weighted by attention values, :
    • Each annotation is a latent representation of the whole input sequence produced by encoder with focus on the context of word
    • attention values computed softmax (over the input sequence, ) of alignment scores (energies), , scoring the match quality around input position and output position given the previous (decoder) RNN hidden state, , and th latent representation (“annotation”, ) of the input sentence.
    • alignment model is a feedforward NN trained jointly with the other components
    • alignment is not considered a latent variable; a soft alignment is computed and trained with backpropagation from the cost function
    • soft alignment interpretation: expected annotation over possible alignments, i.e. is probability of aligning target word and source (input) word
    • attention: allows selective retrieval of information from source sequence of annotations (latent states)

Encoder: Bidirectional RNN for Annotating Sequences

Annotations for each word obtained by concatenating forward and backward hidden states from Bi-RNN (latent summaries of preceding and following words)

Experiment Settings

Dataset

  • ACL WMT ‘14 English-French parallel corpora (see paper for data details)
  • report for comparison Cho et al. 2014a’s model, ceteris paribus
  • 30,000 most frequent words of each language else <UNK> used for training

Models

  • Two models: RNNencdec from Cho et al. 2014a (enc and dec 1000 units each) and RNNsearch (theirs; enc 1000 units for forward, 1000 for backward; dec 1000 units)
  • multilayer network with single maxout hidden layer to produce conditional probability distributions
  • train each model twice: (1) with sentences of up to 30 words; (2) …up to 50 words
  • optimisation: SGD + Adadelta

See also Appendices A and B.

Results

  • BLEU: outperforms RNNencdec in all cases
  • as high as phrase-based MT (Moses) for sentences with only known words; NB Moses uses additional monolingual corpus

Include Figure 2: Performance versus sentence length

  • RNNsearch-50 performs well on sentence lengths of up to 60 words; RNNsearch-30 shows more robustness (degrades slower than RNNencdec with sentence length)
  • RNNsearch-30 outperforms RNNencdec-50 (attention is useful)

Qualitative Analysis: Alignment

Include Figure 3: Alignment (attention) heatmaps

  • Attention is useful for non-trivial, non-monotonic translations e.g. European Economic Area zone économique européenne
  • soft alignment more useful e.g. in the case of the l’ in [the man] to [l’homme] since lookahead must be employed to determine if le, la, les or l’ is appropriate (again non-monotonicity)
    • also provides a mechanism for mapping some words to nothing [”] or [NULL]

Qualitative Analysis: Long Sentences

RNNsearch does not require encoding a long sentence into a fixed-length vector perfectly, but only accurately encoding the parts of the input sentence that surround a particular word.

Example

English Source:

An admitting privilege is the right of a doctor to admit a patient to a hospital or a medical centre to carry out a diagnosis or a procedure, based on his status as a health care worker at a hospital

French Output from RNNencdec-50:

Un privilège d’admission est le droit d’un médecin de reconnaître un patient à l’hôpital ou un centre médical d’un diagnostic ou de prendre un diagnostic en fonction de son état de santé.

French Output from RNNsearch-50:

Un privilège d’admission est le droit d’un médecin d’admettre un patient à un hôpital ou un centre médical pour effectuer un diagnostic ou une procédure, selon son statut de travailleur des soins de santé à l’hôpital.

[Add Example 2]

Related Work

Learning to Align

Graves 2013 used alignment for handwriting synthesis using a mixture of Gaussian kernels to compute annotation weights but such that the location (for the alignment) moves monotonically. This is limiting for translation, which requires non-monotonicity of alignment. Their approach requires computing annotation weight between each combination of source and output (translated) sentences, i.e. is of quadratic complexity. This is not limiting for sentence translations (15-40 words) but may be for other applications.

Neural Networks for Machine Translation

Bengio et al. (2003) built neural network language models, which elicited NMT. NMT was limited to substituting components of statistical machine translation (SMT). This paper provides genuine NMT end-to-end.

Conclusion

  • Cho et al. 2014b and Pouget-Abadie et al. 2014 provide empirical evidence of long sentences being a problem for preexisting NMT systems

Appendix A: Model Architecture

Forthcoming

References and Notes