LSTMs + Grammar as a Foreign Language
A short explanation of long short-term memory networks (LSTMs), a form of recurrent neural network (RNN) and a breakdown of Vinyals et al. (2015) Grammar as a Foreign Language, which uses LSTMs with attention to perform syntactic constituency parsing.
In their paper Grammar as a Foreign Language from 2015, Oriol Vinyals and colleagues1 use a sequence-to-sequence model with attention to accomplish syntactic constituency parsing, a problem in computational linguistics (or natural language processing, NLP) wherein for a given input text, we would like to return the grammatical structure of that text as a syntax tree or similar.
The approach they used is useful in demonstrating how LSTMs work.
They devise a model structure which they refer to as the LSTM+A Parsing Model (“A” for Attention) and begin by recapping Sepp Hochreiter and Jürgen Schmidhuber’s famous Long short-term memory network from 19972.
LSTMs
LSTMs are a form of recurrent neural network, meaning that they preserve some hidden state vector and use this effectively as input to a function to compute outputs at each time step over a sequence of data inputs.
The LSTM network has input $x_t$, hidden states $h_t$ and memory states, $m_t$ indexed over time, $t \in 1, \dots, T$. Given the input sequence, $(x_1, \dots, x_T)$, the LSTM computes the sequence of hidden states, $(h_1, \dots, h_T)$, and memory states, $(m_1, \dots, m_T)$, as follows.
\[\begin{aligned} i_{t} &= \sigma \left(W_{1} x_{t}+W_{2} h_{t-1}\right) \\ i_{t}^{\prime} &=\tanh \left(W_{3} x_{t}+W_{4} h_{t-1}\right) \\ f_{t} &= \sigma \left(W_{5} x_{t}+W_{6} h_{t-1}\right) \\ o_{t} &= \sigma \left(W_{7} x_{t}+W_{8} h_{t-1}\right) \\ m_{t} &=m_{t-1} \odot f_{t}+i_{t} \odot i_{t}^{\prime} \\ h_{t} &=m_{t} \odot o_{t} \end{aligned}\]The operator $\odot$ denotes element-wise multiplication, the matrices $W_{1}, \dots, W_{8}$ and the vector $h_{0}$ are the parameters of the model and the sigmoid, $\sigma$, and hyperbolic tangent, $tanh$, nonlinearities are computed element-wise.
Intuition
Personally, I think this is more intuitively broken down into three types of quantity.
1. Multiplicative gating components:
\[\begin{aligned} i_{t} &= \sigma \left(W_{1} x_{t}+W_{2} h_{t-1}\right) \\ f_{t} &= \sigma \left(W_{5} x_{t}+W_{6} h_{t-1}\right) \\ o_{t} &= \sigma \left(W_{7} x_{t}+W_{8} h_{t-1}\right) \\ \end{aligned}\]These are vectors with elements in the range $[0, 1]$, which we can see from the fact that they result from a sigmoid activation function and they multiply the substantive quantities element-wise to allow varying amounts of them through to subsequent time steps.
The substantive quantities are the rest.
2. The intermediate quantity, $i_t^\prime$:
\[i_{t}^{\prime} =\tanh \left(W_{3} x_{t}+W_{4} h_{t-1}\right)\]This is a single intermediate quantity that results from the input at time, $x_t$, and the previous hidden state, $h_{t-1}$ and it contributes to both the contemporaneous memory / cell state, $m_t$ directly and the contemporaneous hidden state, $h_t$ indirectly via the former, as shown in the last two equations below.
We can see that it by contrast is computed using a hyperbolic tangent activation function, returning elements with values in the range $[-1, 1]$.
3. The focal outputs for a given timestep, the hidden state, $h_t$ and memory - or cell - state, $m_t$:
\[\begin{aligned} m_{t} &=m_{t-1} \odot f_{t} + i_{t} \odot i_{t}^{\prime} \\ h_{t} &=m_{t} \odot o_{t} \end{aligned}\]These are the quantities of interest in the sense that the cell state, $m_t$, is the output of the model at time $t$ may be used to perform classification or another task and hidden state, $h_t$, informs the rest of the sequence due to the fact that LSTMs / RNNs in general are autoregressive.
We can see that the current cell state is a linear combination of the previous cell state gated by the $f_t$ gating vector, $m_{t-1} \odot f_{t}$ and the intermediate quantity, $i_{t}^{\prime}$ also gated by $i_{t}$.
The current hidden state is straightforwardly a gated version of the cell state with the $o_t$ gating vector.
For more intuition on LSTMs consult Chris Olah’s Understanding LSTM Networks, Andrej Karpathy’s The Unreasonable Effectiveness of Recurrent Neural Networks and Afshine and Shervine Amidi’s Recurrent Neural Networks cheatsheet. For some more concrete details, see the PyTorch documentation for LSTM modules.
Deep LSTMs
In a deep LSTM, each subsequent layer uses the $h$-sequence of the previous layer for its input sequence $x$, as shown in the figure below3
Information Flow in a Deep Recurrent Neural Network [Image Source: Figure 1. Ahuja and Morency (2017) Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling]
Note that multi-layer LSTMs differ from regular fully-connected neural networks / multilayer perceptrons (MLPs) in that there are connections across neurons within the same layer.
Grammar as a Foreign Language
Now we’ll continue discussing the application of deep LSTMs to syntactic structure parsing from Vinyals and colleagues’ paper.
They use a deep LSTM on the input sentence (sequence) which defines a distribution over output sequences:
\[\begin{aligned} P(B \mid A) &=\prod_{t=1}^{T_{B}} P\left(B_{t} \mid A_{1}, \ldots, A_{T_{A}}, B_{1}, \ldots, B_{t-1}\right) \\ & \equiv \prod_{t=1}^{T_{B}} \operatorname{softmax}\left(W_o \cdot h_{T_{A}+t}\right)^{\top} \delta_{B_{t}} \end{aligned}\]The above equation assumes a deep LSTM whose input sequence is $x= \left(A_{1}, \ldots, A_{T_{A}}, B_{1}, \ldots, B_{T_{B}}\right)$, so $h_{t}$ denotes $t$-th element of the $h$-sequence of topmost LSTM as per our graphic above.
The matrix $W_o$ contains vector representations of each parsing symbol in the output sequence and $\delta_b$ is the Kronecker delta which selects the $B_t$th element from the softmax probability distribution.
The authors use different LSTMs (“different sets of LSTM parameters”) for the input and output sequences so this model constitutes a classic encoder-decoder architecture4, where the input is encoded by one sequence model (one LSTM in this case) and then the representation(s) produced are decoded by another sequence model (another LSTM with the same architecture in this case).
They train the model via stochastic gradient descent (SGD) where the loss is the average over the training set of the log probability of the correct output sequence given the input sequence. As an instance of a Recurrent Neural Network, their LSTM computes the loss for a single sample in the training set as the loss over all time steps. In general:
\[\mathcal{L}(\hat{y}, y) = \sum_{t = 1}^{T} \mathcal{L}(\hat{y}_t, y_t)\]Attention
Vinyals and co. supplement their LSTM model with attention, co-opting the approach taken in the landmark paper Neural machine translation by jointly learning to align and translate by Dzmitry Bahdanau and colleagues5.
They denote the encoder LSTM hidden states as $(h_1, \dots, h_{T_A})$ and denote the decoder LSTM states instead as $(d_1, \dots, d_{T_B}) := (h_{T_A + 1}, \dots, h_{T_A + T_B})$.
The attention mechanism computes an attention vector, $a^t$, at each output time $t$ over the input words $\left(1, \ldots, T_{A}\right)$ and this vector is used in weighting a linear combination of the encoder states up to that time point to provide another decoder hidden state vector, $d_t’$.
\[\begin{aligned} u_{i}^{t} &=v^{T} \tanh \left(W_{1}^{\prime} h_{i}+W_{2}^{\prime} d_{t}\right) \\ a_{i}^{t} &=\operatorname{softmax}\left(u_{i}^{t}\right) \\ d_{t}^{\prime} &=\sum_{i=1}^{T_{A}} a_{i}^{t} h_{i} \end{aligned}\]The vector $v$ and matrices $W_{1}^{\prime}, W_{2}^{\prime}$ are learnable parameters of the model.
The vector $u^{t}$ has length $T_{A}$ and its $i$-th item contains a score of how much attention should be put on the $i$-th hidden encoder state $h_{i}$. These scores are normalized by softmax to create the attention mask $a^{t}$ over encoder hidden states.
Clarification of Attention Mechanism
Note that what is represented in the equations, $u_i^t$ and $a_i^t$, are single elements of the attention vector before and after softmax normalization.
It is these elements (indexed over $i$) that are formed by multiplying the encoder hidden state, $h_i$, by $W_1’$ and summing this with a linear transformation of the current decoder state, $W_2’ \cdot d_t$ to yield a 256-D vector. This vector is then passed element-wise through a $tanh$ activation and the resulting vector is multiplied by $v^T$ to give a scalar that populates a single entry of $u_i^t$.
Perhaps misleadingly, the softmax is applied over the index of encoder hidden states $i$ so perhaps the second equation would be better expressed as $a^t = \operatorname{softmax}(u^t)$, or even $a^t = \operatorname{softmax}_i(u^t_i)$.
They concatenate the basic and attention-based hidden states, $d_{t}$ and $d_{t}^{\prime}$, to give the decoder hidden state and use this to make predictions and feed into the next time step of the recurrent model.
Attention Matrix During Syntactic Constituency Parsing [Image Source: Figure 4. Vinyals et al. (2015) Grammar as a foreign language.]
The attention mechanism can be seen to move when a terminal node (word / leaf) is consumed for the resolution of a subtree (phrase, or grammatical constituent).
From the attention matrix, where each column is the attention vector over the inputs, it is clear that the model focuses quite sharply on one word as it produces the parse tree. It is also clear that the focus moves from the first word to the last monotonically, and steps to the right deterministically when a word is consumed.
…This stack procedure is learned from data (as all the parameters are randomly initialized), but is not quite a simple stack decoding. Indeed, at the input side, if the model focuses on position $i$, that state has information for all words after $i$ (since we also reverse the inputs). It is worth noting that, in some examples (not shown here), the model does skip words.
Grammar as a Foreign Language - Details
In order to apply the model above, sequence output is needed, but syntactic constituencies are tree structures. However, we can represent trees using opening and closing tags, as in the HTML document object model for example. The authors “linearize” the parse tree in depth-first traversal order in this way, as shown.
Example parsing task and its linearization [Image Source: Figure 2. Vinyals et al. (2015) Grammar as a foreign language.]
- Their 3-layer, 256-unit (per layer) LSTM has a vocabulary of 90,000 and outputs 128 symbols corresponding to entities in the linearized parse tree
- Dropout was used between the LSTMs when training on small datasets
- POS-tags were all normalized to the same symbol, which surprisingly improved results (usually having POS-tags in the training data improves performance in constituency parsing)
- They reversed the input sentences since this helped performance (i.e. performance helped this since sentences input the reversed They)
- They used 512-D word2vec-trained pretrained word embeddings, which were also trainable during their training and used
<UNK>
for all out-of-vocabulary words
The authors use 2 data training sets:
- The WSJ dataset of 40K sentences.
- A new ~11M parsed sentence dataset (250M tokens) constructed by combining (a) the OntoNotes corpus (version 5), the English Web Treebank and the corrected Question Treebank with (b) a silver dataset obtained by parsing (crawled) news with the BerkeleyParser and ZPar and selecting only those data with the same parse tree across parsers (then re-sampling this to match the sentence-length distribution of WSJ) to produce ~11M silver sentences. The 90K golden sentences + 11M silver sentences are referred to as the high-confidence corpus.
NB The BerkeleyParser corpus in the results is a 7M-sentence corpus produced by parsing news only with the BerkeleyParser.
Grammar as a Foreign Language - Results
Vinyals and co report results from the baseline LSTM without attention, the model with attention and an ensemble of five of their models with attention.
Results from Vinyals et al. (2015) Grammar as a foreign language
- The LSTM+A model produced very few malformed trees, i.e. unbalanced trees recovered from the linearized output (1.5% or 0.8% depending on WSJ versus high-confidence training)
- Sentence length does not degrade performance of the LSTM+A model as much as the BerkeleyParser, and even the LSTM baseline degrades only slightly more than it
- Beam size being lowered from 10 to 2 degrades performance only slightly (e.g. F1 drops by 0.2 on the dev set for the LSTM+A)
- Dropout significantly helped the LSTM+A model when training on the small WSJ corpus (2 point drop of LSTM+A versus LSTM+A+D)
- Caveat: Their model generalises worse to te WEB and QTB datasets, as compared to BerkeleyParser (possibly trained also on large datasets)
- Their model parses WSJ quicker than previous approaches at 120 sentences/second on sentences of any length (beam size 1; c.f. 100 sentences/second previously with sentences under 40 words.
~
Here’s a bonus track.
References and Notes
-
Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pp. 2773–2781, 2015. https://arxiv.org/pdf/1412.7449.pdf. ↩
-
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. https://www.bioinf.jku.at/publications/older/2604.pdf. ↩
-
Figure taken from Figure 1 of Ahuja and Morency (2017) Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling. https://arxiv.org/pdf/1710.02254.pdf. ↩
-
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014. https://papers.nips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf. ↩
-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.. https://arxiv.org/pdf/1409.0473.pdf. ↩