Introduction
Consider speech recognition. We have a dataset of audio clips and corresponding transcripts. Unfortunately, we don’t know how the characters in the transcript align to the audio. This makes training a speech recognizer harder than it might at first seem.
Without this alignment, the simple approaches aren’t available to us. We could devise a rule like “one character corresponds to ten inputs”. But people’s rates of speech vary, so this type of rule can always be broken. Another alternative is to hand-align each character to its location in the audio. From a modeling standpoint this works well — we’d know the ground truth for each input time-step. However, for any reasonably sized dataset this is prohibitively time consuming.
This problem doesn’t just turn up in speech recognition. We see it in many other places. Handwriting recognition from images or sequences of pen strokes is one example. Action labelling in videos is another.
Handwriting recognition: The input can be $$
X = [x_1, x_2, \ldots, x_T]
Y = [y_1, y_2, \ldots, y_U]
X
Y
X
Y
X
Y
X
Y.
X
Y
p(Y \mid X).
p(Y \mid X)
Y
X.
This means solving $$ Y^* \enspace =\enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X). $$ IdeallyY^*
can be found efficiently. With CTC we’ll settle for an approximate solution that’s not too expensive to find. ## The Algorithm The CTC algorithm can assign a probability for anyY
X.
The key to computing this probability is how CTC thinks about alignments between inputs and outputs. We’ll start by looking at these alignments and then show how to use them to compute the loss function and perform inference. ### Alignment The CTC algorithm is *alignment-free* — it doesn’t require an alignment between the input and the output. However, to get the probability of an output given an input, CTC works by summing over the probability of all possible alignments between the two. We need to understand what these alignments are in order to understand how the loss function is ultimately calculated. To motivate the specific form of the CTC alignments, first consider a naive approach. Let’s use an example. Assume the input has length six andY =
\[c, a, t\]. One way to alignX
Y
is to assign an output character to each input step and collapse repeats. ![](https://distill.pub/2017/ctc/assets/naive_alignment.svg) This approach has two problems. - Often, it doesn’t make sense to force every input step to align to some output. In speech recognition, for example, the input can have stretches of silence with no corresponding output. - We have no way to produce outputs with multiple characters in a row. Consider the alignment \[h, h, e, l, l, l, o\]. Collapsing repeats will produce “helo” instead of “hello”. To get around these problems, CTC introduces a new token to the set of allowed outputs. This new token is sometimes called the *blank* token. We’ll refer to it here as\epsilon.
\epsilon
Y
\epsilon
Y
\epsilon
between them. With this rule in place, we can differentiate between alignments which collapse to “hello” and those which collapse to “helo”. Let’s go back to the output \[c, a, t\] with an input of length six. Here are a few more examples of valid and invalid alignments. ![](https://distill.pub/2017/ctc/assets/valid_invalid_alignments.svg) The CTC alignments have a few notable properties. First, the allowed alignments betweenX
Y
X
Y
Y
X.
### Loss Function The CTC alignments give us a natural way to go from probabilities at each time-step to the probability of an output sequence. ![](https://distill.pub/2017/ctc/assets/full_collapse_from_audio.svg) To be precise, the CTC objective for a single(X, Y)
p(Y \mid X) ;; =
\sum_{A \in \mathcal{A}_{X,Y}}
\prod_{t=1}^T ; p_t(a_t \mid X)
p_t(a_t \mid X).
Y
U
X
T
{T + U \choose T - U}.
T=100
U=50
10^{40}.
\epsilon
Y
, it’s easier to describe the algorithm using a sequence which includes them. We’ll work with the sequence $$ Z \enspace =\enspace [\epsilon, ~y_1, ~\epsilon, ~y_2,~ \ldots, ~\epsilon, ~y_U, ~\epsilon] $$ which isY
\epsilon
\alpha
\alpha_{s, t}
Z_{1:s}
t
P(Y \mid X)
\alpha
\alpha
\alpha_{s, t}.
z_{s-1}
Z.
Y
Y.
Y
Z
\epsilon
z_{s} = \epsilon.
\epsilon
Y.
z_s = z_{s-2}.
z_{s-1}
\alpha_{s, t} ; =
(\alpha_{s-1, t-1} + \alpha_{s, t-1}) \quad\quad \cdot
t-1
p_t(z_{s} \mid X)
t.
Z.
z_{s-1}
\epsilon
\alpha_{s, t} ; =
(\alpha_{s-2, t-1} + \alpha_{s-1, t-1} + \alpha_{s, t-1}) \quad\quad \cdot
t-1
p_t(z_{s} \mid X)
t.
\epsilon
\mathcal{D}
, the model’s parameters are tuned to minimize the negative log-likelihood $$ \sum_{(X, Y) \in \mathcal{D}} -\log\; p(Y \mid X) $$ instead of maximizing the likelihood directly. ### Inference After we’ve trained the model, we’d like to use it to find a likely output for a given input. More precisely, we need to solve:Y^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X)
A^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{A}} \enspace \prod_{t=1}^{T} ; p_t(a_t \mid X)
\epsilon
Y.
\epsilon
\] and \[a, a, a\] individually have lower probability than \[b, b, b\]. But the sum of their probabilities is actually greater than that of \[b, b, b\]. The naive heuristic will incorrectly proposeY =
\[b\] as the most likely hypothesis. It should have chosenY =
\[a\]. To fix this, the algorithm needs to account for the fact that \[a, a, a\] and \[a, a,\epsilon
\] collapse to the same output. We can use a modified beam search to solve this. Given limited computation, the modified beam search won’t necessarily find the most likelyY.
It does, at least, have the nice property that we can trade-off more computation (a larger beam-size) for an asymptotically better solution. A regular beam search computes a new set of hypotheses at each input step. The new set of hypotheses is generated from the previous set by extending each hypothesis with all possible output characters and keeping only the top candidates. ![](https://distill.pub/2017/ctc/assets/beam_search.svg) A standard beam search algorithm with an alphabet of $$ $\{\epsilon, a, b\}$ $$ and a beam size of three. We can modify the vanilla beam search to handle multiple alignments mapping to the same output. In this case instead of keeping a list of alignments in the beam, we store the output prefixes after collapsing repeats and removing\epsilon
{\epsilon, a, b}
T=3
in the figure above where ‘a’ is proposed as an extension to the prefix \[a\]. Both \[a\] and \[a, a\] are valid outputs for this proposed extension. When we extend \[a\] to produce \[a,a\], we only want include the part of the previous score for alignments which end in\epsilon.
\epsilon
is required between repeat characters. Similarly, when we don’t extend the prefix and produce \[a\], we should only include the part of the previous score for alignments which don’t end in\epsilon.
\epsilon
\epsilon.
Y^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{Y}}
p(Y \mid X) \quad \cdot
p(Y)^\alpha \quad \cdot
L(Y)^\beta
L(Y)
Y
L(Y)
Y.
L(Y)
Y.
L(Y)
\alpha
\beta
are usually set by cross-validation. The language model scores and word insertion term can be included in the beam search. Whenever we propose to extend a prefix by a character, we can include the language model score for the new character given the prefix so far. ## Properties of CTC We mentioned a few important properties of CTC so far. Here we’ll go into more depth on what these properties are and what trade-offs they offer. ### Conditional Independence One of the most commonly cited shortcomings of CTC is the conditional independence assumption it makes. ![](https://distill.pub/2017/ctc/assets/conditional_independence.svg) Graphical model for CTC. The model assumes that every output is conditionally independent of the other outputs given the input. This is a bad assumption for many sequence to sequence problems. Say we had an audio clip of someone saying “triple A”. Another valid transcription could be “AAA”. If the first letter of the predicted transcription is ‘A’, then the next letter should be ‘A’ with high probability and ‘r’ with low probability. The conditional independence assumption does not allow for this. ![](https://distill.pub/2017/ctc/assets/triple_a.svg) If we predict an ‘A’ as the first letter then the suffix ‘AA’ should get much more probability than ‘riple A’. If we predict ‘t’ first, the opposite should be true. In fact speech recognizers using CTC don’t learn a language model over the output nearly as well as models which are conditionally dependent. However, a separate language model can be included and usually gives a good boost to accuracy. The conditional independence assumption made by CTC isn’t always a bad thing. Baking in strong beliefs over output interactions makes the model less adaptable to new or altered domains. For example, we might want to use a speech recognizer trained on phone conversations between friends to transcribe customer support calls. The language in the two domains can be quite different even if the acoustic model is similar. With a CTC acoustic model, we can easily swap in a new language model as we change domains. ### Alignment Properties The CTC algorithm is *alignment-free*. The objective function marginalizes over all alignments. While CTC does make strong assumptions about the form of alignments betweenX
Y
p(Y \mid X) \enspace = \enspace \max_{A \in \mathcal{A}{X,Y}} \enspace \prod{t=1}^T ; p(a_t \mid X).
X
Y.
Y
r
Y
X
2r - 1.
Y
X
, CTC just won’t work. ## CTC in Context In this section we’ll discuss how CTC relates to other commonly used algorithms for sequence modeling. ### HMMs At a first glance, a Hidden Markov Model (HMM) seems quite different from CTC. But, the two algorithms are actually quite similar. Understanding the relationship between them will help us understand what advantages CTC has over HMM sequence models and give us insight into how CTC could be changed for various use cases. Let’s use the same notation as before,X
Y
T
U
p(Y \mid X).
One way to simplify the problem is to apply Bayes’ Rule: $$ p(Y \mid X) \; \propto \; p(X \mid Y) \; p(Y). $$ Thep(Y)
p(X \mid Y).
\mathcal{A}
X
Y.
\mathcal{A}
T.
\mathcal{A}
unspecified for now. We’ll come back to it later. We can marginalize over alignments to get $$ p(X \mid Y)\; = \; \sum_{A \in \mathcal{A}} \; p(X, A \mid Y). $$ To simplify notation, let’s remove the conditioning onY
p(\cdot).
p(X) \quad =
\sum_{A \in \mathcal{A}} ; \prod_{t=1}
p(x_t \mid a_t) \quad \cdot
p(a_t \mid a_{t-1})
a_t
a_{t-1}.
x_t
a_t.
p(a_t \mid a_{t-1})
are uniform. This gives $$ p(X) \enspace \propto \enspace \sum_{A \in \mathcal{A}} \enspace \prod_{t=1}^T \; p(x_t \mid a_t). $$ There are only two differences from this equation and the CTC loss function. The first is that we are learning a model ofX
Y
Y
X.
\mathcal{A}
p(a \mid x).
To do this, we apply Bayes’ rule and rewrite the model as $$ p(X) \enspace \propto \enspace \sum_{A \in \mathcal{A}} \enspace \prod_{t=1}^T \; \frac{p(a_t \mid x_t)\; p(x_t)}{p(a_t)}If we assume a uniform prior over the states
and condition on all of
instead of a single element at a time, we arrive at
The above equation is essentially the CTC loss function, assuming the set
is the same. In fact, the HMM framework does not specify what
should consist of. This part of the model can be designed on a per-problem basis. In many cases the model doesn’t condition on
and the set
consists of all possible length
sequences from the output alphabet. In this case, the HMM can be drawn as an ergodic state transition diagram in which every state connects to every other state. The figure below shows this model with the alphabet or set of unique hidden states as
In our case the transitions allowed by the model are strongly related to
We want the HMM to reflect this. One possible model could be a simple linear state transition diagram. The figure below shows this with the same alphabet as before and
[a, b]. Another commonly used model is the Bakis or left-right HMM. In this model any transition which proceeds from the left to the right is allowed.
Ergodic HMM: Any node can be either a starting or final state.
In CTC we augment the alphabet with
and the HMM model allows a subset of the left-right transitions. The CTC HMM has two start states and two accepting states.
One possible source of confusion is that the HMM model differs for any unique
This is in fact standard in applications such as speech recognition. The state diagram changes based on the output
However, the functions which estimate the observation and transition probabilities are shared.
Let’s discuss how CTC improves on the original HMM model. First, we can think of the CTC state diagram as a special case HMM which works well for many problems of interest. Incorporating the blank as a hidden state in the HMM allows us to use the alphabet of
as the other hidden states. This model also gives a set of allowed alignments which may be a good prior for some problems.
Perhaps most importantly, CTC is discriminative. It models
directly, an idea that’s been important in the past with other discriminative improvements to HMMs. Discriminative training let’s us apply powerful learning algorithms like the RNN directly towards solving the problem we care about.
Encoder-Decoder Models
The encoder-decoder is perhaps the most commonly used framework for sequence modeling with neural networks. These models have an encoder and a decoder. The encoder maps the input sequence
into a hidden representation. The decoder consumes the hidden representation and produces a distribution over the outputs. We can write this as $$ \begin{aligned} H\enspace &= \enspace\textsf{encode}(X) \[.5em] p(Y \mid X)\enspace &= \enspace \textsf{decode}(H). \end{aligned}
\textsf{encode}(\cdot)
\textsf{decode}(\cdot)
H
T.
s
H
T/s
T/s
T
H
into the dimensionality of the output alphabet. We mentioned earlier that CTC makes a conditional independence assumption over the characters in the output sequence. This is one of the big advantages that other encoder-decoder models have over CTC — they can model the dependence over the outputs. However in practice, CTC is still more commonly used in tasks like speech recognition as we can partially overcome the conditional independence assumption by including an external language model. ## Practitioner’s Guide So far we’ve mostly developed a conceptual understanding of CTC. Here we’ll go through a few implementation tips for practitioners. **Software:** Even with a solid understanding of CTC, the implementation is difficult. The algorithm has several edge cases and a fast implementation should be written in a lower-level programming language. Open-source software tools make it much easier to get started: - Baidu Research has open-sourced [warp-ctc](https://github.com/baidu-research/warp-ctc). The package is written in C++ and CUDA. The CTC loss function runs on either the CPU or the GPU. Bindings are available for Torch, TensorFlow and [PyTorch](https://github.com/awni/warp-ctc). - TensorFlow has built in [CTC loss](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss) and [CTC beam search](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_beam_search_decoder) functions for the CPU. - Nvidia also provides a GPU implementation of CTC in [cuDNN](https://developer.nvidia.com/cudnn) versions 7 and up. **Numerical Stability:** Computing the CTC loss naively is numerically unstable. One method to avoid this is to normalize the\alpha
’s at each time-step. The original publication has more detail on this including the adjustments to the gradient. In practice this works well enough for medium length sequences but can still underflow for long sequences. A better solution is to compute the loss function in log-space with the log-sum-exp trick. When computing the sum of two probabilities in log space use the identity $$ \log(e^a + e^b) = \max\{a, b\} + \log(1 + e^{-|a-b|}) $$ Most programming languages have a stable function to compute\log(1 + x)
x
\bar{Y}
\bar{c}.
c
\bar{Y}.
\bar{c} \approx c
\bar{Y}
c_i.
c_g.
c_g \lt c_i.
c_i << c_g