Sequence Modeling with CTC

Introduction

Consider speech recognition. We have a dataset of audio clips and corresponding transcripts. Unfortunately, we don’t know how the characters in the transcript align to the audio. This makes training a speech recognizer harder than it might at first seem.

Without this alignment, the simple approaches aren’t available to us. We could devise a rule like “one character corresponds to ten inputs”. But people’s rates of speech vary, so this type of rule can always be broken. Another alternative is to hand-align each character to its location in the audio. From a modeling standpoint this works well — we’d know the ground truth for each input time-step. However, for any reasonably sized dataset this is prohibitively time consuming.

This problem doesn’t just turn up in speech recognition. We see it in many other places. Handwriting recognition from images or sequences of pen strokes is one example. Action labelling in videos is another.

Handwriting recognition: The input can be $$ $(x, y)$

C o nn ec t i o ni s tT e m p or a lCl a ss i f i c a t i o n (CTC) i s a w a y t o g e t a ro u n d n o t kn o w in g t h e a l i g nm e n t b e tw ee n t h e in p u t an d t h eo u tp u t . A s w e ’ ll see, i t ’ ses p ec ia ll y w e ll s u i t e d t o a ppl i c a t i o n s l ik es p eec han d han d w r i t in g reco g ni t i o n . - - - T o b e abi t m ore f or ma l, l e t ’ sco n s i d er ma pp in g in p u t se q u e n ces

X = [x_1, x_2, \ldots, x_T]

, s u c ha s a u d i o, t ocorres p o n d in g o u tp u t se q u e n ces

Y = [y_1, y_2, \ldots, y_U]

, s u c ha s t r an scr i pt s . W e w an tt o f in d ana cc u r a t e ma pp in g f ro m

’ s t o

’ s . T h ere a rec ha ll e n g es w hi c h g e t in t h e w a yo f u s u s in g s im pl ers u p er v i se d l e a r nin g a l g or i t hm s . I n p a r t i c u l a r : - B o t h

an d

c an v a ry in l e n g t h . - T h er a t i oo f t h e l e n g t h so f

an d

c an v a ry . - W e d o n ’ t ha v e ana cc u r a t e a l i g nm e n t (corres p o n d e n ceo f t h ee l e m e n t s) o f

an d

T h e CTC a l g or i t hm o v erco m es t h esec ha ll e n g es . F or a g i v e n

i t g i v es u s an o u tp u t d i s t r ib u t i o n o v er a llp oss ib l e

’ s . W ec an u se t hi s d i s t r ib u t i o n e i t h er t o * in f er * a l ik e l yo u tp u t or t o a ssess t h e * p ro babi l i t y * o f a g i v e n o u tp u t . N o t a llw a yso f co m p u t in g t h e l oss f u n c t i o nan d p er f or min g in f ere n ce a re t r a c t ab l e . W e ’ ll re q u i re t ha tCTC d o b o t h o f t h esee ff i c i e n tl y . * * L oss F u n c t i o n : * * F or a g i v e nin p u t, w e ’ d l ik e t o t r ain o u r m o d e lt o ma x imi ze t h e p ro babi l i t y i t a ss i g n s t o t h er i g h t an s w er . T o d o t hi s, w e ’ ll n ee d t oe ff i c i e n tl yco m p u t e t h eco n d i t i o na lp ro babi l i t y

p(Y \mid X).

T h e f u n c t i o n

p(Y \mid X)

s h o u l d a l so b e d i ff ere n t iab l e, so w ec an u se g r a d i e n t d esce n t . * * I n f ere n ce : * * N a t u r a ll y, a f t er w e ’ v e t r ain e d t h e m o d e l, w e w an tt o u se i tt o in f er a l ik e l y

g i v e nan

This means solving $$ Y^* \enspace =\enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X). $$ Ideally

Y^*

can be found efficiently. With CTC we’ll settle for an approximate solution that’s not too expensive to find. ## The Algorithm The CTC algorithm can assign a probability for any

g i v e nan

The key to computing this probability is how CTC thinks about alignments between inputs and outputs. We’ll start by looking at these alignments and then show how to use them to compute the loss function and perform inference. ### Alignment The CTC algorithm is *alignment-free* — it doesn’t require an alignment between the input and the output. However, to get the probability of an output given an input, CTC works by summing over the probability of all possible alignments between the two. We need to understand what these alignments are in order to understand how the loss function is ultimately calculated. To motivate the specific form of the CTC alignments, first consider a naive approach. Let’s use an example. Assume the input has length six and

Y =

\[c, a, t\]. One way to align

an d

is to assign an output character to each input step and collapse repeats. ![](https://distill.pub/2017/ctc/assets/naive_alignment.svg) This approach has two problems. - Often, it doesn’t make sense to force every input step to align to some output. In speech recognition, for example, the input can have stretches of silence with no corresponding output. - We have no way to produce outputs with multiple characters in a row. Consider the alignment \[h, h, e, l, l, l, o\]. Collapsing repeats will produce “helo” instead of “hello”. To get around these problems, CTC introduces a new token to the set of allowed outputs. This new token is sometimes called the *blank* token. We’ll refer to it here as

\epsilon.

T h e

\epsilon

t o k e n d oes n ’ t corres p o n d t o an y t hin g an d i ss im pl yre m o v e df ro m t h eo u tp u t . T h e a l i g nm e n t s a ll o w e d b y CTC a re t h es am e l e n g t ha s t h e in p u t . W e a ll o w an y a l i g nm e n tw hi c hma p s t o

a f t er m er g in g re p e a t s an d re m o v in g

\epsilon

t o k e n s :! [] (h ttp s : // d i s t i ll . p u b /2017/ c t c / a sse t s / c t c_{a} l i g nm e n t_{s} t e p s . s vg) I f

ha s tw oo f t h es am ec ha r a c t er ina ro w, t h e na v a l i d a l i g nm e n t m u s t ha v e an

\epsilon

between them. With this rule in place, we can differentiate between alignments which collapse to “hello” and those which collapse to “helo”. Let’s go back to the output \[c, a, t\] with an input of length six. Here are a few more examples of valid and invalid alignments. ![](https://distill.pub/2017/ctc/assets/valid_invalid_alignments.svg) The CTC alignments have a few notable properties. First, the allowed alignments between

an d

a re m o n o t o ni c . I f w e a d v an ce t o t h e n e x t in p u t, w ec ank ee pt h ecorres p o n d in g o u tp u tt h es am eor a d v an ce t o t h e n e x t o n e . A seco n d p ro p er t y i s t ha tt h e a l i g nm e n t o f

t o

i s man y - t o - o n e . O n eor m ore in p u t e l e m e n t sc ana l i g n t o a s in g l eo u tp u t e l e m e n t b u t n o t v i ce - v ers a . T hi s im pl i es a t hi r d p ro p er t y : t h e l e n g t h o f

c ann o t b e g re a t er t han t h e l e n g t h o f

### Loss Function The CTC alignments give us a natural way to go from probabilities at each time-step to the probability of an output sequence. ![](https://distill.pub/2017/ctc/assets/full_collapse_from_audio.svg) To be precise, the CTC objective for a single

(X, Y)

p ai r i s :

p(Y \mid X) ;; =

\sum_{A \in \mathcal{A}_{X,Y}}

\prod_{t=1}^T ; p_t(a_t \mid X)

T h e CTC co n d i t i o na l * * p ro babi l i t y * * * * ma r g ina l i zes * * o v er t h ese t o f v a l i d a l i g nm e n t sco m p u t in g t h e * * p ro babi l i t y * * f or a s in g l e a l i g nm e n t s t e p - b y - s t e p . M o d e l s t r ain e d w i t h CTCt y p i c a ll y u se a rec u rre n t n e u r a l n e tw or k (RNN) t oes t ima t e t h e p er t im e - s t e pp ro babi l i t i es,

p_t(a_t \mid X).

A n RNN u s u a ll y w or k s w e ll s in ce i t a cco u n t s f orco n t e x t in t h e in p u t, b u tw e ’ re f ree t o u se an y l e a r nin g a l g or i t hm w hi c h p ro d u ces a d i s t r ib u t i o n o v ero u tp u t c l a sses g i v e na f i x e d - s i zes l i ceo f t h e in p u t . I f w e a re n ’ t c a re f u l, t h e CTCl ossc anb e v erye x p e n s i v e t oco m p u t e . W eco u l d t ry t h es t r ai g h t f or w a r d a pp ro a c han d co m p u t e t h escore f ore a c ha l i g nm e n t s u mmin g t h e ma ll u p a s w e g o . T h e p ro b l e mi s t h erec anb e ama ss i v e n u mb ero f a l i g nm e n t s . F or a

o f l e n g t h

w i t h o u t an yre p e a t c ha r a c t ers an d an

o f l e n g t h

t h es i zeo f t h ese t i s

{T + U \choose T - U}.

F or

T=100

an d

U=50

t hi s n u mb er i s a l m os t

10^{40}.

T hank f u ll y, w ec an co m p u t e t h e l oss m u c h f a s t er w i t ha d y nami c p ro g r ammin g a l g or i t hm . T h e k ey in s i g h t i s t ha t i f tw o a l i g nm e n t s ha v ere a c h e d t h es am eo u tp u t a tt h es am es t e p, t h e n w ec anm er g e t h e m .! [] (h ttp s : // d i s t i ll . p u b /2017/ c t c / a sse t s / a l l_{a} l i g nm e n t s . s vg) S u mmin g o v er a ll a l i g nm e n t sc anb e v erye x p e n s i v e . S in ce w ec anha v e an

\epsilon

b e f oreor a f t er an y t o k e nin

, it’s easier to describe the algorithm using a sequence which includes them. We’ll work with the sequence $$ Z \enspace =\enspace [\epsilon, ~y_1, ~\epsilon, ~y_2,~ \ldots, ~\epsilon, ~y_U, ~\epsilon] $$ which is

w i t han

\epsilon

a tt h e b e g innin g, e n d, an d b e tw ee n e v eryc ha r a c t er . L e t ’ s l e t

\alpha

b e t h escoreo f t h e m er g e d a l i g nm e n t s a t a g i v e nn o d e . M ore p rec i se l y,

\alpha_{s, t}

i s t h e CTC scoreo f t h es u b se q u e n ce

Z_{1:s}

a f t er

in p u t s t e p s . A s w e ’ ll see, w e ’ ll co m p u t e t h e f ina lCTC score,

P(Y \mid X)

, f ro m t h e

\alpha

’ s a tt h e l a s tt im e - s t e p . A s l o n g a s w e kn o wt h e v a l u eso f

\alpha

a tt h e p re v i o u s t im e - s t e p, w ec an co m p u t e

\alpha_{s, t}.

T h ere a re tw oc a ses . * * C a se 1 : * *! [] (h ttp s : // d i s t i ll . p u b /2017/ c t c / a sse t s / cos t_{n} o_{s} ki p . s vg) I n t hi sc a se, w ec an ’ t j u m p o v er

z_{s-1}

, t h e p re v i o u s t o k e nin

T h e f i rs t re a so ni s t ha tt h e p re v i o u s t o k e n c anb e an e l e m e n t o f

, an d w ec an ’ t s ki p e l e m e n t so f

S in cee v erye l e m e n t o f

in

i s f o ll o w e d b y an

\epsilon

, w ec ani d e n t i f y t hi s w h e n

z_{s} = \epsilon.

T h eseco n d re a so ni s t ha tw e m u s t ha v e an

\epsilon

b e tw ee n re p e a t c ha r a c t ers in

W ec ani d e n t i f y t hi s w h e n

z_s = z_{s-2}.

T oe n s u re w e d o n ’ t s ki p

z_{s-1}

, w ec an e i t h er b e t h ere a tt h e p re v i o u s t im e - s t e p or ha v e a l re a d y p a sse d t h ro ug ha t so m ee a r l i er t im e - s t e p . A s a res u ltt h ere a re tw o p os i t i o n s w ec an t r an s i t i o n f ro m .

\alpha_{s, t} ; =

(\alpha_{s-1, t-1} + \alpha_{s, t-1}) \quad\quad \cdot

T h e CTCp ro babi l i t yo f t h e tw o v a l i d s u b se q u e n ces a f t er

t-1

in p u t s t e p s .

p_t(z_{s} \mid X)

T h e p ro babi l i t yo f t h ec u rre n t c ha r a c t er a t in p u t s t e p

! [] (h ttp s : // d i s t i ll . p u b /2017/ c t c / a sse t s / cos t_{r} e gu l a r . s vg) * * C a se 2 : * * I n t h eseco n d c a se, w e ’ re a ll o w e d t os ki pt h e p re v i o u s t o k e nin

W e ha v e t hi sc a se w h e n e v er

z_{s-1}

i s an

\epsilon

b e tw ee n u ni q u ec ha r a c t ers . A s a res u ltt h ere a re t h ree p os i t i o n s w eco u l d ha v eco m e f ro ma tt h e p re v i o u ss t e p .

\alpha_{s, t} ; =

(\alpha_{s-2, t-1} + \alpha_{s-1, t-1} + \alpha_{s, t-1}) \quad\quad \cdot

T h e CTCp ro babi l i t yo f t h e t h ree v a l i d s u b se q u e n ces a f t er

t-1

in p u t s t e p s .

p_t(z_{s} \mid X)

T h e p ro babi l i t yo f t h ec u rre n t c ha r a c t er a t in p u t s t e p

B e l o w i s an e x am pl eo f t h eco m p u t a t i o n p er f or m e d b y t h e d y nami c p ro g r ammin g a l g or i t hm . E v ery v a l i d a l i g nm e n t ha s a p a t hin t hi s g r a p h .! [] (h ttp s : // d i s t i ll . p u b /2017/ c t c / a sse t s / c t c_{c} os t . s vg) o u tp u t

$Y =$

T h ere a re tw o v a l i d s t a r t in g n o d es an d tw o v a l i df ina l n o d ess in ce t h e

\epsilon

a tt h e b e g innin g an d e n d o f t h ese q u e n ce i so pt i o na l . T h eco m pl e t e p ro babi l i t y i s t h es u m o f t h e tw o f ina l n o d es . N o wt ha tw ec an e ff i c i e n tl yco m p u t e t h e l oss f u n c t i o n, t h e n e x t s t e p i s t oco m p u t e a g r a d i e n t an d t r ain t h e m o d e l . T h e CTCl oss f u n c t i o ni s d i ff ere n t iab l e w i t h res p ec tt o t h e p er t im e - s t e p o u tp u tp ro babi l i t i ess in ce i t ’ s j u s t s u m s an d p ro d u c t so f t h e m . G i v e n t hi s, w ec anana l y t i c a ll yco m p u t e t h e g r a d i e n t o f t h e l oss f u n c t i o n w i t h res p ec tt o t h e (u nn or ma l i ze d) o u tp u tp ro babi l i t i es an df ro m t h erer u nba c k p ro p a g a t i o na s u s u a l . F or a t r ainin g se t

\mathcal{D}

, the model’s parameters are tuned to minimize the negative log-likelihood $$ \sum_{(X, Y) \in \mathcal{D}} -\log\; p(Y \mid X) $$ instead of maximizing the likelihood directly. ### Inference After we’ve trained the model, we’d like to use it to find a likely output for a given input. More precisely, we need to solve:

Y^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X)

O n e h e u r i s t i c i s t o t ak e t h e m os tl ik e l yo u tp u t a t e a c h t im e - s t e p . T hi s g i v es u s t h e a l i g nm e n tw i t h t h e hi g h es tp ro babi l i t y :

A^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{A}} \enspace \prod_{t=1}^{T} ; p_t(a_t \mid X)

W ec an t h e n co ll a p sere p e a t s an d re m o v e

\epsilon

t o k e n s t o g e t

F or man y a ppl i c a t i o n s t hi s h e u r i s t i c w or k s w e ll, es p ec ia ll y w h e nm os t o f t h e p ro babi l i t y ma ss i s a ll o t e d t o a s in g l e a l i g nm e n t . Ho w e v er, t hi s a pp ro a c h c an so m e t im es mi sse a sy t o f in d o u tp u t s w i t hm u c hhi g h er p ro babi l i t y . T h e p ro b l e mi s, i t d oes n ’ tt ak e in t o a cco u n tt h e f a c tt ha t a s in g l eo u tp u t c anha v e man y a l i g nm e n t s . Here ’ s an e x am pl e . A ss u m e t h e a l i g nm e n t s \[a, a,

\epsilon

\] and \[a, a, a\] individually have lower probability than \[b, b, b\]. But the sum of their probabilities is actually greater than that of \[b, b, b\]. The naive heuristic will incorrectly propose

Y =

\[b\] as the most likely hypothesis. It should have chosen

Y =

\[a\]. To fix this, the algorithm needs to account for the fact that \[a, a, a\] and \[a, a,

\epsilon

\] collapse to the same output. We can use a modified beam search to solve this. Given limited computation, the modified beam search won’t necessarily find the most likely

It does, at least, have the nice property that we can trade-off more computation (a larger beam-size) for an asymptotically better solution. A regular beam search computes a new set of hypotheses at each input step. The new set of hypotheses is generated from the previous set by extending each hypothesis with all possible output characters and keeping only the top candidates. ![](https://distill.pub/2017/ctc/assets/beam_search.svg) A standard beam search algorithm with an alphabet of $$ $\{\epsilon, a, b\}$ $$ and a beam size of three. We can modify the vanilla beam search to handle multiple alignments mapping to the same output. In this case instead of keeping a list of alignments in the beam, we store the output prefixes after collapsing repeats and removing

\epsilon

c ha r a c t ers . A t e a c h s t e p o f t h ese a rc h w e a cc u m u l a t escores f or a g i v e n p re f i x ba se d o na llt h e a l i g nm e n t s w hi c hma pt o i t . T h e CTC b e am se a rc ha l g or i t hm w i t han o u tp u t a lp hab e t

{\epsilon, a, b}

an d ab e am s i zeo f t h ree . A p ro p ose d e x t e n s i o n c anma pt o tw oo u tp u tp re f i x es i f t h ec ha r a c t er i s a re p e a t . T hi s i ss h o w na t

T=3

in the figure above where ‘a’ is proposed as an extension to the prefix \[a\]. Both \[a\] and \[a, a\] are valid outputs for this proposed extension. When we extend \[a\] to produce \[a,a\], we only want include the part of the previous score for alignments which end in

\epsilon.

R e m e mb er, t h e

\epsilon

is required between repeat characters. Similarly, when we don’t extend the prefix and produce \[a\], we should only include the part of the previous score for alignments which don’t end in

\epsilon.

G i v e n t hi s, w e ha v e t o k ee pt r a c k o f tw o p ro babi l i t i es f ore a c h p re f i x in t h e b e am . T h e p ro babi l i t yo f a ll a l i g nm e n t s w hi c h e n d in

\epsilon

an d t h e p ro babi l i t yo f a ll a l i g nm e n t s w hi c h d o n ’ t e n d in

\epsilon.

Wh e n w er ank t h e h y p o t h eses a t e a c h s t e p b e f ore p r u nin g t h e b e am, w e ’ ll u se t h e i rco mbin e d scores . T h e im pl e m e n t a t i o n o f t hi s a l g or i t hm d oes n ’ t re q u i re m u c h co d e, b u t i t i s d e n se an d t r i c k y t o g e t r i g h t . C h ec k o u tt hi s [g i s t] (h ttp s : // g i s t . g i t h u b . co m / a w ni /56369 a 90 d 03953 e 370 f 3964 c 826 e d 4 b 0) f or an e x am pl e im pl e m e n t a t i o nin P y t h o n . I n so m e p ro b l e m s, s u c ha ss p eec h reco g ni t i o n, in cor p or a t in g a l an gu a g e m o d e l o v er t h eo u tp u t ss i g ni f i c an tl y im p ro v es a cc u r a cy . W ec anin c l u d e t h e l an gu a g e m o d e l a s a f a c t or in t h e in f ere n ce p ro b l e m .

Y^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{Y}}

p(Y \mid X) \quad \cdot

T h e CTC co n d i t i o na lp ro babi l i t y .

p(Y)^\alpha \quad \cdot

T h e l an gu a g e m o d e lp ro babi l i t y .

L(Y)^\beta

T h e “ w or d ” in ser t i o nb o n u s . T h e f u n c t i o n

L(Y)

co m p u t es t h e l e n g t h o f

in t er m so f t h e l an gu a g e m o d e lt o k e n s an d a c t s a s a w or d in ser t i o nb o n u s . Wi t ha w or d - ba se d l an gu a g e m o d e l

L(Y)

co u n t s t h e n u mb ero f w or d s in

I f w e u se a c ha r a c t er - ba se d l an gu a g e m o d e lt h e n

L(Y)

co u n t s t h e n u mb ero f c ha r a c t ers in

T h e l an gu a g e m o d e l scores a reo n l y in c l u d e d w h e na p re f i x i se x t e n d e d b y a c ha r a c t er (or w or d) an d n o t a t e v erys t e p o f t h e a l g or i t hm . T hi sc a u ses t h ese a rc h t o f a v ors h or t er p re f i x es, a s m e a s u re d b y

L(Y)

, s in ce t h ey d o n ’ t in c l u d e a s man y l an gu a g e m o d e l u p d a t es . T h e w or d in ser t i o nb o n u s h e lp s w i t h t hi s . T h e p a r am e t ers

\alpha

an d

\beta

an d

, t h e m o d e l i s a g n os t i c a s t o h o wp ro babi l i t y i s d i s t r ib u t e d am o n g s tt h e m . I n so m e p ro b l e m s CTC e n d s u p a ll oc a t in g m os t o f t h e p ro babi l i t y t o a s in g l e a l i g nm e n t . Ho w e v er, t hi s i s n ’ t gu a r an t ee d . W eco u l df orce t h e m o d e lt oc h oose a s in g l e a l i g nm e n t b yre pl a c in g t h es u m w i t hama x in t h eo bj ec t i v e f u n c t i o n,

p(Y \mid X) \enspace = \enspace \max_{A \in \mathcal{A}{X,Y}} \enspace \prod{t=1}^T ; p(a_t \mid X).

A s m e n t i o n e d b e f ore, CTC o n l y a ll o w s * m o n o t o ni c * a l i g nm e n t s . I n p ro b l e m ss u c ha ss p eec h reco g ni t i o n t hi s ma y b e a v a l i d a ss u m pt i o n . F oro t h er p ro b l e m s l ik e ma c hin e t r an s l a t i o n w h ere a f u t u re w or d ina t a r g e t se n t e n cec ana l i g n t o an e a r l i er p a r t o f t h eso u rcese n t e n ce, t hi s a ss u m pt i o ni s a d e a l - b re ak er . A n o t h er im p or t an tp ro p er t yo f CTC a l i g nm e n t s i s t ha tt h ey a re * man y - t o - o n e * . M u lt i pl e in p u t sc ana l i g n t o a t m os t o n eo u tp u t . I n so m ec a ses t hi s ma y n o t b e d es i r ab l e . W e mi g h tw an tt oe n f orce a s t r i c t o n e - t o - o n ecorres p o n d e n ce b e tw ee n e l e m e n t so f

an d

A lt er na t i v e l y, w e ma y w an tt o a ll o w m u lt i pl eo u tp u t e l e m e n t s t o a l i g n t o a s in g l e in p u t e l e m e n t . F ore x am pl e, t h ec ha r a c t ers “ t h ” mi g h t a l i g n t o a s in g l e in p u t s t e p o f a u d i o . A c ha r a c t er ba se d CTC m o d e lw o u l d n o t a ll o wt ha t . T h e man y - t o - o n e p ro p er t y im pl i es t ha tt h eo u tp u t c an ’ t ha v e m ore t im e - s t e p s t han t h e in p u t . I f

ha s

co n sec u t i v ere p e a t s, t h e n t h e l e n g t h o f

m u s t b e l ess t han t h e l e n g t h o f

b y

2r - 1.

i so f t e n l o n g er t han

, CTC just won’t work. ## CTC in Context In this section we’ll discuss how CTC relates to other commonly used algorithms for sequence modeling. ### HMMs At a first glance, a Hidden Markov Model (HMM) seems quite different from CTC. But, the two algorithms are actually quite similar. Understanding the relationship between them will help us understand what advantages CTC has over HMM sequence models and give us insight into how CTC could be changed for various use cases. Let’s use the same notation as before,

i s t h e in p u t se q u e n ce an d

i s t h eo u tp u t se q u e n ce w i t h l e n g t h s

an d

res p ec t i v e l y . W e ’ re in t eres t e d in l e a r nin g

p(Y \mid X).

One way to simplify the problem is to apply Bayes’ Rule: $$ p(Y \mid X) \; \propto \; p(X \mid Y) \; p(Y). $$ The

p(Y)

t er m c anb e an y l an gu a g e m o d e l, so l e t ’ s f oc u so n

p(X \mid Y).

L ik e b e f ore w e ’ lll e t

\mathcal{A}

b e a se t o f a ll o w e d a l i g nm e n t s b e tw ee n

an d

M e mb erso f

\mathcal{A}

ha v e l e n g t h

L e t ’ so t h er w i se l e a v e

\mathcal{A}

unspecified for now. We’ll come back to it later. We can marginalize over alignments to get $$ p(X \mid Y)\; = \; \sum_{A \in \mathcal{A}} \; p(X, A \mid Y). $$ To simplify notation, let’s remove the conditioning on

, i tw i ll b e p rese n t in e v ery

p(\cdot).

Wi t h tw o a ss u m pt i o n s w ec an w r i t e d o w n t h es t an d a r d H MM .

p(X) \quad =

T h e p ro babi l i t yo f t h e in p u t

\sum_{A \in \mathcal{A}} ; \prod_{t=1}

M a r g ina l i zeso v er a l i g nm e n t s

p(x_t \mid a_t) \quad \cdot

T h ee mi ss i o n p ro babi l i t y

p(a_t \mid a_{t-1})

T h e t r an s i t i o n p ro babi l i t y T h e f i rs t a ss u m pt i o ni s t h e u s u a lM a r k o v p ro p er t y . T h es t a t e

a_t

i sco n d i t i o na ll y in d e p e n d e n t o f a ll hi s t or i cs t a t es g i v e n t h e p re v i o u ss t a t e

a_{t-1}.

T h eseco n d i s t ha tt h eo b ser v a t i o n

x_t

i sco n d i t i o na ll y in d e p e n d e n t o f e v ery t hin gg i v e n t h ec u rre n t s t a t e

a_t.

! [] (h ttp s : // d i s t i ll . p u b /2017/ c t c / a sse t s / hmm . s vg) T h e g r a p hi c a l m o d e l f or an H MM . N o ww ec an t ak e j u s t a f e w s t e p s t o t r an s f or m t h eH MM in t o CTC an d see h o wt h e tw o m o d e l sre l a t e . F i rs t, l e t ’ s a ss u m e t ha tt h e t r an s i t i o n p ro babi l i t i es

p(a_t \mid a_{t-1})

are uniform. This gives $$ p(X) \enspace \propto \enspace \sum_{A \in \mathcal{A}} \enspace \prod_{t=1}^T \; p(x_t \mid a_t). $$ There are only two differences from this equation and the CTC loss function. The first is that we are learning a model of

g i v e n

a so pp ose d t o

g i v e n

T h eseco n d i s h o wt h ese t

\mathcal{A}

i s p ro d u ce d . L e t ’ s d e a lw i t h e a c hin t u r n . T h eH MM c anb e u se d w i t h d i scr imina t i v e m o d e l s w hi c h es t ima t e

p(a \mid x).

To do this, we apply Bayes’ rule and rewrite the model as $$ p(X) \enspace \propto \enspace \sum_{A \in \mathcal{A}} \enspace \prod_{t=1}^T \; \frac{p(a_t \mid x_t)\; p(x_t)}{p(a_t)}

\propto A \in A \sum t = 1 \prod T \frac{p ( a _{t} ∣ x _{t} )}{p ( a _{t} )} .

If we assume a uniform prior over the states

a

and condition on all of

X

instead of a single element at a time, we arrive at

p (X) \propto A \in A \sum t = 1 \prod T p (a_{t} ∣ X) .

The above equation is essentially the CTC loss function, assuming the set

A

is the same. In fact, the HMM framework does not specify what

A

should consist of. This part of the model can be designed on a per-problem basis. In many cases the model doesn’t condition on

Y

and the set

A

consists of all possible length

T

sequences from the output alphabet. In this case, the HMM can be drawn as an ergodic state transition diagram in which every state connects to every other state. The figure below shows this model with the alphabet or set of unique hidden states as

{a, b, c} .

In our case the transitions allowed by the model are strongly related to

Y .

We want the HMM to reflect this. One possible model could be a simple linear state transition diagram. The figure below shows this with the same alphabet as before and

Y =

[a, b]. Another commonly used model is the Bakis or left-right HMM. In this model any transition which proceeds from the left to the right is allowed.

Ergodic HMM: Any node can be either a starting or final state.

In CTC we augment the alphabet with

ϵ

and the HMM model allows a subset of the left-right transitions. The CTC HMM has two start states and two accepting states.

One possible source of confusion is that the HMM model differs for any unique

Y .

This is in fact standard in applications such as speech recognition. The state diagram changes based on the output

Y .

However, the functions which estimate the observation and transition probabilities are shared.

Let’s discuss how CTC improves on the original HMM model. First, we can think of the CTC state diagram as a special case HMM which works well for many problems of interest. Incorporating the blank as a hidden state in the HMM allows us to use the alphabet of

Y

as the other hidden states. This model also gives a set of allowed alignments which may be a good prior for some problems.

Perhaps most importantly, CTC is discriminative. It models

p (Y ∣ X)

directly, an idea that’s been important in the past with other discriminative improvements to HMMs. Discriminative training let’s us apply powerful learning algorithms like the RNN directly towards solving the problem we care about.

Encoder-Decoder Models

The encoder-decoder is perhaps the most commonly used framework for sequence modeling with neural networks. These models have an encoder and a decoder. The encoder maps the input sequence

X

into a hidden representation. The decoder consumes the hidden representation and produces a distribution over the outputs. We can write this as $$ \begin{aligned} H\enspace &= \enspace\textsf{encode}(X) \[.5em] p(Y \mid X)\enspace &= \enspace \textsf{decode}(H). \end{aligned}

\textsf{encode}(\cdot)

an d

\textsf{decode}(\cdot)

f u n c t i o n s a re t y p i c a ll y RNN s . T h e d eco d erc an o pt i o na ll y b ee q u i pp e d w i t hana tt e n t i o nm ec hani s m . T h e hi dd e n s t a t ese q u e n ce

ha s t h es am e n u mb ero f t im e - s t e p s a s t h e in p u t,

S o m e t im es t h ee n co d ers u b s am pl es t h e in p u t . I f t h ee n co d ers u b s am pl es t h e in p u t b y a f a c t or

t h e n

w i ll ha v e

T/s

t im e - s t e p s . W ec anin t er p re tCTC in t h ee n co d er - d eco d er f r am e w or k . T hi s i s h e lp f u lt o u n d ers t an d t h e d e v e l o p m e n t s in e n co d er - d eco d er m o d e l s t ha t a re a ppl i c ab l e t o CTC an d t o d e v e l o p a co mm o n l an gu a g e f or t h e p ro p er t i eso f t h ese m o d e l s . * * E n co d er : * * T h ee n co d ero f a CTC m o d e l c anb e j u s t ab o u t an ye n co d er w e f in d in co mm o n l y u se d e n co d er - d eco d er m o d e l s . F ore x am pl e t h ee n co d erco u l d b e am u lt i - l a yer bi d i rec t i o na lRNN or a co n v o l u t i o na l n e tw or k . T h ere i s a co n s t r ain t o n t h e CTC e n co d er t ha t d oes n ’ t a ppl y t o t h eo t h ers . T h e in p u tl e n g t h c ann o t b es u b - s am pl e d so m u c h t ha t

T/s

i s l ess t han t h e l e n g t h o f t h eo u tp u t . * * Deco d er : * * W ec an v i e wt h e d eco d ero f a CTC m o d e l a s a s im pl e l in e a r t r an s f or ma t i o n f o ll o w e d b y a so f t ma x n or ma l i z a t i o n . T hi s l a yers h o u l d p ro j ec t a ll

s t e p so f t h ee n co d ero u tp u t

into the dimensionality of the output alphabet. We mentioned earlier that CTC makes a conditional independence assumption over the characters in the output sequence. This is one of the big advantages that other encoder-decoder models have over CTC — they can model the dependence over the outputs. However in practice, CTC is still more commonly used in tasks like speech recognition as we can partially overcome the conditional independence assumption by including an external language model. ## Practitioner’s Guide So far we’ve mostly developed a conceptual understanding of CTC. Here we’ll go through a few implementation tips for practitioners. **Software:** Even with a solid understanding of CTC, the implementation is difficult. The algorithm has several edge cases and a fast implementation should be written in a lower-level programming language. Open-source software tools make it much easier to get started: - Baidu Research has open-sourced [warp-ctc](https://github.com/baidu-research/warp-ctc). The package is written in C++ and CUDA. The CTC loss function runs on either the CPU or the GPU. Bindings are available for Torch, TensorFlow and [PyTorch](https://github.com/awni/warp-ctc). - TensorFlow has built in [CTC loss](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss) and [CTC beam search](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_beam_search_decoder) functions for the CPU. - Nvidia also provides a GPU implementation of CTC in [cuDNN](https://developer.nvidia.com/cudnn) versions 7 and up. **Numerical Stability:** Computing the CTC loss naively is numerically unstable. One method to avoid this is to normalize the

\alpha

’s at each time-step. The original publication has more detail on this including the adjustments to the gradient. In practice this works well enough for medium length sequences but can still underflow for long sequences. A better solution is to compute the loss function in log-space with the log-sum-exp trick. When computing the sum of two probabilities in log space use the identity $$ \log(e^a + e^b) = \max\{a, b\} + \log(1 + e^{-|a-b|}) $$ Most programming languages have a stable function to compute

\log(1 + x)

w h e n

i sc l ose t ozero . I n f ere n ces h o u l d a l so b e d o n e in l o g - s p a ce u s in g t h e l o g - s u m - e x pt r i c k . * * B e am S e a rc h : * * T h ere a re a co u pl eo f g oo d t i p s t o kn o w ab o u tw h e nim pl e m e n t in g an d u s in g t h e CTC b e am se a rc h . T h ecorrec t n esso f t h e b e am se a rc h c anb e t es t e d a s f o ll o w s .1. R u n t h e b e am se a rc ha l g or i t hm o nana r bi t r a ry in p u t .2. S a v e t h e in f erre d o u tp u t

\bar{Y}

an d t h ecorres p o n d in g score

\bar{c}.

3. C o m p u t e t h e a c t u a lCTC score

f or

\bar{Y}.

4. C h ec k t ha t

\bar{c} \approx c

w i t h t h e f or m er b e in g n o g re a t er t han t h e l a tt er . A s t h e b e am s i ze in cre a ses t h e in f erre d o u tp u t

\bar{Y}

ma yc han g e, b u tt h e tw o n u mb erss h o u l d g ro w c l oser . A co mm o n q u es t i o n w h e n u s in g ab e am se a rc h d eco d er i s t h es i zeo f t h e b e am t o u se . T h ere i s a t r a d e - o ff b e tw ee na cc u r a cy an d r u n t im e . W ec an c h ec ki f t h e b e am s i ze i s ina g oo d r an g e . T o d o t hi s f i rs t co m p u t e t h e CTC score f or t h e in f erre d o u tp u t

c_i.

T h e n co m p u t e t h e CTC score f or t h e g ro u n d t r u t h o u tp u t

c_g.

I f t h e tw oo u tp u t s a re n o tt h es am e, w es h o u l d ha v e

c_g \lt c_i.

I f

c_i << c_g

t h e n t h e g ro u n d t r u t h o u tp u t a c t u a ll y ha s ahi g h er p ro babi l i t y u n d er t h e m o d e l an d t h e b e am se a rc h f ai l e d t o f in d i t . I n t hi sc a se a l a r g e in cre a se t o t h e b e am s i ze ma y b e w a rr an t e d .

🪴 Anil's Garden

Explorer

Sequence Modeling with CTC

Introduction

Encoder-Decoder Models

Graph View

Table of Contents

Backlinks