The Hierarchical Softmax
The Hierarchical Softmax is useful for efficient classification as it has logarithmic time complexity in the number of output classes,
Small additions or modifications to the end of the article will be made when time allows in order to include remarks on use of Huffman coding with the Hierarchical Softmax and to draw links between the concepts introduced here and, for example, Noise Contrastive Estimation (NCE). This was was written on 29th March 2021, though published later.
The softmax classifier, which generalises logistic regression from a binary,
Given a vector of logit scores,
The softmax should arguably be called the softargmax as it computes a smooth approximation to the
Remark on Energy and the Boltzmann Distribution
It is interesting to note that the softmax is identical to the Boltzmann distribution where
Here
The fact that energy can be related to probabilities via the Boltzmann distribution is worth remembering in the context of Energy-Based Models (EBMs).
Classification problems occur in language when models must generate probability distributions over a vocabulary of words, for example to predict the next word in language modelling. Typical vocabulary sizes could be around
When constructing language models, we might employ recurrent neural networks2 (RNNs) or transformer-based architectures3. In this case, the logit scores would be computed using a similarity score between the final layer output,
Remark on Similarity Scores
A common choice for similarity is the dot product,
For the very large numbers of output classes encountered in language modelling, the softmax becomes very computationally expensive as, in order to compute the probability of any single word being the next word in a sequence, we have to compute the denominator. The denominator is quadratic in the feature dimensionality as it is in effect the sum of elements after a matrix-vector multiplication between the word embedding matrix and the final layer output,
Hierarchical Softmax
Frederic Morin and Yoshua Bengio (2005)6 proposed the hierarchical softmax classifier, an approximation to the softmax which is logarithmic-time in the vocabulary size yielding an exponential speed-up:
The basic idea is to form a hierarchical description of a word as a sequence of
decisions, and to learn to take these probabilistic decisions instead of directly predicting each word’s probability.
They extend the idea of Goodman7 to define a clustering partition,
They also note that generalization could be better for choices of word clusters that “make sense”, i.e. those for which it easier to learn the second term,
They follow this up with a concrete example8
If
can take values and we have classes [clusters] with words in each class, then instead of doing normalization over choices we only need to do two normalizations, each over choices.
Performing a one-level decomposition provides at most a
Another important idea of this paper is to reuse the same model (i.e. the same parameters) for all those decisions (otherwise a very large number of models would be required and the whole model would not fit in computer memory), using a special symbolic input that characterizes the nodes in the tree of the hierarchical decomposition.
Finally, we use prior knowledge in the WordNet lexical reference system to help define the hierarchy of word classes.
Analogy to binary search in an ordered list. “We’ll see momentarily how clusters making sense is analogous to the notion of order in lists which we can search with binary search. “
Tomas Mikolov and colleagues encountered this problem when using neural networks to compute word embeddings10. They used the pretext task11 of either predicting a word in a window given its context, the Continuous Bag of Words model (CBOW), or predicting the context words given an input word, the Skip-gram model.
Graphical representation of the CBOW model and Skip-gram model. In the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. In the Skip-gram model, the distributed representation of the input word is used to predict the context. Image source: Exploiting Similarities among Languages for Machine Translation12.
Further Reading
- Sebastian Ruder’s (https://ruder.io/) second post in a larger series https://ruder.io/word-embeddings-softmax/index.html. See also the first (https://ruder.io/word-embeddings-1/) and third (https://ruder.io/secret-word2vec/) in this series.
- Nice image illustrating Hierarchical Softmax and up-to-date papers on it at Papers with Code: https://paperswithcode.com/method/hierarchical-softmax.
- Lei Mao Hierarchical Softmax https://leimao.github.io/article/Hierarchical-Softmax/ [Probably already incorporated.]
Related Concepts
- Lei Mao Noise Contrastive Estimation. https://leimao.github.io/article/Noise-Contrastive-Estimation/.
- Lei Mao Entropy, Perplexity and Its Applications. https://leimao.github.io/blog/Entropy-Perplexity/
References and Notes
-
See Baydin, Pearlmutter, Radul and Siskind (2018) Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research. https://jmlr.org/papers/v18/17-468.html. ↩
-
See Chris Olah’s fantastic blog post, Understanding LSTM Networks, for an explanation of long short-term memory networks, the most widely-used variant of RNNs, which uses gating to overcome the vanishing gradient problem. ↩
-
Read the seminal Vaswani et al. (2017) Attention is All You Need https://arxiv.org/pdf/1706.03762.pdf for full details, Jay Alammar’s The Illustrated Transformer for a readable explanation or The Annotated Transformer for details on implementation. ↩
-
In their seminal paper on Transformers, Ashish Vaswani and colleagues used scaled dot-product attention with
, the feature size of the keys, to prevent the dot products from growing in magnitude thus pushing the softmax values to regions where the gradients are small and learning via backpropagation is difficult. See section 3.2.1 in Vaswani et al. (2017) Attention is All You Need https://arxiv.org/pdf/1706.03762.pdf. ↩ -
Morin and Bengio (2005) Hierarchical probabilistic neural network language model. Aistats. https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf. ↩
-
Goodman, J. (2001). Classes for fast maximum entropy training. In International Conference on Acoustics, Speech and Signal Processing (ICASSP) ↩
-
I have tweaked the quote slightly to align it with the notation I introduced. ↩
-
A tree is balanced if the following is true when applied recursively to subtrees: (1) The left and right subtrees’ heights differ by at most one, AND (2) The left subtree is balanced, AND (3) The right subtree is balanced. See https://stackoverflow.com/questions/8015630/definition-of-a-balanced-tree. ↩
-
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013) Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781. ↩
-
Pretext tasks are dummy tasks whose predictions are not important in themselves but rather guide a model to learn useful feature representations. This is now commonly discussed in the context of self-supervised learning. See Ishan Misra’s Self-Supervised Learning - Pretext Tasks from Deep Learning DS-GA 1008 · Spring 2020 · NYU Center for Data Science taught by Yann LeCun & Alfredo Canziani. ↩
-
Mikolov, Le and Sutskever (2013) Exploiting Similarities among Languages for Machine Translation. https://arxiv.org/abs/1309.4168v1. ↩