Title: Enriching Word Vectors with Subword Information
Authors: Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov
Published: 15th July 2016 (Friday) @ 18:27:55
Link: http://arxiv.org/abs/1607.04606v2
Abstract
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character -grams. A vector representation is associated to each character -gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.
Introduces an elaboration of Mikolovâs skip-gram model which trains a network (I believe the densely connected one from 2013) instead to incorporate sub-word information, in order to allow better modelling of morphologically rich languages (e.g. German or Turkish) and enable generation of word representations for out-of-vocaulary words (OOV words).
Authors train representations for n-grams (e.g. for n 3-6) and construct the word representations through a simple vector sum of the sub-word representations. This performs well with less data and on morphologically rich languaes, including for the language modelling task.
fastText paper (1 of 2)