🪴 Anil's Garden

❯

Enriching Word Vectors with Subword Information

18 Jul 20252 min read

paper
annotated

Title: Enriching Word Vectors with Subword Information
Authors: Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov
Published: 15th July 2016 (Friday) @ 18:27:55
Link: http://arxiv.org/abs/1607.04606v2

Abstract

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character $n$ -grams. A vector representation is associated to each character $n$ -gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

Introduces an elaboration of Mikolov’s skip-gram model which trains a network (I believe the densely connected one from 2013) instead to incorporate sub-word information, in order to allow better modelling of morphologically rich languages (e.g. German or Turkish) and enable generation of word representations for out-of-vocaulary words (OOV words).

Authors train representations for n-grams (e.g. for n 3-6) and construct the word representations through a simple vector sum of the sub-word representations. This performs well with less data and on morphologically rich languaes, including for the language modelling task.

fastText paper (1 of 2)

Graph View

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋

🪴 Anil's Garden

Explorer

Enriching Word Vectors with Subword Information

Graph View

Backlinks