Datasets / Corpora

SemCor: semantically annotated English corpus - see SemCor – sense-tagged English corpus (Sketch Engine)

The SemCor corpus is an English corpus with semantically annotated texts. The semantic analysis was done manually with WordNet 1.6 senses (SemCor version 1.6) and later automatically mapped to WordNet 3.0 (SemCor version 3.0). The SemCorpus corpus consists of 352 texts from Brown corpus.

This sense-tagged corpus SemCor 3.0 was automatically created from SemCor 1.6 by mapping WordNet 1.6 to WordNet 3.0 senses. SemCor 1.6 was created and is property of Princeton University. The automatic mapping was performed by Rada Mihalcea (rada@cs.unt.edu).

The corpus has also multi-word expressions (MWE) marked with underscore (_), e.g. manor_house. These multi-word units were annotated by Siva Reddy.

Part-of-speech tagset

SemCor was tagged by TreeTagger using Penn TreeBank tagset.

WordNet

WordNet¼ is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

Structure of WordNet

  • relation among words in WordNet is synonymy (e.g. as between the words shut and close or car and automobile)
  • Synonym: two or more words that denote the same concept and are interchangeable in many contexts
  • Synsets: Synonyms are grouped into unordered sets: synsets
  • 117,000 synsets
  • synsets are linked to other synsets by conceptual relations (few links per synset)
  • Glosses: synsets contain brief definitions: glosses
  • Word forms with several distinct meanings are represented in as many distinct synsets
    • each form-meaning pair in WordNet is unique

Synset Relations in WordNet

Hyperonymy: super- to sub-ordinate relation

  • most frequently encoded relation among synsets is the super-subordinate relation - also called hyperonymy, hyponymy or ISA relation
    • links more general synsets like {furniture, piece_of_furniture} to increasingly specific ones like {bed} and {bunkbed}
  • thus WordNet states that the category furniture includes bed, which in turn includes bunkbed; conversely, concepts like bed and bunkbed make up the category furniture
  • All noun hierarchies ultimately go up the root node {entity}
  • hyponymy WordNet relation is transitive: if an armchair is a kind of chair, and if a chair is a kind of furniture, then an armchair is a kind of furniture

Types versus Instances:

  • WordNet distinguishes:
    • *Types - common nouns; and
    • Instances - specific persons, countries and geographic entities.
    • 
thus armchair is a type of chair, Barack Obama is an instance of a president
  • Instances are always leaf (terminal) nodes in their hierarchies

Meronymy: part-whole relation

  • meronymy, the part-whole relation holds between synsets like {chair} and {back, backrest}, {seat} and {leg}
  • parts are inherited from their superordinates: if a chair has legs, then an armchair has legs as well
  • parts are not inherited “upward” as they may be characteristic only of specific kinds of things rather than the class as a whole
    • for example, chairs and kinds of chairs have legs, but not all kinds of furniture have legs

Verb Synsets:

  • verb synsets are arranged into hierarchies as well
  • verbs towards the bottom of the trees - troponyms - express increasingly specific manners characterizing an event
    • as in {communicate}-{talk}-{whisper}
  • the specific manner expressed depends on the semantic field
    • volume (as in the example above) is just one dimension along which verbs can be elaborated
    • others are speed (move-jog-run) or intensity of emotion (like-love-idolize)
  • verbs describing events that necessarily and unidirectionally entail one another are linked: {buy}-{pay}, {succeed}-{try}, {show}-{see}, etc.

Adjective Synsets:

  • adjectives are organized in terms of antonymy
  • pairs of direct antonyms (like wet-dry and young-old) reflect the strong semantic contract of their members
  • each of these polar adjectives in turn is linked to a number of “semantically similar” ones:
    • dry is linked to parched, arid, dessicated and bone-dry and wet to soggy, waterlogged, etc.
  • semantically similar adjectives are “indirect antonyms” of the contral member of the opposite pole.
  • relational adjectives - pertainyms - point to the nouns they are derived from (criminal-crime)

Adverb Synsets:

  • only a few adverbs in WordNet (hardly, mostly, really, etc.) as the majority of English adverbs are straightforwardly derived from adjectives via morphological affixation (surprisingly, strangely, etc.)

Cross-POS relations:

  • majority of the WordNet’s relations connect words from the same part of speech (POS)
    • i.e. WordNet really consists of four sub-nets: nouns, verbs, adjectives and adverbs
  • few cross-POS pointers
  • Cross-POS relations include the “morphosemantic” links that hold among semantically similar words sharing a stem with the same meaning:
    • observe (verb), observant (adjective) observation, observatory (nouns)
  • In many of the noun-verb pairs the semantic role of the noun with respect to the verb has been specified:
    • {sleeper, sleeping_car} is the LOCATION for {sleep} and {painter}is the AGENT of {paint}, while {painting, picture} is its RESULT

Polysemy vs Homonymy:

  • Polysemy: the coexistence of many possible meanings for a word or phrase.
  • Homonymy: Two or more words having the same lexical form (spelling) but different, unrelated meanings
    • for example, pole1 (round bar) and pole2 (either of the two points at which the axis of a circle cuts the surface of a sphere) 
    • cf homograph: each of two or more words spelled the same but not necessarily pronounced the same and having difference meanings and origins
    • cf homophone: each of two or more words having the same pronunciation but different meanings, origins, or spelling, for example new and knew

See Homonym > Related terms (Wikipedia)

Course and Fine-grained Word Senses

There is a course-to-fine mapping for word senses (provided either by WordNet or Navigli’s group).

An example of the difference in granularity is school (course sense ID: school.n.h.01):

"school.n.h.01": [
        {
            "school.n.01": "an educational institution"
        },
        {
            "school.n.02": "a building where young people receive education"
        },
        {
            "school.n.03": "the process of being formally educated at a school"
        },
        {
            "school.n.04": "a body of creative artists or writers or thinkers linked by a similar style or by similar teachers"
        },
        {
            "school.n.05": "the period of instruction in a school; the time period when school is in session"
        },
        {
            "school.n.06": "an educational institution's faculty and students"
        }
    ]

SemEval

SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series.

See also SemEval-2007 organised by Navigli on course-grained SemEval (page linked is now hosted on Sapienza’s website).

Word Sense Disambiguation - Questions and Curiosities

How do you represent the senses in a network architecture?

In neural Word Sense Disambiguation architectures, how is the target represented in the network? Specifically, I’m confused by the fact that - naively - one would think that every word in the vocabulary needs a set of possible senses, so we have to retain a huge set of targets corresponding to the senses, where most of the target classes are not valid for a given word

Answer | Phind v8 Model

In neural Word Sense Disambiguation (WSD) architectures, the target is typically represented by a one-hot encoded vector that corresponds to the correct sense of the ambiguous word in the given context. This vector is part of the training data and is used to guide the learning process of the neural network. However, it’s important to note that the size of this vector is not equal to the size of the vocabulary, but rather to the number of possible senses for each word. This is because not all words have the same number of senses, and not all senses are valid for a given word.

For example, consider a word that has three senses: ‘sense1’, ‘sense2’, and ‘sense3’. The target for this word would be a vector of length 3, with a 1 at the index corresponding to the correct sense and 0s elsewhere. If ‘sense2’ is the correct sense, the target vector would be [0, 1, 0].

The challenge you mentioned - that every word in the vocabulary needs a set of possible senses, leading to a large number of target classes - is indeed a common issue in WSD. However, it’s important to note that the number of target classes is not equal to the size of the vocabulary. Instead, it’s equal to the number of possible senses across all words, which is typically much smaller than the size of the vocabulary.

There are several strategies to handle this issue:

  1. Sense Inventory: WSD models often use a predefined sense inventory, such as WordNet, which provides a list of possible senses for each word. This allows the model to focus on a manageable number of senses. However, this approach has limitations because the sense inventory may not cover all possible senses of a word, and it may not accurately capture the nuances of a word’s meaning Source 7.

  2. Contextualized Embeddings: Another approach is to use contextualized word embeddings, such as those generated by transformer models like BERT. These embeddings capture the context in which a word is used, which can help the model distinguish between different senses of a word. However, this approach requires a large amount of training data and computational resources Source 5.

  3. Memory Enhancement Mechanism: This approach stores the gloss knowledge of previously identified words into a memory, and further utilizes it to assist the disambiguation of the next target word. This can help the model leverage knowledge from previous words to disambiguate the current word Source 7.

In summary, while it’s true that the target in WSD architectures can be a large vector due to the number of possible senses for each word, this doesn’t mean that every word in the vocabulary needs a set of possible senses. The actual number of target classes is typically much smaller than the size of the vocabulary, and there are several strategies to handle this issue.