Title: BERTScore: Evaluating Text Generation with BERT
Authors: Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
Published: 21st April 2019 (Sunday) @ 23:08:53
Link: http://arxiv.org/abs/1904.09675v3

Abstract

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.


  • BERTSCORE computes the similarity of two sentences as a sum of cosine similarities between their tokens’ embeddings. [boldface mine]” - See Computing BERTScore
    • Compute matrix of cosine similarities between tokens in the two sentences e.g. source and target using BERT embeddings (pre-normalized so you can do it via an inner product)
    • Match tokens greedily based on maximum similarity
    • Compute “precision” and “recall” and consequently F1 score - see the definitions in Computing BERTScore
      • Question: Do these correspond to true
    • Rescaling: they observe their BERTScores are compressed in a subset of the range in which cosine similarity lies: “We address this by rescaling BERTSCORE with respect to its empirical lower bound b as a baseline.”
      • “We compute b using Common Crawl monolingual datasets.”
  • Addresses two problems:
      1. Paraphrases - BLEU and Meteor are based on on surface-level (token) candidate-reference overlap and penalize semantically superior paraphrases
      • example: BLEU and METEOR (Banerjee & Lavie, 2005) incorrectly give a higher score to people like visiting places abroad compared to consumers prefer imported cars
      1. -models fail to accommodate semantically relevant word permutations e.g. in long distance dependencies
      • BLEU will only mildly penalize swapping of cause and effect clauses (e.g. A because B instead of B because A)
      • this semantic distinction is better captured by BERTScore

Computing BERTScore

Similarity Measure The vector representation allows for a soft measure of similarity instead of exact-string (Papineni et al., 2002) or heuristic (Banerjee & Lavie, 2005) matching. The cosine similarity of a reference token and a candidate token is . We use pre-normalized vectors, which reduces this calculation to the inner product . While this measure considers tokens in isolation, the contextual embeddings contain information from the rest of the sentence.

BERTSCORE

The complete score matches each token in to a token in to compute recall, and each token in to a token in to compute precision. We use greedy matching to maximize the matching similarity score, where each token is matched to the most similar token in the other sentence. We combine precision and recall to compute an F1 measure. For a reference and candidate , the recall, precision, and F1 scores are:

Importance Weighting

Previous work on similarity measures demonstrated that rare words can be more indicative for sentence similarity than common words (Banerjee & Lavie, 2005; Vedantam et al., 2015). BERTSCORE enables us to easily incorporate importance weighting. We experiment with inverse document frequency (idf) scores computed from the test corpus. Given reference sentences , the idf score of a word-piece token is

where is an indicator function. We do not use the full tf-idf measure because we process single sentences, where the term frequency (tf) is likely 1 . For example, recall with idf weighting is

Because we use reference sentences to compute idf, the idf scores remain the same for all systems evaluated on a specific test set. We apply plus-one smoothing to handle unknown word pieces.