Skip to main content

Available Metrics

When creating an evaluation run, these five metrics are available out of the box:
  • Bleu - Measure quality of translation
  • Rouge - Measure quality of summary or translation
  • Meteor - Measure quality of translation using semantic matching
  • Cosine Similarity - Assess similarity by measuring their distance in vector space
  • F1 Score - Measure token-level precision and recall
Evaluation metrics selection interface The available fields to compare for the metrics are defined by the schema of the dataset. For example, summarization datasets will have document, summary, expected_summary as choices for comparison.

Bleu

Library: nltk.sentence_bleu Non-configurable parameters:
  • weights - 0.25 for all n-grams
  • tokenizer - nltk.tokenize.word_tokenize

Rouge

Library:rouge_score.rouge_scorer Configurable parameters:
  • score_types: List[str] - defines which rouge-n metrics will be outputted. Defaults to ["rouge1", "rouge2", "rougeL"]

Meteor

Library: nltk.translate.meteor_score Non-configurable parameters:
  • stemmer -PorterStemmer
  • wordnet -nltk.corpus.wordnet
  • alpha=0.9, beta=3.0, gamma=0.5

Cosine Similarity

Library: sklearn.metrics.pairwise.cosine_similarity Non-configurable parameters:
  • embedding model - sentence-transformers/all-MiniLM-L12-v2

F1 Score

Matching algorithm: tokenize the case-insensitive ground truth and predicted answer, then do exact matching without considering the order of the tokens. Non-configurable parameters:
  • tokenizer - nltk.tokenize.word_tokenize