Skip to main content

Evaluating with Objective Metrics

Transformer Lab provides a suite of industry-standard objective metrics supported by DeepEval to evaluate model outputs. Here's an overview of the available metrics:

  • Rouge: Evaluates text similarity based on overlapping word sequences
  • BLEU: Measures the quality of machine-translated text by comparing it with reference translations
  • Exact Match: Checks for perfect string matches between output and expected output
  • Quasi Exact Match: Similar to exact match but allows for minor variations in capitalization and whitespace
  • Quasi Contains: Checks if the expected output is contained within the model output, allowing for minor variations
  • BERT Score: Uses BERT embeddings to compute similarity between outputs

Dataset Requirements

To perform these evaluations, you'll need to upload a dataset with the following required columns:

  • input: The prompt/query given to the LLM
  • output: The actual response generated by the LLM
  • expected_output: The ideal or reference response

Step-by-Step Evaluation Process

1. Download the Plugin

Navigate to the Plugins section and install the DeepEval Objective Metrics plugin.

Download Plugin

2. Create Evaluation Task

Configure your evaluation task with these settings:

a) Basic Configuration

  • Provide a name for your evaluation task
  • Select the desired evaluation metrics from the Tasks tab

b) Plugin Configuration

  • Set the sample fraction for evaluation
  • Select your evaluation dataset
Create Task

3. Run the Evaluation

Click the Queue button to start the evaluation process. Monitor progress through the View Output option.

Run Evaluation

4. Review Results

After completion, you can:

  • View the evaluation scores directly in the interface
  • Access the Detailed Report for in-depth analysis
  • Download the complete evaluation report
View Results