Skip to main content

LLM-as-Judge Evaluations

Transformer Lab provides an easier way to intergrate the LLM-as-judge suite of metrics by DeepEval to evaluate model outputs across multiple dimensions. Here's an overview of the available metrics:

  • Bias: Measures the presence of unfair prejudices or discriminatory content in model outputs
  • Toxicity: Evaluates the presence of harmful, offensive, or inappropriate content
  • Faithfulness: Assesses how well the output aligns with and stays true to the provided context
  • Hallucination: Checks for fabricated or incorrect information not supported by the context
  • Answer Relevancy: Measures how well the output addresses the input query
  • Contextual Precision: Evaluates the accuracy of information used from the provided context
  • Contextual Recall: Assesses how comprehensively the output covers relevant information from the context
  • Contextual Relevancy: Measures how well the output relates to and uses the given context
  • Custom (GEval): Allows creation of custom evaluation criteria by providing specific evaluation guidelines

Dataset Requirements

To perform these evaluations, you'll need to upload a dataset with the following required columns:

  • input: The prompt/query given to the LLM
  • output: The actual response generated by the LLM
  • expected_output: The ideal or reference response
  • context: Supporting context (optional for plugins that don't require it)

Step-by-Step Evaluation Process

1. Download the Plugin

Navigate to the Plugins section and install the DeepEval Evaluations (LLM-as-Judge) plugin.

Download Plugin

2. Create Evaluation Task

Configure your evaluation task with these settings:

a) Basic Configuration

  • Provide a name for your evaluation task
  • Select the desired evaluation metrics from the Tasks tab

b) Plugin Configuration

  • Choose a judge model (select 'local' to use Transformer Lab's local model)
  • Set the sample fraction for evaluation
  • For GEval tasks:
    • Specify the Criteria Name and Description
    • Choose between context-dependent or context-independent evaluation
  • Select your evaluation dataset
Create Task

3. Run the Evaluation

Click the Queue button to start the evaluation process. Monitor progress through the View Output option.

Run Evaluation

4. Review Results

After completion, you can:

  • View the evaluation scores directly in the interface
  • Access the Detailed Report for in-depth analysis
  • Download the complete evaluation report
View Results

Note: You can now create custom evaluation metrics using GEval by providing custom evaluation metrics' description and the Judge LLM would use that to formulate and score based on your description and provide a score and reason for the same.

View Results