LLM-as-Judge Evaluations

Transformer Lab provides an easier way to intergrate the LLM-as-judge suite of metrics by DeepEval to evaluate model outputs across multiple dimensions. Here's an overview of the available metrics:

Bias: Measures the presence of unfair prejudices or discriminatory content in model outputs
Toxicity: Evaluates the presence of harmful, offensive, or inappropriate content
Faithfulness: Assesses how well the output aligns with and stays true to the provided context
Hallucination: Checks for fabricated or incorrect information not supported by the context
Answer Relevancy: Measures how well the output addresses the input query
Contextual Precision: Evaluates the accuracy of information used from the provided context
Contextual Recall: Assesses how comprehensively the output covers relevant information from the context
Contextual Relevancy: Measures how well the output relates to and uses the given context
Custom (GEval): Allows creation of custom evaluation criteria by providing specific evaluation guidelines

Dataset Requirements

To perform these evaluations, you'll need to upload a dataset with the following required columns:

input: The prompt/query given to the LLM
output: The actual response generated by the LLM
expected_output: The ideal or reference response
context: Supporting context (optional for plugins that don't require it)

Step-by-Step Evaluation Process

1. Download the Plugin

Navigate to the Plugins section and install the DeepEval Evaluations (LLM-as-Judge) plugin.

2. Create Evaluation Task

Configure your evaluation task with these settings:

a) Basic Configuration

Provide a name for your evaluation task
Select the desired evaluation metrics from the Tasks tab

b) Plugin Configuration

Choose a judge model (select 'local' to use Transformer Lab's local model)
Set the sample fraction for evaluation
For GEval tasks:
- Specify the Criteria Name and Description
- Choose between context-dependent or context-independent evaluation
Select your evaluation dataset

3. Run the Evaluation

Click the Queue button to start the evaluation process. Monitor progress through the View Output option.

4. Review Results

After completion, you can:

View the evaluation scores directly in the interface
Access the Detailed Report for in-depth analysis
Download the complete evaluation report

Note: You can now create custom evaluation metrics using GEval by providing custom evaluation metrics' description and the Judge LLM would use that to formulate and score based on your description and provide a score and reason for the same.

Dataset Requirements​

Step-by-Step Evaluation Process​

1. Download the Plugin​

2. Create Evaluation Task​

3. Run the Evaluation​

4. Review Results​