LLM-as-Judge Evaluations
Transformer Lab provides an easier way to intergrate the LLM-as-judge suite of metrics by DeepEval to evaluate model outputs across multiple dimensions. Here's an overview of the available metrics:
- Bias: Measures the presence of unfair prejudices or discriminatory content in model outputs
- Toxicity: Evaluates the presence of harmful, offensive, or inappropriate content
- Faithfulness: Assesses how well the output aligns with and stays true to the provided context
- Hallucination: Checks for fabricated or incorrect information not supported by the context
- Answer Relevancy: Measures how well the output addresses the input query
- Contextual Precision: Evaluates the accuracy of information used from the provided context
- Contextual Recall: Assesses how comprehensively the output covers relevant information from the context
- Contextual Relevancy: Measures how well the output relates to and uses the given context
- Custom (GEval): Allows creation of custom evaluation criteria by providing specific evaluation guidelines
Dataset Requirements
To perform these evaluations, you'll need to upload a dataset with the following required columns:
input
: The prompt/query given to the LLMoutput
: The actual response generated by the LLMexpected_output
: The ideal or reference responsecontext
: Supporting context (optional for plugins that don't require it)
Step-by-Step Evaluation Process
1. Download the Plugin
Navigate to the Plugins section and install the DeepEval Evaluations (LLM-as-Judge)
plugin.

2. Create Evaluation Task
Configure your evaluation task with these settings:
a) Basic Configuration
- Provide a name for your evaluation task
- Select the desired evaluation metrics from the Tasks tab
b) Plugin Configuration
- Choose a judge model (select 'local' to use Transformer Lab's local model)
- Set the sample fraction for evaluation
- For GEval tasks:
- Specify the Criteria Name and Description
- Choose between context-dependent or context-independent evaluation
- Select your evaluation dataset

3. Run the Evaluation
Click the Queue button to start the evaluation process. Monitor progress through the View Output option.

4. Review Results
After completion, you can:
- View the evaluation scores directly in the interface
- Access the Detailed Report for in-depth analysis
- Download the complete evaluation report

Note: You can now create custom evaluation metrics using GEval by providing custom evaluation metrics' description and the Judge LLM would use that to formulate and score based on your description and provide a score and reason for the same.
