Building and Evaluating a RAG Pipeline in Transformer Lab
Retrieval-Augmented Generation (RAG) combines the power of retrieval systems with generative AI to create more accurate, factual, and contextually relevant responses. In this hands-on tutorial, we'll walk through building and evaluating a complete RAG pipeline in Transformer Lab using documentation files as our knowledge base.
What We'll Buildโ
In this tutorial, we will:
- Use three .md documents from the Transformer Lab documentation
- Generate a RAG QnA dataset from these documents
- Fine-tune the
BAAI/bge-base-en-v1.5
embedding model - Compare RAG results between the pre-trained and fine-tuned embedding models
- Evaluate performance using contextual precision and answer relevancy metrics
Let's get started!
Step 1: Upload Your Documentsโ
First, we'll upload three markdown documentation files from the Transformer Lab project.
- Navigate to the Documents tab in Transformer Lab
- Create a new folder called
rag
- Upload the following three .md files from our documentation:
docs.md
scratch.md
raw_text.md

These files contain detailed information about the three synthesizer plugins (Generate from Documents, Raw Text and Scratch) in Transformer Lab.
Step 2: Generate a RAG Q&A Datasetโ
Next, we'll create a dataset of questions and answers based on our documentation files.
- Navigate to the
Generate
tab - Select
Generate Dataset with QA Pairs for RAG Evaluation
plugin - Configure the plugin task:
- Documents:
rag
- Number of QA pairs: 20
- Generation Model: GPT-4o-mini
- Documents:
- Generate the Q&A pairs automatically

The generated dataset will contain questions that span different aspects of the documentation.
Step 3: Fine-tune an Embedding Modelโ
Now we'll fine-tune the BAAI/bge-base-en-v1.5
embedding model on our documentation.
- Go to the
Train
tab. - Select the
Embedding Model Trainer
plugin. - Configure the fine-tuning parameters:
- Dataset: Your generated RAG QnA dataset
- Dataset Type:
single sentences
- Loss Function:
DenoisingAutoEncoderLoss
- Text Column Name:
context
- Start the fine-tuning process

We're fine-tuning on our specific documentation domain to improve retrieval performance on Transformer Lab-related queries.
Step 4: Select Your Embedding Modelโ
After fine-tuning, we'll test both the original and fine-tuned models:
- Navigate to the Foundation tab
- First, select the original "BAAI/bge-base-en-v1.5" model from the dropdown
- We'll run tests with this, then switch to our fine-tuned model later
By selecting the embedding model in the Foundation tab, we tell the system which embeddings to use for our RAG pipeline.
Step 5: Configure the Model Serverโ
Let's run the model server with our selected embedding model:
- Ensure the original "BAAI/bge-base-en-v1.5" model is selected in the Foundation tab
- Run the model server to use for the RAG pipeline.
- Wait for confirmation that the server is running successfully
The model server needs to be running for the RAG pipeline to generate embeddings for our documents.
Step 6: Generate Answers Using RAG with the Pre-trained Modelโ
Now we'll test our RAG pipeline using the original pre-trained embedding model:
- Go to the Plugins section
- Select the "RAG Batched Outputs Generator" plugin
- Select the generated dataset in Step 2 (The plugin automatically uses the BAAI/bge-base-en-v1.5 model we selected in the Foundation tab)

Step 7: Switch to the Fine-tuned Model and Compare Resultsโ
Now let's repeat the process with our fine-tuned embedding model:
- Return to the Foundation tab
- Select your fine-tuned version of "BAAI/bge-base-en-v1.5"
- Restart the model server with this new model
- Run the same queries through the "RAG Batched Outputs Generator" plugin
- Compare the results from both models
This comparison will help us understand how fine-tuning improves or degrades retrieval quality for our specific documentation domain.

Step 8: Evaluate Performanceโ
Finally, let's quantitatively evaluate both models:
- Go to the Plugins section
- Select the "DeepEval Evaluations (LLM-as-Judge)" plugin
- Create a task for each RAG outputs generation
- Task 1: Pre-trained model results
- Task 2: Fine-tuned model results
- Configure the evaluation:
- Metrics: "Contextual Precision" and "Answer Relevancy"
- Dataset: Results from the RAG outputs
- Run the evaluation and analyze results and compare them

The evaluation results will show us how fine-tuning affects:
- Contextual Precision: How accurately the retrieved content matches the query context
- Answer Relevancy: How relevant the generated answers are to the original questions
Results Analysisโ
The specific improvements will vary based on your fine-tuning parameters, size of the data and documentation content. We get lower scores because we fine-tuned the embedding model on only 20 QnA pairs which degraded the embedding model.
Conclusionโ
In this tutorial, we've built a complete RAG pipeline using Transformer Lab documentation, fine-tuned an embedding model, and quantitatively compared performance between pre-trained and fine-tuned models.
This approach demonstrates how domain-specific fine-tuning can affect RAG performance for specialized knowledge bases. By following these steps, you can create and evaluate your own custom RAG solutions for any domain-specific use case.