Creating a Generation Plugin Script
This guide explains how to adapt your existing dataset generation scripts to work with Transformer Lab using the tlab_gen
decorator class. By integrating with Transformer Lab, your generation scripts gain progress tracking, parameter management, dataset creation, and automatic upload capabilities with minimal code changes.
What is tlab_gen
?
tlab_gen
is a decorator class that helps integrate your generation script with Transformer Lab's job management system. It provides:
- Argument parsing and configuration management
- Model loading for different providers (local, OpenAI, Claude, etc.)
- Dataset generation and storage
- Automatic dataset upload to Transformer Lab
- Progress tracking and reporting
- Job status management
Getting Started
1. Import the decorator
Add this import to your generation script:
from transformerlab.sdk.v1.generate import tlab_gen
2. Decorate your main generation function
Wrap your main generation function with the job_wrapper
decorator:
@tlab_gen.job_wrapper(
wandb_project_name="my_gen_project", # Optional: Set custom Weights & Biases project name
manual_logging=False # Optional: Set to True for manual metric logging
)
def generate_dataset():
# Your generation code here
pass
The decorator parameters include:
progress_start
andprogress_end
: Define the progress range (typically 0-100)wandb_project_name
: Optional custom name for your Weights & Biases projectmanual_logging
: Set to True for generation scripts without automatic logging integration
Note: There is also an async version of the job wrapper available for functions that might need to run asynchronously. This can be used by changing
@tlab_gen.job_wrapper
to@tlab_gen.async_job_wrapper
.
3. Use helper methods
Replace parts of your code with tlab_gen
helper methods:
- For generation model loading:
tlab_gen.load_evaluation_model()
- For saving datasets:
tlab_gen.save_generated_dataset(df)
- For progress tracking:
tlab_gen.progress_update(progress)
- For generating output file paths:
tlab_gen.get_output_file_path()
- For generating expected outputs:
tlab_gen.generate_expected_outputs(inputs)
Complete Example
Here's how a typical dataset generation script can be adapted to use tlab_gen
:
import pandas as pd
from transformerlab.sdk.v1.generate import tlab_gen
@tlab_gen.job_wrapper()
def generate_dataset():
# 1. Initialize data list
data = []
# 2. Generate inputs
input_prompts = [
"Explain the concept of recursion in programming.",
"What is the difference between machine learning and deep learning?",
"How does a transformer neural network work?"
]
# 3. Generate expected outputs using a model
expected_outputs = tlab_gen.generate_expected_outputs(
input_prompts,
task="Create educational content about programming concepts",
scenario="You are a programming tutor creating explanations",
output_format="Clear, concise explanation with examples"
)
# 4. Create dataset entries
for i, (prompt, response) in enumerate(zip(input_prompts, expected_outputs)):
data.append({
"id": i,
"prompt": prompt,
"response": response,
"category": "programming_education"
})
# Update progress
progress = int((i + 1) / len(input_prompts) * 100)
tlab_gen.progress_update(progress)
# 5. Convert to DataFrame
df = pd.DataFrame(data)
# 6. Save and upload the generated dataset
output_file, dataset_name = tlab_gen.save_generated_dataset(
df,
additional_metadata={"purpose": "educational content", "domain": "programming"}
)
print(f"Dataset generated and saved as '{dataset_name}'")
return True
# Call the function
generate_dataset()
Key Features
Saving Generated Datasets
tlab_gen
provides an easy way to save datasets and automatically upload them to Transformer Lab:
output_file, dataset_name = tlab_gen.save_generated_dataset(
df, # DataFrame containing the generated data
additional_metadata={"domain": "finance", "quality": "high"}, # Optional metadata
dataset_id="custom_dataset_id" # Optional custom dataset ID
)
The method:
- Saves the DataFrame to a JSON file
- Creates and saves metadata about the generation
- Uploads the dataset to Transformer Lab
- Returns the file path and dataset name
Generating Expected Outputs
generate_expected_outputs
method helps create output responses for given inputs using a local model running on Transformer Lab:
expected_outputs = tlab_gen.generate_expected_outputs(
input_values=["What is Python?", "Explain variables."],
task="Create educational content",
scenario="You are a programming tutor",
input_format="Questions about programming concepts",
output_format="Clear, concise explanations with examples"
)
This automatically:
- Formats appropriate prompts based on the task and scenario
- Uses the configured model to generate responses
- Updates progress during generation
- Returns a list of generated outputs
Parameter Access
Parameters are automatically loaded from the Transformer Lab configuration. You can access them in several ways:
- Direct access:
tlab_gen.params.<parameter_name>
- Safe access with default:
tlab_gen.params.get(<parameter_name>, <default_value>)
Common parameters include:
tlab_gen.params.model_name
: Model to evaluatetlab_gen.params.dataset_name
: Dataset to usetlab_gen.params.experiment_name
: Name of the experimenttlab_gen.params.run_name
: Name for the run
Progress Reporting
Keep users informed about generation progress:
# Update progress (0-100)
tlab_gen.progress_update(75) # 75% complete
The progress update also checks if the job was requested to stop and will raise a KeyboardInterrupt if needed.
Best Practices
- Error Handling: While the decorator handles basic error reporting, include try/except blocks for specific operations
- Parameter Access: Always use
.get()
with sensible defaults for optional parameters - Dataset Structure: Design your DataFrame with clear, consistent fields for better compatibility
- Progress Updates: Provide regular progress updates, especially for long-running generations
- Metadata: Include helpful metadata about the generation process and dataset characteristics
Summary
By following this guide, you can quickly adapt your existing dataset generation scripts to work within the Transformer Lab ecosystem, gaining parameter management, progress tracking, dataset upload capabilities, and integrated logging with minimal code changes.