Creating a Generation Plugin Script

This guide explains how to adapt your existing dataset generation scripts to work with Transformer Lab using the tlab_gen decorator class. By integrating with Transformer Lab, your generation scripts gain progress tracking, parameter management, dataset creation, and automatic upload capabilities with minimal code changes.

What is `tlab_gen`?

tlab_gen is a decorator class that helps integrate your generation script with Transformer Lab's job management system. It provides:

Argument parsing and configuration management
Model loading for different providers (local, OpenAI, Claude, etc.)
Dataset generation and storage
Automatic dataset upload to Transformer Lab
Progress tracking and reporting
Job status management

Getting Started

1. Import the decorator

Add this import to your generation script:

from transformerlab.sdk.v1.generate import tlab_gen

2. Decorate your main generation function

Wrap your main generation function with the job_wrapper decorator:

@tlab_gen.job_wrapper(
    wandb_project_name="my_gen_project",  # Optional: Set custom Weights & Biases project name
    manual_logging=False  # Optional: Set to True for manual metric logging
)
def generate_dataset():
    # Your generation code here
    pass

The decorator parameters include:

progress_start and progress_end: Define the progress range (typically 0-100)
wandb_project_name: Optional custom name for your Weights & Biases project
manual_logging: Set to True for generation scripts without automatic logging integration

Note: There is also an async version of the job wrapper available for functions that might need to run asynchronously. This can be used by changing @tlab_gen.job_wrapper to @tlab_gen.async_job_wrapper.

3. Use helper methods

Replace parts of your code with tlab_gen helper methods:

For generation model loading: tlab_gen.load_evaluation_model()
For saving datasets: tlab_gen.save_generated_dataset(df)
For progress tracking: tlab_gen.progress_update(progress)
For generating output file paths: tlab_gen.get_output_file_path()
For generating expected outputs: tlab_gen.generate_expected_outputs(inputs)

Complete Example

Here's how a typical dataset generation script can be adapted to use tlab_gen:

import pandas as pd
from transformerlab.sdk.v1.generate import tlab_gen

@tlab_gen.job_wrapper()
def generate_dataset():
    # 1. Initialize data list
    data = []
    
    # 2. Generate inputs
    input_prompts = [
        "Explain the concept of recursion in programming.",
        "What is the difference between machine learning and deep learning?",
        "How does a transformer neural network work?"
    ]
    
    # 3. Generate expected outputs using a model
    expected_outputs = tlab_gen.generate_expected_outputs(
        input_prompts,
        task="Create educational content about programming concepts",
        scenario="You are a programming tutor creating explanations",
        output_format="Clear, concise explanation with examples"
    )
    
    # 4. Create dataset entries
    for i, (prompt, response) in enumerate(zip(input_prompts, expected_outputs)):
        data.append({
            "id": i,
            "prompt": prompt,
            "response": response,
            "category": "programming_education"
        })
        
        # Update progress
        progress = int((i + 1) / len(input_prompts) * 100)
        tlab_gen.progress_update(progress)
    
    # 5. Convert to DataFrame
    df = pd.DataFrame(data)
    
    # 6. Save and upload the generated dataset
    output_file, dataset_name = tlab_gen.save_generated_dataset(
        df, 
        additional_metadata={"purpose": "educational content", "domain": "programming"}
    )
    
    print(f"Dataset generated and saved as '{dataset_name}'")
    return True

# Call the function
generate_dataset()

Key Features

Saving Generated Datasets

tlab_gen provides an easy way to save datasets and automatically upload them to Transformer Lab:

output_file, dataset_name = tlab_gen.save_generated_dataset(
    df,  # DataFrame containing the generated data
    additional_metadata={"domain": "finance", "quality": "high"},  # Optional metadata
    dataset_id="custom_dataset_id"  # Optional custom dataset ID
)

The method:

Saves the DataFrame to a JSON file
Creates and saves metadata about the generation
Uploads the dataset to Transformer Lab
Returns the file path and dataset name

Generating Expected Outputs

generate_expected_outputs method helps create output responses for given inputs using a local model running on Transformer Lab:

expected_outputs = tlab_gen.generate_expected_outputs(
    input_values=["What is Python?", "Explain variables."],
    task="Create educational content",
    scenario="You are a programming tutor",
    input_format="Questions about programming concepts",
    output_format="Clear, concise explanations with examples"
)

This automatically:

Formats appropriate prompts based on the task and scenario
Uses the configured model to generate responses
Updates progress during generation
Returns a list of generated outputs

Parameter Access

Parameters are automatically loaded from the Transformer Lab configuration. You can access them in several ways:

Direct access: tlab_gen.params.<parameter_name>
Safe access with default: tlab_gen.params.get(<parameter_name>, <default_value>)

Common parameters include:

tlab_gen.params.model_name: Model to evaluate
tlab_gen.params.dataset_name: Dataset to use
tlab_gen.params.experiment_name: Name of the experiment
tlab_gen.params.run_name: Name for the run

Progress Reporting

Keep users informed about generation progress:

# Update progress (0-100)
tlab_gen.progress_update(75)  # 75% complete

The progress update also checks if the job was requested to stop and will raise a KeyboardInterrupt if needed.

Best Practices

Error Handling: While the decorator handles basic error reporting, include try/except blocks for specific operations
Parameter Access: Always use .get() with sensible defaults for optional parameters
Dataset Structure: Design your DataFrame with clear, consistent fields for better compatibility
Progress Updates: Provide regular progress updates, especially for long-running generations
Metadata: Include helpful metadata about the generation process and dataset characteristics

Summary

By following this guide, you can quickly adapt your existing dataset generation scripts to work within the Transformer Lab ecosystem, gaining parameter management, progress tracking, dataset upload capabilities, and integrated logging with minimal code changes.

What is tlab_gen?​

Getting Started​

1. Import the decorator​

2. Decorate your main generation function​

3. Use helper methods​

Complete Example​

Key Features​

Saving Generated Datasets​

Generating Expected Outputs​

Parameter Access​

Progress Reporting​

Best Practices​

Summary​