Experiments

Gluepy supports autonomous, long-running experimentation where an AI coding agent iteratively improves a DAG’s work tasks while evaluation tasks remain frozen. This ensures all improvements are genuine and metrics cannot be gamed.

Overview

The experiment system adds three concepts to Gluepy:

EvaluationTask – a Task subclass for evaluation steps that must remain untouched during experiments.
eval_tasks – a new attribute on DAG that separates evaluation from work.
Structured metrics output – a machine-readable block printed after evaluation, enabling agents to parse results.

An AI agent (Claude Code, Cursor, etc.) drives the experiment loop: it modifies work task code, runs the DAG, reads the metrics, and decides whether to keep or revert changes.

Adding eval_tasks to a DAG

Add an eval_tasks list to your DAG alongside the existing tasks:

from gluepy.exec import DAG, EvaluationTask
from gluepy.ops import default_mlops
from .tasks import PrepareDataTask, TrainModelTask, PredictTask


class ComputeAccuracyTask(EvaluationTask):
    """Computes forecast accuracy metrics.

    Metrics:
        forecast_mape (float): Mean Absolute Percentage Error.
            Primary measure of forecast accuracy. Lower is better.
        forecast_bias (float): Systematic directional error.
            Closer to 0.0 is better.
    """
    label = "compute_accuracy"

    def run(self):
        # ... compute metrics ...
        default_mlops.log_metric("forecast_mape", mape_score)
        default_mlops.log_metric("forecast_bias", mean_bias)


class ForecastDAG(DAG):
    label = "forecast"
    tasks = [PrepareDataTask, TrainModelTask, PredictTask]
    eval_tasks = [ComputeAccuracyTask]

If eval_tasks is omitted, the DAG behaves exactly as before – fully backward compatible.

Writing EvaluationTask Subclasses

EvaluationTask is a subclass of Task. The key contract is:

Call default_mlops.log_metric(key, value) for every metric computed.
Document metrics and their significance in the class docstring.
The agent reads docstrings to understand what “improvement” means – no explicit primary_metric tagging needed.

Write descriptive docstrings that explain:

What each metric measures
What direction is “better” (lower/higher/closer to a target)
How metrics relate to each other
Any hard constraints (e.g., “must be 1.0”)

class ValidateOutputTask(EvaluationTask):
    """Validates ETL output correctness.

    Metrics:
        row_count_match (float): 1.0 if output row count matches
            expected, 0.0 otherwise. Must be 1.0.
        processing_time_seconds (float): Wall-clock time for the
            pipeline. Lower is better.
    """
    label = "validate_output"

    def run(self):
        # ... validation logic ...
        default_mlops.log_metric("row_count_match", 1.0 if valid else 0.0)
        default_mlops.log_metric("processing_time_seconds", elapsed)

CLI Flags

The dag command supports several flags for evaluation control:

# Run work tasks + evaluation tasks (default)
python manage.py dag forecast

# Skip evaluation tasks
python manage.py dag forecast --skip-eval

# Run only evaluation tasks on an existing run
python manage.py dag forecast --eval-only --retry <run_folder>

# Compare metrics across runs
python manage.py dag forecast --compare <run_folder_1> <run_folder_2>

--eval-only requires --retry because evaluation tasks need an existing run’s output to evaluate.

--skip-eval and --eval-only are mutually exclusive.

--compare reads metrics.json from each run folder and prints a TSV table for easy comparison.

Structured Metrics Output

After evaluation tasks run, the DAG runner prints a machine-readable block to stdout:

=== GLUEPY METRICS ===
metric:forecast_mape=12.5
metric:forecast_bias=-0.03
=== END METRICS ===

This uses print() (not logger.info()) so agents can reliably extract metrics with grep "^metric:" regardless of logging configuration.

Metrics are also persisted to metrics.json in the run folder for programmatic access and the --compare feature.

Running Experiments with Claude Code

After installing the skill (python manage.py skill claude), use the /experiment command:

/experiment forecast "improve the MAPE score"

The agent will:

Verify git is available and the project is a git repository
Run a setup checklist with you (frozen paths, baseline metrics)
Create an experiment branch
Enter an autonomous loop: modify code, run DAG, evaluate, keep/revert

For Cursor users, run python manage.py skill cursor and use the same /experiment command.

Customizing Frozen Paths

During setup, the agent agrees with you on which paths are frozen (cannot be modified). The defaults are:

Evaluation task modules (files containing EvaluationTask subclasses)
setup.py, pyproject.toml
Dockerfile*, .dockerignore
.github/workflows/

Everything else is fair game, including configs, dependencies, tests, and DAG definitions.

Interpreting experiment_log.tsv

The experiment log tracks all experiments as a TSV file:

Column	Description
experiment_id	Sequential integer
timestamp	ISO 8601 timestamp
run_id	Gluepy run ID
metrics	All metrics as semicolon-separated key=value pairs
status	`keep`, `discard`, `crash`, or `timeout`
description	What was changed
commit	Git commit hash
duration_seconds	Wall-clock time for the run

Tips for Writing DAGs That Experiment Well

Clear metric docstrings: The agent’s decision quality depends on understanding what metrics mean.
Fast execution: Shorter DAG runs mean more experiments per hour. Consider using data subsets during experimentation.
Deterministic evaluation: Non-deterministic evaluations add noise. Average over multiple samples if needed.
Modular work tasks: Smaller, focused tasks are easier for the agent to modify without breaking things.
Good defaults: Start with reasonable hyperparameters so the agent has a solid baseline to improve from.