Directed Acyclic Graphs (DAG)

As a framework that structures workflows and pipelines, Gluepy’s key components are the DAG and Task. A DAG is your directed-acyclic-graph, or “pipeline” and its made up of one or more Task instances that holds the logic of that step in your pipeline.

DAG

from gluepy.exec import DAG
from .tasks import (
    DataTask, ForecastTrainingTask, ForecastTestTask, OutputFormatTask
)


class TrainingDAG(DAG):
    label = "training"
    tasks = [
        DataTask, ForecastTrainingTask, ForecastTestTask, OutputFormatTask
    ]

DAG Registry

All DAGs that are part of a module registered in INSTALLED_MODULES will automatically be added to the DAG Registry located at gluepy.exec.dags.REGISTRY. This REGISTRY is the source of truth of all available DAGs that is available to run through the Gluepy Commands and the dag command.

Task

A Task is a single step in a DAG, and holds the logic and code related to that step. It has an entrypoint method run() that must be defined as part of a Task to not raise an error.

Tasks intentionally do not accept any keyword arguments in the run() method, this is to ensure that every Task in your DAG do not have in-memory dependencies from previous Task earlier in the DAG, and that each Task can be retried independently at failure without the need to rerun the full DAG.

from gluepy.exec import Task
from gluepy.files.data import data_manager


class ForecastTrainingTask(Task):
    label = "forecast-training"

    def run(self):
        """Entrypoint to our task"""
        df = data_manager.read("training.parquet")
        # Add additional logic to train model...

Task Registry

All Tasks that are part of a module registered in INSTALLED_MODULES will automatically be added to the Task Registry located at gluepy.exec.tasks.REGISTRY. This REGISTRY is the source of truth of all available Tasks that is available to run through the Gluepy Commands and the dag command.