Directed Acyclic Graphs (DAG)
As a framework that structures workflows and pipelines, Gluepy’s key components are the DAG and Task.
A DAG is your directed-acyclic-graph, or “pipeline” and its made up of one or more Task instances that holds the
logic of that step in your pipeline.
DAG
from gluepy.exec import DAG
from .tasks import (
DataTask, ForecastTrainingTask, ForecastTestTask, OutputFormatTask
)
class TrainingDAG(DAG):
label = "training"
tasks = [
DataTask, ForecastTrainingTask, ForecastTestTask, OutputFormatTask
]
DAG Registry
All DAGs that are part of a module registered in INSTALLED_MODULES will automatically be added to the DAG Registry located at
gluepy.exec.dags.REGISTRY. This REGISTRY is the source of truth of all available DAGs that is available to run through the Gluepy Commands and the
dag command.
Task
A Task is a single step in a DAG, and holds the logic and code related to that step. It has an entrypoint method run() that must be
defined as part of a Task to not raise an error.
Tasks intentionally do not accept any keyword arguments in the run() method, this is to ensure that every Task in your DAG do not have
in-memory dependencies from previous Task earlier in the DAG, and that each Task can be retried independently at failure without the
need to rerun the full DAG.
from gluepy.exec import Task
from gluepy.files.data import data_manager
class ForecastTrainingTask(Task):
label = "forecast-training"
def run(self):
"""Entrypoint to our task"""
df = data_manager.read("training.parquet")
# Add additional logic to train model...
Task Registry
All Tasks that are part of a module registered in INSTALLED_MODULES will automatically be added to the Task Registry located at
gluepy.exec.tasks.REGISTRY. This REGISTRY is the source of truth of all available Tasks that is available to run through the Gluepy Commands and the
dag command.