Directed Acyclic Graphs (DAG)
As a framework that structures workflows and pipelines, Gluepy’s key components are the DAG and Task.
A DAG
is your directed-acyclic-graph, or “pipeline” and its made up of one or more Task
instances that holds the
logic of that step in your pipeline.
DAG
from gluepy.exec import DAG
from .tasks import (
DataTask, ForecastTrainingTask, ForecastTestTask, OutputFormatTask
)
class TrainingDAG(DAG):
label = "training"
tasks = [
DataTask, ForecastTrainingTask, ForecastTestTask, OutputFormatTask
]
DAG Registry
All DAGs that are part of a module registered in INSTALLED_MODULES
will automatically be added to the DAG Registry located at
gluepy.exec.dags.REGISTRY
. This REGISTRY
is the source of truth of all available DAGs that is available to run through the Gluepy Commands and the
dag command.
Task
A Task
is a single step in a DAG, and holds the logic and code related to that step. It has an entrypoint method run()
that must be
defined as part of a Task
to not raise an error.
Tasks intentionally do not accept any keyword arguments in the run()
method, this is to ensure that every Task
in your DAG
do not have
in-memory dependencies from previous Task
earlier in the DAG
, and that each Task
can be retried independently at failure without the
need to rerun the full DAG
.
from gluepy.exec import Task
from gluepy.files.data import data_manager
class ForecastTrainingTask(Task):
label = "forecast-training"
def run(self):
"""Entrypoint to our task"""
df = data_manager.read("training.parquet")
# Add additional logic to train model...
Task Registry
All Tasks that are part of a module registered in INSTALLED_MODULES
will automatically be added to the Task Registry located at
gluepy.exec.tasks.REGISTRY
. This REGISTRY
is the source of truth of all available Tasks that is available to run through the Gluepy Commands and the
dag command.