Writing your first Gluepy app, part 2

In the previous step Writing your first Gluepy app, part 1, we setup our Project, created our forecaster module, and tried running our first SampleDAG. Next up, we will write some of our own code that introduce the topics of how Gluepy help you accessing your data by writing a simple forecaster training step.

In this tutorial we will:

Create a GetTrainingDataTask that generate some sample training data for us.
Create a TrainingTask that use xgboost to train a regression model.

Generate Training Data

The first step we want to do is to modify our forecaster/tasks.py file and add the following class:

from datetime import datetime
from gluepy.exec import Task
from gluepy.files.data import data_manager
import pandas as pd


class GetTrainingDataTask(Task):
    label = "get-data"

    def run(self):
        """Entrypoint to our task"""
        # Generating a sample dataframe to later
        # be used for training. In a real project
        # you would likely instead read data from a
        # pre-existing dataset.
        df = pd.DataFrame({
            "units": [10, 5, 3, 11, 8],
            "date": [
                 datetime(2024, 1, 1),
                 datetime(2024, 1, 2),
                 datetime(2024, 1, 3),
                 datetime(2024, 1, 4),
                 datetime(2024, 1, 5)
            ],
            "article_id": [1, 1, 1, 1, 1]
        })

        # Write dataset to run_folder.
        data_manager.write("training.csv", df, index=False)

This does a few simple things:

Define a new Task instance where the business logic lives in the run() method.
Generate mock data as a Pandas Dataframe that will later be used for training.
Use the data_manager to store the dataframe in the run’s Run Folder.

Next up, we need to add this new Task to our SampleDAG within our forecaster/dags.py file.

from gluepy.exec import DAG
from .tasks import GetTrainingDataTask


class SampleDAG(DAG):
    label = "sample"
    tasks = [GetTrainingDataTask]

Now let’s try running our SampleDAG again with the dag command to ensure it is still working.

$ python manage.py dag sample
INFO 2024-06-25 12:48:44,061 dag - ---------- Started task 'BootstrapTask'
DEBUG 2024-06-25 12:48:44,062 tasks -
         Run ID: e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4
         Run Folder: runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4

DEBUG 2024-06-25 12:48:44,062 local - Writing file to path '/demo/data/runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/context.yaml'.
INFO 2024-06-25 12:48:44,062 dag - ---------- Completed task 'BootstrapTask' in 0.000991 seconds
INFO 2024-06-25 12:48:44,062 dag - ---------- Started task 'GetTrainingDataTask'
INFO 2024-06-25 12:48:44,064 pandas - Writing file to path 'runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/training.csv'.
DEBUG 2024-06-25 12:48:44,102 local - Writing file to path '/demo/data/runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/training.csv'.
INFO 2024-06-25 12:48:44,102 dag - ---------- Completed task 'GetTrainingDataTask' in 0.039749 seconds

Training our Machine Learning Model

Next up, it is time to create our TrainingTask. First, we must first install some additional dependencies that we will use to train our machine learning model.

$ pip install scikit-learn xgboost

After installing xgboost and scikit-learn we can create our TrainingTask class.

import os
import io
import pickle
import xgboost as xgb
from gluepy.exec import Task
from gluepy.conf import default_context
from gluepy.files.data import data_manager
from gluepy.files.storages import default_storage
import pandas as pd


class TrainingTask(Task):
    label = "training"

    def run(self):
        # Read the training dataset previous generated in
        # ``GenerateTrainingDataTask``. The path is automatically
        # formatted to read from the run_folder to ensure data versioning
        # and isolation of output between executions.
        df: pd.DataFrame = data_manager.read("training.csv")
        df["date"] = df["date"].astype("category")

        # Train our machine learning model.
        model = xgb.XGBRegressor(enable_categorical=True)
        model.fit(df[["date", "article_id"]], df["units"])

        # Store model to disk to later be used when we want
        # to do inference.
        stream = io.BytesIO()
        pickle.dump(model, stream)
        stream.seek(0)
        default_storage.touch(
            os.path.join(default_context.gluepy.run_folder, "model.pkl"), stream
        )

We must then add our new TrainingTask to our SampleDAG:

from gluepy.exec import DAG
from .tasks import GetTrainingDataTask, TrainingTask


class SampleDAG(DAG):
    label = "sample"
    tasks = [GetTrainingDataTask, TrainingTask]

Finally, we can try executing our updated SampleDAG using the dag command.

$ python manage.py dag sample
INFO 2024-06-25 13:10:37,903 dag - ---------- Started task 'BootstrapTask'
DEBUG 2024-06-25 13:10:37,903 tasks -
         Run ID: c29b8b49-dee9-4984-8ccc-860651780054
         Run Folder: runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054

DEBUG 2024-06-25 13:10:37,904 local - Writing file to path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/context.yaml'.
INFO 2024-06-25 13:10:37,904 dag - ---------- Completed task 'BootstrapTask' in 0.001035 seconds
INFO 2024-06-25 13:10:37,904 dag - ---------- Started task 'GetTrainingDataTask'
INFO 2024-06-25 13:10:37,905 pandas - Writing file to path 'runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
DEBUG 2024-06-25 13:10:37,906 local - Writing file to path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
INFO 2024-06-25 13:10:37,906 dag - ---------- Completed task 'GetTrainingDataTask' in 0.002413 seconds
INFO 2024-06-25 13:10:37,906 dag - ---------- Started task 'TrainingTask'
DEBUG 2024-06-25 13:10:37,906 local - Reading file from path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
INFO 2024-06-25 13:10:37,906 pandas - Reading file from path 'training.csv'.
INFO 2024-06-25 13:10:37,945 dag - ---------- Completed task 'TrainingTask' in 0.038396 seconds

Note the following:

Gluepy automatically assign a unique Run ID to this training run of our model.
Gluepy automatically assign a unique Run Folder to version and isolate the Context and all our output.
Gluepy automatically serialize and save our Context to context.yaml within the run folder.
Even though we only tell our code to read and write training.csv, Gluepy automatically format the path to /demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv

We have now created a set of custom Task instances that use the data_manager object and Data and DataFrames to read and write data. Next up we will see how we can use Context to parameterize our project and make it easier to configure in the future.