Writing your first Gluepy app, part 2
In the previous step Writing your first Gluepy app, part 1, we setup our Project, created our forecaster
module, and tried running our first SampleDAG
.
Next up, we will write some of our own code that introduce the topics of how Gluepy help you accessing your data by writing a simple forecaster training step.
In this tutorial we will:
Create a
GetTrainingDataTask
that generate some sample training data for us.Create a
TrainingTask
that usexgboost
to train a regression model.
Generate Training Data
The first step we want to do is to modify our forecaster/tasks.py
file and add the following class:
from datetime import datetime
from gluepy.exec import Task
from gluepy.files.data import data_manager
import pandas as pd
class GetTrainingDataTask(Task):
label = "get-data"
def run(self):
"""Entrypoint to our task"""
# Generating a sample dataframe to later
# be used for training. In a real project
# you would likely instead read data from a
# pre-existing dataset.
df = pd.DataFrame({
"units": [10, 5, 3, 11, 8],
"date": [
datetime(2024, 1, 1),
datetime(2024, 1, 2),
datetime(2024, 1, 3),
datetime(2024, 1, 4),
datetime(2024, 1, 5)
],
"article_id": [1, 1, 1, 1, 1]
})
# Write dataset to run_folder.
data_manager.write("training.csv", df, index=False)
This does a few simple things:
Define a new Task instance where the business logic lives in the
run()
method.Generate mock data as a Pandas Dataframe that will later be used for training.
Use the data_manager to store the dataframe in the run’s Run Folder.
Next up, we need to add this new Task to our SampleDAG
within our forecaster/dags.py
file.
from gluepy.exec import DAG
from .tasks import GetTrainingDataTask
class SampleDAG(DAG):
label = "sample"
tasks = [GetTrainingDataTask]
Now let’s try running our SampleDAG
again with the dag command to ensure it is still working.
$ python manage.py dag sample
INFO 2024-06-25 12:48:44,061 dag - ---------- Started task 'BootstrapTask'
DEBUG 2024-06-25 12:48:44,062 tasks -
Run ID: e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4
Run Folder: runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4
DEBUG 2024-06-25 12:48:44,062 local - Writing file to path '/demo/data/runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/context.yaml'.
INFO 2024-06-25 12:48:44,062 dag - ---------- Completed task 'BootstrapTask' in 0.000991 seconds
INFO 2024-06-25 12:48:44,062 dag - ---------- Started task 'GetTrainingDataTask'
INFO 2024-06-25 12:48:44,064 pandas - Writing file to path 'runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/training.csv'.
DEBUG 2024-06-25 12:48:44,102 local - Writing file to path '/demo/data/runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/training.csv'.
INFO 2024-06-25 12:48:44,102 dag - ---------- Completed task 'GetTrainingDataTask' in 0.039749 seconds
Training our Machine Learning Model
Next up, it is time to create our TrainingTask
. First, we must first install some additional dependencies that we will use to train our
machine learning model.
$ pip install scikit-learn xgboost
After installing xgboost
and scikit-learn
we can create our TrainingTask
class.
import os
import io
import pickle
import xgboost as xgb
from gluepy.exec import Task
from gluepy.conf import default_context
from gluepy.files.data import data_manager
from gluepy.files.storages import default_storage
import pandas as pd
class TrainingTask(Task):
label = "training"
def run(self):
# Read the training dataset previous generated in
# ``GenerateTrainingDataTask``. The path is automatically
# formatted to read from the run_folder to ensure data versioning
# and isolation of output between executions.
df: pd.DataFrame = data_manager.read("training.csv")
df["date"] = df["date"].astype("category")
# Train our machine learning model.
model = xgb.XGBRegressor(enable_categorical=True)
model.fit(df[["date", "article_id"]], df["units"])
# Store model to disk to later be used when we want
# to do inference.
stream = io.BytesIO()
pickle.dump(model, stream)
stream.seek(0)
default_storage.touch(
os.path.join(default_context.gluepy.run_folder, "model.pkl"), stream
)
We must then add our new TrainingTask
to our SampleDAG
:
from gluepy.exec import DAG
from .tasks import GetTrainingDataTask, TrainingTask
class SampleDAG(DAG):
label = "sample"
tasks = [GetTrainingDataTask, TrainingTask]
Finally, we can try executing our updated SampleDAG
using the dag command.
$ python manage.py dag sample
INFO 2024-06-25 13:10:37,903 dag - ---------- Started task 'BootstrapTask'
DEBUG 2024-06-25 13:10:37,903 tasks -
Run ID: c29b8b49-dee9-4984-8ccc-860651780054
Run Folder: runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054
DEBUG 2024-06-25 13:10:37,904 local - Writing file to path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/context.yaml'.
INFO 2024-06-25 13:10:37,904 dag - ---------- Completed task 'BootstrapTask' in 0.001035 seconds
INFO 2024-06-25 13:10:37,904 dag - ---------- Started task 'GetTrainingDataTask'
INFO 2024-06-25 13:10:37,905 pandas - Writing file to path 'runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
DEBUG 2024-06-25 13:10:37,906 local - Writing file to path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
INFO 2024-06-25 13:10:37,906 dag - ---------- Completed task 'GetTrainingDataTask' in 0.002413 seconds
INFO 2024-06-25 13:10:37,906 dag - ---------- Started task 'TrainingTask'
DEBUG 2024-06-25 13:10:37,906 local - Reading file from path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
INFO 2024-06-25 13:10:37,906 pandas - Reading file from path 'training.csv'.
INFO 2024-06-25 13:10:37,945 dag - ---------- Completed task 'TrainingTask' in 0.038396 seconds
Note the following:
Gluepy automatically assign a unique Run ID to this training run of our model.
Gluepy automatically assign a unique Run Folder to version and isolate the Context and all our output.
Gluepy automatically serialize and save our Context to
context.yaml
within the run folder.Even though we only tell our code to read and write
training.csv
, Gluepy automatically format the path to/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv
We have now created a set of custom Task instances that use the data_manager
object and Data and DataFrames to read and write data. Next up we will see how we can use Context to parameterize our project and make it easier to configure in the future.