Writing your first Gluepy app, part 4

In the previous parts of this tutorial Writing your first Gluepy app, part 1, Writing your first Gluepy app, part 2 and Writing your first Gluepy app, part 3 we have setup a project that contains a DAG that train a machine learning model on some sample data.

Next up, we’ll talk about the CLI and commands.

Gluepy already comes bundled with pre-existing commands that allow you to do basic tasks such as running your DAG with the dag command, but there may be situations where you want to add functionality or scripts to your project that does not fit into the concept of a DAG or Task. E.g. you may want to write a command that copies a run folder, or a command that takes a trained .pkl model file and deploys it in a registry.

In this final step of the tutorial, we will introduce the concept of writing custom Gluepy Commands that copies the output of a previous run to a new location, to simulate a deployment to production.

Reviewing the default CLI command

If you recall Writing your first Gluepy app, part 1, when we created out forecaster module using the startmodule command, it generated a file at forecaster/commands.py that looks like this:

import click
from gluepy.commands import cli


@cli.command()
def sample():
    click.echo("Sample command called")

What happens here is the following:

The command is using Click under the hood for logic related to CLI such as adding options, groups of commands, help text and more.
All commands in Gluepy served on manage.py is part of the gluepy.commands.cli group. You must add a command to gluepy.commands.cli using the @cli.command() operator.

This command can be called using:

$ python manage.py sample
Sample command called

Creating a custom CLI command

Now let’s modify this sample command to instead receive a path to a run folder, and copy the .pkl model file that we created in Writing your first Gluepy app, part 2 to a /data/production directory to simulate a deployment. In a real project, you may instead deploy the model to something like MLFlow.

import os
import click
from gluepy.commands import cli
from gluepy.files.storages import default_storage
from gluepy.conf import default_context


@cli.command()
@click.argument("run_folder")
def deploy(run_folder):
    default_storage.cp(
     os.path.join(run_folder, "model.pkl"),
     os.path.join("production", "model.pkl"),
    )
    click.echo("Model deployed to production")

The code above defines the following:

Add a new command named deploy to the manage.py CLI using the @cli.command() decorator.
Add a new argument using Click that expect user to pass a Run Folder path.
Use default_storage to copy the file from our run folder, to a centralized folder we use for “production” models.

This can now be called in the following manner.

$ python manage.py deploy runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/
Model deployed to production

Wrapping up

That was it for this tutorial. We have now learned:

How to create new projects
How to create a DAG consisting of 2 Task that train a machine learning model.
Using output versioning with Run Folder.
Retrying DAG runs and running subset of runs.
Parameterizing our model using YAML and Context.
File system interactions with default_storage and Storage and File System.

You should now be familiar with the key concepts of Gluepy. To read more details, see