Configuration
Settings
As described in Settings Reference, Gluepy have two types of settings made up of the Context and Settings.
The Settings is the application / project wide application settings for the environment you are currently executing your code on. These settings include what modules you have activated, which logging configuration you have, any references to credentials required to connect to your data warehouse and more.
Different Settings by Environment
A common use case would be to have different project settings for different environments. For example, you may
want to use different settings for your local development versus your production workloads. This can be done by
splitting the settings.py
file into multiple files by environment, and pointing to the one you want to enable
as a dotted string path with the GLUEPY_SETTINGS_MODULE
environment variable.
For example, you may have the following settings:
src/
configs/
base.py
prod.py
…
myapp/
manage.py
# base.py
STORAGE_BACKEND = "gluepy.files.storages.local.LocalStorage"
STORAGE_ROOT = os.path.join(BASE_DIR, "data")
LOGGING = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'simple': {
'format': '{levelname} {asctime} {module} - {message}',
'style': '{'
}
},
'handlers': {
'stream': {
'level': 'DEBUG',
'class': 'logging.StreamHandler',
'formatter': 'simple',
}
},
'loggers': {
'gluepy': {
'handlers': ['stream', ],
'level': 'DEBUG',
'propagate': True,
},
}
}
# prod.py
from .base import *
STORAGE_BACKEND = "gluepy.files.storages.google.GoogleStorage"
STORAGE_ROOT = "data/"
GOOGLE_GCS_BUCKET = "mybucket-1234"
Then set an environment variable on your system named GLUEPY_SETTINGS_MODULE
to configs.prod
.
All commands executed with manage.py
will now load the prod.py
settings using base.py
as the defaults.
Context
In the exploratory, experimental, scienticic world of Data Science and Machine Learning, it is crucial to be able to modify the behavior of your pipeline or model without changing significant amount of code. E.g. you may want to change the dates used to train your model, the features used or other parameters.
The Context is a singleton object that is instantiated in the beginning of each execution of your pipeline, and populated with various parameters and configurations that is specific to that execution.
The context of a run is always serialized and store together with the Run Folder to ensure that you can rerun and replicate the output of a historical/previous execution of a DAG.
Default Context Object
The default_context
object is the lazily evaluated object that holds the Context of your execution, and where you can
access all your parameters from within your code.
Your default parameters are defined in .yaml
files stored within the CONFIG_PATH
directory and are automatically populated
by the Context Managers.
For example, you may have a config.yaml
file that looks like this
# Gluepy protected parameters
meta:
run_id:
run_folder:
created_at:
# Custom user added parameters
forecaster:
start_date: 2024-01-01
That you later want to access in your Python code like this:
from gluepy.conf import default_context
from gluepy.exec import Task
class ForecasterTask(Task):
def run(self):
print(default_context.forecaster.start_date)
If in the future, you change your config.yaml
file, you can still rerun and replicate the same results using the same parameters
as defined in Retry previous run.
Run ID
The Run ID is a unique identifier that is given to the specific execution/run that you are running.
This ID is assigned in the DefaultContextManager
defined in CONFIG_BACKEND
and defaults to a uuid4
.
If you want to customize the default run ID, you can create your own ContextManager
and refer to it in CONFIG_BACKEND
.
from gluepy.conf import default_context
from gluepy.exec import Task
class ForecasterTask(Task):
def run(self):
print(default_context.forecaster.meta.run_id)
Run Folder
To ensure the ability to replicate results of a previous run, and to ensure that output of your pipeline is versioned, each execution of Gluepy use the Run ID to create a directory on your Storage Backends that is unique to the specific run, where it can serialize and save the Context, and any other input or output.
from gluepy.conf import default_context
from gluepy.exec import Task
class ForecasterTask(Task):
def run(self):
# /runs/2024/01/01/af41a763-18bc-44b0-9293-f52266898a89/
print(default_context.forecaster.meta.run_folder)
By default, all paths used with the data managers are relative to the run_folder
. Any time you want to reach out outside of the Run’s run_folder
you
need to do so explicitally using the root
kwarg.
import pandas as pd
from gluepy.conf import default_context
from gluepy.files.data import data_manager
from gluepy.exec import Task
class ForecasterTask(Task):
def run(self):
# Writes to /runs/2024/01/01/af41a763-18bc-44b0-9293-f52266898a89/file.csv
data_manager.write("file.csv", pd.DataFrame({"foo": [1]}))
# Writes to /file.csv
data_manager.write("file.csv", pd.DataFrame({"foo": [1]}), root=True)
Retry previous run
Warning
Retrying a run will reuse the same run_folder
and run_id
, which means that the execution would
overwrite any previous output.
Since the Context is serialized and stored in the Run Folder automatically on each run, and all data by default is saved within a run’s Run Folder, Gluepy makes it very easy to re-run and replicate the output of a previous execution.
This is done using the dag command and the -retry
option.
$ python manage.py dag forecaster --retry /runs/2024/01/01/af41a763-18bc-44b0-9293-f52266898a89/
You can also retry a previous run but start from a specific task that is not the beginning using the --from-task
option.
$ python manage.py dag forecaster --from-task training --retry /runs/2024/01/01/af41a763-18bc-44b0-9293-f52266898a89/