============= Configuration ============= .. _topic_settings: Settings ======== As described in :doc:`Settings Reference `, Gluepy have two types of settings made up of the :ref:`context` and :doc:`Settings `. The Settings is the application / project wide application settings for the environment you are currently executing your code on. These settings include what modules you have activated, which logging configuration you have, any references to credentials required to connect to your data warehouse and more. .. _topic_settings_by_environment: Different Settings by Environment --------------------------------- A common use case would be to have different project settings for different environments. For example, you may want to use different settings for your local development versus your production workloads. This can be done by splitting the ``settings.py`` file into multiple files by environment, and pointing to the one you want to enable as a dotted string path with the ``GLUEPY_SETTINGS_MODULE`` environment variable. For example, you may have the following settings: - src/ - configs/ - base.py - prod.py - ... - myapp/ - manage.py .. code-block:: python # base.py STORAGE_BACKEND = "gluepy.files.storages.local.LocalStorage" STORAGE_ROOT = os.path.join(BASE_DIR, "data") LOGGING = { 'version': 1, 'disable_existing_loggers': False, 'formatters': { 'simple': { 'format': '{levelname} {asctime} {module} - {message}', 'style': '{' } }, 'handlers': { 'stream': { 'level': 'DEBUG', 'class': 'logging.StreamHandler', 'formatter': 'simple', } }, 'loggers': { 'gluepy': { 'handlers': ['stream', ], 'level': 'DEBUG', 'propagate': True, }, } } .. code-block:: python # prod.py from .base import * STORAGE_BACKEND = "gluepy.files.storages.google.GoogleStorage" STORAGE_ROOT = "data/" GOOGLE_GCS_BUCKET = "mybucket-1234" Then set an environment variable on your system named ``GLUEPY_SETTINGS_MODULE`` to ``configs.prod``. All commands executed with ``manage.py`` will now load the ``prod.py`` settings using ``base.py`` as the defaults. .. _topic_context: Context ======= In the exploratory, experimental, scienticic world of Data Science and Machine Learning, it is crucial to be able to modify the behavior of your pipeline or model without changing significant amount of code. E.g. you may want to change the dates used to train your model, the features used or other parameters. The :ref:`context` is a singleton object that is instantiated in the beginning of each execution of your pipeline, and populated with various parameters and configurations that is specific to that execution. The context of a run is always serialized and store together with the :ref:`context_run_folder` to ensure that you can rerun and replicate the output of a historical/previous execution of a DAG. .. _context_default_context: Default Context Object ---------------------- The ``default_context`` object is the lazily evaluated object that holds the :ref:`context` of your execution, and where you can access all your parameters from within your code. Your default parameters are defined in ``.yaml`` files stored within the :setting:`CONFIG_PATH` directory and are automatically populated by the :ref:`context_manager`. For example, you may have a ``config.yaml`` file that looks like this .. code-block:: yaml # Gluepy protected parameters meta: run_id: run_folder: created_at: # Custom user added parameters forecaster: start_date: 2024-01-01 That you later want to access in your Python code like this: .. code-block:: python from gluepy.conf import default_context from gluepy.exec import Task class ForecasterTask(Task): def run(self): print(default_context.forecaster.start_date) If in the future, you change your ``config.yaml`` file, you can still rerun and replicate the same results using the same parameters as defined in :ref:`context_retry_example`. .. _context_run_id: Run ID ------ The Run ID is a unique identifier that is given to the specific execution/run that you are running. This ID is assigned in the ``DefaultContextManager`` defined in :setting:`CONFIG_BACKEND` and defaults to a ``uuid4``. If you want to customize the default run ID, you can create your own ``ContextManager`` and refer to it in :setting:`CONFIG_BACKEND`. .. code-block:: python from gluepy.conf import default_context from gluepy.exec import Task class ForecasterTask(Task): def run(self): print(default_context.forecaster.meta.run_id) .. _context_run_folder: Run Folder ---------- To ensure the ability to replicate results of a previous run, and to ensure that output of your pipeline is versioned, each execution of Gluepy use the :ref:`context_run_id` to create a directory on your :ref:`storage_backends` that is unique to the specific run, where it can serialize and save the :ref:`context`, and any other input or output. .. code-block:: python from gluepy.conf import default_context from gluepy.exec import Task class ForecasterTask(Task): def run(self): # /runs/2024/01/01/af41a763-18bc-44b0-9293-f52266898a89/ print(default_context.forecaster.meta.run_folder) By default, all paths used with the data managers are relative to the ``run_folder``. Any time you want to reach out outside of the Run's ``run_folder`` you need to do so explicitally using the ``root`` kwarg. .. code-block:: python import pandas as pd from gluepy.conf import default_context from gluepy.files.data import data_manager from gluepy.exec import Task class ForecasterTask(Task): def run(self): # Writes to /runs/2024/01/01/af41a763-18bc-44b0-9293-f52266898a89/file.csv data_manager.write("file.csv", pd.DataFrame({"foo": [1]})) # Writes to /file.csv data_manager.write("file.csv", pd.DataFrame({"foo": [1]}), root=True) .. _context_retry_example: Retry previous run ------------------ .. warning:: Retrying a run will reuse the same ``run_folder`` and ``run_id``, which means that the execution would overwrite any previous output. Since the :ref:`context` is serialized and stored in the :ref:`context_run_folder` automatically on each run, and all data by default is saved within a run's :ref:`context_run_folder`, Gluepy makes it very easy to re-run and replicate the output of a previous execution. This is done using the :ref:`cli_dag` and the ``-retry`` option. .. code-block:: bash $ python manage.py dag forecaster --retry /runs/2024/01/01/af41a763-18bc-44b0-9293-f52266898a89/ You can also retry a previous run but start from a specific task that is not the beginning using the ``--from-task`` option. .. code-block:: bash $ python manage.py dag forecaster --from-task training --retry /runs/2024/01/01/af41a763-18bc-44b0-9293-f52266898a89/