Data

Data is of course a key part of any machine learning or data science project, and Gluepy comes packaged with multiple helpful tools to help read, interact and customize any data interaction.

Storage and File System

All file system interaction should be done using the default_storage object which is instantiated with the class defined in STORAGE_BACKEND. The default_storage object have the methods defined as described on The BaseStorage Class.

By using the default_storage object, you do not need to be concerned if you are on Azure, GCP, S3 or your local file system. Your code looks the same and migrating from local development, to production workloads in the cloud is as easy as changing the STORAGE_BACKEND setting.

from gluepy.exec import Task
from gluepy.files.storages import default_storage


class ForecasterTask(Task):

    def run(self):
        if default_storage.exists("file.txt"):
            data = default_storage.open("file.txt")
        else:
            raise FileNotFoundError("file.txt not found")

Use Cloud Storage

By default, STORAGE_BACKEND points to the The LocalStorage Class. This makes it easy for you to get started with Gluepy on your local machine.

Many project may have data that cannot fit into the local machine or your business may have a centralized data lake where all data resides. For these situations, you must change the STORAGE_BACKEND to point to a Storage Backends that support your cloud storage of choice.

Gluepy already support the following providers:

Here is an example using the The GoogleStorage Class.

# settings.py
STORAGE_BACKEND = "gluepy.files.storages.google.GoogleStorage"
GOOGLE_GCS_BUCKET = "mybucket-1234"

Now, my default_storage object will point to an instance of GoogleStorage, and all method calls will authenticate to Google Cloud Storage and use it as a file system for our Gluepy project.

Data and DataFrames

All reads and writes of DataFrames should be done using the data_manager object. This object is lazily evaluated to the DATA_BACKEND, which allow you to modify what kind of data frame that is returned with minor changes to your code.

The benefit of ensuring that your project use the data_manager instead of reading the data directly, is that it will ensure that all your Dataframes are stored and versioned in the Run Folder, and it will automatically read the data from the STORAGE_BACKEND defined.

You may want to work with Spark DataFrames, Polars Dataframes or Pandas Dataframes. If you want to modify the type of data frame that your project is relying on, you can change that using the DATA_BACKEND.

See Data Backends for the currently supported providers.

from gluepy.exec import Task
from gluepy.files.data import data_manager


class ForecasterTask(Task):

    def run(self):
        df = data_manager.read("training.parquet")

Schemas and Data Validation

There are plenty of great DataFrame validation frameworks out there, and we have made the decision to not invent our own. We recommend that you use tools such as pandera or pydantic to do your data validation.

from datetime import datetime
from gluepy.exec import Task
from gluepy.files.data import data_manager
import pandera as pa


class TrainingSchema(pa.DataFrameModel):
    created_at: pa.typing.Series[datetime]
    article_id: pa.typing.Series[int]
    store_id: pa.typing.Series[int]
    units_sold: pa.typing.Series[int]


class ForecasterTask(Task):

    def run(self):
        df = TrainingSchema(data_manager.read("training.parquet"))