Data
Data is of course a key part of any machine learning or data science project, and Gluepy comes packaged with multiple helpful tools to help read, interact and customize any data interaction.
Storage and File System
All file system interaction should be done using the default_storage
object which is instantiated with the class defined in STORAGE_BACKEND
. The default_storage
object have the methods defined as described on The BaseStorage Class.
By using the default_storage
object, you do not need to be concerned if you are on Azure, GCP, S3 or your local file system. Your code looks the same and migrating from
local development, to production workloads in the cloud is as easy as changing the STORAGE_BACKEND
setting.
from gluepy.exec import Task
from gluepy.files.storages import default_storage
class ForecasterTask(Task):
def run(self):
if default_storage.exists("file.txt"):
data = default_storage.open("file.txt")
else:
raise FileNotFoundError("file.txt not found")
Use Cloud Storage
By default, STORAGE_BACKEND
points to the The LocalStorage Class. This makes it easy for you to get started with Gluepy on your local machine.
Many project may have data that cannot fit into the local machine or your business may have a centralized data lake where all data resides. For these situations, you must
change the STORAGE_BACKEND
to point to a Storage Backends that support your cloud storage of choice.
Gluepy already support the following providers:
Here is an example using the The GoogleStorage Class.
# settings.py
STORAGE_BACKEND = "gluepy.files.storages.google.GoogleStorage"
GOOGLE_GCS_BUCKET = "mybucket-1234"
Now, my default_storage
object will point to an instance of GoogleStorage
, and all method calls will authenticate to Google Cloud Storage and
use it as a file system for our Gluepy project.
Data and DataFrames
All reads and writes of DataFrames should be done using the data_manager
object. This object is lazily evaluated to the DATA_BACKEND
, which allow you to
modify what kind of data frame that is returned with minor changes to your code.
The benefit of ensuring that your project use the data_manager
instead of reading the data directly, is that it will ensure that all your Dataframes are stored and versioned in the Run Folder, and
it will automatically read the data from the STORAGE_BACKEND
defined.
You may want to work with Spark DataFrames, Polars Dataframes or Pandas Dataframes. If you want to modify the type of data frame that your project is relying on,
you can change that using the DATA_BACKEND
.
See Data Backends for the currently supported providers.
from gluepy.exec import Task
from gluepy.files.data import data_manager
class ForecasterTask(Task):
def run(self):
df = data_manager.read("training.parquet")
Schemas and Data Validation
There are plenty of great DataFrame validation frameworks out there, and we have made the decision to not invent our own. We recommend that you use tools such as pandera or pydantic to do your data validation.
from datetime import datetime
from gluepy.exec import Task
from gluepy.files.data import data_manager
import pandera as pa
class TrainingSchema(pa.DataFrameModel):
created_at: pa.typing.Series[datetime]
article_id: pa.typing.Series[int]
store_id: pa.typing.Series[int]
units_sold: pa.typing.Series[int]
class ForecasterTask(Task):
def run(self):
df = TrainingSchema(data_manager.read("training.parquet"))