==== Data ==== Data is of course a key part of any machine learning or data science project, and Gluepy comes packaged with multiple helpful tools to help read, interact and customize any data interaction. .. _topic_storage: Storage and File System ======================= All file system interaction should be done using the ``default_storage`` object which is instantiated with the class defined in :setting:`STORAGE_BACKEND`. The ``default_storage`` object have the methods defined as described on :ref:`storage_backend_base`. By using the ``default_storage`` object, you do not need to be concerned if you are on Azure, GCP, S3 or your local file system. Your code looks the same and migrating from local development, to production workloads in the cloud is as easy as changing the :setting:`STORAGE_BACKEND` setting. .. code-block:: python from gluepy.exec import Task from gluepy.files.storages import default_storage class ForecasterTask(Task): def run(self): if default_storage.exists("file.txt"): data = default_storage.open("file.txt") else: raise FileNotFoundError("file.txt not found") Use Cloud Storage ----------------- By default, :setting:`STORAGE_BACKEND` points to the :ref:`storage_backend_local`. This makes it easy for you to get started with Gluepy on your local machine. Many project may have data that cannot fit into the local machine or your business may have a centralized data lake where all data resides. For these situations, you must change the :setting:`STORAGE_BACKEND` to point to a :ref:`storage_backends` that support your cloud storage of choice. Gluepy already support the following providers: * :ref:`storage_backend_local` * :ref:`storage_backend_google` * :ref:`storage_backend_s3` Here is an example using the :ref:`storage_backend_google`. .. code-block:: python # settings.py STORAGE_BACKEND = "gluepy.files.storages.google.GoogleStorage" GOOGLE_GCS_BUCKET = "mybucket-1234" Now, my ``default_storage`` object will point to an instance of ``GoogleStorage``, and all method calls will authenticate to Google Cloud Storage and use it as a file system for our Gluepy project. .. _topic_data: Data and DataFrames =================== All reads and writes of DataFrames should be done using the ``data_manager`` object. This object is lazily evaluated to the :setting:`DATA_BACKEND`, which allow you to modify what kind of data frame that is returned with minor changes to your code. The benefit of ensuring that your project use the ``data_manager`` instead of reading the data directly, is that it will ensure that all your Dataframes are stored and versioned in the :ref:`context_run_folder`, and it will automatically read the data from the :setting:`STORAGE_BACKEND` defined. You may want to work with Spark DataFrames, Polars Dataframes or Pandas Dataframes. If you want to modify the type of data frame that your project is relying on, you can change that using the :setting:`DATA_BACKEND`. See :ref:`data_backends` for the currently supported providers. .. code-block:: python from gluepy.exec import Task from gluepy.files.data import data_manager class ForecasterTask(Task): def run(self): df = data_manager.read("training.parquet") .. _topic_data_validation: Schemas and Data Validation --------------------------- There are plenty of great DataFrame validation frameworks out there, and we have made the decision to not invent our own. We recommend that you use tools such as `pandera `_ or `pydantic `_ to do your data validation. .. code-block:: python from datetime import datetime from gluepy.exec import Task from gluepy.files.data import data_manager import pandera as pa class TrainingSchema(pa.DataFrameModel): created_at: pa.typing.Series[datetime] article_id: pa.typing.Series[int] store_id: pa.typing.Series[int] units_sold: pa.typing.Series[int] class ForecasterTask(Task): def run(self): df = TrainingSchema(data_manager.read("training.parquet"))