Data

Data interactions is a key component of any project that leverage Gluepy. As a technology agnostic framework, we intend to make Gluepy flexible for you to customize and configure to work with what ever your data technology stack of choice is, and no matter which cloud provider you are using.

This file describes some of the features that might be relevant to Gluepy usage. It is not intended as a replacement for server-specific documentation or reference manuals.

General notes

Separating Dataframes from Storage

When it comes to interacting with data, Gluepy are separating the various kind of actions into two types of actions, the “File system actions” such as listing directories, creating files, checking paths and so on is managed by the Storage Backends. While the “dataframe” reads and writes are managed by the Data Backends.

This means that you can separately define if you are using Local filesystem, S3 file system or Google Storage file system, from if you are using Pandas, Polars or PySpark in your project.

Storage Backends

As described in our Storages Topic guide, all file interactions such as creating directories, checking for file existance, creating files or reading files are done using the default_storage object. This object is lazily evaluated at runtime to whatever Storage Backends defined in STORAGE_BACKEND.

This means that by using default_storage and modifying STORAGE_BACKEND in your project, you can use the same code or libraries on multiple file systems simply by adjusting your configuration.

By default, Gluepy comes included with the following STORAGE_BACKEND support:

The LocalStorage Class

The LocalStorage class is the storage implementation that use the local file system, and is used by Gluepy as a default STORAGE_BACKEND when you start a new project. The LocalStorage backend is based on the interface and methods defined on The BaseStorage Class.

The GoogleStorage Class

The GoogleStorage class is the storage implementation that use GoogleStorage as a file system. It is based on the interface and methods defined on The BaseStorage Class.

The class is a wrapper around the google-cloud-storage PyPI package and is able to use the same authentication methods as defined in Google Documentation.

Custom Settings

  • GOOGLE_GCS_BUCKET define the name of the GoogleStorage bucket that you want to use.

Use STORAGE_ROOT to define where on bucket files are stored, this setting should be set to a relative path on the bucket. E.g. "my_project/data/".

The S3Storage Class

Warning

The S3Storage is partly implemented and is currently missing some implementations such as cp(), isdir() and isfile().

The S3Storage class is the storage implementation that use S3 as a file system. It is based on the interface and methods defined on The BaseStorage Class.

The class is a wrapper around the boto3 PyPI package and is able to use Access Key and Access Secret Key as authentication method.

Custom Settings

  • AWS_S3_REGION_NAME defines the AWS region to use. E.g. "eu-west-1".

  • AWS_S3_ENDPOINT_URL custom S3 URL to use when connecting to S3, including scheme.

  • AWS_ACCESS_KEY_ID AWS Access Key to be used for authentication together with AWS_SECRET_ACCESS_KEY.

  • AWS_SECRET_ACCESS_KEY AWS Secret Key to be used for authentication together with AWS_ACCESS_KEY_ID.

  • AWS_STORAGE_BUCKET_NAME name of the S3 bucket to connect to.

Use STORAGE_ROOT to define where on bucket files are stored, this setting should be set to a relative path on the bucket. E.g. "my_project/data/".

The BaseStorage Class

The BaseStorage class is the default interface that define the required methods and arguments for any storage implementation. This is an abstract class that raise NotImplementedError on use, but is used by The LocalStorage Class and other storage backends to ensure consistency.

Data Backends

As described in our Data Topic guide, all dataframe read and writes are done using the data_manager object. This object is lazily evaluated at runtime to whatever Data Backends defined in DATA_BACKEND.

This means that by using data_manager and modifying DATA_BACKEND in your project, you can customize the behavior of Gluepy and work with your prefered ways of dataframe technologies such as Polars, PySpark or Pandas.

By default, Gluepy comes included with the following DATA_BACKEND support:

The PandasDataManager Class

The PandasDataManager class is the data manager implementation that use interacts with Pandas Dataframes, and is used by Gluepy as a default DATA_BACKEND when you start a new project. The PandasDataManager backend is based on the interface and methods defined on The BaseDataManager Class.

The BaseDataManager Class

The BaseDataManager class is the default interface that define the required methods and arguments for any data manager implementation. This is an abstract class that raise NotImplementedError on use, but is used by The PandasDataManager Class and other data manager backends to ensure consistency.