Data
Data interactions is a key component of any project that leverage Gluepy. As a technology agnostic framework, we intend to make Gluepy flexible for you to customize and configure to work with what ever your data technology stack of choice is, and no matter which cloud provider you are using.
This file describes some of the features that might be relevant to Gluepy usage. It is not intended as a replacement for server-specific documentation or reference manuals.
General notes
Separating Dataframes from Storage
When it comes to interacting with data, Gluepy are separating the various kind of actions into two types of actions, the “File system actions” such as listing directories, creating files, checking paths and so on is managed by the Storage Backends. While the “dataframe” reads and writes are managed by the Data Backends.
This means that you can separately define if you are using Local filesystem, S3 file system or Google Storage file system, from if you are using Pandas, Polars or PySpark in your project.
Storage Backends
As described in our Storages Topic guide, all file interactions such as creating directories, checking for file existance, creating files or reading files are done using
the default_storage
object. This object is lazily evaluated at runtime to whatever Storage Backends defined in STORAGE_BACKEND
.
This means that by using default_storage
and modifying STORAGE_BACKEND
in your project, you can use the same code or libraries on multiple
file systems simply by adjusting your configuration.
By default, Gluepy comes included with the following STORAGE_BACKEND
support:
The LocalStorage Class
The LocalStorage
class is the storage implementation that use the local file system, and is used by Gluepy as a default STORAGE_BACKEND
when you start a new project.
The LocalStorage
backend is based on the interface and methods defined on The BaseStorage Class.
The GoogleStorage Class
The GoogleStorage
class is the storage implementation that use GoogleStorage as a file system. It is based on the interface and methods defined on The BaseStorage Class.
The class is a wrapper around the google-cloud-storage PyPI package and is able to use the same authentication methods as defined in Google Documentation.
Custom Settings
GOOGLE_GCS_BUCKET
define the name of the GoogleStorage bucket that you want to use.
Use STORAGE_ROOT
to define where on bucket files are stored, this setting should be set to a relative path on the bucket. E.g. "my_project/data/"
.
The S3Storage Class
Warning
The S3Storage is partly implemented and is currently missing some implementations such as cp()
, isdir()
and isfile()
.
The S3Storage
class is the storage implementation that use S3 as a file system. It is based on the interface and methods defined on The BaseStorage Class.
The class is a wrapper around the boto3 PyPI package and is able to use Access Key and Access Secret Key as authentication method.
Custom Settings
AWS_S3_REGION_NAME
defines the AWS region to use. E.g."eu-west-1"
.AWS_S3_ENDPOINT_URL
custom S3 URL to use when connecting to S3, including scheme.AWS_ACCESS_KEY_ID
AWS Access Key to be used for authentication together withAWS_SECRET_ACCESS_KEY
.AWS_SECRET_ACCESS_KEY
AWS Secret Key to be used for authentication together withAWS_ACCESS_KEY_ID
.AWS_STORAGE_BUCKET_NAME
name of the S3 bucket to connect to.
Use STORAGE_ROOT
to define where on bucket files are stored, this setting should be set to a relative path on the bucket. E.g. "my_project/data/"
.
The BaseStorage Class
The BaseStorage
class is the default interface that define the required methods and arguments for any storage implementation.
This is an abstract class that raise NotImplementedError
on use, but is used by The LocalStorage Class and other storage backends
to ensure consistency.
Data Backends
As described in our Data Topic guide, all dataframe read and writes are done using
the data_manager
object. This object is lazily evaluated at runtime to whatever Data Backends defined in DATA_BACKEND
.
This means that by using data_manager
and modifying DATA_BACKEND
in your project, you can customize the behavior of Gluepy and work
with your prefered ways of dataframe technologies such as Polars, PySpark or Pandas.
By default, Gluepy comes included with the following DATA_BACKEND
support:
The PandasDataManager Class
The PandasDataManager
class is the data manager implementation that use interacts with Pandas Dataframes, and is used by Gluepy as a default DATA_BACKEND
when you start a new project.
The PandasDataManager
backend is based on the interface and methods defined on The BaseDataManager Class.
The BaseDataManager Class
The BaseDataManager
class is the default interface that define the required methods and arguments for any data manager implementation.
This is an abstract class that raise NotImplementedError
on use, but is used by The PandasDataManager Class and other data manager backends
to ensure consistency.