=====================================
Writing your first Gluepy app, part 2
=====================================

In the previous step :doc:`/intro/tutorial01`, we setup our Project, created our ``forecaster`` module, and tried running our first ``SampleDAG``.
Next up, we will write some of our own code that introduce the topics of how Gluepy help you accessing your data by writing a simple forecaster training step.

In this tutorial we will:

* Create a ``GetTrainingDataTask`` that generate some sample training data for us.
* Create a ``TrainingTask`` that use ``xgboost`` to train a regression model.


Generate Training Data
======================


The first step we want to do is to modify our ``forecaster/tasks.py`` file and add the following class:


.. code-block:: python

   from datetime import datetime
   from gluepy.exec import Task
   from gluepy.files.data import data_manager
   import pandas as pd


   class GetTrainingDataTask(Task):
       label = "get-data"

       def run(self):
           """Entrypoint to our task"""
           # Generating a sample dataframe to later
           # be used for training. In a real project
           # you would likely instead read data from a
           # pre-existing dataset.
           df = pd.DataFrame({
               "units": [10, 5, 3, 11, 8],
               "date": [
                    datetime(2024, 1, 1),
                    datetime(2024, 1, 2),
                    datetime(2024, 1, 3),
                    datetime(2024, 1, 4),
                    datetime(2024, 1, 5)
               ],
               "article_id": [1, 1, 1, 1, 1]
           })

           # Write dataset to run_folder.
           data_manager.write("training.csv", df, index=False)


This does a few simple things:

* Define a new :ref:`tasks` instance where the business logic lives in the ``run()`` method.
* Generate mock data as a Pandas Dataframe that will later be used for training.
* Use the :ref:`data_manager <topic_data>` to store the dataframe in the run's :ref:`context_run_folder`.


Next up, we need to add this new :ref:`tasks` to our ``SampleDAG`` within our ``forecaster/dags.py`` file.


.. code-block:: python

   from gluepy.exec import DAG
   from .tasks import GetTrainingDataTask


   class SampleDAG(DAG):
       label = "sample"
       tasks = [GetTrainingDataTask]


Now let's try running our ``SampleDAG`` again with the :ref:`cli_dag` to ensure it is still working.


.. code-block:: bash

   $ python manage.py dag sample
   INFO 2024-06-25 12:48:44,061 dag - ---------- Started task 'BootstrapTask'
   DEBUG 2024-06-25 12:48:44,062 tasks -
            Run ID: e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4
            Run Folder: runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4

   DEBUG 2024-06-25 12:48:44,062 local - Writing file to path '/demo/data/runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/context.yaml'.
   INFO 2024-06-25 12:48:44,062 dag - ---------- Completed task 'BootstrapTask' in 0.000991 seconds
   INFO 2024-06-25 12:48:44,062 dag - ---------- Started task 'GetTrainingDataTask'
   INFO 2024-06-25 12:48:44,064 pandas - Writing file to path 'runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/training.csv'.
   DEBUG 2024-06-25 12:48:44,102 local - Writing file to path '/demo/data/runs/2024/6/25/e7966509-ca8f-4e12-8c9f-7b0b1c2fcfd4/training.csv'.
   INFO 2024-06-25 12:48:44,102 dag - ---------- Completed task 'GetTrainingDataTask' in 0.039749 seconds


Training our Machine Learning Model
===================================

Next up, it is time to create our ``TrainingTask``. First, we must first install some additional dependencies that we will use to train our
machine learning model.

.. code-block:: bash

   $ pip install scikit-learn xgboost


After installing ``xgboost`` and ``scikit-learn`` we can create our ``TrainingTask`` class.


.. code-block:: python

   import os
   import io
   import pickle
   import xgboost as xgb
   from gluepy.exec import Task
   from gluepy.conf import default_context
   from gluepy.files.data import data_manager
   from gluepy.files.storages import default_storage
   import pandas as pd


   class TrainingTask(Task):
       label = "training"

       def run(self):
           # Read the training dataset previous generated in
           # ``GenerateTrainingDataTask``. The path is automatically
           # formatted to read from the run_folder to ensure data versioning
           # and isolation of output between executions.
           df: pd.DataFrame = data_manager.read("training.csv")
           df["date"] = df["date"].astype("category")

           # Train our machine learning model.
           model = xgb.XGBRegressor(enable_categorical=True)
           model.fit(df[["date", "article_id"]], df["units"])

           # Store model to disk to later be used when we want
           # to do inference.
           stream = io.BytesIO()
           pickle.dump(model, stream)
           stream.seek(0)
           default_storage.touch(
               os.path.join(default_context.gluepy.run_folder, "model.pkl"), stream
           )

We must then add our new ``TrainingTask`` to our ``SampleDAG``:

.. code-block:: python

   from gluepy.exec import DAG
   from .tasks import GetTrainingDataTask, TrainingTask


   class SampleDAG(DAG):
       label = "sample"
       tasks = [GetTrainingDataTask, TrainingTask]


Finally, we can try executing our updated ``SampleDAG`` using the :ref:`cli_dag`.

.. code-block:: bash

   $ python manage.py dag sample
   INFO 2024-06-25 13:10:37,903 dag - ---------- Started task 'BootstrapTask'
   DEBUG 2024-06-25 13:10:37,903 tasks -
            Run ID: c29b8b49-dee9-4984-8ccc-860651780054
            Run Folder: runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054

   DEBUG 2024-06-25 13:10:37,904 local - Writing file to path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/context.yaml'.
   INFO 2024-06-25 13:10:37,904 dag - ---------- Completed task 'BootstrapTask' in 0.001035 seconds
   INFO 2024-06-25 13:10:37,904 dag - ---------- Started task 'GetTrainingDataTask'
   INFO 2024-06-25 13:10:37,905 pandas - Writing file to path 'runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
   DEBUG 2024-06-25 13:10:37,906 local - Writing file to path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
   INFO 2024-06-25 13:10:37,906 dag - ---------- Completed task 'GetTrainingDataTask' in 0.002413 seconds
   INFO 2024-06-25 13:10:37,906 dag - ---------- Started task 'TrainingTask'
   DEBUG 2024-06-25 13:10:37,906 local - Reading file from path '/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv'.
   INFO 2024-06-25 13:10:37,906 pandas - Reading file from path 'training.csv'.
   INFO 2024-06-25 13:10:37,945 dag - ---------- Completed task 'TrainingTask' in 0.038396 seconds


Note the following:

* Gluepy automatically assign a unique :ref:`context_run_id` to this training run of our model.
* Gluepy automatically assign a unique :ref:`context_run_folder` to version and isolate the :ref:`context` and all our output.
* Gluepy automatically serialize and save our :ref:`context` to ``context.yaml`` within the run folder.
* Even though we only tell our code to read and write ``training.csv``, Gluepy automatically format the path to ``/demo/data/runs/2024/6/25/c29b8b49-dee9-4984-8ccc-860651780054/training.csv``


We have now created a set of custom :ref:`tasks` instances that use the ``data_manager`` object and :ref:`topic_data` to read and write data. Next up we will see how we can use :ref:`context` to parameterize our project and make it easier to configure in the future.