More

    Kedro as a Data Pipeline in 10 Minutes

    Development Workflow Framework made easy — explained with Python examples

    Update: This article is part of a series. Check out other “in 10 Minutes” topics here!

    Photo by Roman Davydko on Unsplash

    In a data science project, various coding components can be thought of as a flow of data — data flows from the source, to feature engineering, to modelling, to evaluation, etc. This flow is made more complex with training, evaluation, and scoring pipelines where the flow for each pipeline can be potentially very different.

    Kedro is a Python framework that helps structure codes into a modular data pipeline. Kedro allows reproducible and easy (one-line command!) running of different pipelines and even ad-hoc rerunning of a small portion of a pipeline. This article will touch on the components and terminologies used in Kedro and Python examples on how to set up, configure, and run Kedro pipelines.

    Kedro is the first open-source software tool developed by McKinsey and is recently donated to Linux Foundation. It is a Python framework for creating reproducible, maintainable, and modular codes.

    Kedro combines the best practices of software engineering with the world of data science

    There are several components of Kedro, namely

    • Node: Function wrapper; wraps input to function, the function itself, and function output together (defines what codes should run)
    • Pipeline: Link nodes together; resolves dependencies and determines the execution order of functions (defines what order the codes should be run)
    • DataCatalog: Wrapper for data; links the input and output names specified in node to a file path
    • Runner: Object that determines how the pipeline is run, for example sequentially or in parallel (defines how the codes should be run)

    Kedro can be installed as a Python package and be instantiated with the following commands,

    pip install kedro
    kedro new

    Note that this setup only needs to be run once. The program will prompt for your project name and create a folder structure within your repository.

    To instantiate from an existing Kedro project, we can use the command kedro new --starter=https://github.com/your-repo.git instead.

    Now, let’s code some pipelines!

    Photo by Jesse Echevarria on Unsplash
    Photo by Jesse Echevarria on Unsplash

    After instantiating Kedro, several folders and files are added to the project. The project should have the following folder structure,

    project-name
    ├── conf (configuration files)
    │ ├── base
    │ └── local
    ├── data (project data)
    │ ├── 01_raw
    │ ├── 02_intermediate
    │ ├── 03_primary
    │ ├── 04_feature
    │ ├── 05_model_input
    │ ├── 06_models
    │ ├── 07_model_output
    │ └── 08_reporting
    ├── docs (project documentation)
    │ └── source
    ├── logs (project output logs)
    ├── notebooks (project jupyter notebooks)
    └── src (project source code)
    ├── project_name
    │ └── pipelines
    └── tests
    └── pipelines

    conf

    The conf folder is used to store configuration files. The base subfolder stores shared project-related configurations, while the local subfolder stores user-related or confidential configurations. Configurations in the local subfolder should never be committed to the code repository, and this setting is handled in the .gitignore file.

    Sample configuration files are created in the base subfolder for DataCatalog, logging, and parameters, whereas sample configuration files in the local subfolder are for storing credentials.

    data

    The data folder stores the input, intermediate, and output data created at various data pipeline stages. Similar to local configurations, files in the data folder should never be committed to the code repository.

    Since Kedro was created as a data science framework, the default folder structure corresponds to the various stages of a data science project — which you can customize further if required.

    src

    The src folder is where the bulk of codes are stored. It contains the project codes, node and pipeline codes, runner codes, and more.

    The high-level Kedro workflow follows this order,

    1. Define Conf: Specify DataCatalog and parameters
    2. Write Functions: Wrap project codes into functions
    3. Define Node(s): Wrap function into nodes
    4. Define Pipeline(s): Link nodes together
    5. Define Runner: Specify how the pipeline should be run
    6. Run Pipeline: Read conf, use runner, and run the pipeline!

    Breaking down the workflow, we will go through each step in more detail.

    №1: Define Conf

    DataCatalog links the file name with its corresponding file type and location and is defined in the conf/base/catalog.yml file. It can be defined as such,

    input_data:
    type: pandas.CSVDataSet
    filepath: data/01_raw/iris.csv

    intermediate_data:
    type: pandas.CSVDataSet
    filepath: data/02_intermediate/iris_small.csv

    processed_data:
    type: pandas.CSVDataSet
    filepath: data/03_primary/iris_processed.csv

    Note that, we do not need to have all the file paths present in the data folder, only the input data is required since the intermediate data will be created when the pipeline runs.

    Parameters are input to functions and are defined as a nested dictionary in the conf/base/parameters.yml file,

    input_param:
    n_rows: 100

    №2: Write Functions

    Functions are project-specific code defined within the src folder. The DataConfig and parameters defined in the previous step are handled automatically such that the input to the function is pandas DataFrame and dictionary type respectively, and the output is a pandas DataFrame. We do not need to perform any file reading or saving operation — this abstracts out the file I/O codes!

    In the advanced sections, we will write custom I/O codes to specify custom reading and saving methods that help connect to external databases.

    For this demonstration, let us assume I have written two functions with the signatures

    • load_and_truncate_data(data: pd.DataFrame, input_param: Dict[str, Any]) -> pd.DataFrame
    • drop_null_data(data: pd.DataFrame) -> pd.DataFrame

    №3: Define Node(s)

    The Node object wraps the input and output of a function (in Step 1) to the function itself (in Step 2). It is optional to specify the node name, but node names help in running functions for specific nodes instead of running the whole pipeline. To minimize the number of imports, I prefer to define nodes and pipelines in the same file.

    №4: Define Pipeline(s)

    The Pipeline object links several nodes together, which is defined in the src/project_name/pipelines folder. The pipeline and nodes can be defined as such,

    https://gist.github.com/kayjan/f2f72b72fc7d42f506d11a01a7ce57d8.js

    The kedro.pipeline.Pipeline (line 10) chains together multiple kedro.pipeline.node objects, where each node wraps the function as func, input data and parameters as inputs, output as outputs, and the node name as name.

    Finally, the pipeline has to be registered in the pipeline_registry.py file. The bulk of the code has already been given, we can add our pipeline into the dictionary in the return statement,

    def register_pipelines() -> Dict[str, Pipeline]:
    return {
    "pipeline1": pipeline([processing_pipeline()]),
    "__default__": pipeline([processing_pipeline()])
    }

    From the pipeline definition, we observe that the code is extendable — we can define a training, validation, and scoring pipeline easily which calls different pipelines that run different sequences of nodes!

    №5: Define Runner

    The Runner object specifies how the pipeline is run. By default, the SequentialRunner is selected to run the nodes sequentially. There is also ParallelRunner (multiprocessing) and ThreadRunner (multithreading).

    A custom runner can also be used, for instance, to handle node failures by skipping the node or allowing retries.

    №6: Run Pipeline

    The Pipeline can be run using code or Command Line Interface (CLI). It is recommended to run kedro using CLI as it ensures there is no variable leakage — and it is also simple; the pipeline can be run with the command kedro run!

    Figure 1: Terminal output from running pipeline — Image by author
    Figure 1: Terminal output from running pipeline — Image by author

    In the sample run above, we observe the sequence of how Kedro runs the pipeline; it loads the data and parameters before running the node and saves the data after, then reports that it has completed the task. The logs also allow us to easily identify the data name, data type, node name, function name, and function signature at a glance.

    For more variations of running the pipeline, we can add options such as,

    kedro run --pipeline=pipeline1
    kedro run --node=load_and_truncate_data
    kedro run --from-nodes=node1 --to_node=node2
    kedro run --runner=ParallelRunner

    Add documentation component, unit testing, and linting capabilities

    As mentioned in the overview, Kedro makes it easy to implement coding best practices from software engineering.

    The command kedro test runs all the unit tests in the project, following Python unittest or pytest framework. Whereas the command kedro lint performs linting using flake8, isort, and black Python packages. These CLI commands work seamlessly with no additional effort required by the user, except for writing the test cases.

    It is also a good practice to write docstrings and documentation for your source codes so that the repository can be understood by both technical and non-technical audiences. The docs folder, which was created in the Kedro template, follows the Sphinx folder structure which can display documentation in a beautiful HTML format.

    Kedro DataCatalog supports reading and saving datasets that are on local or network file systems, Hadoop File System (HDFS), Amazon S3, Google Cloud, Azure, and HTTP(s). Additional arguments such as fs_args, load_args, and save_args can be passed as keys to the DataCatalog to specify parameters when interacting with the file system.

    If more read/write operations are needed by the user, custom I/O classes can be written in the project_name/io folder. These custom classes use Kedro Abstract classes, such as AbstractDataSet and AbstractVersionedDataSet, as the base class. A sample custom I/O class that implements reading and writing pandas DataFrame from CSV files can be written as such,

    import pandas as pd

    from kedro.io import AbstractDataSet
    from typing import Any, Dict

    class PythonCSV(AbstractDataSet):
    def __init__(self, filepath: str, load_args: Dict[str, Any] = None, save_args: Dict[str, Any] = None):
    self._filepath = filepath
    self._load_args = load_args if load_args is not None else {}
    self._save_args = save_args if save_args is not None else {}

    def _load(self) -> Any:
    return pd.read_csv(self._filepath, **self._load_args)

    def _save(self, data: Any) -> None:
    return data.to_csv(self._filepath, **self._save_args)

    The corresponding DataCatalog can be defined as such,

    intermediate_data:
    type: project_name.io.python_csv.PythonCSV
    filepath: data/02_intermediate/iris_small.csv
    save_args:
    index: false

    This way of writing custom I/O classes using Kedro’s Abstract classes as a base class is a neat way of extending Kedro’s functionality to provide customization options — and if you want custom runner functions, it is also done in the same way!

    Hope you gained a better understanding of Kedro as a Development Workflow Framework. There are many more features that Kedro offers, such as data versioning, data transcoding, writing hooks, and writing CLI to call custom command such as kedro , to name a few.

    I find that the beauty of Kedro lies in its balance of strictness but still allowing leeway for customization

    For instance, Kedro enforces a folder structure, yet gives users the freedom to structure their codes in the src folder. Kedro abstracts out the complexity of handling I/O operations, yet allows users to easily extend and write custom I/O classes if needed.

    The intricacies of how Kedro works under the hood are also amazing

    The way that the runner, pipeline, and configurations are handled without having to explicitly state how they should interact with each other improves the coding experience. However, this abstraction inevitably results in less control and visibility of how the components are run.

    In terms of usability, Kedro forces the source codes to be modular

    Since the node object references function directly, there will not be loose codes lying around the codebase. This makes the codebase easier to maintain and to run full, partial, or different pipelines in a reproducible manner. However, it may be space- and time-consuming as every node loads and saves a dataset, resulting in a lot of I/O operations compared to if I were to write several functions and directly pass the output of a function to the next function without having to save and load it every time. This would be a tradeoff between the need for checkpointing and reproducibility vs. the need for efficiency.

    This is the link for the GitHub repository used in the demonstration.

    Thank you for reading! If you liked this article, feel free to share it.

    Kedro as a Data Pipeline in 10 Minutes Republished from Source https://towardsdatascience.com/kedro-as-a-data-pipeline-in-10-minutes-21c1a7c6bbb?source=rss----7f60cf5620c9---4 via https://towardsdatascience.com/feed

    Recent Articles

    spot_img

    Related Stories

    Stay on op - Ge the daily news in your inbox