More

    Top Python Packages for Feature Engineering

    Know these packages to improve your data workflow

    Photo by Markus Spiske on Unsplash

    Feature engineering is the process of creating new features from the existing data. Whether we made a simple addition of two columns or combined more than a thousand features, the process is already considered feature engineering.

    The feature engineering process is inherently different from data cleaning. While feature engineering creates additional features, data cleaning might change or decrease the existing feature.

    Feature engineering is an essential part of the data workflow because the activity could massively improve our project performance. For example, empirical analysis by Heaton (2020) has shown that feature engineering improves various machine learning model performances.

    To help the feature engineering process, this article will go through my top Python package for feature engineering. Let’s get into it!

    Featuretools is an open-source Python package to automate the feature engineering process developed by Alteryx. It’s a package designed for deep feature creation from any features we have, especially from temporal and relation features.

    Deep Feature Synthesis (DFS)is the heart of Featuretools activity as it allowed us to acquire new features from our data quickly. How to perform it? Let’s use the example dataset from Featuretools to do it. First, we need to install the package.

    pip install featuretools

    Next, I would load the toy dataset that had already come from the package to perform Deep Feature Synthesis.

    import featuretools as ft#Loading the mock data
    data = ft.demo.load_mock_customer()
    cust_df = data["customers"]
    session_df = data["sessions"]
    transaction_df = data["transactions"]
    All the datasets from Featuretools mock data (Image by Author)

    In the above dataset, we have three different connected datasets:

    • The customer table (unique customer)
    • The session table (unique session for the customer)
    • The transaction table (session transaction activity)

    In some way, all the datasets were connected with their respective key. To use the Featuretools DFS, we need to specify the table name and the primary key with the dictionary object (If there is the DateTime feature, we also add it as the key).

    dataframes = {
    "customers": (cust_df, "customer_id"),
    "sessions": (session_df, "session_id", "session_start"),
    "transactions": (transaction_df, "transaction_id", "transaction_time"),
    }

    Then we need to specify as well the relationship between the table. This is important because DFS would rely on this relationship to create the features.

    relationships = [
    ("sessions", "session_id", "transactions", "session_id"),
    ("customers", "customer_id", "sessions", "customer_id"),
    ]

    Finally, we could initiate the DFS process. To do that, we could run the following code. What is important is the target_dataframe_name parameter needs to be specified for the resulting level you want. For example, this code would result in feature engineering at the customer level.

    feature_matrix_customers, features_defs = ft.dfs(
    dataframes=dataframes,
    relationships=relationships,
    target_dataframe_name="customers",
    )
    feature_matrix_customers
    Image by Author

    As we can see from the above picture, we have new features of the customer data with various new features from the session and transaction table; for example, the Count and Mean of certain features. With few lines, we have produced a lot of features.

    Of course, not all the features would be helpful for machine learning modelling, but it’s the work for feature selection. In our case, feature engineering is only concerned with creating the feature.

    If you need an explanation for each feature, we could use the following function.

    feature = features_defs[10]
    ft.describe_feature(feature)
    Image by Author

    Feature-Engine is an open-source Python package for feature engineering and selection procedures. The package works as a transformer with similarity to scikit-learn functions such as fit and transform.

    How valuable is Feature-Engine? It’s beneficial when you already have a machine learning pipeline in mind, primarily if you use scikit-learn-based APIs. The Feature-Engine transformers were designed to work with the scikit-learn pipeline and interact similarly with the scikit-learn package.

    There are many APIs to try in the Feature-Engine package, but for this article’s purpose, we would only focus on the Feature Engineering functions available. For the feature engineering, there are three APIs we could try:

    Let’s try all the transformers to test the feature engineering process. For the starter, I would use the example mpg dataset from seaborn.

    import seaborn as snsdf= sns.load_dataset('mpg')
    Image by Author

    First, I want to try the MathFeatures function for the mathematical function of feature engineering. To do this, I would set the transformer with both the column and transformation we want to do.

    from feature_engine.creation import MathFeaturestransformer = MathFeatures(variables=["mpg", "cylinders"],
    func = ["sum", "min", "max", "std"])

    After the setup, we could transform our original data using the transformer.

    df_t = transformer.fit_transform(df)
    df_t
    Image by Author

    As we can see above, there are new columns from our feature engineering process. The column name has been stated easily to understand what happened in the process. For note, we could always pass our function to the transformer function to do our calculation.

    We can also try the RelativeFeatures function to create a transformer that uses a reference variable to develop features.

    from feature_engine.creation import RelativeFeaturestransformer = RelativeFeatures(
    variables=["mpg", "weight"],
    reference=["mpg"],
    func = ["sub", "div", "mod"])
    df_r = transformer.fit_transform(df)
    df_r
    Image by Author

    As we can see from the above result, the newly created columns were all based on the reference feature (‘mpg’); for example, weight subtracted by the mpg. This way, we can quickly develop features based on the feature we want.

    Tsfresh is an open-source Python package for time-series and sequential data feature engineering. The package allows us to create thousands of new features with few lines. Moreover, the package is compatible with the Scikit-Learn method, which enables us to incorporate the package into the pipeline.

    The feature engineering from Tsfresh is different because the extracted features can’t be used directly in the machine learning model training. The features were used to describe the time series dataset and need additional steps to include the data into the training model.

    Let’s try the package with an example dataset. For this sample, I would use the DJIA 30 stock data from Kaggle (License: CC0: Public Domain). To be specific, I would use all stock data from 2017 only. Let’s read the dataset.

    import pandas as pd
    df = pd.read_csv('all_stocks.csv')
    #cleaning the data by dropping nan values
    df = df.dropna().reset_index(drop = True)
    df
    Image by Author

    The stock data contain the Date column as the time index and the Name column as the stock reference. The other columns would be the values we would interested of to describe using Tsfresh.

    Let’s try out the package using the following code.

    from tsfresh import extract_features
    extracted_features = extract_features(df, column_id="Name", column_sort="Date")
    extracted_features
    Image by Author
    extracted_features.columns
    Image by Author

    As we can see from the above result, the extracted features contain around 3945 new features. These features were all descriptions of the available features and ready to be used. If you want to know each feature’s descriptions, you can read it all here.

    We can also use feature selection functions to select only the relevant features. You can read the feature selection on the following page.

    Feature engineering is an activity to create new features from the existing dataset. The action was proven to help improve data science projects. However, it could take a long time to create our features.

    This article shows my top Python package for the feature engineering process in this article. They are:

    1. Featuretools
    2. Feature-Engine
    3. Tsfresh

    I hope it helps!

    Visit me on my Social Media to have a more in-depth conversation or any questions.

    If you are not subscribed as a Medium Member, please consider subscribing through my referral.

    Top Python Packages for Feature Engineering Republished from Source https://towardsdatascience.com/top-python-packages-for-feature-engineering-c0a75dba0081?source=rss----7f60cf5620c9---4 via https://towardsdatascience.com/feed

    Recent Articles

    spot_img

    Related Stories

    Stay on op - Ge the daily news in your inbox