## An efficient tool for exploratory data analysis

The groupby is one of the most frequently used Pandas functions in data analysis. It is used for grouping the data points (i.e. rows) based on the distinct values in the given column or columns. We can then calculate aggregated values for the generated groups.

If we have a dataset that contains brand and price information for cars, the groupby function can be used for calculating the average price for each brand.

In this article, we will go over 25 examples to try to discover the full potential of the groupby function. Even if you are comfortable with using this function, I suggest you keep reading because we will also cover operations that are not so-commonly-used but come in handy for a variety of tasks.

I have created a sales dataset with mock data. Let’s start by creating a DataFrame with this dataset.

import pandas as pdsales = pd.read_csv("sales_data.csv")sales.head()

**Example 1: Single aggregation**

We can calculate the average stock quantity for each store as follows:

sales.groupby("store")["stock_qty"].mean()Output:store

Daisy 1811.861702

Rose 1677.680000

Violet 14622.406061

Name: stock_qty, dtype: float64

**Example 2: Multiple aggregations**

We can do multiple aggregations in a single operation. Here is how we can calculate the average stock quantity and price for each store.

sales.groupby("store")[["stock_qty","price"]].mean()Output:

Make sure to write the columns to be aggregated in a Python list.

**Example 3: Multiple aggregations — 2**

We can also use the agg function for calculating multiple aggregate values.

sales.groupby("store")["stock_qty"].agg(["mean", "max"])Output

**Example 4: Named aggregations**

In the previous two examples, it’s not clear what the aggregate columns represent. For instance, “mean” does not tell us it is the mean of stock quantity. In such cases, we can use named aggregations.

sales.groupby("store").agg(avg_stock_qty = ("stock_qty", "mean"),

)

max_stock_qty = ("stock_qty", "max")Output

The column to be aggregated and the function name are written in a tuple.

**Example 5: Multiple aggregations and multiple functions**

sales.groupby("store")[["stock_qty","price"]].agg(["mean", "max"])Output

**Example 6: Named aggregations using different columns**

We can use named aggregations with different columns and functions.

sales.groupby("store").agg(avg_stock_qty = ("stock_qty", "mean"),

)

avg_price = ("price", "mean")Output

**Example 7: as_index parameter**

If the output of a groupby operation is DataFrame, the group values are shown in the index. We can make them a column in the DataFrame using the as_index parameter.

sales.groupby("store", as_index=False).agg(avg_stock_qty = ("stock_qty", "mean"),

)

avg_price = ("price", "mean")Output

**Example 8: Multiple columns for grouping**

Just like we can aggregate multiple columns, we can use multiple columns for grouping.

sales.groupby(["store","product_group"], as_index=False).agg(avg_sales = ("last_week_sales", "mean")

).head()

Output

A group is generated for each store-product group combination.

**Example 9: Sorting the output**

We can sort the output based on the aggregated columns using the sort_values function.

sales.groupby(["store","product_group"], as_index=False).agg( avg_sales = ("last_week_sales", "mean")).sort_values(by="avg_sales", ascending=False).head()

Output

The rows are sorted in descending order based on the average sales values.

**Example 10: Largest n values**

The max function returns the maximum value for each group. If we need the largest n of values, we can use the nlargest function.

# largest 2 values

sales.groupby("store")["last_week_sales"].nlargest(2)Outputstore

Daisy 413 1883

231 947

Rose 948 883

263 623

Violet 991 3222

339 2690

Name: last_week_sales, dtype: int64

We can see both the values and the index of their rows, which can be used for accessing the entire row.

**Example 11: Smallest n values**

The nsmallest function returns the n smallest values for each group.

# smallest 2 values

sales.groupby("store")["last_week_sales"].nsmallest(2)Output

store

Daisy 23 12

52 12

Rose 304 12

320 12

Violet 20 12

21 12

Name: last_week_sales, dtype: int64

**Example 12: The nth value**

We can also find the nth value in a group. Let’s first sort the sales DataFrame by the store and last month’s sales columns.

`sales_sorted = sales.sort_values(by=["store","last_month_sales"], ascending=False, ignore_index=True)`

We can find the products in each store with the 5th highest last month sales as follows:

sales_sorted.groupby("store").nth(4)Output

The output contains the 5th row for each group. Since rows are sorted based on the last month sales values, we get the rows with 5th highest last month sales.

**Example 13: The nth with negative index**

We can use the nth with negative values as well. For instance, “nth(-2)” returns the second rows from the end.

sales_sorted.groupby("store").nth(-2)Output

**Example 14: Unique values**

The unique function can be used for finding the unique values in each group. For instance, we can find the unique product codes in each group as follows:

sales.groupby("store", as_index=False).agg( unique_values = ("product_code","unique"))Output

**Example 15: Number of unique values**

We can also find the number of unique values in each group using the nunique function.

sales.groupby("store", as_index=False).agg( number_of_unique_values = ("product_code","nunique"))Output

**Example 16: Lambda expressions**

We can use lambda expressions as aggregations in the agg function.

sales.groupby("store").agg(total_sales_in_thousands = (

"last_month_sales",

lambda x: round(x.sum() / 1000, 1)

))

Output

**Example 17: Lambda expressions with apply**

Lambda expressions can be applied to each group using the apply function. For instance, we can calculate the average of the difference between the last week sales and one fourth of the last month sales for each store as follows:

sales.groupby("store").apply( lambda x: (x.last_week_sales - x.last_month_sales / 4).mean())Outputstore

Daisy 5.094149

Rose 5.326250

Violet 8.965152

dtype: float64

**Example 18: The dropna paramater.**

The groupby function ignores the missing values by default. Thus, if there is a missing value in the column used for grouping, it will not be included in any group and not shown separately. We can change this behavior using the dropna parameter.

Let’s first add a new row with a missing store value.

`sales.loc[1000] = [None, "PG2", 10000, 120, 64, 96, 15, 53]`

We will calculate the average price for each store with and without the dropna parameter to see the difference.

# without dropna

sales.groupby("store")["price"].mean()Output

store

Daisy 69.327426

Rose 60.513700

Violet 67.808727

Name: price, dtype: float64------------------------------------------------------------------

# with dropna

sales.groupby("store", dropna=False)["price"].mean()Output

store

Daisy 69.327426

Rose 60.513700

Violet 67.808727

NaN 96.000000

Name: price, dtype: float64

*Note**: In order to use the dropna parameter of the groupby function, you need to have pandas version 1.1.0 or higher.*

**Example 19: How many groups**

We sometimes need to know how many groups are generated, which can be found using the ngroups method.

sales.groupby(["store", "product_group"]).ngroupsOutput

18

There are 18 different combinations of the distinct values in the store and product group columns.

**Example 20: Getting a particular group**

The get_group function can be used for getting a particular group as a DataFrame. Groups are defined with the values in the column used for grouping. In the case of using multiple columns for grouping, the values are written in a tuple.

For instance, we can get the rows that belong to store “Daisy” and product group “PG1” as follows:

daisy_pg1 = sales.groupby( ["store", "product_group"]).get_group(("Daisy","PG1"))daisy_pg1.head()Output

**Example 21: Assigning a rank**

The rank function is used for assigning a rank to the rows based on the values in the given column. We can use the rank and the groupby functions to rank rows within each group separately.

sales["rank"] = sales.groupby("store"["price"].rank( ascending=False, method="dense")sales.head()Output

**Example 22: Cumulative sum**

We can calculate the cumulative sum within each group. Let’s create a new DataFrame with simple time series data.

import numpy as npdf = pd.DataFrame(

{

"date": pd.date_range(start="2022-08-01", periods=8, freq="D"),

"category": list("AAAABBBB"),

"value": np.random.randint(10, 30, size=8)

}

)dfOutput

We can create a column that contains the cumulative sum for the value column as follows:

df["cum_sum"] = df.groupby("category")["value"].cumsum()dfOutput

**Example 23: Cumulative sum with expanding**

The expanding function provides expanding transformations. We still need a function to do aggregations such as mean and sum. If it is used with the sum function, the result will be the same as with the cumsum function.

df["cum_sum_2"] = df.groupby(

"category"

)["value"].expanding().sum().valuesdfOutput

**Example 24: Cumulative mean**

We can also calculate cumulative average by using the expanding and mean functions.

df["cum_mean"] = df.groupby(

"category"

)["value"].expanding().mean().valuesdfOutput

**Example 25: Current highest with expanding**

The expanding and max functions can be used for keeping a record of the current maximum value in a group.

df["current_highest"] = df.groupby(

"category"

)["value"].expanding().max().valuesdfOutput

**Final thoughts**

The groupby function together with the aggregate functions make a highly efficient tool for data analysis. They help us extract informative insights from a dataset.

The examples we have done in this article cover most of the use cases of the groupby function.

*You can become a **Medium member** to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to **subscribe** if you’d like to get an email whenever I publish a new article.*

Thank you for reading. Please let me know if you have any feedback.

All About Pandas Groupby Explained with 25 Examples Republished from Source https://towardsdatascience.com/all-about-pandas-groupby-explained-with-25-examples-494e04a8ef56?source=rss----7f60cf5620c9---4 via https://towardsdatascience.com/feed