More

    How to Merge Pandas DataFrames

    How to Avoid Losing Valuable Data Points (incl. Cheat Sheet)

    Merged pandas (Image by the author)

    One common pitfall when merging two DataFrames is unintentionally losing valuable data points. Sometimes you need to extend your initial dataset with additional information from a second dataset. For this, you can read the two datasets into pandas DataFrames and then combine them with the .merge() method into one DataFrame. However, depending on how you merge them, you can end up with fewer or more data points as expected.

    However, depending on how you merge them, you can end up with fewer or more data points as expected.

    This article will go over the four most common methods to merge two DataFrames. Since merging pandas DataFrames is similar to SQL joins, we will use them as analogies [1]. Namely, we will showcase how to conduct:

    • LEFT OUTER JOIN (pandas: “left”)
    • RIGHT OUTER JOIN (pandas: “right”)
    • FULL OUTER JOIN (pandas: “outer”)
    • INNER JOIN (pandas: “inner”)

    Also, we will show you how you can verify your results.

    To explain the concepts we will use the following two minimal fictional datasets. In this example, we have two tables for pandas in zoos. The first table contains the location information about zoos. The second table contains information about which panda is in which zoo.

    On the left side you can see a pandas DataFrame for zoos with four entries, which is colored in blue. On the right side you can see a pandas DataFrame for pandas with six entries, which is colored in yellow.
    Left DataFrame df1 (blue) and right DataFrame df2 (yellow) (Image by the author)

    The DataFrames are colored to illustrate which entries result from which DataFrame in the following examples. When merging two DataFrames, you refer to them as “left” and “right” DataFrame. In this example, df1 is the left DataFrame and is colored in blue. df2 is the right DataFrame and is colored in yellow. If an entry in the merged DataFrame results from both DataFrames, it will be indicated with a green row background.

    • Left DataFrame: df1, colored in blue
    • Right DataFrame: df2, colored in yellow
    • Key column: Common column to merge df1 and df2 on. In this example, the key column is “zoo_id”.
    • Merged DataFrame: df_merged with rows from left in blue, from right in yellow, and from both in green

    Let’s have a look at the .merge() method and its essential parameters. This method has more parameters than those discussed below. However, we will only touch on those that are relevant to this article. You can refer to the documentation [2] for further parameters.

    DataFrame.merge(right, 
    how = "...",
    on = None,
    indicator = False,
    ...)

    First, you call the .merge() method from the left Dataframe df1 and the first parameter is the left DataFrame df2.

    df_merged = df1.merge(df2)

    You could also merge two DataFrames as follows, where the first argument is the left DataFrame and the second argument is the right DataFrame:

    df_merged = pd.merge(df1, df2)

    While the .merge() method is smart enough to find the common key column to merge on, I would recommend to explicitly define it with the parameter on. Not only does it make your code more readable, but this also speeds up the execution time.

    df_merged = df1.merge(df2,
    on = "zoo_id")

    If the key columns don’t have an identical name in both DataFrames, you can use the parameters on_left and on_right instead of on.

    df_merged = df1.merge(df2,
    on_left = "key1",
    on_right = "key2")

    To indicate from which DataFrame a row in the merged DataFrame resulted from, we will use the parameter indicator = True. This option will create a new column “_merge” in the merged DataFrame as you will see in the following examples. For the regular usage of merging DataFrames, you can omit the indicator parameter.

    What if you would want to get the big picture of every panda and every zoo?

    Venn diagram made from two illustrated pandas. The left one is blue. The right one is yellow. Their intersection is green.
    FULL OUTER JOIN (Image by the author)

    For this, you would use a FULL OUTER JOIN in SQL speak [1].

    SELECT *
    FROM df1
    FULL OUTER JOIN df2
    ON df1.zoo_id = df2.zoo_id;

    In pandas, you would use how = "outer" [2].

    df_merged = df1.merge(df2, 
    on = "zoo_id",
    how = "outer",
    indicator = True)

    Below you can see every possibility to match each row from both DataFrames via the key column. The values 101, 102, and 103 appear in both key columns of both DataFrames. A match with both DataFrames is indicated with a green dot at the intersection of the two DataFrames.

    However, the value 104 only appears in the key column of the left DataFrame and the value 105 only appears in the key column of the right DataFrame. The unmatched rows are indicated with a blue or yellow dot respectively at the intersection with a line called “no match”.

    A full outer join contains all the dots in the below figure.

    What if you would want to get the big picture of every panda and every zoo?

    Both DataFrames to be merged are tilted at a 30° angle. Each row in both DataFrames is extended with a line. At their intersections, a green dot indicates, if two rows are matched by their key column values. If a row is not matched, it is colred in blue or yellow respectively. Below you can see the merged DataFrame with seven entries: four green, one blue, two yellow.
    DataFrame merged with “outer” (Image by the author)

    As a sanity check, the expected length of the merged DataFrame should be longer than or equal to the length of the longer DataFrame. The merged DataFrame df_merged has a total of seven rows: four from both, one from left only, and two from right only as indicated in the column _merge.

    While the green rows contain no NULL values, the blue and yellow rows have missing values. Since the green rows result from both DataFrames, each column has a value. However, since the left DataFrame df2 did not contain any pandas living in the zoo with the zoo_id = 104, the column panda_name is nan for row 4. The same goes for the yellow rows 5 and 6 since df1 did not contain any information about the zoo with the zoo_id = 105.

    But what if you would want to look at only zoos which house pandas?

    Venn diagram made from two illustrated pandas. Both pandas are white but their intersection is green.
    INNER JOIN (Image by the author)

    For this, you would use an INNER JOIN in SQL speak [1].

    SELECT *
    FROM df1
    INNER JOIN df2
    ON df1.zoo_id = df2.zoo_id;

    In pandas, you would use how = "inner" [2].

    df_merged = df1.merge(df2, 
    on = "zoo_id",
    how = "inner",
    indicator = True)

    In the below figure, you can again see the matches as described for the FULL OUTER JOIN. However, an INNER JOIN only considers the green dots, which indicate that a value is present in both key columns of both DataFrames. The unmatched values (blue and yellow dots from FULL OUTER JOIN) are excluded as illustrated in the above panda Venn diagram.

    But what if you would want to look at only zoos which house pandas?

    Both DataFrames to be merged are tilted at a 30° angle. Each row in both DataFrames is extended with a line. At their intersections, a green dot indicates, if two rows are matched by their key column values. Below you can see the merged DataFrame with four green entries.
    DataFrame merged with “inner” (Image by the author)

    As a sanity check, the expected length of the merged DataFrame should be longer than or equal to the length of the shorter DataFrame. The merged DataFrame df_merged has a total of four rows: four from both as indicated in the column _merge.

    Now, let’s say you would want to know for every zoo which pandas it houses. E.g., with this information, you could calculate, how many pandas each zoo has.

    Venn diagram made from two illustrated pandas. The left one is blue. The right one is white. Their intersection is green.
    LEFT OUTER JOIN (Image by the author)

    For this, you would use a LEFT OUTER JOIN in SQL speak [1].

    SELECT *
    FROM df1
    LEFT OUTER JOIN df2
    ON df1.zoo_id = df2.zoo_id;

    In pandas, you would use how = "left" [2].

    df_merged = df1.merge(df2, 
    on = "zoo_id",
    how = "left",
    indicator = True)

    In the below figure, you can again see the matches as described for the FULL OUTER JOIN. However, a LEFT OUTER JOIN only considers the green and blue dots as illustrated in the above panda Venn diagram. The unmatched values from the right DataFrame (yellow dots from FULL OUTER JOIN) are excluded.

    Let’s say you would want to calculate how many pandas each zoo has.

    Both DataFrames to be merged are tilted at a 30° angle. Each row in both DataFrames is extended with a line. At their intersections, a green dot indicates, if two rows are matched by their key column values. If a row is not matched, it is colred in blue or yellow respectively. Below you can see the merged DataFrame with five entries: four green, one blue.
    DataFrame merged with “left” (Image by the author)

    As a sanity check, the expected length of the merged DataFrame should be longer than or equal to the length of the left DataFrame. The merged DataFrame df_merged has a total of five rows: four from both and one from left only as indicated in the column _merge.

    Finally, let’s say you would want to know for every panda in which zoo it lives.

    Venn diagram made from two illustrated pandas. The left one is white. The right one is yellow. Their intersection is green.
    RIGHT OUTER JOIN (Image by the author)

    For this, you would use a RIGHT OUTER JOIN in SQL speak [1].

    SELECT *
    FROM df1
    RIGHT OUTER JOIN df2
    ON df1.zoo_id = df2.zoo_id;

    In pandas, you would use how = "right" [2].

    df_merged = df1.merge(df2, 
    on = "zoo_id",
    how = "right",
    indicator = True)

    In the below figure, you can again see the matches as described for the FULL OUTER JOIN. However, a RIGHT OUTER JOIN only considers the green and yellow dots as illustrated in the above panda Venn diagram. The unmatched values from the left DataFrame (blue dots from FULL OUTER JOIN) are excluded.

    Let’s say you would want to know for every panda in which zoo it lives.

    Both DataFrames to be merged are tilted at a 30° angle. Each row in both DataFrames is extended with a line. At their intersections, a green dot indicates, if two rows are matched by their key column values. If a row is not matched, it is colred in blue or yellow respectively. Below you can see the merged DataFrame with six entries: four green, two yellow.
    DataFrame merged with “right” (Image by the author)

    As a sanity check, the expected length of the merged DataFrame should be longer than or equal to the length of the right DataFrame. The merged DataFrame df_merged has a total of six rows: four from both and two from right only as indicated in the column _merge.

    This article (literally) illustrated how to merge two pandas DataFrames with the .merge() method. Namely, we looked at how to conduct the most common types of SQL joins in pandas: FULL OUTER JOIN, INNER JOIN, LEFT OUTER JOIN, and RIGHT OUTER JOIN.

    Below you can find a visual summary of this article as a cheat sheet:

    Cheat sheet of pandas DataFrame merge: FULL OUTER JOIN (outer), INNER JOIN (inner), LEFT OUTER JOIN (left), RIGHT OUTER JOIN (right).
    Cheat sheet for how to merge pandas DataFrames (Image by the author)

    [1] “pandas 1.4.2 documentation”, “Comparison with SQL.” pandas.pydata.org. https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html#join (accessed July 13, 2022)

    [2] “pandas 1.4.2 documentation”, “pandas.DataFrame.merge.” pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html (accessed July 13, 2022)

    Recent Articles

    spot_img

    Related Stories

    Stay on op - Ge the daily news in your inbox