Skip to content

ENH: inspect duplicate rows for columns that vary #61318

Closed
@kirisakow

Description

@kirisakow

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I had a function that would inspect a DataFrame that has duplicate values and yield, per each group of rows that have a duplicate value, a subset of the input DataFrame featuring only the columns that vary.

Feature Description

from typing import Union
import pandas as pd


class ExtendedDF(pd.DataFrame):
    @property
    def _constructor(self):
        return ExtendedDF

    def inspect_duplicates(self, key_col: str) -> Union[pd.DataFrame, 'ExtendedDF']:
        """Inspects a DataFrame that has duplicate values in the `key_col` column,
        and yields, per each group of rows that have same `key_col` value, a subset
        of the input DataFrame featuring only the columns that vary.

        Args:
            key_col (str): name of the column with duplicate values

        Yields:
            pd.DataFrame: per each group of rows that have same `key_col` value,
            yields a subset of the input DataFrame featuring only the columns that
            vary.

        Examples:

        Consider a dataset containing ramen ratings with duplicates:

        ```py
        df = pd.DataFrame({
            'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
            'style': ['cup', 'pack', 'cup', 'cup', 'pack'],
            'rating': [4, 3.5, 4, 15, 5],
            'col_that_doesnt_change': ['so yummy' for _ in range(5)],
            'another_col_that_doesnt_change': ['mmm love it' for _ in range(5)],
        })
        df = ExtendedDF(df)
        ```
        ```
        df
             brand  style  rating col_that_doesnt_change another_col_that_doesnt_change
        0  Yum Yum    cup     4.0               so yummy                    mmm love it
        1  Yum Yum   pack     3.5               so yummy                    mmm love it
        2  Indomie    cup     4.0               so yummy                    mmm love it
        3  Indomie    cup    15.0               so yummy                    mmm love it
        4  Indomie   pack     5.0               so yummy                    mmm love it
        ```

        Inspect the duplicates using 'brand' column as the key:

        ```py
        print(
            *df.inspect_duplicates('brand')
        )
        ```
        ```
             brand  rating
        2  Indomie     4.0
        3  Indomie    15.0
        4  Indomie     5.0

             brand style  rating
        0  Yum Yum   cup     4.0
        1  Yum Yum  pack     3.5
        ```

        Inspect the duplicates using 'style' column as the key:

        ```py
        print(
            *df.inspect_duplicates('style')
        )
        ```
        ```
          style    brand
        0   cup  Yum Yum
        2   cup  Indomie
        3   cup  Indomie

          style    brand  rating
        1  pack  Yum Yum     3.5
        4  pack  Indomie     5.0
        ```

        Inspect the duplicates using 'rating' column as the key:

        ```py
        print(
            *df.inspect_duplicates('rating')
        )
        ```
        ```
           rating    brand
        0     4.0  Yum Yum
        2     4.0  Indomie
        ```

        You can also concatenate everything that is yielded into a single DataFrame:

        ```py
        print(
            pd.concat([
                *df.inspect_duplicates('brand')
            ])
        )
        ```
        ```
             brand style  rating
        0  Yum Yum   cup     4.0
        1  Yum Yum  pack     3.5
        2  Indomie   NaN     4.0
        3  Indomie   NaN    15.0
        4  Indomie   NaN     5.0
        ```
        """
        mark_all_dupl_mask = self.duplicated(key_col, keep=False)
        df_dupl = self.loc[mark_all_dupl_mask]
        for k in set(df_dupl[key_col].values):
            sub_df = self.loc[self[key_col] == k]
            mask_eq = sub_df.iloc[0] != sub_df.iloc[1]
            diff_cols = mask_eq.loc[mask_eq].index.values
            yield sub_df.loc[:, [key_col] + list(diff_cols)]

Alternative Solutions

None

Additional Context

Authors:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions