How to Drop All Duplicate Rows in Python Pandas

Avatar

By squashlabs, Last Updated: Oct. 15, 2023

How to Drop All Duplicate Rows in Python Pandas

To drop all duplicate rows in a pandas DataFrame in Python, you can use the drop_duplicates() method. This method removes all rows that have identical values across all columns. Here are two possible approaches you can take to drop all duplicate rows in Python using pandas:

Approach 1: Using the drop_duplicates() method

The simplest and most straightforward way to drop all duplicate rows in a pandas DataFrame is by using the drop_duplicates() method. This method removes all rows that have the same values across all columns.

Here's an example of how you can use the drop_duplicates() method to drop all duplicate rows:

import pandas as pd

# Create a sample DataFrame with duplicate rows
data = {'col1': [1, 2, 3, 2, 4, 3],
        'col2': ['A', 'B', 'C', 'B', 'D', 'C']}
df = pd.DataFrame(data)

# Drop all duplicate rows
df.drop_duplicates(inplace=True)

# Print the resulting DataFrame
print(df)

Output:

   col1 col2
0     1    A
1     2    B
2     3    C
4     4    D

In this example, we create a DataFrame with duplicate rows in the 'col1' and 'col2' columns. We then use the drop_duplicates() method with the inplace=True parameter to modify the original DataFrame and remove all duplicate rows. Finally, we print the resulting DataFrame without the duplicate rows.

Related Article: How to Manage Relative Imports in Python 3

Approach 2: Dropping duplicate rows based on specific columns

In some cases, you may want to drop duplicate rows based on specific columns in your DataFrame. To achieve this, you can pass a subset of columns to the drop_duplicates() method.

Here's an example of how you can drop duplicate rows based on specific columns:

import pandas as pd

# Create a sample DataFrame with duplicate rows
data = {'col1': [1, 2, 3, 2, 4, 3],
        'col2': ['A', 'B', 'C', 'B', 'D', 'C'],
        'col3': ['X', 'Y', 'Z', 'Y', 'W', 'Z']}
df = pd.DataFrame(data)

# Drop duplicate rows based on 'col1' and 'col2'
df.drop_duplicates(subset=['col1', 'col2'], inplace=True)

# Print the resulting DataFrame
print(df)

Output:

   col1 col2 col3
0     1    A    X
1     2    B    Y
2     3    C    Z
4     4    D    W

In this example, we create a DataFrame with duplicate rows in the 'col1', 'col2', and 'col3' columns. We then use the drop_duplicates() method with the subset=['col1', 'col2'] parameter to drop duplicate rows based on the values in the 'col1' and 'col2' columns. Finally, we print the resulting DataFrame without the duplicate rows.

Best practices and considerations

Related Article: How to Create and Fill an Empty Pandas DataFrame in Python

When dropping duplicate rows in a pandas DataFrame, keep the following considerations in mind:

1. Be cautious when using the inplace=True parameter with the drop_duplicates() method. This parameter modifies the original DataFrame in place, meaning that it permanently removes the duplicate rows from the DataFrame. If you want to preserve the original DataFrame, consider assigning the result of the drop_duplicates() method to a new DataFrame variable.

2. If you want to drop duplicate rows based on a subset of columns, make sure to pass the column names as a list to the subset parameter of the drop_duplicates() method. You can include multiple columns by providing a list of column names.

3. By default, the drop_duplicates() method keeps the first occurrence of each unique row and removes all subsequent occurrences. If you want to keep the last occurrence of each unique row and remove all previous occurrences, you can use the keep='last' parameter.

4. If you want to drop duplicate rows based on a specific column and keep the first or last occurrence, you can use the drop_duplicates() method with the subset and keep parameters. For example, to drop duplicate rows based on the 'col1' column and keep the first occurrence, you can use df.drop_duplicates(subset=['col1'], keep='first').

5. If you want to drop duplicate rows based on a specific column and keep all occurrences, you can use the duplicated() method to identify the duplicate rows and then filter the DataFrame using boolean indexing. For example, to drop duplicate rows based on the 'col1' column and keep all occurrences, you can use df[~df.duplicated(subset=['col1'])].

Overall, the drop_duplicates() method provides a convenient way to drop all duplicate rows in a pandas DataFrame. By specifying the appropriate parameters, you can customize the behavior of the method to suit your specific requirements.

For more information on the drop_duplicates() method and other data manipulation techniques in pandas, you can refer to the official pandas documentation: pandas.DataFrame.drop_duplicates().

More Articles from the How to do Data Analysis with Python & Pandas series:

How to Use Numpy Percentile in Python

This technical guide provides an overview of the numpy percentile functionality and demonstrates how to work with arrays in numpy. It covers calculat… read more

Python Numpy.where() Tutorial

This article: Learn how to use the 'where' function in Python Numpy for array operations. Explore the syntax, parameters, return values, and best pra… read more

How To Iterate Over Dictionaries Using For Loops In Python

Learn how to iterate over dictionaries using for loops in Python. Find out why someone might want to iterate over a dictionary, explore different app… read more

How To Handle Ambiguous Truth Value In Python Series

Learn how to handle ambiguous truth value in Python series using a.empty, a.bool(), a.item(), a.any() or a.all(). This article covers background info… read more

FastAPI Enterprise Basics: SSO, RBAC, and Auditing

As software engineering continues to evolve, implementing secure and web applications becomes increasingly challenging. In this article, we will expl… read more

How to Remove Duplicates From Lists in Python

Guide to removing duplicates from lists in Python using different methods. This article covers Method 1: Using the set() Function, Method 2: Using a … read more

How to Determine the Type of an Object in Python

Identifying the type of an object in Python can be done easily using the type() function. This article provides a guide on how to determine the type … read more

How to Manipulate Strings in Python and Check for Substrings

Learn how to manipulate strings in Python and check for substrings. Understand the basics of strings in Python and explore various techniques for str… read more

Advanced Django Views & URL Routing: Mixins and Decorators

Class-based views in Django, mixin classes, and complex URL routing are essential concepts for developers to understand in order to build robust web … read more

How to Rename Column Names in Pandas

Renaming column names in Pandas using Python is a common task when working with data analysis and manipulation. This tutorial provides a step-by-step… read more