Beginner’s Guide to Data Cleaning with Pyjanitor

Data Cleaning with PyJanitor
Image by Author | DALLE-3 & Canva

Have you ever dealt with messy datasets? They are one of the biggest hurdles in any data science project. These datasets can contain inconsistencies, missing values, or irregularities that hinder analysis. Data cleaning is the essential first step that lays the foundation for accurate and reliable insights, but it's lengthy and time-consuming.

Fear not! Let me introduce you to Pyjanitor, a fantastic Python library that can save the day. It is a convenient Python package, providing a simple remedy to these data-cleaning challenges. In this article, I am going to discuss the importance of Pyjanitor along with its features and practical usage.

By the end of this article, you will have a clear understanding of how Pyjanitor simplifies data cleaning and its application in everyday data-related tasks.

What is Pyjanitor?

Pyjanitor is an extended R package of Python, built on top of pandas that simplifies data cleaning and preprocessing tasks. It extends its functionality by offering a variety of useful functions that refine the process of cleaning, transforming, and preparing datasets. Think of it as an upgrade to your data-cleaning toolkit. Are you eager to learn about Pyjanitor? Me too. Let’s start.

Getting Started

First things first, you need to install Pyjanitor. Open your terminal or command prompt and run the following command:

pip install pyjanitor

The next step is to import Pyjanitor and Pandas into your Python script. This can be done by:

import janitor  import pandas as pd

Now, you are ready to use Pyjanitor for your data cleaning tasks. Moving forward, I will cover some of the most useful features of Pyjanitor which are:

1. Cleaning Column Names

Raise your hand if you have ever been frustrated by inconsistent column names. Yup, me too. With Pyjanitor's clean_names() function, you can quickly standardize your column names making them uniform and consistent with just a simple call. This powerful function replaces spaces with underscores, converts all characters to lowercase, strips leading and trailing whitespace, and even replaces dots with underscores. Let’s understand it with a basic example.

#Create a data frame with inconsistent column names  student_df = pd.DataFrame({      'Student.ID': [1, 2, 3],      'Student Name': ['Sara', 'Hanna', 'Mathew'],      'Student Gender': ['Female', 'Female', 'Male'],      'Course*': ['Algebra', 'Data Science', 'Geometry'],      'Grade': ['A', 'B', 'C']  })    #Clean the column names  clean_df = student_df.clean_names()  print(clean_df)

Output:

   student_id    student_name    student_gender        course    grade  0           1            Sara            Female       Algebra        A  1           2           Hanna            Female  Data Science        B  2           3          Mathew              Male      Geometry        C

2. Renaming Columns

At times, renaming columns not only enhances our understanding of the data but also improves its readability and consistency. Thanks to the rename_column() function, this task becomes effortless. A simple example showcasing the usability of this function is as follows:

student_df = pd.DataFrame({      'stu_id': [1, 2],      'stu_name': ['Ryan', 'James'],  })  # Renaming the columns  student_df = student_df.rename_column('stu_id', 'Student_ID')  student_df =student_df.rename_column('stu_name', 'Student_Name')  print(student_df.columns)  

Output:

Index(['Student_ID', 'Student_Name'], dtype='object')  

3. Handling Missing Values

Missing values are a real headache when dealing with datasets. Fortunately, the fill_missing() comes in handy for addressing these issues. Let's explore how to handle missing values using Pyjanitor with a practical example. First, we will create a dummy data frame and populate it with some missing values.

# Create a data frame with missing values  employee_df = pd.DataFrame({      'employee_id': [1, 2, 3, 4, 5],      'name': ['Ryan', 'James', 'Alicia'],      'department': ['HR', None, 'Engineering'],      'salary': [60000, 55000, None]  })  

Now, let's see how Pyjanitor can assist in filling up these missing values:

# Replace missing 'department' with 'Unknown'  # Replace the missing 'salary' with the mean of salaries  employee_df = employee_df.fill_missing({      'department': 'Unknown',      'salary': employee_df['salary'].mean(),  })  print(employee_df)  

Output:

   employee_id     name   department   salary  0            1     Ryan           HR  60000.0  1            2    James      Unknown  55000.0  2            3   Alicia  Engineering  57500.0  

In this example, the department of employee ‘James’ is substituted with ‘Unknown', and the salary of ‘Alicia’ is substituted with the average of ‘Ryan’ and ‘James’ salaries. You can use various strategies for handling missing values like forward pass, backward pass, or, filling with a specific value.

4. Filtering Rows & Selecting Columns

Filtering rows and columns is a crucial task in data analysis. Pyjanitor simplifies this process by providing functions that allow you to select columns and filter rows based on specific conditions. Suppose you have a data frame containing student records, and you want to filter out students(rows) whose marks are less than 60. Let’s explore how Pyjanitor helps us in achieving this.

# Create a data frame with student data  students_df = pd.DataFrame({      'student_id': [1, 2, 3, 4, 5],      'name': ['John', 'Julia', 'Ali', 'Sara', 'Sam'],      'subject': ['Maths', 'General Science', 'English', 'History''],      'marks': [85, 58, 92, 45, 75],      'grade': ['A', 'C', 'A+', 'D', 'B']  })    # Filter rows where marks are less than 60  filtered_students_df = students_df.query('marks >= 60')  print(filtered_students_df)  

Output:

   student_id    name  subject  marks grade  0           1    John     Math     85     A  2           3   Lucas  English     92    A+  4           5  Sophia     Math     75     B  

Now suppose you also want to output only specific columns, such as only the name and ID, rather than their entire data. Pyjanitor can also help in doing this as follows:

# Select specific columns  selected_columns_df = filtered_students_df.loc[:,['student_id', 'name']]  

Output:

   student_id    name    0           1    John      2           3   Lucas   4           5  Sophia   

5. Chaining Methods

With Pyjanitor's method chaining feature, you can perform multiple operations in a single line. This capability stands out as one of its best features. To illustrate, let's consider a data frame containing data about cars:

# Create a data frame with sample car data  cars_df =pd.DataFrame ({      'Car ID': [101, None, 103, 104, 105],      'Car Model': ['Toyota', 'Honda', 'BMW', 'Mercedes', 'Tesla'],      'Price ($)': [25000, 30000, None, 40000, 45000],      'Year': [2018, 2019, 2017, 2020, None]  })  print("Cars Data Before Applying Method Chaining:")  print(cars_df)  

Output:

Cars Data Before Applying Method Chaining:     Car ID Car Model  Price ($)    Year  0   101.0    Toyota    25000.0  2018.0  1     NaN     Honda    30000.0  2019.0  2   103.0       BMW        NaN  2017.0  3   104.0  Mercedes    40000.0  2020.0  4   105.0     Tesla    45000.0     NaN  

Now that we see the data frame contains missing values and inconsistent column names. We can solve this by performing operations sequentially, such as clean_names(), rename_column(), and, dropna(), etc. in multiple lines. Alternatively, we can chain these methods together– performing multiple operations in a single line –for a fluent workflow and cleaner code.

# Chain methods to clean column names, drop rows with missing values, select specific columns, and rename columns  cleaned_cars_df = (    cars_df    .clean_names()  # Clean column names    .dropna()  # Drop rows with missing values    .select_columns(['car_id', 'car_model', 'price']) #Select columns    .rename_column('price', 'price_usd')  # Rename column  )    print("Cars Data After Applying Method Chaining:")  print(cleaned_cars_df)  

Output:

Cars Data After Applying Method Chaining:     car_id car_model  price_usd   0   101.0    Toyota  25000   3   104.0  Mercedes  40000    

In this pipeline, the following operations have been performed:

  • clean_names() function cleans out the column names.
  • dropna() function drops the rows with missing values.
  • select_columns() function selects specific columns which are ‘car_id’, ‘car_model’ and ‘price’.
  • rename_column() function renames the column ‘price’ with ‘price_usd’.

Wrapping Up

So, to wrap up, Pyjanitor proves to be a magical library for anyone working with data. It offers many more features than discussed in this article, such as encoding categorical variables, obtaining features and labels, identifying duplicate rows, and much more. All of these advanced features and methods can be explored in its documentation. The deeper you delve into its features, the more you will be surprised by its powerful functionality. Lastly, enjoy manipulating your data with Pyjanitor.

Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

More On This Topic

  • A Beginner's Guide to Anomaly Detection Techniques in Data Science
  • A Beginner’s Guide to Data Engineering
  • Introduction to Data Science: A Beginner's Guide
  • Learn To Reproduce Papers: Beginner’s Guide
  • A Beginner's Guide to End to End Machine Learning
  • Essential Machine Learning Algorithms: A Beginner's Guide
Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Inline Feedbacks
View all comments

Latest stories

You might also like...