7 Essential Data Quality Checks with Pandas

7 Essential Data Quality Checks with Pandas
Image by Author

As a data professional, you’re probably familiar with the cost of poor data quality. For all data projects—big or small—you should perform essential data quality checks.

There are dedicated libraries and frameworks for data quality assessment. But if you are a beginner, you can run simple yet important data quality checks with pandas. And this tutorial will teach you how.

We’ll use the California Housing Dataset from scikit-learn for this tutorial.

An Overview of the California Housing Dataset

We’ll use the California housing dataset from Scikit-learn’s datasets module. The data set contains over 20,000 records of eight numeric features and a target median house value.

Let’s read the dataset into a pandas dataframe df:

from sklearn.datasets import fetch_california_housing  import pandas as pd    # Fetch the California housing dataset  data = fetch_california_housing()    # Convert the dataset to a Pandas DataFrame  df = pd.DataFrame(data.data, columns=data.feature_names)    # Add target column  df['MedHouseVal'] = data.target

For a detailed description of the dataset, run data.DESCR as shown:

print(data.DESCR)

7 Essential Data Quality Checks with Pandas
Output of data.DESCR

Let's get some basic information on the dataset:

df.info()

Here’s the output:

Output >>>      RangeIndex: 20640 entries, 0 to 20639  Data columns (total 9 columns):   #   Column   	Non-Null Count  Dtype    ---  ------   	--------------  -----     0   MedInc   	20640 non-null  float64   1   HouseAge 	20640 non-null  float64   2   AveRooms 	20640 non-null  float64   3   AveBedrms	20640 non-null  float64   4   Population   20640 non-null  float64   5   AveOccup 	20640 non-null  float64   6   Latitude 	20640 non-null  float64   7   Longitude	20640 non-null  float64   8   MedHouseVal  20640 non-null  float64  dtypes: float64(9)  memory usage: 1.4 MB

Because we have numeric features, let us also get the summary starts using the describe() method:

df.describe()

7 Essential Data Quality Checks with Pandas
Output of df.describe() 1. Check for Missing Values

Real-world datasets often have missing values. To analyze the data and build models, you need to handle these missing values.

To ensure data quality, you should check if the fraction of missing values is within a specific tolerance limit. You can then impute the missing values using suitable imputation strategies.

The first step, therefore, is to check for missing values across all features in the dataset.

This code checks for missing values in each column of the dataframe df:

# Check for missing values in the DataFrame  missing_values = df.isnull().sum()  print("Missing Values:")  print(missing_values)

The result is a pandas series that shows the count of missing values for each column:

Output >>>    Missing Values:  MedInc     	0  HouseAge   	0  AveRooms   	0  AveBedrms  	0  Population 	0  AveOccup   	0  Latitude   	0  Longitude  	0  MedHouseVal	0  dtype: int64

As seen, there are no missing values in this dataset.

2. Identify Duplicate Records

Duplicate records in the dataset can skew analysis. So you should check for and drop the duplicate records as needed.

Here’s the code to identify and return duplicate rows in df. If there are any duplicate rows, they will be included in the result:

# Check for duplicate rows in the DataFrame  duplicate_rows = df[df.duplicated()]  print("Duplicate Rows:")  print(duplicate_rows)

The result is an empty dataframe. Meaning there are no duplicate records in the dataset:

Output >>>    Duplicate Rows:  Empty DataFrame  Columns: [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal]  Index: []

3. Check Data Types

When analyzing a dataset, you’ll often have to transform or scale one or more features. To avoid unexpected errors when performing such operations, it is important to check if the columns are all of the expected data type.

This code checks the data types of each column in the dataframe df:

# Check data types of each column in the DataFrame  data_types = df.dtypes  print("Data Types:")  print(data_types)

Here, all numeric features are of float data type as expected:

Output >>>    Data Types:  MedInc     	float64  HouseAge   	float64  AveRooms   	float64  AveBedrms  	float64  Population 	float64  AveOccup   	float64  Latitude   	float64  Longitude  	float64  MedHouseVal	float64  dtype: object

4. Check for Outliers

Outliers are data points that are significantly different from other points in the dataset. If you remember, we ran the describe() method on the dataframe.

Based on the quartile values and the maximum value, you could’ve identified that a subset of features contain outliers. Specifically, these features:

  • MedInc
  • AveRooms
  • AveBedrms
  • Population

One approach to handling outliers is to use the interquartile range, the difference between the 75th and 25th quartiles. If Q1 is the 25th quartile and Q3 is the 75th quartile, then the interquartile range is given by: Q3 – Q1.

We then use the quartiles and the IQR to define the interval [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]. And all points outside this range are outliers.

columns_to_check = ['MedInc', 'AveRooms', 'AveBedrms', 'Population']    # Function to find records with outliers  def find_outliers_pandas(data, column):  	Q1 = data[column].quantile(0.25)  	Q3 = data[column].quantile(0.75)  	IQR = Q3 - Q1  	lower_bound = Q1 - 1.5 * IQR  	upper_bound = Q3 + 1.5 * IQR  	outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]  	return outliers    # Find records with outliers for each specified column  outliers_dict = {}    for column in columns_to-check:  	outliers_dict[column] = find_outliers_pandas(df, column)    # Print the records with outliers for each column  for column, outliers in outliers_dict.items():  	print(f"Outliers in '{column}':")  	print(outliers)  	print("n")

7 Essential Data Quality Checks with Pandas
Outliers in 'AveRooms' Column | Truncated Output for Outliers Check 5. Validate Numeric Ranges

An important check for numeric features is to validate the range. This ensures that all observations of a feature take on values in an expected range.

This code validates that the 'MedInc' value falls within an expected range and identifies data points that do not meet this criteria:

# Check numerical value range for the 'MedInc' column  valid_range = (0, 16)    value_range_check = df[~df['MedInc'].between(*valid_range)]  print("Value Range Check (MedInc):")  print(value_range_check)

You can try for other numeric features of your choice. But we see that all values in the 'MedInc' column lie in the expected range:

Output >>>    Value Range Check (MedInc):  Empty DataFrame  Columns: [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal]  Index: []

6. Check Cross-Column Dependency

Most data sets contain related features. So it's important to include checks based on logically relevant relationships between columns (or features).

While features—individually—may take on values in the expected range, the relationship between them may be inconsistent.

Here is an example for our dataset. In a valid record, the ‘AveRooms’ should typically be greater than or equal to the ‘AveBedRms’.

# AveRooms should not be smaller than AveBedrooms  invalid_data = df[df['AveRooms'] < df['AveBedrms']]  print("Invalid Records (AveRooms < AveBedrms):")  print(invalid_data)

In the California housing dataset we’re working with, we see that there are no such invalid records:

Output >>>    Invalid Records (AveRooms < AveBedrms):  Empty DataFrame  Columns: [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal]  Index: []

7. Check for Inconsistent Data Entry

Inconsistent data entry is a common data quality issue in most datasets. Examples include:

  • Inconsistent formatting in datetime columns
  • Inconsistent logging of categorical variable values
  • Recording of reading in different units

In our dataset, we’ve verified the data types of columns and have identified outliers. But you can also run checks for inconsistent data entry.

Let’s whip up a simple example to check if all the date entries have a consistent formatting.

Here we use regular expressions in conjunction with pandas apply() function to check if all date entries are in the YYYY-MM-DD format:

import pandas as pd  import re    data = {'Date': ['2023-10-29', '2023-11-15', '23-10-2023', '2023/10/29', '2023-10-30']}  df = pd.DataFrame(data)    # Define the expected date format  date_format_pattern = r'^d{4}-d{2}-d{2}$'  # YYYY-MM-DD format    # Function to check if a date value matches the expected format  def check_date_format(date_str, date_format_pattern):  	return re.match(date_format_pattern, date_str) is not None    # Apply the format check to the 'Date' column  date_format_check = df['Date'].apply(lambda x: check_date_format(x, date_format_pattern))    # Identify and retrieve entries that do not follow the expected format  non_adherent_dates = df[~date_format_check]    if not non_adherent_dates.empty:  	print("Entries that do not follow the expected format:")  	print(non_adherent_dates)  else:  	print("All dates are in the expected format.")

This returns the entries that do not follow the expected format:

Output >>>    Entries that do not follow the expected format:       	Date  2  23-10-2023  3  2023/10/29

Wrapping up

In this tutorial, we went over common data quality checks with pandas.

When you are working on smaller data analysis projects, these data quality checks with pandas are a good starting point. Depending on the problem and the dataset, you can include additional checks.

If you’re interested in learning data analysis, check out the guide 7 Steps to Mastering Data Wrangling with Pandas and Python.

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.

More On This Topic

  • Data Quality Dimensions: Assuring Your Data Quality with Great Expectations
  • When to Retrain an Machine Learning Model? Run these 5 checks to…
  • Deep Learning For Compliance Checks: What's New?
  • 10 Essential Pandas Functions Every Data Scientist Should Know
  • Data Observability, Part II: How to Build Your Own Data Quality…
  • Inside the Architecture Powering Data Quality Management at Uber
Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...