How to Handle Missing Data with Scikit-learn’s Imputer Module

How to Handle Missing Data with Scikit-learn's Imputer Module
Image by Editor | Midjourney & Canva

Let’s learn how to use Scikit-learn’s imputer for handling missing data.

Preparation

Ensure you have the Numpy, Pandas and Scikit-Learn installed in your environment. If not, you can install them via pip using the following code:

pip install numpy pandas scikit-learn

Then, we can import the packages into your environment:

import numpy as np  import pandas as pd  import sklearn  from sklearn.experimental import enable_iterative_imputer

Handle Missing Data with Imputer

A scikit-Learn imputer is a class used to replace missing data with certain values. It can streamline your data preprocessing process. We will explore several strategies for handling the missing data.

Let’s create a data example for our example:

sample_data = {'First': [1, 2, 3, 4, 5, 6, 7, np.nan,9], 'Second': [np.nan, 2, 3, 4, 5, 6, np.nan, 8,9]}  df = pd.DataFrame(sample_data)  print(df)
    First  Second  0    1.0     NaN  1    2.0     2.0  2    3.0     3.0  3    4.0     4.0  4    5.0     5.0  5    6.0     6.0  6    7.0     NaN  7    NaN     8.0  8    9.0     9.0

You can fill the columns' missing values with the Scikit-Learn Simple Imputer using the respective column’s mean.

    First  Second  0   1.00    5.29  1   2.00    2.00  2   3.00    3.00  3   4.00    4.00  4   5.00    5.00  5   6.00    6.00  6   7.00    5.29  7   4.62    8.00  8   9.00    9.00

For note, we round the result into 2 decimal places.

It’s also possible to impute the missing data with Median using Simple Imputer.

imputer = sklearn.SimpleImputer(strategy='median')  df_imputed = round(pd.DataFrame(imputer.fit_transform(df), columns=df.columns),2)    print(df_imputed)
   First  Second  0    1.0     5.0  1    2.0     2.0  2    3.0     3.0  3    4.0     4.0  4    5.0     5.0  5    6.0     6.0  6    7.0     5.0  7    4.5     8.0  8    9.0     9.0

The mean and median imputer approach is simple, but it can distort the data distribution and create bias in a data relationship.

There are also possible to use a K-NN imputer to fill in the missing data using the nearest neighbour approach.

knn_imputer = sklearn.KNNImputer(n_neighbors=2)  knn_imputed_data = knn_imputer.fit_transform(df)  knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)    print(knn_imputed_df)
    First  Second  0    1.0     2.5  1    2.0     2.0  2    3.0     3.0  3    4.0     4.0  4    5.0     5.0  5    6.0     6.0  6    7.0     5.5  7    7.5     8.0  8    9.0     9.0

The KNN imputer would use the mean or median of the neighbour's values from the k nearest neighbours.

Lastly, there is the Iterative Impute methodology, which is based on modelling each feature with missing values as a function of other features. As this article states, it’s an experimental feature, so we need to enable it initially.

iterative_imputer = IterativeImputer(max_iter=10, random_state=0)  iterative_imputed_data = iterative_imputer.fit_transform(df)  iterative_imputed_df = round(pd.DataFrame(iterative_imputed_data, columns=df.columns),2)    print(iterative_imputed_df)
    First  Second  0    1.0     1.0  1    2.0     2.0  2    3.0     3.0  3    4.0     4.0  4    5.0     5.0  5    6.0     6.0  6    7.0     7.0  7    8.0     8.0  8    9.0     9.0

If you can properly use the imputer, it could help make your data science project better.

Additional Resouces

  • How to Deal with Missing Values in Your Dataset
  • How to Handle Missing Data with Python
  • Data Cleaning with Pandas

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

More On This Topic

  • Using Scikit-learn's Imputer
  • KDnuggets News, August 31: The Complete Data Science Study Roadmap…
  • 7 Techniques to Handle Imbalanced Data
  • Using PyCaret’s New Time Series Module
  • Say Goodbye to Print(): Use Logging Module for Effective Debugging
  • The Optimal Way to Input Missing Data with Pandas fillna()
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments