How to Handle Missing Data with Scikit-learn’s Imputer Module

Image by Editor | Midjourney & Canva

Let’s learn how to use Scikit-learn’s imputer for handling missing data.

Preparation

Ensure you have the Numpy, Pandas and Scikit-Learn installed in your environment. If not, you can install them via pip using the following code:

pip install numpy pandas scikit-learn

Then, we can import the packages into your environment:

import numpy as np  import pandas as pd  import sklearn  from sklearn.experimental import enable_iterative_imputer

Handle Missing Data with Imputer

A scikit-Learn imputer is a class used to replace missing data with certain values. It can streamline your data preprocessing process. We will explore several strategies for handling the missing data.

Let’s create a data example for our example:

sample_data = {'First': [1, 2, 3, 4, 5, 6, 7, np.nan,9], 'Second': [np.nan, 2, 3, 4, 5, 6, np.nan, 8,9]}  df = pd.DataFrame(sample_data)  print(df)

    First  Second  0    1.0     NaN  1    2.0     2.0  2    3.0     3.0  3    4.0     4.0  4    5.0     5.0  5    6.0     6.0  6    7.0     NaN  7    NaN     8.0  8    9.0     9.0

You can fill the columns' missing values with the Scikit-Learn Simple Imputer using the respective column’s mean.

    First  Second  0   1.00    5.29  1   2.00    2.00  2   3.00    3.00  3   4.00    4.00  4   5.00    5.00  5   6.00    6.00  6   7.00    5.29  7   4.62    8.00  8   9.00    9.00

For note, we round the result into 2 decimal places.

It’s also possible to impute the missing data with Median using Simple Imputer.

imputer = sklearn.SimpleImputer(strategy='median')  df_imputed = round(pd.DataFrame(imputer.fit_transform(df), columns=df.columns),2)    print(df_imputed)

   First  Second  0    1.0     5.0  1    2.0     2.0  2    3.0     3.0  3    4.0     4.0  4    5.0     5.0  5    6.0     6.0  6    7.0     5.0  7    4.5     8.0  8    9.0     9.0

The mean and median imputer approach is simple, but it can distort the data distribution and create bias in a data relationship.

There are also possible to use a K-NN imputer to fill in the missing data using the nearest neighbour approach.

knn_imputer = sklearn.KNNImputer(n_neighbors=2)  knn_imputed_data = knn_imputer.fit_transform(df)  knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)    print(knn_imputed_df)

    First  Second  0    1.0     2.5  1    2.0     2.0  2    3.0     3.0  3    4.0     4.0  4    5.0     5.0  5    6.0     6.0  6    7.0     5.5  7    7.5     8.0  8    9.0     9.0

The KNN imputer would use the mean or median of the neighbour's values from the k nearest neighbours.

Lastly, there is the Iterative Impute methodology, which is based on modelling each feature with missing values as a function of other features. As this article states, it’s an experimental feature, so we need to enable it initially.

iterative_imputer = IterativeImputer(max_iter=10, random_state=0)  iterative_imputed_data = iterative_imputer.fit_transform(df)  iterative_imputed_df = round(pd.DataFrame(iterative_imputed_data, columns=df.columns),2)    print(iterative_imputed_df)

    First  Second  0    1.0     1.0  1    2.0     2.0  2    3.0     3.0  3    4.0     4.0  4    5.0     5.0  5    6.0     6.0  6    7.0     7.0  7    8.0     8.0  8    9.0     9.0

If you can properly use the imputer, it could help make your data science project better.

Additional Resouces

How to Deal with Missing Values in Your Dataset
How to Handle Missing Data with Python
Data Cleaning with Pandas

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Preparation

Handle Missing Data with Imputer

Additional Resouces

More On This Topic