How to Perform Memory-Efficient Operations on Large Datasets with Pandas

Image by Editor | Midjourney

Let’s learn how to perform operation in Pandas with Large datasets.

Preparation

As we are talking about the Pandas package, you should have one installed. Additionally, we would use the Numpy package as well. So, install them both.

pip install pandas numpy

Then, let’s get into the central part of the tutorial.

Perform Memory-Efficients Operations with Pandas

Pandas are typically not known to process large datasets as memory-intensive operations with the Pandas package can take too much time or even swallow your whole RAM. However, there are ways to improve efficiency in panda operations.

In this tutorial, we will walk you through ways to enhance your experience with large Datasets in Pandas.

First, try loading the dataset with a memory optimization parameter. Also, try changing the data type, especially to a memory-friendly type, and drop any unnecessary columns.

import pandas as pd    df = pd.read_csv('some_large_dataset.csv', low_memory=True, dtype={'column': 'int32'}, usecols=['col1', 'col2'])

Converting the integer and float with the smallest type would help reduce the memory footprint. Using category type to the categorical column with a small number of unique values would also help. Smaller columns also help with memory efficiency.

Next, we can use the chunk process to avoid using all the memory. It would be more efficient if process it iteratively. For example, we want to get the column mean, but the dataset is too big. We can process 100,000 data at a time and get the total result.

chunk_results = []    def column_mean(chunk):      chunk_mean = chunk['target_column'].mean()      return chunk_mean    chunksize = 100000  for chunk in pd.read_csv('some_large_dataset.csv', chunksize=chunksize):      chunk_results.append(column_mean(chunk))    final_result = sum(chunk_results) / len(chunk_results)

Additionally, avoid using the apply method with lambda functions; it could be memory intensive. Alternatively, it’s better to use vectorized operations or the .apply method with normal function.

df['new_column'] = df['existing_column'] * 2

For conditional operations in Pandas, it’s also faster to use np.whererather than directly using the Lambda function with .apply

import numpy as np   df['new_column'] = np.where(df['existing_column'] > 0, 1, 0)

Then, using inplace=Truein many Pandas operations is much more memory-efficient than assigning them back to their DataFrame. It’s much more efficient because assigning them back would create a separate DataFrame before we put them into the same variable.

df.drop(columns=['column_to_drop'], inplace=True)

Lastly, filter the data early before any operations, if possible. This will limit the amount of data we process.

df = df[df['filter_column'] > threshold]

Try to master these tips to improve your Pandas experience in large datasets.

Additional Resources

How to Remove Duplicates in Large Datasets
7 Ways to Handle Large Data Files for Machine Learning
How to Load Large Datasets From Directories for Deep Learning in Keras

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Preparation

Perform Memory-Efficients Operations with Pandas

Additional Resources

More On This Topic