NumPy with Pandas for More Efficient Data Analysis

Image by jcomp on Freepik

As a data person, Pandas is a go-to package for any data manipulation activity because it’s intuitive and easy to use. That’s why many data science education include Pandas in their learning curriculum.

Pandas are built on the NumPy package, especially the NumPy array. Many NumPy functions and methodologies still work well with them, so we can use NumPy to effectively improve our data analysis with Pandas.

This article will explore several examples of how NumPy can help our Pandas data analysis experience.

Let’s get into it.

Pandas Data Analysis Improvement with NumPy

Before proceeding with the tutorial, we should have all the required packages installed. If you haven’t done so, you can install Pandas and NumPy using the following code.

pip install pandas numpy

We can start by explaining how Pandas and NumPy are connected. As mentioned above, Pandas is built on the NumPy package. Let’s see how they could complement each other to improve our data analysis.

First, let’s try to create a NumPy array and Pandas DataFrame with the respective packages.

import numpy as np  import pandas as pd    np_array= np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])  pandas_df = pd.DataFrame(np_array, columns=['A', 'B', 'C'])    print(np_array)  print(pandas_df)

Output>>  [[1 2 3]   [4 5 6]   [7 8 9]]     A  B  C  0  1  2  3  1  4  5  6  2  7  8  9

As you can see in the code above, we can create Pandas DataFrame with a NumPy array with the same dimension structure.

Next, we can use NumPy in the Pandas data processing and cleaning steps. For example, we can use the NumPy NaN object as the missing data placeholder.

df = pd.DataFrame({      'A': [1, 2, np.nan, 4, 5],      'B': [5, np.nan, np.nan, 3, 2],      'C': [1, 2, 3, np.nan, 5]  })  print(df)

Output>>      A    B    C  0  1.0  5.0  1.0  1  2.0  NaN  2.0  2  NaN  NaN  3.0  3  4.0  3.0  NaN  4  5.0  2.0  5.0

As you can see in the result above, the NumPy NaN object becomes a synonym with any missing data in Pandas.

This code can examine the number of NaN objects in each Pandas DataFrame column.

df.isnull().sum()

Output>>  A    1  B    2  C    1  dtype: int64

The data collector may represent the missing data values in the DataFrame column as strings. If that happens, we can try to replace that string value with a NumPy NaN object.

df['A'] = df['A'].replace('missing data'', np.nan)

NumPy can also used for outlier detection. Let’s see how we can do that.

df = pd.DataFrame({      'A': np.random.normal(0, 1, 1000),      'B': np.random.normal(0, 1, 1000)  })    df.loc[10, 'A'] = 100  df.loc[25, 'B'] = -100    def detect_outliers(data, threshold=3):      z_scores = np.abs((data - data.mean()) / data.std())      return z_scores > threshold    outliers = detect_outliers(df)  print(df[outliers.any(axis =1)])

Output>>              A           B  10  100.000000    0.355967  25    0.239933 -100.000000

In the code above, we generate random numbers with NumPy and then create a function that detects outliers using the Z-score and sigma rules. The result is the DataFrame containing the outlier.

We can perform statistical analysis with Pandas. NumPy could help facilitate more efficient analysis during the aggregation process. For example, here is statistical aggregation with Pandas and NumPy.

df = pd.DataFrame({      'Category': [np.random.choice(['A', 'B']) for i in range(100)],      'Values': np.random.rand(100)  })    print(df.groupby('Category')['Values'].agg([np.mean, np.std, np.min, np.max]))

Output>>               mean       std      amin      amax  Category                                          A         0.524568  0.288471  0.025635  0.999284  B         0.525937  0.300526  0.019443  0.999090

Using NumPy, we can use the statistical analysis function to the Pandas DataFrame and acquire aggregate statistics similar to the above output.

Lastly, we will talk about vectorized operations using Pandas and NumPy. Vectorized operations are a method of performing operations on the data simultaneously rather than looping them individually. The result would be faster and memory-optimized.
For example, we can perform element-wise addition operations between DataFrame columns using NumPy.

data = {'A': [15,20,25,30,35], 'B': [10, 20, 30, 40, 50]}    df = pd.DataFrame(data)  df['C'] = np.add(df['A'], df['B'])      print(df)

Output>>     A   B   C  0  15  10  25  1  20  20  40  2  25  30  55  3  30  40  70  4  35  50  85

We can also transform the DataFrame column via the NumPy mathematical function.

df['B_exp'] = np.exp(df['B'])  print(df)

Output>>     A   B   C         B_exp  0  15  10  25  2.202647e+04  1  20  20  40  4.851652e+08  2  25  30  55  1.068647e+13  3  30  40  70  2.353853e+17  4  35  50  85  5.184706e+21

There is also the possibility of conditional replacement with NumPy for Pandas DataFrame.

df['A_replaced'] = np.where(df['A'] > 20, df['B'] * 2, df['B'] / 2)  print(df)

Output>>     A   B   C         B_exp  A_replaced  0  15  10  25  2.202647e+04         5.0  1  20  20  40  4.851652e+08        10.0  2  25  30  55  1.068647e+13        60.0  3  30  40  70  2.353853e+17        80.0  4  35  50  85  5.184706e+21       100.0

Those are all the examples we have explored. These functions from NumPy would undoubtedly help to improve your Data Analysis process.

Conclusion

This article discusses how NumPy can help improve efficient data analysis using Pandas. We have tried to perform data preprocessing, data cleaning, statistical analysis, and vectorized operations with Pandas and NumPy.

I hope it helps!

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Pandas Data Analysis Improvement with NumPy

Conclusion

More On This Topic