ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset
Image by Author

Analyzing a dataset to understand it better and gain insights is an important step in any data science project. And ChatGPT can help simplify and speed up such analysis.

Here we’ll perform exploratory data analysis on a sample dataset using ChatGPT. We’ll understand distributions of features, how a feature of interest varies with other features, and much more.

To follow along and access ChatGPT, you need an OpenAI account. If you don't have one, you can sign up for a free account.

Loading the Dataset and Getting Basic Info

The seaborn library has some ready-to-use datasets that we can analyze. We’ll use the tips dataset.

Prompt: Tell me more about the `tips` dataset built into the seaborn library.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

Knowing what the dataset is about is a good starting point. Let's start by doing the following:

Load the dataset into a pandas dataframe, and
Get basic info and summary statistics of the columns to understand more about the dataset.

Prompt: Let’s use the `tips` dataset. Write the code to read the dataset into a pandas dataframe and get the basic info and summary statistics of the available fields.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

import seaborn as sns  import pandas as pd    # Load the 'tips' dataset from Seaborn  tips_data = sns.load_dataset('tips')    # Create a Pandas DataFrame from the loaded dataset  tips_df = pd.DataFrame(tips_data)    # Display the first few rows of the DataFrame  print("First few rows of the 'tips' dataset:")  print(tips_df.head())    # Get basic information about the fields  print("nInformation about the 'tips' dataset:")  print(tips_df.info())    # Get summary statistics of the numeric fields  print("nSummary statistics of the numeric fields:")  print(tips_df.describe())

We see that we have quite an informative summary of the dataset:

First few rows of the 'tips' dataset:     total_bill   tip     sex smoker  day    time  size  0       16.99  1.01  Female     No  Sun  Dinner     2  1       10.34  1.66    Male     No  Sun  Dinner     3  2       21.01  3.50    Male     No  Sun  Dinner     3  3       23.68  3.31    Male     No  Sun  Dinner     2  4       24.59  3.61  Female     No  Sun  Dinner     4    Information about the 'tips' dataset:    RangeIndex: 244 entries, 0 to 243  Data columns (total 7 columns):   #   Column      Non-Null Count  Dtype     ---  ------      --------------  -----      0   total_bill  244 non-null    float64    1   tip         244 non-null    float64    2   sex         244 non-null    category   3   smoker      244 non-null    category   4   day         244 non-null    category   5   time        244 non-null    category   6   size        244 non-null    int64     dtypes: category(4), float64(2), int64(1)  memory usage: 7.4 KB  None    Summary statistics of the numeric fields:         total_bill         tip        size  count  244.000000  244.000000  244.000000  mean    19.785943    2.998279    2.569672  std      8.902412    1.383638    0.951100  min      3.070000    1.000000    1.000000  25%     13.347500    2.000000    2.000000  50%     17.795000    2.900000    2.000000  75%     24.127500    3.562500    3.000000  max     50.810000   10.000000    6.000000

From the summary statistics, we have an idea of the numerical features in the dataset. We know the minimum and maximum values, mean and median values, and percentile values for the numerical features. There are no missing values so we can proceed with the next steps.

Exploring the Dataset – The What, the Why, and the How

Now that we have an idea of the dataset, let's go further.

The goal of this exploratory data analysis is to understand the tipping behavior better. For this we can come up with helpful visualizations. These should help us understand the relationship of the tip amount to the various categorical variables in the dataset.

Because this is a simple dataset to analyze, let's prompt ChatGPT to give us a set of steps to go about analyzing this data set further.

Prompt: The goal of the analysis is to get some insights into the tipping behavior, starting with some simple visualizations. Suggest steps for the same.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

The data exploration steps suggested by ChatGPT all seem valid. So we will use these steps—to better understand a dataset—one step at a time. We’ll prompt ChatGPT to generate code, try to run the generated code, and modify them as needed.

Exploring the Distribution of Tip Amounts

As a first step, let's visualize the distribution of the tip amount prompt.

Prompt: Write the code to plot the distribution of tip amounts.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

ChatGPT gave the following code to generate a histogram and a kernel density plot that help us get an idea of the distribution of the tip amounts:

import matplotlib.pyplot as plt    # Create a histogram of tip amounts  plt.figure(figsize=(8, 6))  sns.histplot(data=tips_data, x='tip', kde=True)  plt.title("Distribution of Tip Amounts")  plt.xlabel("Tip Amount")  plt.ylabel("Frequency")  plt.show()

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset Analyzing Tipping Behavior Based on Categorical Variables

We have many categorical variables in the dataset. It will help to have a simple bar plot showing the average tip amount for the different values of these category variables.

Prompt: Use all the available categorical variables (‘sex’, ‘smoker’, ‘day’, and ‘time’). Write the code to generate bar plots to understand tipping behavior. Use subplots so we can have all the four bar plots in a single figure.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

And here’s the code. The code ran without errors; I only had to modify the figure size:

# Define the categorical variables to analyze  categorical_vars = ['sex', 'smoker', 'day', 'time']    # Create subplots for each categorical variable  fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 5))  fig.suptitle("Tipping Behavior based on Categorical Variables", fontsize=16)    # Generate bar plots for each categorical variable  for ax, var in zip(axes.flatten(), categorical_vars):      sns.barplot(data=tips_data, x=var, y='tip', ax=ax)      ax.set_xlabel(var.capitalize())      ax.set_ylabel("Average Tip Amount")    plt.tight_layout()  plt.show()

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

From the plots, we see that features like sex and smoking behavior don’t influence tipping behavior (which is expected). While days and times seem to. The average tip amount on weekends and dinner seem to be slightly higher.

Visualizing the Relationship Between Total Bill and Tip Amount

Now, let’s see how the total bill influences the tip amount paid.

Prompt: I’d like to understand the relationship between total bill and the tip amount. Please give me the code to generate a suitable plot for this. I believe a simple scatter plot will be helpful.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

Here’s the code to generate the required scatter plot:

# Create a scatter plot of total bill vs. tip amount  plt.figure(figsize=(6, 4))  sns.scatterplot(data=tips_data, x='total_bill', y='tip')  plt.title("Total Bill vs. Tip Amount")  plt.xlabel("Total Bill")  plt.ylabel("Tip Amount")  plt.show()

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

We see that there’s some positive correlation between the total bill and the tip amounts.

Visualizing the Relationship Between Party Size and Tip Amount

As a next step, let’s try visualizing the relationship between the party size and the tip amount.

Prompt: Now, I'd like to understand how the tip amount varies with the number of people who dine (party size). Which is a better choice: a bar plot or a violin plot?

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

As we’d also like to understand the distribution of the tip amounts, let’s go ahead with creating a violin plot.

Prompt: Cool! Please write the code to generate a violin plot for this visualization.

And here’s the code:

# Create a violin plot for tip amount by party size  plt.figure(figsize=(6, 4))  sns.violinplot(data=tips_data, x='size', y='tip')  plt.title("Tip Amount by Party Size")  plt.xlabel("Party Size")  plt.ylabel("Tip Amount")  plt.show()

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset Understanding Tipping Behavior Based on Both Time and Day

Next, let’s understand how the time and the day influence the tipping behavior.

Prompt: I want to understand how the tipping behavior varies depending on both time and day. What is a good visualization? Also, write the code to generate the required visualization.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

This code snippet shows how to pivot using the day as the index and aggregate the mean value of the tip amounts for lunch and dinner (the time of the day):

# Create a pivot table of average tip amount by time and day  pivot_table = tips_data.pivot_table(values='tip', index='day', columns='time', aggfunc='mean')    # Create a heatmap of tipping behavior based on time and day  plt.figure(figsize=(8, 6))  sns.heatmap(pivot_table, cmap='YlGnBu', annot=True, fmt=".2f", cbar=True)  plt.title("Tipping Behavior based on Time and Day")  plt.xlabel("Time")  plt.ylabel("Day")  plt.show()

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

From the heat map, we see that the average tip amounts are marginally higher for dinner on Sunday than other time-day pairs. In general, for a given day, on an average, tip amounts for dinner seem to be higher than those for lunch.

We also see that there is not enough data for lunch on Saturdays and Sundays. Because we only have a total of 244 records, we may not have any record for lunches on weekends. Let’s go ahead and verify this.

Prompt: I want to get the number of parties who ordered lunch on weekends. Write the code to filter the `tips_data` dataframe and get this count.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

Here’s the code to filter the dataframe and obtain a subset containing the records for lunches on weekends:

# Filter the data for lunch on weekends  weekend_lunch_data = tips_data[(tips_data['time'] == 'Lunch') & (tips_data['day'].isin(['Sat', 'Sun']))]    # Get the count of parties who ordered lunch on weekends  count = weekend_lunch_data['size'].count()  print("Number of parties who ordered lunch on weekends:", count)

And we see that we don’t have any records for the same, so our analysis is correct thus far:

Number of parties who ordered lunch on weekends: 0

And that’s a wrap! We explored the `tips` dataset and generated some helpful visualizations by prompting ChatGPT.

Wrapping Up

In this article, we learned how to leverage ChatGPT for data exploration. If you’re interested in integrating ChatGPT into your data science workflow, check out this guide. It walks through an example project—along with tips and best practices—to effectively use ChatGPT for data science experiments.
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.

ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset

More On This Topic

Latest stories

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron...

PNNL: Integrating AI into Biological Research

Rick Stevens on the Genesis Mission and the Future of...

Inside the DOE’s 26 AI Challenges for Genesis Mission

You might also like...

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron Star Data

PNNL: Integrating AI into Biological Research