ChatGPT offers Advanced Data Analysis, a powerful tool that helps users crunch data and search for insights. Over the past year, I've had the opportunity to use it to quite good effect. Other AIs are starting to offer similar tools, but so far, they don't seem to have the power of ChatGPT's Advanced Data Analysis feature.
Data analysis has been part of computing's kit bag for decades, so AI offers, in a sense, more of the same. Raw data goes in, and some conclusion comes out.
Here's what makes the AI different in a context that you can put to use right away: AI can do heavy computational analytics tasks with just a few prompts — getting results that previously would require an experienced programmer or data scientist to produce.
I'm a programmer who's capable of doing such computational analytics; in fact, I enjoy writing data-crunching routines. I also love charts and tables to a degree that likely borders on the pathological. It's a condition that hovers somewhere between kind of cute and deeply worrisome, according to certain family members.
But my point here is that even I — someone with obsessive analytics programming chops — found that ChatGPT was helpful and could produce results in minutes that would take days or weeks of programming.
In this article, I'll introduce you to some of the projects I gave to ChatGPT and explain why they're relevant to your data needs. Then I'll review the lessons I learned regarding how to think about using ChatGPT for data analysis. Finally, I'll help you explore ways to work around the more unpredictable limitations in this generation of AI.
First, a bit of context
Regarding the data analysis we'll be exploring, we'll look at chatbot-based data analysis only.
There's no doubt that AI and big data can work together using APIs and complex programs to do some amazing things. For example, cybersecurity professionals can now combine AI tools and deep automation to filter tremendous amounts of real-time data at rates far faster than humans can process. This usually uses various APIs to link the systems together to perform the analysis.
We're not doing that here. We're limiting the data analysis to what you can upload to or paste into ChatGPT's end-user-focused web interface (and its Mac and Windows apps).
Also, when I'm talking about ChatGPT, I'll generally be referring to the $20/month Plus subscription version. When Advanced Data Analysis was first introduced, its capabilities were available only to Plus users. File uploads were also limited to Plus users.
Today, both features are available to free users, although the new integrated web search is limited to Plus users. It's hard to nail down precisely which features are in ChatGPT Plus vs. the free version, other than OpenAI describing the free version as "limited."
To do most of its data analysis, ChatGPT uses a feature called Advanced Data Analysis. In earlier versions of ChatGPT, you had to turn it on and off in settings. Now, it's just part of the chatbot.
In my experience, I found that I could run a few prompts by the free ChatGPT before being told that I had exceeded my welcome and was asked to come back later. The Plus version allows me to keep interacting with the AI for (mostly) as long as I want. I did get cut off once when I conducted a very lengthy, rapid-fire interrogation of the AI, which — after quite a while — threw up its virtual hands and exclaimed, "Enough, already!" But, you know, that's rare.
Feel free to get as far as you can on the free account. But do be aware that for some more serious work, you might be asked to pay up.
And with that, let's dive in.
Easy data normalization
One of my earliest tests of ChatGPT's data analysis capabilities was asking it to chart city populations. Working off of data in its internal knowledge base, the AI gave me functional results, although some of the data was inaccurate. Because the knowledge base had a finite temporal end, city population data was a few years out of date.
Using just the knowledge in ChatGPT's knowledge base, its data analysis was a mere parlor trick. But once ChatGPT could upload actual data files, that changed.
My next test also used out-of-date information: New York City baby names from 2011-2014. I didn't care that it was out-of-date because I was only using it as test data to get to know the tool. It was a 69,215-record data set of baby names, ethnicity, gender, and overall count.
Also: How to use ChatGPT to make charts and tables with Advanced Data Analysis
From this data, I was able to do easy calculations, such as determining the ratio of boy babies to girl babies. I was also able to determine the most popular baby names. And I was able to use the fairly limited ethnicity data to chart names based on ethnicity.
However, there were problems with how ChatGPT interpreted the data. It assumed that capitalization was relevant, so it considered "Madison" and "MADISON" to be different names. It also was confused by shortened versions of data, so "Hisp" and "Hispanic" were considered different ethnicities.
What makes ChatGPT so powerful, especially for non-programmers, is that you can fix these problems with a prompt. Something like "For all the following requests, baby names should be case insensitive" can get you there with no programming.
The key trick is to keep in mind that data normalization may need to be embedded in your prompts. Don't be afraid to copy a previous prompt and add a sentence that addresses how to adjust the data for consistency.
Your job will be to pay diligent attention to the results provided by the AI and make prompt adjustments for areas where it doesn't get it right. Think of it as if you were getting an assignment back from a student in your class. You'd look at their work and make corrections. Same with the AI.
Quick business insights
Here's the moment I became convinced that ChatGPT's data analysis capabilities are game-changing: I performed a sentiment analysis project with my own proprietary business data.
I have a small freemium software business with a few products. This business model allows users to download and use a fully functional software product for free. The revenue comes from selling add-ons that increase the capability of the base software. I don't make a ton from this side business, but it keeps my coding chops up and helps pay for my slight power tool addiction.
I gather data based on when users uninstall the free product. If a user stops using a product, they're presented with a simple questionnaire asking about typical reasons for uninstalling, as well as a fill-in field where they can share any thoughts.
I'd been gathering the data on my server for years but had never gotten around to doing anything with it. I could never justify the time it would take to write all the custom code necessary to turn that gathered information into useful insights.
However, after a fairly substantial update to one of the products, I was concerned about whether users were satisfied with the update or were put off by the changes. I wanted to see what my uninstall data could tell me, and whether there had been any change in pattern before and after that update.
But I still didn't have the week or two it would take me to code an analysis. So I turned to ChatGPT. I pulled the data off the server and spent some time in Excel cleaning up the raw data so it could be read by the AI. That involved giving columns field names, removing garbage delimiters, etc. Then I uploaded the cleaned file to ChatGPT.
Also: The moment I realized ChatGPT Plus was a game-changer for my business
This is where I need to point out that I gave ChatGPT proprietary business data. In my case, it's a side hustle, so I don't care what — if anything — OpenAI does with the data. But you need to consider your corporate policy for data sharing before you upload your proprietary data into the cloud, especially an AI that might use it for training data.
In any case, ChatGPT was able to crunch my data. Not only did it handle the basic static field content, but it was also able to analyze the freeform comments written by users. It sliced and diced the results based on time, version, and product. From that, I was able to ascertain that there was no increase in either uninstalls or dissatisfaction following my update.
I wrote two prompts, essentially two sentences, and ChatGPT churned through 22,797 records and gave me a conclusion I didn't expect: Users demonstrated "a slight increase in positive sentiment" after the update.
This is incredibly powerful stuff. Setting side the issue of whether you can share proprietary data with the AI, we now have a mechanism whereby non-programmers can perform amazing analytical feats in minutes. With that kind of ability in hand, it becomes worth digging through all those data silos where useful data has been stored but never examined, to see what insights might be unearthed by some quick queries.
Informative diagnostic forensics
Let's dig in for one more example. (You can read the full details of the project in this article.) The basic problem statement is that my 3D printer exhibits some inconsistent performance patterns.
This project involved feeding in two files of G-code, the machine movement instructions for producing the 3D prints I was evaluating. I created one of the files through normal practices, and the other was the factory-provided version.
Also: Why data is the Achilles Heel of AI (and every other business plan)
Their version took 16 minutes to print. Mine took 42 minutes. I wanted to know why. My outreach to the company yielded zero clarity, so I turned to ChatGPT for an assist.
Together, the two files consisted of more than 170,000 lines of code, mostly the X and Y coordinates of where the print head was to move. There were also some temperature and flow settings, along with occasional instructions to move the build plate so the machine could lay down another layer of molten plastic.
But why was one version so much faster than the other? I uploaded both files to ChatGPT and asked it to compare the files. Then I asked a simple question: Why is "fast print" so much faster?
That's it. Can you imagine the programming that would be necessary to do that kind of data analysis? All I did was ask a simple question. This is huge.
Also: How ChatGPT scanned 170k lines of code in seconds, saving me hours of work
Once I got an answer (basically, the factory played with movement speeds and material flow rates), I asked some simple follow-up questions. Once again, ChatGPT churned through a couple of giant data files and fed me back actionable information.
Here's another important takeaway: The data you feed to ChatGPT doesn't need to be traditional form data with rows and columns. If you have data and it's in a file (or two), feel free to give it to the AI and see what it can discern from the raw files.
We've had database engines for years. But the AI is a lot more than that. It's like having an army of data prep interns and data scientists who can manage the data, find patterns, and produce insights. With results so fast and prompting so easy, there's no reason not to try the AI on any file of information you want to explore. I mean, why not?
Token limit
One of the interesting discussions I've had with ZDNet commenters concerns the question of token limits. AIs don't measure the quantity of information they're asked to process based on lines or characters, but instead by tokens. These are text representations of the information the AI is expected to process.
According to ChatGPT itself, its token limit for GPT-3.5 was approximately 4K tokens per interaction. GPT-4 increased that to 8K tokens per interaction. GPT-4o is a bit more flexible, with a token limit of 128K for "extended interactions," but only 8K or 32K for most interactions.
Also: Why data is the Achilles Heel of AI (and every other business plan)
What sent one of my commenters into a particularly amusing tizzy was the question of how I could get ChatGPT to process 170,000 lines of input data when the token limitations seemed to be a lot less.
I asked ChatGPT the question and was informed that tokens aren't one-for-one matches to commands or lines. Also, ChatGPT told me it processes files in chunks and does selective processing. So it does manage the data through mechanisms more nuanced than simple brute force scanning.
My way of testing whether ChatGPT was able to process the full file was to ask questions about the full file, from how many lines it could count to asking it to describe some data from the very beginning to the very end, and seeing if that data matched what I saw.
But with all that said, don't be afraid to use a spreadsheet program to pre-process data before uploading it. You can use this to normalize data for more predictable responses from the AI, to remove any distracting data elements you don't want the AI to consider, and — possibly — to segment the data if the chatbot gets overwhelmed by the amount of data it's been presented.
Also: How to use ChatGPT to write Excel formulas
You can use ChatGPT to help you create the formulas in Excel necessary to refactor and normalize your data. ChatGPT also can provide support for Google Sheets and Apple Numbers, so no matter which spreadsheet tool you choose to use, you don't have to go it on your own.
10 quick tips
ChatGPT is surprisingly powerful and can provide insights that otherwise would have taken a lot of work and programming skills. I'll wrap up this overview with 10 quick tips based on my experience chatting with ChatGPT.
- When in doubt, try it. There's no harm or added cost in throwing data at ChatGPT and seeing what it will tell you.
- Data doesn't have to be represented only in rows and columns. You can feed ChatGPT full-text input, and even PDFs.
- Always double-check its results. The easiest way to do this is to just ask a lot of questions about the data and see what it says. Of course, be sure to look at the data yourself to confirm.
- Ask "What can you tell me about this data?" It's a great open-ended question that can get ChatGPT started giving you insights and is a jumping-off point for additional analysis.
- "Show your work" is another powerful way to get insights into how ChatGPT looks at and processes your data.
- If you don't get what you want, try again. Rephrase your prompt. Simplify it. Add details and constraints. Your first question may not be precise enough, but after a few iterations, you'll probably get good results.
- Ask for charts and tables. ChatGPT won't always automatically produce charts and tables and they can produce very powerful insights.
- ChatGPT makes mistakes, but it doesn't get upset if you tell it so. If you don't think the results the AI gave you are correct, tell it, and ask it to rethink its approach.
- Slowly build up your analysis. You can copy and paste previous prompts, adding more specificity and instruction as ChatGPT proves it did earlier steps correctly.
- Have fun. There's nothing quite as much fun as feeding a giant data set to an AI and having it spit back cool charts and tables.
So there you go. Give ChatGPT data analysis a try. I'm sure you'll discover insights hidden in them thar data hills.
What about you?
Have you used ChatGPT to do advanced data analysis? Are you a ChatGPT Plus subscriber? Have you gotten any interesting insights or fed it any data or files you think were particularly interesting uses of the technology? Let us know in the comments below.
You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.