I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively

Last week, I got an email from Anthropic announcing that Claude 3.5 Sonnet was available. According to the AI company, "Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations."

The company added: "Claude 3.5 Sonnet is ideal for complex tasks like code generation." I decided to see if that was true.

Also: How to use ChatGPT to create an app

I'll subject the new Claude 3.5 Sonnet model to my standard set of coding tests — tests I've run against a wide range of AIs with a wide range of results. Want to follow along with your own tests? Point your browser to How I test an AI chatbot's coding ability – and you can too, which contains all the standard tests I apply, explanations of how they work, and what to look for in the results.

OK, let's dig into the results of each test and see how they compare to previous tests using Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Advanced, and ChatGPT.

1. Writing a WordPress plugin

At first, this seemed to have so much promise. Let's start with the user interface Claude 3.5 Sonnet created based on my test prompt.

This is the first time an AI has decided to put the two data fields side-by-side. The layout is clean and looks great.

Claude also decided to do something else I've never seen an AI do. This plugin can be created using just PHP code, which is the code running at the back end of a WordPress server.

Also: How I test an AI chatbot's coding ability – and you can too

But some AI implementations also have added JavaScript code (which runs in the browser to control dynamic user interface features) and CSS code (which controls how the browser displays information).

In a PHP environment, if you need PHP, JavaScript, and CSS, you can either include the CSS and JavaScript right in the PHP code (that's a feature of PHP), or you can put the code in three separate files — one for PHP, one for JavaScript, and one for CSS.

Usually, when an AI wants to use all three languages, it shows what needs to be cut and pasted into the PHP file, then another block to be cut and pasted into a JavaScript file, and then a third block to be cut and pasted into a CSS file.

But Claude just provided one PHP file and then, when it ran, auto-generated the JavaScript and CSS files into the plugin's home directory. This is both fairly impressive and somewhat wrong-headed. It's cool that it tried to make the plugin creation process easier, but whether or not a plugin can write to its own folder is dependent on the settings of the OS configuration — and there's a very high chance it could fail.

I allowed it in my testing environment, but I'd never allow a plugin to rewrite its own code in a production environment. That's a very serious security flaw.

Also: How to use ChatGPT to write code: What it can and can't do for you

Despite the fairly creative nature of Claude's code generation solution, the bottom line is that the plugin failed. Pressing the Randomize button does absolutely nothing. That's sad because, as I said, it had so much promise.

Here are the aggregate results of this and previous tests:

Claude 3.5 Sonnet: Interface: good, functionality: fail
ChatGPT GPT-4o: Interface: good, functionality: good
Microsoft Copilot: Interface: adequate, functionality: fail
Meta AI: Interface: adequate, functionality: fail
Meta Code Llama: Complete failure
Google Gemini Advanced: Interface: good, functionality: fail
ChatGPT 4: Interface: good, functionality: good
ChatGPT 3.5: Interface: good, functionality: good

2. Rewriting a string function

This test is designed to evaluate how the AI does rewriting code to work more appropriately for the given need; in this case — dollars and cents conversions.

The Claude 3.5 Sonnet revision properly removed leading zeros, making sure that entries like "000123" are treated as "123". It properly allows integers and decimals with up to two decimal places (which is the key fix the prompt asked for). It prevents negative values. And it's smart enough to return "0" for any weird or unexpected input, which prevents the code from abnormally ending in an error.

Also: Can AI detectors save us from ChatGPT? I tried 6 online tools to find out

One failure is that it won't allow decimal values alone to be entered. So if the user entered 50 cents as ".50" instead of "0.50", it would fail the entry. Based on how the original text description for the test is written, it should have allowed this input form.

Although most of the revised code worked, I have to count this as a fail because if the code were pasted into a production project, users would not be able to enter inputs that contained only values for cents.

Here are the aggregate results of this and previous tests:

Claude 3.5 Sonnet: Failed
ChatGPT GPT-4o: Succeeded
Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Succeeded
Google Gemini Advanced: Failed
ChatGPT 4: Succeeded
ChatGPT 3.5: Succeeded

3. Finding an annoying bug

The big challenge of this test is that the AI is tasked with finding a bug that's not obvious and — to solve correctly — requires platform knowledge of the WordPress platform. It's also a bug I did not immediately see on my own and, originally, asked ChatGPT to solve (which it did).

Also: The best free AI courses in 2024 (and whether AI certificates are worth it)

Claude not only got this right — catching the subtlety of the error and correcting it — but it was also the first AI since I published the full set of tests online to catch the fact that the publishing process introduced an error into the sample query (which I subsequently fixed and republished).

Here are the aggregate results of this and previous tests:

Claude 3.5 Sonnet: Succeeded
ChatGPT GPT-4o: Succeeded
Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
Meta AI: Succeeded
Meta Code Llama: Failed
Google Gemini Advanced: Failed
ChatGPT 4: Succeeded
ChatGPT 3.5: Succeeded

So far, we're at two out of three fails. Let's move on to our last test.

4. Writing a script

This test is designed to see how far the AI's programming knowledge goes into specialized programming tools. While AppleScript is fairly common for scripting on Macs, Keyboard Maestro is a commercial application sold by a lone programmer in Australia. I find it indispensable, but it's just one of many such apps on the Mac.

However, when testing in ChatGPT, ChatGPT knew how to "speak" Keyboard Maestro as well as AppleScript, which shows how broad its programming language knowledge is.

Also: From AI trainers to ethicists: AI may obsolete some jobs but generate new ones

Unfortunately, Claude does not have that knowledge. It did write an AppleScript that attempted to speak to Chrome (that's part of the test parameter) but it ignored the essential Keyboard Maestro component.

Worse, it generated code in AppleScript that would generate a runtime error. In an attempt to ignore case for the match in the test, Claude generated the line:

if theTab's title contains input ignoring case then

This is pretty much a double error because the "contains" statement is case insensitive and the phrase "ignoring case" does not belong where it was placed. It caused the script to error out with an "Ignoring can't go after this" syntax error message.

Here are the aggregate results of this and previous tests:

Claude 3.5 Sonnet: Failed
ChatGPT GPT-4o: Succeeded but with reservations
Microsoft Copilot: Failed
Meta AI: Failed
Meta Code Llama: Failed
Google Gemini Advanced: Succeeded
ChatGPT 4: Succeeded
ChatGPT 3.5: Failed

Overall results

Here are the overall results of the five tests:

Claude 3.5 Sonnet: 1 out of 4 succeeded
ChatGPT GPT-4o: 4 out of 4 succeeded, but there's that one weird dual-choice answer
Microsoft Copilot: 0 out of 4 succeeded
Meta AI: 1 out of 4 succeeded
Meta Code Llama: 1 out of 4 succeeded
Google Gemini Advanced: 1 out of 4 succeeded
ChatGPT 4: 4 out of 4 succeeded
ChatGPT 3.5: 3 out of 4 succeeded

I was somewhat bummed about Claude 3.5 Sonnet. The company specifically promised that this version was suited to programming. But as you can see, not so much. It's not that it can't program. It just can't program correctly.

Also: I used ChatGPT to write the same routine in 12 top programming languages. Here's how it did

I keep looking for an AI that can best the ChatGPT solutions, especially as platform and programming environment vendors start to integrate these other models directly into the programming process. But, for now, I'm going back to ChatGPT when I need programming help, and that's my advice to you as well.

Have you used an AI to help you program? Which one? How did it go? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively

1. Writing a WordPress plugin

2. Rewriting a string function

3. Finding an annoying bug

4. Writing a script

Overall results

Artificial Intelligence

Gemini’s data-analyzing abilities aren’t as good as Google claims

Here are India’s biggest AI startups based on how much money they’ve raised

AI CEOs Should Stop Gaslighting Students With LLMs

Latest stories

Here are India’s biggest AI startups based on how much...

AI CEOs Should Stop Gaslighting Students With LLMs

Gemini’s data-analyzing abilities aren’t as good as Google claims

Oracle Announces General Availability of HeatWave GenAI, an in-database LLM...

How knowledge graphs + LLMs enable AI efficiency and trust

You might also like...

Here are India’s biggest AI startups based on how much money they’ve raised

AI CEOs Should Stop Gaslighting Students With LLMs

Gemini’s data-analyzing abilities aren’t as good as Google claims