The factor I discover most baffling concerning the programming assessments I've been operating is that instruments primarily based on the identical massive language mannequin are likely to carry out fairly in a different way.
Additionally: The most effective AI for coding in 2025 (and what to not use)
For instance, ChatGPT, Perplexity, and GitHub Copilot are all primarily based on the GPT-4 mannequin from OpenAI. However, as I'll present you beneath, whereas ChatGPT and Perplexity's professional plans carried out excellently, GitHub Copilot failed as usually because it succeeded.
I examined GitHub Copilot embedded inside a VS Code occasion. I'll clarify how you can set that up and use GitHub Copilot in an upcoming step-by-step article. However first, let's run by the assessments.
If you wish to understand how I check and the prompts for every particular person check, be at liberty to learn how I check an AI chatbot's coding skill.
TL;DR: GitHub Copilot handed two and failed two.
Take a look at 1: Writing a WordPress Plugin
So, this failed miserably. This was my first check, so I can't inform but whether or not GitHub Copilot is horrible at writing code or whether or not the context through which one interacts with it’s limiting to the purpose the place it will possibly't meet this requirement.
Let me clarify.
This check includes asking the AI to create a completely useful WordPress plugin, full with admin interface components and operational logic. The plugin takes in a set of names, kinds them, and, if there are duplicates, separates the duplicates so that they're not facet by facet.
Additionally: I tested DeepSeek's R1 and V3 coding skills – and we're not all doomed (yet)
This was a real-world software that my spouse wanted as a part of an involvement gadget she runs on her very energetic Fb group as a part of her digital items e-commerce enterprise.
Many of the different AIs handed this check, at the least partly. 5 of the ten AI fashions examined handed the check utterly. Three of them handed a part of the check. Two (together with Microsoft Copilot) failed utterly.
The factor is, I gave GitHub Copilot the identical immediate I give all of them, nevertheless it solely wrote PHP code. To be clear, this drawback may be solved solely utilizing PHP code. However some AIs like to incorporate some JavaScript for the interactive options. GitHub Copilot included code for utilizing JavaScript however by no means really generated the JavaScript that it tried to make use of.
What's worse, after I created a JavaScript file and, from throughout the JavaScript file, tried to get GitHub Copilot to run the immediate, it gave me one other PHP script, which additionally referenced a JavaScript file.
As you possibly can see beneath, throughout the randomizer.js file, it tried to enqueue (principally to usher in to run) the randomizer.js file, and the code it wrote was PHP, not JavaScript.
Take a look at 2: Rewriting a string operate
This check is pretty easy. I wrote a operate that was supposed to check for {dollars} and cents however wound up solely testing for integers ({dollars}). The check asks the AI to repair the code.
GitHub Copilot did rework the code, however there have been a bunch of issues with the code it produced.
- It assumed a string worth was all the time a string worth. If it was empty, the code would break.
- The revised common expression code would break if a decimal level (i.e., "3.") was entered, if a number one decimal level (i.e., ".3") was entered, or if main zeros had been included (i.e., "00.30").
For one thing that was supposed to check whether or not foreign money was entered appropriately, failing with code that may crash on edge instances shouldn’t be acceptable.
So, we have now one other fail.
Take a look at 3: Discovering an annoying bug
GitHub Copilot obtained this proper. That is one other check pulled from my real-life coding escapades. What made this bug so annoying (and tough to determine) is that the error message isn't straight associated to the precise drawback.
Additionally: I put DeepSeek AI's coding skills to the test – here's where it fell apart
The bug is form of the coder equal of a trick query. Fixing it requires understanding how particular API calls within the WordPress framework work after which making use of that information to the bug in query.
Microsoft Copilot, Gemini, and Meta Code Llama all failed this check. However GitHub Copilot solved it appropriately.
Take a look at 4: Writing a script
Right here, too, GitHub Copilot succeeded the place Microsoft Copilot failed. The problem right here is that I'm testing the AI's skill to create a script that is aware of about coding in AppleScript, the Chrome object mannequin, and a bit of Mac-only third-party coding utility known as Keyboard Maestro.
Additionally: X's Grok did surprisingly well in my AI coding tests
To cross this check, the AI has to have the ability to acknowledge that each one three coding environments want consideration after which tailor particular person strains of code to every of these environments.
Remaining ideas
On condition that GitHub Copilot makes use of GPT-4, I discover the truth that it failed half of the assessments discouraging. GitHub is nearly the most well-liked supply administration atmosphere on the planet, and one would hope that the AI coding assist was fairly dependable.
As with all issues AI, I'm positive efficiency will get higher. Let's keep tuned and examine again in a number of months to see if the AI is more practical at the moment.
Do you utilize an AI to assist with coding? What AI do you favor? Have you ever tried GitHub Copilot? Tell us within the feedback beneath.
You may comply with my day-to-day venture updates on social media. Remember to subscribe to my weekly replace publication, and comply with me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.