Sign In

Communications of the ACM


AI Does Not Help Programmers

View as: Print Mobile App Share:
Bertrand Meyer

Everyone is blown away by the new AI-based assistants. (Myself included: see an earlier article on this blog which, by the way, I would write differently today.) They pass bar exams and write songs. They also produce programs. Starting with Matt Welsh's article in Communications of the ACM, many people now pronounce programming dead, most recently The New York Times.

I have tried to understand how I could use ChatGPT for programming and, unlike Welsh, found almost nothing. If the idea is to write some sort of program from scratch, well, then yes. I am willing to believe the experiment reported on Twitter of how a beginner using Copilot to beat hands-down a professional programmer for a from-scratch development of a Minimum Viable Product program, from "Figma screens and a set of specs." I have also seen people who know next to nothing about programming get a useful program prototype by just typing in a general specification. I am talking about something else, the kind of use that Welsh touts: a professional programmer using an AI assistant to do a better job. It doesn't work.

Precautionary observations:

  • Caveat 1: We are in the early days of the technology and it is easy to mistake teething problems for fundamental limitations. (PC Magazine's initial review of the iPhone: "it's just a plain lousy phone, and although it makes some exciting advances in handheld Web browsing it is not the Internet in your pocket.") Still, we have to assess what we have, not what we could get.
  • Caveat 2: I am using ChatGPT (version 4). Other tools may perform better.
  • Caveat 3: It has become fair game to try out ChatGPT or Bard, etc., into giving wrong answers. We all have great fun when they tell us that Famous Computer Scientist X has received the Turing Award and next (equally wrongly) that X is dead. Such exercises have their use, but here I am doing something different: not trying to trick an AI assistant by pushing it to the limits of its knowledge, but genuinely trying to get help from it for my key purpose, programming. I would love to get correct answers and, when I started, thought I would. What I found through honest, open-minded enquiry is at complete odds with the hype.
  • Caveat 4: The title of this article is rather assertive. Take it as a proposition to be debated ("This house believes that..."). I would be interested to be proven wrong. The main immediate goal is not to edict an inflexible opinion (there is enough of that on social networks), but to spur a fruitful discussion to advance our understanding beyond the "Wow!" effect.

Here is my experience so far. As a programmer, I know where to go to solve a problem. But I am fallible; I would love to have an assistant who keeps me in check, alerting me to pitfalls and correcting me when I err. A effective pair-programmer. But that is not what I get. Instead, I have the equivalent of a cocky graduate student, smart and widely read, also polite and quick to apologize, but thoroughly, invariably, sloppy and unreliable. I have little use for such  supposed help.

It is easy to see how generative AI tools can peform an excellent job and outperform people in many areas: where we need a result that comes very quickly, is convincing, resembles what a top expert would produce, and is almost right on substance. Marketing brochures. Translations of Web sites. Actually, translations in general (I would not encourage anyone to embrace a career as interpreter right now). Medical image analysis. There are undoubtedly many more. But programming has a distinctive requirement: programs must be right. We tolerate bugs, but the core functionality must be correct. If the customer's order is to buy 100 shares of Microsoft and sell 50 of Amazon, the program should not do the reverse because an object was shared rather than replicated. That is the kind of serious error professional programmers make and for which they need help.

AI in its modern form, however, does not generate correct programs: it generates programs inferred from many earlier programs it has seen. These programs look correct but have no guarantee of correctness. (I am talking about  "modern" AI to distinguish it from the earlier kind—largely considered to have failed—which tried to reproduce human logical thinking, for example through expert systems. Today's AI works by statistical inference.)

Fascinating as they are, AI assistants are not works of logic; they are works of words. Large language models: smooth talkers (like the ones who got all the dates in high school). They have become incredibly good at producing text that looks right. For many applications that is enough. Not for programming.

Some time ago, I published on this blog a sequence of articles that tackled the (supposedly) elementary problem of binary search, each looking good and each proposing a version which, up to the last installments, was wrong. (The first article is here; it links to its successor, as all items in the series do. There is also a version on my personal blog as a single article, which may be more convenient to read.)

I submitted the initial version to ChatGPT. (The interaction took place late May; I have not run it again since.)

The answer begins with a useful description of the problem:

Good analysis; similar in fact to the debunking of the first version in my own follow-up. The problem can actually arise with any number of elements, not just two, but to prove a program incorrect it suffices to exhibit a single counterexample. (To prove it correct, you have to show that it works for all examples.) But here is what ChatGPT comes up with next, even though all I had actually asked was whether the program was correct, not how to fix it:


 (Please examine this code now!) It includes helpful comments:


All this is very good, but if you did examine the proposed replacement code, you may have found something fishy, as I did.

I report it:

Indeed, in trying to fix my bug, ChatGPT produced another buggy version, although the bug is a new one. There is an eerie similarity with my own original sequence of binary search posts, where each attempt introduced a version that seemed to correct the mistake in the preceding one —only to reveal another problem.

The difference, of course, is that my articles were pedagogical, instead of asserting with undaunted assurance that the latest version is the correct fix!

One thing ChatGPT is very good at is apologizing:

Well, for my part, when looking for an assistant I am all for him/her/it to be polite and to apologize, but what I really want is that the assistant be right. Am I asking too much? ChatGPT volunteers, as usual, the corrected version that I had not even (or not yet) requested:

(Do you also find that the tool doth apologize too much? I know I am being unfair, but I cannot help think of the French phrase trop poli pour être honnête, too polite to be honest.)

At this point, I did not even try to determine whether that newest version is correct; any competent programmer knows that spotting cases that do not work and adding a specific fix for each is not the best path to a correct program.

I, too, remain (fairly) polite:

Now I am in for a good case of touché: ChatGPT is about to lecture me on the concept of loop invariant!

I never said or implied, by the way, that I "want a more systematic way of verifying the correctness of the algorithm." Actually, I do, but I never used words like "systematic" or "verify." A beautiful case of mind-reading by statistical inference from a large corpus: probably, people who start whining about remaining bugs and criticize software changes as "kludges" are correctness nuts like me who, in the next breath, are going to start asking for a systematic approach and verification.

I am, however,  a tougher nut to crack than what my sweet-talking assistant—the one who is happy to toss in knowledge about fancy topics such as class invariant—thinks. My retort:

 There I get a nice answer, almost as if (you see my usual conceit) the training set had included our loop invariant survey (written with Carlo Furia and Sergey Velder) in ACM's Computing Surveys. Starting with a bit of flattery, which can never hurt:

And then I stopped.

Not that I had succumbed to the flattery. In fact, I would have no idea where to go next. What use do I have for a sloppy assistant? I can be sloppy just by myself, thanks, and an assistant who is even more sloppy than I is not welcome. The basic quality that I would expect from a supposedly intelligent  assistant—any other is insignificant in comparison —is to be right.

It is also the only quality that the ChatGPT class of automated assistants cannot promise.

Help me produce a basic framework for a program that will "kind-of" do the job, including in a programming language that I do not know well? By all means. There is a market for that. But help produce a program that has to work correctly? In the current state of the technology, there is no way it can do that.

For software engineering there is, however, good news. For all the hype about not having to write programs, we cannot forget that any programmer, human or automatic, needs specifications, and that any candidate program requires verification. Past the "Wow!", stakeholders eventually  realize that an impressive program written at the push of a button does not have much use, and can even be harmful, if it does not do the right things—what the stakeholders want. (The requirements literature, including my own recent book on the topic, is there to help us build systems that achieve that goal.)

There is no absolute reason why Generative AI For Programming could not integrate these concerns. I would venture that if it is to be effective for serious professional programming, it will have to spark a wonderful renaissance of studies and tools in formal specification and verification.


Bertrand Meyer is a professor and Provost at the Constructor Institute (Schaffhausen, Switzerland) and chief technology officer of Eiffel Software (Goleta, CA).


David Erb

The focus on this article on ChatGPT for coding makes the entire article uninteresting. True, ChatGPT can't write good code, but other AI tools, like Github Copilot, allow a coder to write simple prompts in comments and generate good boilerplate code, which the coder can then adjust as needed. I find my work goes 2-4 times faster with this simple approach, so I cannot agree that AI does not help programmers. A more apt title would be "The Current Generation of ChatGPT Does Not Help Programmers," but that might seem less intriguing.

Petr Kures

I agree with David. Of course I can't blindly use output of GPT4 or any other LM. But it helps with learning new APIs , transforming SQL quuery result to data classes, etc. Maybe you write binary search every day, but I'am laughing at such examples, because it would it would be foolish to write such things unless absolutely necessary. There are libraries - tested libraries for that. Only if no such library is available I will write such low level code and then I know to be extra careful to not make many stupid mistakes, write many tests etc.

I was asked to write integraton with Microsoft Azure recently and without help of AI it would take 4 times as long. And I absolutely cannot see it replacing me as a programmer, it cannot understand, refactor and improve large scale projects (yet :-), but it saves me lot of time not having to study documentation for a week before being able to write some incantation to authenticate to MS cloud. I'd rather focus on getting the architecture right. Thinking about it - OK AI generated code isn't always correct - so isn't mine, I have to test it and iterate - and AI makes the iteration faster. Unattended software development by AI is pretty far away I'd think - I can't imagine our customers having to explain their ideas to Chat GPT ;-) We have to communicate with them all the time, read their minds and sometimes even suggest and develop things they didn't know they want.

Bertrand Meyer

Thanks for the comments.

@David Erb I explicitly wrote that my contribution is limited by looking at ChatGPT only. On the other hand ChatGPT is the current reference and defines the state of the art. I would be most interested to see what Copilot and others produce on the kind of issues I covered, and specifically the example covered in depth in my earlier binary search article.

@Petr Kuewa Thanks for the experience report. I do not attach much value to arguments of the sort "binary search is an academic exercise". Of course I know that binary search is not an exciting software project. But any technology needs to be evaluated on some examples and if it fails on supposedly toy examples it's unlikely that it will succeed on bigger ones. Whether binary search or (say) building the software for GPS (as an example of something truly big and significant), one of the key issues of building software, often the key issue, is to get it right. A tool that does no better than me in this respect is of limited interest. It can still have its uses, as you suggest.

Roman Suzi

I fully agree, that currently ChatGPT can't replace human programmer for formal verification, however, there are some points worth to be mentioned.

First, I believe LLMs can help retrieve an ontology behind problem domain. Something programmers need especially in the new areas. ChatGPT makes remarkable job in finding proper terminology given examples. In a sense, ChatGPT peeks most frequently used words, but also knows lesser used ones. My own blog on this: . This is very useful to establish ubiquitous language, especially in the greenfield project. And building good ontology is both costly and important for laying good foundations for the software.

Second, formal verification using proof assistants like Coq (or programming languages like Idris2) can be made more approachable for beginners (I mean professional programmers who are beginners in formal verification methods). If ChatGPT or similar technology can generate a proof by guessing hints, it can save a lot of time. And unlike the "untyped" example of binary search checked by human, a proof assistant will not let an incorrect proof slip through. With enough "sloppy assistants," the end result will still be correct. For this reason, I consider ChatGPT as a specialized syntax-aware search engine.

Third, well-architected systems require a minimum amount of boilerplate code that ChatGPT can assist in authoring. I see a lot of potential in AI-friendly domain-specific languages (DSLs) that can help bridge the gap in formalization (natural language -> DSL). This is also a relevant use case for professional programming.

And I am sure there are other ways programmers and data scientists can benefit.

Gilbert Nash

A few hours ago, encouraged by your article, I asked ChatGPT-4 for help in designing simple logic circuits.
It started faltering right after stating that it was familiar with the topic, as soon as I asked it for an easy functional extension of the basic NAND flip-flop. Then it went on piling up errors, omissions and humble apologies until I desisted, thanked it for trying (old habits die hard) and logged off.

I'm afraid you're right: as of today LLM (or at least GPT-4) are better as peddlers or politicians than as scientists or engineers: excellent with words, polite, self-assured, persuasive, authoritative, poker faced - regardless if they really understand what they're talking about or are just producing convincing noise.

Sic transit gloria mundi... (sigh)

James Jones

Back when I thought I was good at using search tools like Alta Vista as a senior software developer, a college intern working with me introduced me to Google, the latest in search engines, by having me copy/paste the error message I got for insights into the problem. That shifted my paradigm for search engines as a programmer. When I think of AI tools that will help programmers I am looking for something that broadens my horizons to see different ways of coding something or encourages me to add test cases I overlooked because of its understanding of info system and language processing design structures under the hood. So I am looking for AI as an assistant (like the intern) rather than a be all end all developer. To be most useful, the AI generator should be able to give me snippet of code that I ask for and be able to follow any restrictions about languages, frameworks, and tools that I want it to avoid or use. When I use Bard to improve a paragraph taken from a letter, it often "thinks" it is to write a whole letter and it creates more than the improved snippet. I rarely use much of what it gives me, but it helps to have a wordsmithing assistant like the synonym selector on steroids. Another analogy that comes to mind was when I would take a MS Word doc and save it as an HTML file. What I got was impossible for me to reach into and make coding tweaks. I would worry that AI coding might give us code that is overly bloated or complex and give us something we cannot tweak or understand. A human programmer makes adjustments all the time, if by following a certain path will make the result overly complex and inefficient.

Displaying all 6 comments

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account