Once in a while I take a stab at a big, uncertain topic — like COVID or Bitcoin — as a way of recording a snapshot of my thinking. Now it’s time to do the same for AI (Artificial Intelligence), in a post that will surely be massively out-of-date almost as soon as I’ve published it. Be though, as it may, a doomed enterprise, I still want to do this, if nothing else because my own job as a software engineer is one of the ones that is most likely to be dramatically affected by the rise of AI. And while I could go on about image and video generation or any of a number of other applications of the new wave of AI products, I’m mostly going to focus on the area that is currently most relevant to the business of software engineering; that is, LLMs (Large Language Models) as applied to tasks in and around software development.
The current state of AI in software development
In the world of programming, LLMs are being crammed and wedged into every available gap. I say "crammed" because the textual, conversational model doesn’t necessarily always feel like a natural fit within our existing user interfaces. Products like GitHub Copilot seek to make the interaction as natural as possible — for example, proposing completions for you when you do something like type a comment describing what the code should do — but fundamentally the LLM paradigm imposes a turn-based, conversational interaction pattern. You ask for something by constructing a prompt, and the LLM provides a (hopefully) reasonable continuation. In various places you see products trying to make this interaction seem more seamless and less turn-like — sometimes the AI agent is hidden behind a button, a menu, or a keyboard shortcut — but I generally find these attempts to be clumsy and intrusive.
And how good is this state of affairs? At the time of writing, the answer is "it depends". There are times when it can produce appallingly buggy but reasonable-seeming code (note: humans can do this too), and others where it knocks out exactly what you would have written yourself, given enough time. Use cases that have felt anywhere from "good" to "great" for me have been things like:
- Low-stakes stuff like Bash and Zsh scripts for local development. Shell scripts that run locally, using trusted input only, doing not-mission-critical things. Shells have all sorts of esoteric features and hard-to-remember syntax that an LLM can generally churn out quite rapidly; and even if it doesn’t work, the code it gives you is often close enough that it can give you an idea of what to do, or a hint about what part of the manual page you should be reading to find out about, say, a particular parameter expansion feature. The conversational model lends itself well to clarifying questions too. You might ask it to give you the incantation needed for your fancy shell prompt, and when it gives you something that looks indistinguishable from random noise, you can ask it to explain each part.
- React components. Once again, for low-stakes things (side-projects, for example), the LLM is going to do just fine here. I remember using an LLM after a period of many months of not doing React, and it helped me rapidly flesh out things like Error Boundary components that I would otherwise have had to read up on in order to refresh my memory.
- Dream interpretation. Ok, so I snuck in a non-programming use case. If you’ve ever had a weird dream and asked Google for help interpreting it, you’ll find yourself with nothing more than a bunch of links to low-quality "listicles" and SEO-motivated goop that you’ll have to wade into like a swamp, with little hope of actually coming out with useful answers; ask an LLM on the other hand, and you’ll obtain directed, on-point answers of a calibre equal to that of an experienced
charlatan professional dream interpreter.
- Writing tests. Tests are often tedious things filled with painful boilerplate, but you want them to be that way (ie. if they fail, you want to be able to jump straight to the failing test and be able to read it straightforwardly from top to bottom, as opposed to having to jump through hoops reverse-engineering layers of cleverness and indirection). An LLM is good for churning out these things, and the risk of it hallucinating and producing something that doesn’t actually verify the correct behavior is far more benign than a comparable flaw making it into the implementation code that’s going to run in production. The bar is lower here because humans are at least as capable of writing bad tests as LLMs are. This is probably because it’s harder to ship a flagrant but undetected implementation bug because if anybody actually uses the software then the bug will be flushed out in short order: on the other hand, all manner of disgusting tests can get shipped and live on for extended periods in a test suite as long was they remain green. We’ve all seen ostensibly green tests that ended up verifying the wrong behavior, not verifying anything meaningful at all, or being mere facsimiles of the form and structure of the thing they purport to test, but utterly failing to express, exercise, specify, or constrain the expected behavior.
But it’s not all roses. One of the problems with LLMs is they’re only as good as the data used to train them. So, given a huge corpus of code written by humans (code with bugs), it’s only to be expected that LLM code can be buggy too. The dark art of tuning models can only get you so far, and curating the training data is hard to scale-up without a kind of chicken-and-egg problem in which you rely on (untrustworthy) AI to select the best training material to feed into your AI model. In my first experiences with LLMs, I found they had two main failure modes: one was producing something that looks reasonable, appears to be what I asked for, and is indeed "correct", but was subtly ill-suited for the task; the other was producing code that again had the right shape that I’d expect to see in a solution, but which actually had some fatal bug or flaw (ie. is objectively "incorrect"). This means you have to be skeptical of everything that comes out of an LLM; just because the tool seemed "confident" about it is no guarantee of it actually being any good! And as anybody who has had an interaction with an LLM has seen, the apparent confidence with which they answer your questions is the flimsiest of veneers, rapidly blown away by the slightest puff of questioning air:
Programmer: Give me a function that sorts this list in descending order, lexicographically and case-insensitively.
Copilot: Sure thing, the function you ask for can be composed of the following elements… (shows and explains function in great detail).
Programmer: This function sorts the list in ascending order.
Copilot: Oh yes, that is correct. My apologies for farting out broken garbage like that. To correct the function, we must do the following…
In practice, the double-edge sword of current LLMs mean that I mostly don’t use tools like GitHub Copilot in my day-to-day work, but I do make light use of ChatGPT like I described in a recent YouTube video. As I’ve hinted at already, I’m more likely to use LLMs for low-stakes things (local scripts, tests), and only ever as a scaffolding that I then scrutinize as closely or more closely than I would code from a human colleague. Sadly, when I observe my own colleagues’ usage of Copilot I see that not everybody shares my cautious skepticism; some people are wary of the quality of LLM-generated code and vet it carefully, but others gushingly accept whatever reasonable-seeming hallucination it sharts out.
One thing I’m all too keenly aware of right now is that my approach to code review will need to change. When I look at a PR, I still look at it with the eyes of a human who thinks they are reading code written by another human. I allow all sorts of circumstantial factors to influence my level of attention (who wrote the code? what do I know about their strengths, weaknesses, and goals? etc), and I rarely stop to think and realize that some or all of what I’m reviewing may actually have been churned out by a machine. I’m sure this awareness will come naturally to me over time, but for now I have to make a conscious effort in order to maintain that awareness.
Am I worried about losing my job?
I’m notoriously bad at predicting the future, but it seems it would be derelict of me not to at least contemplate the possibility of workforce reductions in the face of the rise of the AI juggernaut. I don’t think any LLM currently can consistently produce the kind of results I’d expect of a skilled colleague, but it’s certainly possible that that could change within a relatively short time-scale. It seems that right now the prudent cause is to judiciously use AI to get your job done faster, allowing you to focus on the parts where you can clearly add more value than the machine can.
At the moment, LLMs are nowhere near being able to do the hard parts of my job, precisely because those parts require me to keep and access a huge amount of context that is not readily accessible to the machine itself. In my daily work, I routinely have to analyze and understand information coming from local sources (source code files, diffs) and other sources spread out spatially and temporally across Git repos (commit messages from different points in history, files spread across repositories and organizations), pull requests, issues, Google Docs, Slack conversations, documentation, lore, and many other places. It’s only a matter of time before we’ll be able to provide our LLMs with enough of that context for them to become competitive with a competent human when it comes to those tricky bug fixes, nuanced feature decisions, and cross-cutting changes that require awareness not just of code but also of how distributed systems, teams, and processes are structured.
It’s quite possible that, as with other forms of automation, AI will displace humans when it comes to the low-level tasks, but leave room "up top" for human decision-makers to specialize in high-leverage activities. That is, humans getting the machines to do their bidding, or using the machine to imbue them with apparent "superpowers" to get stuff done more quickly. Does this mean that the number of programming jobs will go down? Or that we’ll just find new things — or harder things — to build with all that new capacity? Will it change the job market, compensation levels, supply and demand? I don’t have the answer to any of those questions, but it makes sense to remain alert and seek to keep pace with developments so as not to be left behind.
Where will this take us all?
There have been some Twitter memes going around about how AI is capable of churning out essentially unreadable code, and how we may find ourselves in a future where we no longer understand the systems that we maintain. To an extent, it’s already true that we have systems large and complicated enough that they are impossible for any one person to understand exhaustively, but AI might be able to build something considerably worse: code that compiles and apparently behaves as desired but is nevertheless not even readable at the local level, when parts of it are examined in isolation. Imagine a future where, in the same way that we don’t really know how LLMs "think", they write software systems for us that we also can’t explain. I don’t really want to live in a world like that (too scary), although it may be that that way lies the path to game-changing achievements like faster-than-light travel, usable fusion energy, room-temperate superconductors and so on. I think that at least in the short term we humans have to impose the discipline required to ensure that LLMs are used for "good", in the sense of producing readable, maintainable code. The end goal should be that LLMs help us to write the best software that we can, the kind of software we’d expect an expert human practitioner to produce. I am in no hurry to rush forwards into a brave new world where genius machines spit out magical software objects that I can’t pull apart, understand, or aspire to build myself.
The other thing I am worried about is what’s going to happen once the volume of published code produced by LLMs exceeds that produced by humans, especially given that we don’t have a good way of indicating the provenance of any particular piece — everything is becoming increasingly mixed up, and it is probably already too late to hope to rigorously label it all. I honestly don’t know how we’ll train models to produce "code that does X" once our training data becomes dominated by machine-generated examples of "code that does X". The possibility that we might converge inescapably on suboptimal implementations is just as concerning as the contrary possibility (that we might see convergence in the direction of ever greater quality and perfection) is exciting. There could well be an inflection point somewhere up ahead, if not a singularity, beyond which all hope of making useful predictions breaks down.
Where would this take us all in an ideal world?
At the moment, I see LLMs being used for many programming-adjacent applications; for example, AI-summarization. There is something about these summaries that drains my soul. They end up being so pedestrian, so bland. I would rather read a thoughtful PR description written by a human than a mind-numbingly plain AI summary any day. Yet, in the mad rush to lead the race into the new frontier lands, companies are ramming things like summarization tools down our throats with the promise of productivity, in the hope of becoming winners in the AI gold rush.
Sadly, I don’t think the forces of free-market capitalism are going to drive AI towards the kinds of applications I really want, at least not in the short term, but here is a little wish list:
- I’d like the autocomplete on my phone to be actually useful as opposed to excruciating. Relatedly, I’d like speech-to-text to be at least as good at hearing what I’m saying as a human listener. Even after all these years, our existing implementations feel like they’ve reached some kind of local maximum beyond which progress is exponentially harder. 99% of all messages I type on my phone require me to backspace and correct at least once. As things currently stand, I can’t imagine ever trusting a speech-to-text artifact without carefully reviewing it.
- Instead of a web populated with unbounded expanses of soulless, AI-generated fluff, I want a search engine that can guide me towards the very best human-generated content. Instead of a dull AI summary, I’d like an AI that found, arranged, and quoted the best human content for me, in the same way a scholar or a librarian might curate the best academic source material.
- If I must have an AI pair-programmer, I’d want it to be a whole lot more like a skilled colleague than things like Copilot currently are. Right now they feel like a student that’s trying to game the system, producing answers that will get them the necessary marks and not deeply thinking and caring about producing the right answer.
- AI can be useful not just for guiding one towards the best information on the public internet. Even on my personal computing device, I already have an unmanageably large quantity of data. Consider, for example, the 50,000 photos I have on my laptop, taken over the last 20 years. I’d like a trustworthy ally that I can rely on to sort and classify these; not the relatively superficial things like face detection that software has been able to do for a while now, but something capable of reliably doing things like "thinning" the photo library guided only by vague instructions like "reduce the amount of near-duplication in here by identifying groups of similar photos taken around the same time and place, and keep the best ones, discarding the others". Basically, the kind of careful sorting you could do yourself if only you had a spare few dozen hours and the patience and resolve to actually get through it all.
I’m bracing myself for a period of intensive upheaval, and I’m not necessarily expecting any of this transformation to lead humanity into an actually-better place. Will AI make us happier? I’m not holding my breath. I’d give this an excitement score of 4 out of 10. For comparison, my feelings around the birth of personal computing (say, in the 1980s) were a 10 out of 10, and the mainstream arrival of the internet (the 1990s) were a 9 out of 10. But to end on a positive note, I will say that we’ll probably continue to have some beautiful, monumental, human-made software achievements to be proud of and to continue using into the foreseeable future (that is, during my lifetime): things like Git, for example. I’m going to cherish those while I still can.