Using LLMs for software development (summer '25 update)
(Or, "My big revelation about LLMs and software development")
A few months ago, I did a demo of Claude Code at the Northwest Ruby User Group. I had just started using Claude Code and it had impressed me. I wanted to show off some of its capabilities and explain how it was different to the LLM-powered stuff I'd seen before - in particular, the likes of GitHub Copilot, which up till then had been an over-powered auto-complete. Claude Code was different - it could actually do stuff - write code, run the tests, then go back correct itself, based on the feedback it received.
However, three months later on, the way I'm using Claude Code has changed a lot. And that's partly because I've had a really big realisation about LLMs and how they can help software developers. (As an aside, this does not change the ethical issues around LLMs, nor does it affect the financial bubble that they seem to be creating).
I've always said that code is easier to write than it is to read.
That's part of the reason I love Ruby. When you're writing in most other languages, whether that's Javascript or Dart or whatever, what you're writing tends to look like computerCode()
. Not so with Ruby - you can make it look like (weird) english. And, because code is hard to read, anything that reduces the friction is a good thing.
However, LLMs are really good at reading code.
And that makes a big difference to how I'm now using Claude.
In the past I've spent a lot of time working as a freelancer, in a development team of one. When I did work with others, it was always for small companies, and I would be classed as the "senior". Which meant, in those scenarios, I would be gatekeeper to the "quality" of the codebase. In practical terms, this resulted in me looking at pull requests and reviewing code. But code is easy to write, not so easy to read. So I really do not enjoy this part of the job.
Claude Code has a feature called "commands". You can create a text file with a series of instructions, put it in your "commands" folder and it becomes a slash command, available in your terminal session. So I wrote one called code_review.md
.
Now, when I have a pull request to review, I fetch the branch, launch Claude, then type /code_review ISSUE_123
. Claude reads the code_review.md
file and starts following the instructions.
Here's a shortened extract of the current version:
- Read the original specification for the work from Linear as "Issue $ARGUMENTS". The task details may be nested as part of a hierarchy of issues, so read the parent issues as well.
- Fetch the latest version of the `develop` branch, then fetch and switch to the latest version of the feature branch.
- Merge the `develop` branch into the feature branch, then run database migrations, lint and ensure that tests all pass.
- Compare the `develop` branch to the feature branch and get a list of files that have changed.
- *Briefly* evaluate if the code changes deliver all the required functionality and that the test coverage is adequate.
- Check all exposed endpoints to ensure they have automated authentication/authorisation tests.
- Evaluate the changed files against the project style-guide, glossary and documentation.
- Update the project documentation if it does not include details of this feature.
- Make your final evaluation:
- ready to be merged
- ready to be merged following a user-interface review
- should be returned to the original developer with feedback
The rest of the command file then goes in to more detail about each step.
To be clear, this does not absolve me of responsibility of performing the code review. But what it does do is give the LLM the task of reading the code.
Claude is good at seeing if any the authorisation rules have been updated to match the new functionality.
Claude is good at checking if any routes have been added, if any controllers have changed - and ensuring that they are covered by specs testing each endpoint for proper authentication and authorisation.
Claude is good at identifying deviations from the style guide (although not so good at following it).
After a few minutes of burning through tokens, Claude comes back to me with a summary of everything it has checked, with a recommendation:
- ready to be merged
- ready to be merged following a user-interface review
- should be returned to the developer with the following feedback
Once I have that recommendation, I then know how much effort I need to put into reading the code myself.
If Claude says it meets my criteria then I take a quick glance at the diff in case there's stuff it's missed. If Claude says there are issues, I roll my sleeves up and dive into the code proper to see what's wrong.
In other words, it removes a whole load of boring grunt work from me.
Yesterday, when I was performing a code review, Claude said there were several issues. I realised that there was a problem with the original spec, which had confused the developer doing the work. As it was my fault (I had missed the issue in the specification), I began to fix it. And then I got stuck.
We use CanCanCan for defining our authorisation rules. You define permissions and the conditions in which they apply for a given user. These conditions get evaluated into SQL queries. For example, in this project, a user with the "site manager" role can read or update any SiteWorker
record for sites that this user manages. This is defined as: can [:read, :update], SiteWorker, site_id: user.managed_sites.pluck(:id)
. Whereas an "operative", a sub-contractor working at the site, can only read the records of other SiteWorker
s who are on the same contract: can :read, Operative, site: {status: "active", account: account}, contract: {workers: {user_id: user.id}}
.
In this case, I was updating the specs for a controller that was accessed by three different roles - "administrators", "site managers" and "contract managers". I updated the rule in CanCanCan
accordingly, but the spec kept on returning no output for contract managers. I peppered the spec with puts
statements, trying to pinpoint why it was not working. Had I got my authorisation conditions wrong? Was I setting up the test data incorrectly? But everything looked OK and the test continued to fail.
After an hour of getting nowhere, I asked Claude to look at it. Within ten seconds, Claude had identified the problem and thirty seconds later, it had fixed the spec. CanCanCan
was constructing a SQL join, based on the conditions I had specified. But, at a certain level of complexity, CanCanCan
gets into a mess and builds an invalid query. Claude spotted this immediately, rewrote the conditions with a simpler join (producing the same results), then ran the spec again. We were green once more. Incidentally, I've been using CanCanCan
for about ten years, and I knew that it struggles with complex conditions, but I still wasted an hour on something that Claude fixed in a minute.
One last example.
Our main product is a big Rails app that was written in a hurry. The full test suite takes 35 minutes to run, in parallel, on my M4 Pro MacBook. However, I knew there were a couple of things I could do to optimise this.
Firstly, I was using a factory gem that myself and Caius had written fifteen years ago. It did the job, but there were certain places where it would create a whole tree of test records when only a few of them were required. I have recently written Fabrik (inspired by Oaken) which gives me much more fine-grained control over what gets built and when. But there were over 30,000 lines of spec code that would need updating - a job that, even if I had the time, I really didn't want to do.
Secondly, some of the models required ActiveJob
. They would create or update dependent records for things that didn't need to happen immediately in the application, but would have an impact on the specs. Things like changing a user's role must then update permission records across a number of tables. So the tests had ActiveJob
running inline
for every spec. But not every spec needed those background jobs - and they were causing a whole load of database activity that, in most cases, was not necessary. I needed to switch ActiveJob
off, go through every now-failing spec and switch it back on (perform_enqueued_jobs
) for the examples where it was actually required.
I knew what needed to be done, but it required me to go back through a lot of complex code. I'd need to patiently make changes, test those changes, fix the breakages before moving on to the next spec. A huge task. No wonder I kept putting it off.
Last weekend, I put Claude to work.
I broke the two tasks down into simple steps, then told Claude to do each step in turn, stopping before moving to the next one. (LLMs can get confused and get stuck in a loop when they're given big tasks). I got started on the Friday afternoon, and checked in every hour or so (giving feedback or refining the tasks as needed). I stopped on Friday evening, then restarted on Saturday morning, checking in less frequently this time. And by Sunday evening, both tasks were complete. The entire test suite had been updated, using Fabrik throughout and only using ActiveJob where necessary. The end result? The test suite had gone from over 35 minutes to an average of 18 minutes to complete.
The way LLMs can help us has changed immensely in the last six months - the rate of progress is breathtaking. But just using them to "vibe code", whilst empowering for non-technical users (that's an article for another time), is a waste of their best ability. They are good at the part of the job that we, as humans, are bad at. And whilst I'm happy to carry on working without them (the way I have been for almost thirty years), the time they've saved me in the last three weeks alone, makes me think that would be a huge waste of my time. LLMs have suddenly become a vital tool in our software development arsenal.