Indeed, that is the most disturbing instance in that particular run. Just ignore explicit commands? Sounds like an insubordinate AI.
My company has been pushing us to all use AI more often, and is itself releasing several products that incorporate it. Part of this push is making GitHub Copilot available to everyone who fools with code, with several models available.
Well, I have a script that finds and reports changes in the configuration of the product I support by comparing it’s very long XML files. However there is one section that is contained in XML, but is not itself XML. It predates our using XML to store the configuration, and apparently someone didn’t want to monkey with this pretty core function and port its config to XML. So it is stored in its own proprietary format. It’s not too bad, but it can include nested sections and things like that. I could port the original parser over, but that seemed like no fun. Ditto on writing my own. Boooor-ing. I know the spec well enough to describe it in a paragraph or so, but it seems tedious to write the code for it by hand. Perfect job for Copilot, right?
So, I ask Claude 3.7 (the most advanced model we had available at the moment) to write me the parser. I read through its suggestion, noticed a couple of things that needed tweaking and suggested them, and it gave me what at first glance seemed to be a good result. But upon testing it would do weird things. It would work fine when you were comparing two configs for different systems, but totally skip sections if the two configs were for the same system at two different points in time. So I went back to Claude and asked it to debug why, providing the updated script, the configuration files it was to use, and the current output. It replied with a hallucination that the cause was that large sections of this config contained nothing but ellipses.
Ellipses occur nowhere in this config, not even in the commit comments.
So I argued with it a bit, clarifying that the sections of the config it was pointing out actually did not contain ellipses. It responded with more easily disproven hallucinations. I switched to a couple of other models to see if they could debug it. None of them provided a solution that was even plausible, even if one of them did offer a couple of valid improvements to the parser.
So I actually worked through the code like good monkey should have, figured out that it was opening and closing file handles in inappropriate ways and found a logic error about how it decided it was done reading in this section of code. I fixed those, and ta-da, it worked.
Did it make my life better, or more efficient? Nah, not really. I probably spent about the same amount of time debugging its mistakes and arguing with it as I would have writing it myself or porting the original parser over. It was different and frustrating in a new way, I suppose. It did create reports in a nicer format than I usually would (even if some of its output is useless, but pretty). I’ll probably give it another chance when I next try to improve this script – but that’s mostly because it’s the company’s time, and they want me to try it. The folks “vibe coding” with these models seem insane.