A lawyer in BC recently got caught filing a couple of cases that did not exist; AI “hallucinations”.
She’s been ordered to pay costs to the other side, apologised profusely to the court, and may be facing professional discipline.
A lawyer in BC recently got caught filing a couple of cases that did not exist; AI “hallucinations”.
She’s been ordered to pay costs to the other side, apologised profusely to the court, and may be facing professional discipline.
There was a similar case here in the US just a few months after ChatGPT exploded.
That’s too bad. Speaking generically, we want to teach students how to use “AI” tools to do real work, not as bullshit generators.
(This thread is mostly about professional and Phd-level submissions, but may I broaden it to freshman-level cheating?)
Here’s a technique which is intended to teach critical thinking, but may be useful for catching AI abuse: Have the students write an annotated bibliography for each paper which is long and personal.
Have the student list every source he read, and explain in great detail why he read that author, which chapters of that book applied to the topic of the paper, and how useful the student found that source. And, for other sources, have the student explain how he looked into the book, but found that it was not relevant for his topic, because it discussed different issues,(and explain which issues.)
A simple 4-page paper would need a 2-page explanatory note describing all the sources the student used, and sources which he investigated but did not use, and explain how they relate to the professor’s lectures and the topic of the paper.
Example:
“In source number 1, I found this author to be more detailed in his description of the subject we discussed in class about Plato and Aristotle’s views on …x., but he did not mention the political situation of…y. I used this author for 3 of the 5 footnoted quotes in my paper. For source number 2, I found the book to be much more about the history of ancient Greece and the social position of Socrates, but not directly about Plato’s philosophy, so I only mention it in passing, with no direct quotes in the body of my paper. Source number 3 was an author who was more detailed and confusing than I had anticipated, as he assumed the reader already knew all the subjects we discussed in class about a,b,and c…and used many technical terms and Greek expressions such as x,y,and z, which were never mentioned in class and did not directly apply to the topic of my paper.”
The grade the student received on his paper was based on his explanatory notes in the bibliography as much as on the content of his paper. It seems like a good pedagogical technique, to force the student to organize and explain his own thought processes. But it seems to me that it would also be a way to test for cheating by AI. I think it would be difficult for Chatgpt to make up such specific references to the classroom discussions.
Legal Eagle had a good video on this very thing.
https://youtu.be/oqSYljRYDEM
From what I know about ChatGPT, I do not think this sort of thing will ever get better. It’s simply not what the technology is designed for. The tech has a mission to produce words, not accuracy.
It’s even worse than that. GPT, and generative AI at large, does not have any context to even understand “fact” from “fiction”. In machine intelligence terms, it is not ‘aligned’ for factuality, and has no inherent mechanism to assure the accuracy of information or distinguish its own fabulations from verifiable fact. It is all the same as long as the algorithm produces it per its trained neural network. There is probably no way to assure or force an LLM to be ‘honest’ or validate the factuality of its own product, either, although you could add a post hoc ‘filter’ that would go about at least verifying that information in the response independently correlates to some database of factual information or published citations.
Stranger
I think you are both wrong. I work in this field now and factual accuracy is perhaps the biggest problem being addressed right now. It will be solved.
I know that people are working on the problem and, of course, expect that it is solvable, but most if not all of these approaches are ad hoc attempts at either verifying the factuality of particular types statements or controlling datasets and reinforcing factual (or factual-like) responses, largely for the purpose of making more reliable agents for natural language user interfaces. For certain types of basic fact checking, such as verifying that citations to publications, this approach of essentially having a lookup function can work (provided that you have some validated reference to check against) but the problem is that more complex semantic errors are non-trivial to check with such methods. Verifying the factuality of compound or multivariate statements intrinsically requires some kind of symbolic system of logic-like reasoning versus just a simple lookup function, and making symbolic representation systems general enough to function within a broad world model (which was a leading approach in AI research before neural networks came back into vogue) is a frustratingly complicated problem that never achieved much outside of very constrained world models.
Furthermore, integrating these corrective methods into the core of an LLM (or other generative system) is essentially impossible because it is at odds with what the system is intended to do, which is to make statistical predictions about the next most likely set of tokens to produce verbiage. In other words, an LLM will produce a result that is most aligned to the frequency in its training set rather than “most truthful”, and even if you try to reinforce factual statements the system has no actual basis to interpret intrinsic factuality in the context of the real world except by supervised reinforcement. This is compounded by the phenomenon of “hallucination”, where a generative model basically goes off the rails by reinforcing its own errors and often producing dramatically wrong or complete gibberish responses. There are a number of “hallucination mitigation methods” to limit just how far off track a generative AI can get from retrieval-augmented generation and prompt tuning to supervised fine-tuning and apply loss function frameworks but that doesn’t mean it doesn’t still produce substantial errors, just that they are more subtle in nature that the corrective methods can adjust for, and in fact hallucination is probably an intrinsic consequence of generative AI that can’t be coded out.
And that gets to the fundamental problem of LLMs, which is that they are trained and ‘aligned’ with creating authoritative reponses even when producing grammatical correct semantic nonsense. A ‘good’ LLM will produce a result in response to a prompt that sounds like an expert teacher lecturing to a student, and if the latter doesn’t know much about the topic they can easily be fooled into believing that the answer must be correct even if a knowledgable reader immediately sees factual errors and logical flaws in the response. Of course, human intelligences can also provide false responses to questions–and unlike LLMs trained and used for legitimate purposes, they’ll often do so intentionally instead of incidentally–but the difference is that most people are not actually good bullshitters and a well-trained expert can see different patterns in truthful statements (“knowing”), unconfidently but unintentional non-truthful statements (“guessing”), and deliberate deception (“lying”), whereas an LLM produces the same level of confidence regardless of whether it is providing correct information or total nonsense. And while current LLMs from legitimate research organizations and companies are only incidentally non-factual, it is trivially easy to make an LLM trained on conspiranoia nonsense that produces authoritative-sounding bullshit propaganda, for which which I can guarantee it is already being used. That kind of misuse is as inevitable as it is unpreventable, and frankly I don’t expect that there are going to be any effective countermeasures as bad actors use reinforcement methods to make LLMs and other generative AIs even more effective at manipulating public beliefs and emotions,.
Of course, people are already using and applying chatbots to be ‘expert’ interfaces in critical applications even though it is well-understood that these systems just are not reliable at producing factually-correct answers to inexpert prompts, and nobody is waiting around for a bunch of egghead researchers to figure out how to make them more reliable as long as they can reduce labor costs and achieve business ‘efficiencies’. Which is the real danger of a rush into applied AI; not that it is going to produce AGI killbots that will murder us all and build a machine civilization on the ruins of ours (that comes later, if at all) but that we are going to place way to much faith and put unreliable agents in control of critical knowledge and control systems in ways that cannot easily be undone, as well as undermine our own critical thinking, organization, and communication skills so we don’t have to deal with the tedium of thinking about or doing this work ourselves, and hence lose the broad societal capacity to do so.
Stranger
Since generative AI tools such as ChatGPT became public in late 2022, publishers and researchers have debated these issues. Some say the tools can help draft manuscripts if used responsibly—by authors who do not have English as their first language, for example. Others fear scientific fraudsters will use them to publish convincing but fake work quickly. LLMs’ propensity to make things up, combined with their relative fluency in writing and an overburdened peer-review system, “poses a grave threat to scientific research and publishing,” says Tanya De Villiers-Botha, a philosopher at Stellenbosch University.
Some journals, including Science and Nature, and other bodies have already released rules about how scientists can use generative AI tools in their work. (Science’s News department is editorially independent.) Those policies often state that AI tools cannot be authors because they cannot be accountable for the work. They also require authors to declare where the tools have been used.
Stranger
This has got to be the most obvious AI-generated content I’ve ever seen submitted to Wikipedia:
Someone who saw that before I did reverted it as promotional rather than AI-generated.
Just as a followup, I got four papers to review for this conference and one was clearly AI generated. Supposedly from the place I mentioned in the OP, on a topic not even remotely relevant, and horribly written. Dead giveaway: Some of the references had titles, authors, journal numbers and page numbers, but no journal titles.
I recommended that the authors be banned.
Zeitschrift für Undrukbare Komischeschrecken
Death is too good for them.
Stranger
I think being sentenced to be a rag picker in the garbage piles would be worse.
I had not heard being a reviewer called that…
I just finished judging 118 independently published science fiction books. I didn’t think of it in those terms, but it kind of fits …
I work in IT, where there is some concern that these LLMs will be coming for our jobs.
I am pretty sure that that will not happen, writing the business requirements that make up even a small project for clarity for an LLM would require an exceptionally competant technical writer, something of which the industry has a dearth.
But for shits and giggles, and because I am awake at 05:00 (SAST) due to a nagging cough waking me up, I asked ChatGPT to generate a recursive fibonacci sequence generator to provide the first 20 numbers. This is a pretty trivial test you might give a junior applicant.
But I chose, in my somewhat unwell state, to ask it to write it in the esoteric language “Brainfuck” .
Now, granted, the language was purposefully designed to be hard to understand, and I imagine ChatGPT only knows about it by accident, but I was a little disappointed by the output when I ran it through an online compiler:
My job is safe.
Yes, well…
Note that you are not necessarily supposed to type your commands into “ChatGPT”. You are supposed to (e.g.) use Copilot or Devin or OpenDevin or whatever with a high-powered GPT on the back end. People already use these tools. And the point is that one person will be able to do the job that took n people in the past, not that all humans up to and including the CEO will be immediately replaced. Just hope that you are not one of the unlucky n-1.
Abstract
The use of ChatGPT and similar Large Language Model (LLM) tools in scholarly communication and academic publishing has been widely discussed since they became easily accessible to a general audience in late 2022. This study uses keywords known to be disproportionately present in LLM-generated text to provide an overall estimate for the prevalence of LLM-assisted writing in the scholarly literature. For the publishing year 2023, it is found that several of those keywords show a distinctive and disproportionate increase in their prevalence, individually and in combination. It is estimated that at least 60,000 papers (slightly over 1% of all articles) were LLM-assisted, though this number could be extended and refined by analysis of other characteristics of the papers or by identification of further indicative keywords.
Stranger
Interesting. I wish that they had looked at some of the papers with high numbers of the suspicious terms to see if they were written totally with AI or just partially, and which parts. All the papers in the study were published - I assume they mean really published, after peer review - so it would be interesting to see what got through. It would also be interesting to poll editors to see if there were many obviously AI written papers coming in, as I noticed.
The idea of reviewers using AI to write reviews is just scary. Falsifying results is bad enough, that might even be worse.