<
https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463>
theconversation.com
A weird phrase is plaguing scientific papers – and we traced it
back to a glitch in AI training data
Rayane El Masri
Earlier this year, scientists discovered a peculiar term
appearing in published papers: “vegetative electron microscopy”.
This phrase, which sounds technical but is actually nonsense,
has become a “digital fossil” – an error preserved and
reinforced in artificial intelligence (AI) systems that is
nearly impossible to remove from our knowledge repositories.
Like biological fossils trapped in rock, these digital artefacts
may become permanent fixtures in our information ecosystem.
The case of vegetative electron microscopy offers a troubling
glimpse into how AI systems can perpetuate and amplify errors
throughout our collective knowledge.
A bad scan and an error in translation
Vegetative electron microscopy appears to have originated
through a remarkable coincidence of unrelated errors.
First, two papers from the 1950s, published in the journal
Bacteriological Reviews, were scanned and digitised.
However, the digitising process erroneously combined
“vegetative” from one column of text with “electron” from
another. As a result, the phantom term was created.
Decades later, “vegetative electron microscopy” turned up in
some Iranian scientific papers. In 2017 and 2019, two papers
used the term in English captions and abstracts.
This appears to be due to a translation error. In Farsi, the
words for “vegetative” and “scanning” differ by only a single
dot.
An error on the rise
The upshot? As of today, “vegetative electron microscopy”
appears in 22 papers, according to Google Scholar. One was the
subject of a contested retraction from a Springer Nature
journal, and Elsevier issued a correction for another.
The term also appears in news articles discussing subsequent
integrity investigations.
Vegetative electron microscopy began to appear more frequently
in the 2020s. To find out why, we had to peer inside modern AI
models – and do some archaeological digging through the vast
layers of data they were trained on.
Empirical evidence of AI contamination
The large language models behind modern AI chatbots such as
ChatGPT are “trained” on huge amounts of text to predict the
likely next word in a sequence. The exact contents of a model’s
training data are often a closely guarded secret.
To test whether a model “knew” about vegetative electron
microscopy, we input snippets of the original papers to find out
if the model would complete them with the nonsense term or more
sensible alternatives.
The results were revealing. OpenAI’s GPT-3 consistently
completed phrases with “vegetative electron microscopy”. Earlier
models such as GPT-2 and BERT did not. This pattern helped us
isolate when and where the contamination occurred.
We also found the error persists in later models including
GPT-4o and Anthropic’s Claude 3.5. This suggests the nonsense
term may now be permanently embedded in AI knowledge bases.
By comparing what we know about the training datasets of
different models, we identified the CommonCrawl dataset of
scraped internet pages as the most likely vector where AI models
first learned this term.
The scale problem
Finding errors of this sort is not easy. Fixing them may be
almost impossible.
One reason is scale. The CommonCrawl dataset, for example, is
millions of gigabytes in size. For most researchers outside
large tech companies, the computing resources required to work
at this scale are inaccessible.
Another reason is a lack of transparency in commercial AI
models. OpenAI and many other developers refuse to provide
precise details about the training data for their models.
Research efforts to reverse engineer some of these datasets have
also been stymied by copyright takedowns.
When errors are found, there is no easy fix. Simple keyword
filtering could deal with specific terms such as vegetative
electron microscopy. However, it would also eliminate legitimate
references (such as this article).
More fundamentally, the case raises an unsettling question. How
many other nonsensical terms exist in AI systems, waiting to be
discovered?
Implications for science and publishing
This “digital fossil” also raises important questions about
knowledge integrity as AI-assisted research and writing become
more common.
Publishers have responded inconsistently when notified of papers
including vegetative electron microscopy. Some have retracted
affected papers, while others defended them. Elsevier notably
attempted to justify the term’s validity before eventually
issuing a correction.
We do not yet know if other such quirks plague large language
models, but it is highly likely. Either way, the use of AI
systems has already created problems for the peer-review
process.
For instance, observers have noted the rise of “tortured
phrases” used to evade automated integrity software, such as
“counterfeit consciousness” instead of “artificial
intelligence”. Additionally, phrases such as “I am an AI
language model” have been found in other retracted papers.
Some automatic screening tools such as Problematic Paper
Screener now flag vegetative electron microscopy as a warning
sign of possible AI-generated content. However, such approaches
can only address known errors, not undiscovered ones.
Living with digital fossils
The rise of AI creates opportunities for errors to become
permanently embedded in our knowledge systems, through processes
no single actor controls. This presents challenges for tech
companies, researchers, and publishers alike.
Tech companies must be more transparent about training data and
methods. Researchers must find new ways to evaluate information
in the face of AI-generated convincing nonsense. Scientific
publishers must improve their peer review processes to spot both
human and AI-generated errors.
Digital fossils reveal not just the technical challenge of
monitoring massive datasets, but the fundamental challenge of
maintaining reliable knowledge in systems where errors can
become self-perpetuating.