New subject: [nexa] ‘Biggest act of copyright theft in history’: thousands of Australian books allegedly used to train AI model | Australia news | The Guardian

Sept. 29, 2023

      <https://www.theguardian.com/australia-news/2023/sep/28/australian-books-trai...>

Thousands of books from some of Australia’s most celebrated authors have potentially been caught up in what Booker prize-winning novelist Richard Flanagan has called “the biggest act of copyright theft in history”.

The works have allegedly been pirated by the US-based Books3 dataset and used to train generative AI for corporations such as Meta and Bloomberg.

Flanagan, who found 10 of his works, including the multi-international award-winning 2013 novel The Narrow Road to the Deep North, on the Books3 dataset, told Guardian Australia he was deeply shocked by the discovery made several days ago.

“I felt as if my soul had been strip mined and I was powerless to stop it,” he said in a statement.

“This is the biggest act of copyright theft in history.”

AI could ‘turbo-charge fraud’ and be monopolised by tech companies, Andrew Leigh warns

The Australian Publishers Association confirmed to Guardian Australia on Wednesday that as many as 18,000 fiction and nonfiction titles with Australian ISBNs (unique international standard book numbers) appeared to be affected by the copyright infringement, although it is not yet clear what proportion of these are Australian editions of internationally authored books.

“We’re still working through [the data] to work out the impact in terms of Australian authors,” APA spokesperson Stuart Glover said.

“This is a massive legal and ethical challenge for the publishing industry and for authors globally.”

A search tool published on Monday by US media platform The Atlantic and uploaded by the US Authors Guild on Wednesday revealed the works of Peter Carey, Helen Garner, Kate Grenville, Anna Funder, Christos Tsiolkas and Thomas Keneally, as well as Flanagan and dozens of other high-profile Australian authors, were included in the pirated dataset containing more than 180,000 titles.

On Thursday, the Australian Society of Authors issued a statement saying it was “horrified” to learn that the works of Australian writers were being used to train artificial intelligence without permission from the authors.

ASA chief executive, Olivia Lanchester, described the Books3 dataset as piracy on an industrial scale.

“Authors appropriately feel outraged,” Lanchester said. “The fact is this technology relies upon books, journals, essays written by authors, yet permission was not sought nor compensation granted.”

Lanchester said the Australian literary industry, while not objecting per se to emerging technologies such as AI, was deeply concerned about the lack of transparency evident in the development and monetisation of AI by global tech companies.

“Turning a blind eye to the legitimate rights of copyright owners threatens to diminish already precarious creative careers,” she said.

“The enrichment of a few powerful companies is at the cost of thousands of individual creators. This is not how a fair market functions.”

Josephine Johnston, chief executive of Australia’s Copyright Agency, described the Books3 development as “a free kick to big tech” at the expense of Australia’s creative and cultural life.

“We’re going to need greater transparency – how these tools have been developed, trained, how they operate – before people can truly understand what their legal rights might be,” she said.

“We seem to be in this terrible position now where content owners – remembering that the vast majority of them will be individual authors – may actually have to take out court cases to enforce their rights.”

Australian copyright law protects creators of original content from data scraping.

Litigation in the US against ChatGPT creator OpenAI over use of allegedly pirated book datasets, Books1 and Books2 (which do not appear to be affiliated with Books3) has already commenced.

In July, North American horror/fantasy writers Mona Awad (author of Bunny) and Paul Tremblay (author of The Cabin at the End of the World) filed a lawsuit in a San Francisco federal court, alleging ChatGPT unlawfully digested their books as part of its AI training data.

On 28 August, OpenAI filed a motion to dismiss the lawsuit, arguing that the authors “misconceive the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence”.

On 19 September the Writers Guild and 17 of its members, including bestselling novelists John Grisham, George RR Martin and Jodi Picoult, filed a complaint in a New York district court against OpenAI, seeking redress for “flagrant and harmful infringements” of guild members’ registered copyrights.

In a statement on its website, the guild says while it is aware that companies such as Meta and Bloomberg have used the Books3 dataset to train their LLMs, it is not yet clear whether OpenAI is using Books3 to train its ChatGPT models GPT 3.5 or GPT 4.

Democracies face ‘truth decay’ as AI blurs fact and fiction, warns head of Australia’s military

Guardian Australia has sought comment from OpenAI, which has yet to officially respond to the guild’s complaint, and Meta.

On 4 September, US technology magazine Wired reported that a Danish anti-piracy group called Rights Alliance had been told by Bloomberg that the company did not plan to train future versions of its BloombergGPT using Books3.

Bloomberg declined to respond to the Guardian’s queries.

The APA said the global nature of the issue would present significant challenges in enforcement and prosecution, and has joined the authors’ society in calling for AI technologies to be regulated.

Consultation closed last month for a Department of Industry, Science and Resources discussion paper on supporting responsible AI.

A parliamentary inquiry is under way examining the use of generative artificial intelligence in the Australian education system.

Flanagan said it was up to the Australian government to act to protect Australia’s writers.

“It has power and we do not,” he said.

“If it cares for our culture it must now stand up and fight for it.”

‘Biggest act of copyright theft in history’: thousands of Australian books allegedly used to train AI model | Australia news | The Guardian

tags

participants (17)