Why stolen Australian books are being used to train AI

Jennifer Dudley-Nicholson |

Trent Dalton says it felt “very creepy” to learn his best-selling novel had been stolen to train AI.
Trent Dalton says it felt “very creepy” to learn his best-selling novel had been stolen to train AI.

Award-winning Australian author Trent Dalton worries about what tech giants are going to do with his most personal story.

The writer, who this week launched Lola in the Mirror, recently discovered his best-selling debut novel had been stolen and used to train generative artificial intelligence tools.

He says it feels “very creepy” having the intimate tale, Boy Swallows Universe, taken and exploited without his knowledge or consent.

“The story was my mum’s story so it’s not even potentially taking things from me, it’s taking things from my mum,” Mr Dalton told AAP.

“And that really terrifies me; this sweet mum of mine who went through hell and she beautifully gave me that story, the only gift my mum could give to me.

“For me, it’s really unsettling and I find it deeply invasive.”

His book is among as many as 18,000 pirated works from Australian authors including Sally Hepworth, Candice Fox, Grantlee Kieza, Liane Moriarty, John Marsden and Colleen McCullough, taken from a database of more than 191,000 titles used to advance the technology.

Artificial intelligence and publishing experts say the issue flags the beginning of a complicated debate about technology, copyright, compensation for creators and industry regulation.

And while some argue use of these books will not endanger the vital role “human authors” play, it could start a conversation about how writers want their work to be used and how they work with AI.

The dispute began earlier in 2023 when researchers discovered a large collection of pirated novels had been downloaded and used to train high-profile generative AI systems.

US authors including Sarah Silverman and Christopher Golden launched a lawsuit against Meta and OpenAI – the creator of ChatGPT – over the issue, alleging the companies “copied and ingested” protected work.

America’s Authors Guild has also initiated a class action case against OpenAI, representing 17 celebrated writers including John Grisham and George R. R. Martin.

Australian Society of Authors chief executive Olivia Lanchester says the group supports the Guild’s action and will write to AI firms to express concerns about the use of authors’ works.

She says using books to train AI tools without permission from their creators or payment lacks basic fairness.

“The inescapable message to authors and artists is that while your work has been essential in developing our product, we’re not prepared to pay you for it,” she said.

“Tech companies will charge the end user of their products but will not pay for the labour that enabled it.”

Australian Writers’ Centre chief executive Valerie Khoo says the way the Books3 dataset has been used is “appalling” and proves Australia needs strong regulations to prevent recurrences.

“We need to develop a robust model that can protect the intellectual property of writers and provide compensation where appropriate,” she said.

“The model should also give authors the choice on whether or not they want their work to feed into datasets used for machine learning.

“Some authors may be happy to do this, others not.”

The Books3 dataset has been removed from its original host following a take-down notice from Danish anti-piracy group Rights Alliance, and a Bloomberg spokesperson said the information would not be used to train future commercial versions of its AI tool.

But University of Queensland digital media and culture lecturer Leah Henrickson says it is easy to see why developers would want to use books to train AI tools as they could help produce longer, more consistent results.

“What a lot of these AI systems have been really good at is generating short snippets – it’s why chatbots work really well,” she said.

“Books often include a train of thought so all of a sudden, we have this dataset that allows us to look … paragraph by paragraph or chapter by chapter and we can see how we can make sense globally, rather than just in short snippets.”

Novels also have a history of being used to develop language recognition technology such as Google’s BERT, which is used to process English words and phrases in search queries.

Using books to train generative AI is more controversial, Dr Henrickson says, because the data is being used to create results that can feel like original texts even though it cannot compete with the creativity of writers.

“We will always need human authors,” she said.

“We’re not at the level where AI can generate numerous Jane Austen equivalents.”

UNSW regulation and governance associate professor Rob Nicholls says laws will be needed to ensure copyright material is not used without permission and authors are compensated for the use of their work. 

The process will be tricky as quantifying the value of work to developers will be hard and lawsuits abut the issue are still pending.

“In practice, there’s a likelihood that the well-known large language model creators … would probably be quite content to pay a very small amount, but that hasn’t been tested yet,” he said.

“Cases haven’t come to a conclusion, nor have they been settled out of court.”

The working relationship between AI tools and some authors could become more complicated in the future, with Amazon recently asking writers who publish work through Kindle Direct Publishing to disclose whether content is “AI-generated”.

Dr Nicholls says many writers have started using generative AI in research but greater transparency would be needed if its results were used substantially.

“There will be an ongoing relationship between authors and generative AI,” he said.

“If authors use the output of the generative AI directly, there should be disclosure, but if they use it to help their creative flows, I’m not sure they need to disclose that any more than saying they used a thesaurus, because it is a research tool.”

AAP