Per chi se li fosse persi: un paio di lavori che indicano in modo abbastanza chiaro come gli LLM siano certamente delle gigantesche memorie che contengono intere opere soggette a diritto d'autore.

Ciao, Enrico

1)

Extracting books from production language models
hmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang
https://arxiv.org/abs/2601.02671

Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question ... and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. ... e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone ... Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

----------------

2)

Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty
Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
https://arxiv.org/abs/2603.20957

Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text.

...

Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors’ works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.


--

-- EN

https://www.hoepli.it/libro/la-rivoluzione-informatica/9788896069516.html
======================================================
Prof. Enrico Nardelli
Past President di "Informatics Europe"
Direttore del Laboratorio Nazionale "Informatica e Scuola" del CINI
Dipartimento di Matematica - UniversitĂ  di Roma "Tor Vergata"
Via della Ricerca Scientifica snc - 00133 Roma
home page: https://www.mat.uniroma2.it/~nardelli
blog: https://link-and-think.blogspot.it/
tel: +39 06 7259.4204 fax: +39 06 7259.4699
mobile: +39 335 590.2331 e-mail: nardelli@mat.uniroma2.it
online meeting: https://blue.meet.garr.it/b/enr-y7f-t0q-ont
======================================================

--