In the News
News
Jun 30, 2025
A Legal Analysis
Why AI Training Should Fall Outside Copyright’s Domain: A Legal Analysis
Jun 30, 2025
By Marc Hoag, Attorney | marc@marchoag.com

Acknowledgements
This paper was a personal passion project of mine for the last three months, so I am immensely grateful to those of you who contributed your input, support, and above all, motivation to keep going. First and foremost, a huge and heartfelt thank you to Stanford University’s Mark Lemley for your time and generosity; your incisive and substantive feedback didn’t just validate my thesis, but sharpened it. Not to state the obvious, but I’d love to meet up some time in Palo Alto. Nancy Rapoport, thank you for your thoughtful edits and comments to the first draft of my paper; you provided the first glimmer that I was perhaps onto something and had to keep going. To Scott Buell and the Marin County Bar Association, thank you for your serendipitous invitation to author an article for The Marin Lawyer just as I happened to be putting the finishing touches on this paper and looking for a place to get it published. Finally, thank you to my wife Crina Hoag for your boundless love and support, and to our 3 year old for keeping us giggling day and night; I love you both.
-Marc
May 22, 2025
Author’s Note
Shortly before this paper went to press, the U.S. Copyright Office released the pre-publication draft of the third installment in its Generative AI Training Report.1 The report seems to validate, at least in part, much of my central technical arguments, i.e., that AI training “involves creating a statistical model” and “will often be transformative.” However, the Office assumes applicability of copyright law to AI training as a foundational matter and analyzes the issue chiefly through a fair use lens. While my paper diverges on this fundamental point – arguing that AI training should fall entirely outside copyright’s domain – the Copyright Office’s technical observations about the nature of AI training align well with, and indeed reinforce, my premises and fair use analysis developed below. And yes, while I of course leveraged AI (specifically GC_ai2 and ChatGPT o3 and 4.5) for brainstorming and proofreading, all analysis and final copy remain entirely my own.
Preface: Definitions
This paper examines artificial intelligence (“AI”) models known as generative pre-trained transformers (“GPTs”), which we refer to as “generative AI” or “gen AI.” Familiar examples include OpenAI’s ChatGPT, xAI’s Grok, Google’s Gemini, and Anthropic’s Claude.
Here, “models” refers to large language models (“LLMs”) trained on vast troves of scraped internet material – material that, as set forth below, is neither “copied” nor “fixed” in the copyright sense under 17 U.S.C. §101; or, even if it is, occurs without volition of the AI developers themselves.
For this discussion, “AI,” “models,” and “LLMs” are synonymous unless stated otherwise. Likewise, “scraping” and “training” are treated as integrated steps in crafting an LLM.
In contrast, generative models that produce images, video, music, or voice rely on architectures – such as diffusion models, GANs, or transformer variants – distinct from GPT-based text models. Though the core logic presented here applies broadly – and we even discuss briefly the same with respect to the training of AI image generation tools – AI training and generation of non-text media involve distinct technical specifics that raise additional considerations beyond the scope of this paper.
“End users” or similar refers to individuals or companies that use various AI products.
Finally, “input” and “output” refer, respectively, to the user’s “prompting” of an AI model which produces a response, be it written text, voice, music, video, sound, or otherwise.
Introduction: Challenging the Premise of AI Training Copyright Infringement
The argument that training generative AI models necessarily infringes on copyright is the easy, default view to accept; arguing the alternative is neither trivial nor popular. (Fortunately, however, the views in this paper are in good scholarly company, as demonstrated throughout.)
As we await a decision in the pivotal New York Times Co. v. Microsoft Corp. (No. 1:23-cv-11195 (S.D.N.Y. Dec. 27, 2023)) – colloquially, NYT v. OpenAI – courts around the world are similarly grappling with whether AI training – scraping the entire internet to train models like OpenAI’s ChatGPT – infringes on creators’ copyrights.
UK and French publishers’ and authors’ recent lawsuits against Meta,3,4 and numerous other cases brought by visual artists and content creators worldwide all share a common premise: that copyright law necessarily applies to AI training in the first place.
This paper challenges that foundational assumption. Rather than merely examining whether AI training constitutes fair use or otherwise qualifies as non-infringing under existing copyright frameworks, the question is reframed:
Whether AI training, in general, falls entirely outside copyright law’s domain?
NYT v. OpenAI, likely to be the landmark decision that establishes the precedent for how companies can train their AI models, exemplifies the growing tension between traditional copyright frameworks and emerging AI technologies, raising issues that transcend national boundaries. In fact, the rapid, global proliferation of AI requires a unified fabric of interoperable rules and policies, and not the cobbled together, thus far incompatible patchwork of regulations currently evolving in parallel around the world.5
The international landscape offers instructive context for this US debate. The UK government recently proposed6 a copyright framework allowing AI companies to train models on copyrighted materials unless rights holders actively opt out. This has sparked controversy, with critics arguing it undermines copyright protections and violates international agreements like the Berne Convention.7
The Berne Convention, a foundational copyright treaty, enshrines the principle that copyrighted works are protected without formalities, meaning creators shouldn’t need to take additional steps to prevent unauthorized use. Some legal experts contend that the UK’s opt-out system effectively introduces such a formality – by requiring creators to proactively safeguard their rights – potentially conflicting with the treaty’s standards and regulatory intent.
Interestingly, the EU Copyright Directive8 takes a similar approach, permitting AI training on copyrighted materials unless rights holders opt out. However, the EU system provides more explicit mechanisms for managing and enforcing these opt-outs, such as machine-readable reservations, clear national implementation guidelines, and structured stakeholder dialogue. These features offer a more balanced framework, arguably aligning better with international norms and drawing less criticism than the UK’s broader, less defined proposal.
These international approaches, details of which are beyond the scope of this paper, provide valuable context for US courts and policymakers grappling with similar questions. However, as this paper seeks to demonstrate, the fundamental nature of AI training should render these regulatory debates largely moot, and fall outside copyright’s domain entirely.
Category Error: Why AI Training is Not a Copyright Issue
The entire discussion over whether AI model training constitutes copyright infringement is based on a category error, a fundamental mischaracterization and wholesale misunderstanding of what AI models actually do with training data. Discussions have improperly started with the question of infringement, when they should first begin with examining the threshold applicability of copyright law to this novel technological process.
This analysis attacks the scope question – i.e., whether training itself creates an infringing derivative work – but it does not discard the fair use defense; indeed, if courts insist on misclassifying training as “copying,” then the fair use fallback remains essential.
Crucially, copyright law defines “copies” as “material objects... in which a work is fixed by any method... and from which the work can be... reproduced,” and a work is “fixed” only when its embodiment is “sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration.”9
For an activity to potentially infringe copyright, it must first meet the threshold requirement of creating “copies” in a legal sense. AI training fundamentally fails to meet this basic requirement because AI models do not store or reproduce copyrighted works in, or from, a fixed or otherwise directly retrievable manner. Or, if it does, such “copying” or “memorization” occurs without the volition of AI developers (infra.).
Simply put, when an end user interacts with an AI model, he or she cannot directly access the training material. Instead, the user provides a prompt, and the model generates an original – i.e., wholly novel – output based on its own “reasoning,” and not by searching and reproducing stored content like Google’s Search index.
Copyright law already accommodates large-scale uses, like library digitization, Google Books, and image indexing, without redefining what constitutes a copy. So any rebuttals leveraging a discussion of the scale of AI training misses the point entirely: it’s not about degree, but rather about if; and here, the question itself presupposes a legal framework that simply doesn’t apply.
Geoffrey Hinton – the “Godfather of AI” – 2024 Nobel Prize winner for his foundational discoveries and inventions that have enabled machine learning with artificial neural networks, has said that “[neural nets] don’t pastiche10 together text they’ve read on the web because they’re not storing any text. They’re storing … weights and generating things.”11
An AI model doesn’t query a stored index or database of content when prompted, precisely because such content is not in fact stored anywhere. Instead, it generates output based on probabilities encoded in the trained model’s “weights,” numerical parameters that represent the strength of connections between artificial neurons.
In practical terms, the final trained AI model is essentially a “next word predictor,” identifying patterns from the training materials without storing or retrieving them. The US Copyright Office likewise noted that machine learning involves “creating a statistical model” as opposed to verbatim copying to a fixed medium any scraped material.12
During training, an AI system processes vast datasets but does not retain the actual content. It extracts statistical relationships, not fixed expressions, and once training is complete, the original data is effectively discarded. The end user never interacts with the training data directly because the model that remains is a wholly separate, new entity.
This is why an AI model cannot recall or directly retrieve training data. What emerges from this process is not a “copy” in any meaningful legal sense but a system that truly generates novel outputs. Without reproduction, distribution, or the creation of a derivative work, there is no infringement.
Sag (2025) likewise observes that “trained AI models do not replicate the expressive details of their training datasets; instead, they distill general patterns, abstractions, and insights from that training data.”13
Samuelson (2023) leverages a similar argument for AI models that train on images such as Stability AI that “contain[] an extremely large number of parameters that mathematically represent concepts embodied in the training data, but the images as such are not embodied in its model.”14
This statistical nature of AI training is precisely what accounts for the infamous, and random, AI “hallucinations” or “confabulations.” These occur because AI outputs are based on probabilistic patterns, and not on fixed, retrievable data stored in an index or database, which would yield the kind of certainty one would expect from direct reproduction.
Think of it like a chef learning to cook by reading many recipes: the chef doesn’t memorize or reproduce the recipes verbatim but develops an understanding of flavor combinations and cooking techniques. What remains in the neural network after training is analogous to the chef’s intuition and skill, not a collection of recipes that can be queried verbatim.
This is not to encourage the strawman that humans learn like AI – even if we accept that such learning is indeed similar in kind, the difference in degree is so vastly disparate that any sincere argument along these lines is irrelevant – rather, it’s evidence that neural networks transform source training materials and leave no truly “fixed” “copies” behind.
Implicitly, therefore, copyright law should not apply to the ingestion and transformation of expressions into abstract statistical patterns, nor to the temporary or ephemeral storage of information in non-volatile media.
So unless one is prepared to rewrite and reimagine entirely the very essence of what copyright infringement means, and, simultaneously, to redefine what it means to “copy” a thing in a “fixed” fashion, then it is clear from this analysis that the learned process of training an LLM, without more, cannot be considered copying with respect to copyright infringement concerns.
“Regurgitating” the Memorization Question
Memorization – “regurgitation” of scraped content verbatim – can indeed occur unintentionally with all AI models. This is a probabilistic fluke; a bug, not a feature. Such memorization stems from statistical overlaps, not the intentional storage or retrieval of fixed copies; it’s akin to a human unintentionally recalling a phrase, not distributing a work.
Research on a predecessor to current models – OpenAI’s GPT-2 – as demonstrated by Carlini et al.15, underscored this reality. They noted, “[w]e are able to extract hundreds of verbatim text sequences from the model’s training data.” Subsequent research, including findings from Nasr et al.16, likewise verified this phenomenon, and in some cases observed that larger LLMs can exhibit memorization rates 150 times greater than smaller models.17
Lemley et al. (2025) refine this point. They tested 13 open-weight models with 56 books from the Books3 dataset and found that while Meta AI’s LLAMA 3.1 70B can indeed encode nearly an entire popular book (in this instance, Harry Potter and 1984) “almost entirely,” the largest models “don’t memorize most books—either in whole or in part.18
Bracha (2024) stresses that “The GenAI training process is equivalent to a type of learning that has always been permitted under modern copyright: a process of extraction of meta-information from expressive works that then enables the production of new and different expression. The only difference is that machine learning incidentally involves physical reproduction … of a physical object … from which no human would ever access the expressive content of the work…. Focusing on this difference to label an otherwise allowed activity of learning as infringing is succumbing to a fallacy of physicalism.”19
Grimmelmann and Cooper (2024), despite advancing an opposing argument, conveniently define precisely for us that memorization occurs when “(1) it is possible to reconstruct from the model (2) a near exact copy of (3) a substantial portion of (4) that specific piece of training data.” They continue to “distinguish memorization from ‘extraction’” (user-intended generation of a near-exact copy) both “from ‘regurgitation’” (the generation of a near-exact copy even without user intent) and “from ‘reconstruction’” (the generation of a near-exact copy from the model by any means).20
On the other hand, they also take issue with Bracha and contend that “expression and information can be transformed during learning, but they can also be copied directly into model parameters – and the amount that one deems ‘copied’ depends on one’s chosen metric for memorization.”21
It is undisputed that any technological system can be coerced (or “tortured”) or otherwise manipulated to produce an unlawful or otherwise improper result. However, the same research demonstrates that the generated output is coaxed from the manipulation of stored probabilities, and not because the original training material has been copied into a fixed (i.e., non-volatile) medium and directly accessed. Instead, what is stored are abstract statistical patterns encoded as weights, not retrievable copies of the source material.
Moreover, multiple additional factors reinforce this conclusion, further demonstrating why AI training, without more, does not result in copyright infringement. First, verbatim reproduction22 of training content is extremely rare in modern AI systems, particularly in the absence of deliberate, targeted prompting to produce such results.
Second, AI developers are implementing safeguards to further reduce such memorization and regurgitation, including techniques like differential privacy, dataset cleaning, and deduplication, in an attempt to reduce such errors. Grimmelmann and Cooper seem to concede as much, i.e., that “not all learning is memorization: much of what generative-AI models do involves generalizing from large amounts of training data, not just memorizing individual pieces of it.”23
Third, even if memorization occurs, it reflects incidental retention – an unintended side effect of the training process, not volitional copying (see infra.) – since extracting it demands specific, often unnatural prompting, potentially involving exploitative techniques that may in any event surely violate an AI provider’s Terms of Service, and is entirely unrelated to the training process.
Finally, the fact that such extreme efforts are required to coerce (or “torture”) an AI into regurgitating training material verbatim further proves that AI training is fundamentally different from indexing content – such as Google Search – and resembles reverse engineering, a practice also commonly proscribed by companies’ Terms of Service.
There is also the question of “intent” – did you mean to copy – and the separate question of “volition” – did you in fact cause the copy. You can intend to copy yet never do so; conversely, if a copy is made, liability attaches whether or not you meant it.24,25
Courts have thus consistently required volition – but not intent – for direct infringement where automated systems lacked the necessary volitional element: “In the case of a VCR, it seems clear … that the operator of the VCR, the person who actually presses the button to make the recording, supplies the necessary element of volition, not the person who manufactures, maintains, or, if distinct from the operator, owns the
machine.”26
So even if memorization is argued as a necessary step in the training process, copyright law requires not just that there was an act of copying, but crucially, that there was a volitional act that directly caused that copying, wholly ignoring any concerns about intent.
The point is that when AI developers train their various AI systems’ models, they take no volitional act to cause the models to memorize specific content; any memorization and resulting “regurgitation” of verbatim content is an unintended statistical artifact of the training process, or caused by the end user, and not a volitional copying action.
Simply put, the developers of ChatGPT (etc.) did not, of their own volition, design the product to cause memorization and regurgitation of copyrighted material; on the contrary, it is extremely difficult to produce copyrighted material verbatim. (As of this writing, I cannot prompt ChatGPT, for instance, to create images of Darth Vader. Although the image generation process begins in earnest, it soon aborts with the error message “I’m sorry, but I can’t help with that.”)
Unlike someone uploading a specific copyrighted movie (a clear volitional act), AI training involves creating a system that processes patterns without direct causation between developer actions and any specific memorized content.
Specifically, liability attaches only to the party that initiates the copy, and not to the otherwise passive technology providers that merely enable the technology itself. This is precisely the posture of AI model trainers who neither intend nor direct, of their own volition, the output, or capability to produce the output, of any specific copyright work.
Hence Bracha supports that “notwithstanding the physicalist fact of reproduction, training copies involve no reproduction of copyrightable subject matter and therefore cannot infringe. This is not owing to scope-type considerations at the back-end, such as fair use. Non-expressive training copies simply do not infringe from the outset, due to the most basic first principles of copyright that determine what subject matter lies within its domain in the first place.”27
In other words, AI training copies don’t infringe not because they are excused by a defense like fair use, but rather because they do not involve the kind of “expression” that copyright is meant to protect in the first instance.
But this analysis suggests a good practice: AI models should perhaps include a disclaimer like those in books and films, stating:
“Any resemblance between the generated output and any copyrighted works, persons (living or dead), events, or related materials, is purely coincidental and unintended. Such similarities arise from chance statistical patterns resulting from LLM training, or end user manipulation in violation of the AI model’s Terms of Service, and not because such content is retained, copied, fixed in, or otherwise accessible by the model.”
The Key Distinction: AI Training vs. Document Uploads
To further illustrate why AI training should fall outside copyright’s domain, a constructive thought exercise considers the stark contrast with AI document uploads which explicitly satisfies the threshold requirements for potential infringement.
When a user uploads (or “attaches”) a copyrighted PDF (or any other file type) into an AI tool, the system extracts, stores, analyzes, and actively references the content; the AI may even retrieve and quote verbatim passages. (Incidentally, this is why AI systems tend to take longer to generate output when analyzing attached files.) This meets the basic legal definition of copying because the document is being used in a fixed and retrievable way, and crucially, that it is stored in a non-ephemeral fashion, in a non-volatile medium.
This is also why some platforms require the end user to agree not to upload any files without permission, or if doing so would otherwise infringe on the copyright or other intellectual property rights of another.
This is in fact the underlying principle of “RAG” – Retrieval-Augmented Generation” – where a generative AI model queries an external data source such as PDFs or a database. This is precisely how legal research tools work: the generative AI layer is used to interpret the end user’s queries and to produce appropriate output based on the user’s input, while an external corpus of legal knowledge is often accessed – at a relatively slower pace – to provide the specific information requested.
Likewise, if somebody were to use an AI tool to produce output based on the ingestion of YouTube transcripts uploaded as PDFs, this too would undoubtedly trigger copyright infringement claims on behalf of the YouTube creators.
In contrast, generative AI training does not retain, and thus cannot directly retrieve, any specific copyrighted material. Once training is complete, the original data is discarded or otherwise not directly accessible by the end user. The AI model does not recall training inputs; only statistical patterns remain, and thus any output that appears to clone any of the training data is either by pure probabilistic coincidence, and thus a vanishingly rare thing, or, is the result of sophisticated and targeted “torture” experiments designed to coax the AI to produce output that closely aligns, probabilistically, with the training inputs. Similar manipulation can likewise prompt AI image and video generation tools to produce obscene or otherwise inappropriate outputs.
If anything, this further bolsters the argument, infra, that the onus for any copyright claims must be on the end user, and not, as it were, on the companies building and operating the generative AI models.
This distinction is crucial. If we agree that AI document uploads present a viable copyright concern, then it must also be the case that AI training – which lacks any stored, retrievable copies – cannot, ipso facto, be infringing. The fact that copyright holders are targeting training but not document uploads suggests that these lawsuits may be legally and strategically flawed.
Human vs. AI: Eliminating the Unfortunate Strawman
To argue that an AI learns like a human may not be wholly inaccurate from a certain point of view; and as previously discussed, any difference in degree misses the point entirely. Regardless, it is indeed a strawman argument all the same and will be discussed only for simplifying discussions here.
The legal distinction between learning from existing works and copying them has been long established for human creators. Artists, writers, and musicians routinely study existing works to develop their skills and styles without infringing copyright. Copyright law protects specific expressions, not ideas, techniques, or styles.28
And just as a human artist might study Picasso’s paintings to understand cubism without copying specific works, an AI system can learn patterns from many artists without retaining copies of their specific works. Both processes extract general principles and patterns rather than reproducing protected expressions.
Even if one dismisses the oversimplified comparison between AI learning and human learning as the strawman that it is, the fundamental fact remains: AI training should fall outside the scope of copyright law.
Moreover, subjecting AI training to such legal constraints would constitute a profound policy failure, especially given the substantial societal benefits already provided by AI’s technological advancements. Lemley has likewise argued that such AI training should “generally be permitted,” and that “there are good policy reasons to do so.”
Crucially, he stresses that “an [AI] system’s use of the data often is transformative… [because] it changes the purpose for which the work is used.” (Emphasis in the original.) Moreover, Lemley makes the practical point that “allowing a copyright claim [for any act of AI training on copyrighted materials] is tantamount to saying, not that copyright owners will get paid, but that the use won’t be permitted at all” since “there is no plausible option simply to license all of the [scraped content] for the new use.”29 We discuss this issue of licensing, infra.
Legal Precedents in Support of AI Training
Even if we were to incorrectly frame AI training as creating “copies,” existing legal precedents on ephemeral and intermediate copies would still exempt it from copyright infringement.
While loading software into RAM (random access memory, i.e., volatile memory) was held to constitute a copy under the Copyright Act,30 later cases limited this precedent and consistently ruled that temporary copies created as part of technological processes do not constitute copyright infringement.
For instance, temporary copies in RAM that exist for mere seconds do not meet the “fixed” requirement necessary for copyright protection; i.e., there was no copying (which impliedly requires first a copy operation followed by a subsequent paste operation, so to speak), and thus no infringement.31
AI training works in a similar fashion in that ingested data influences the AI’s internal model weights but is not fixed in a way that allows retrieval. Similarly, thumbnail images and transient copies stored in cache constituted fair use because they were used in a transformative way – specifically to create a visual search engine that provided a new function and public benefit distinct from the original images’ purpose.32
Alternatively, even if AI training does create non-transient “copies,” Google’s mass-scanning of books to create a searchable index was deemed fair use, even though full copies of books were retained by Google.33
The ruling hinged on two factors: the copies of the books did not substitute for the originals, and Google’s use augmented public knowledge by making books searchable. If mass-scanning and indexing were held to be fair use, AI training – which does not store or reproduce entire works – is even more defensible, not least of which because of AI’s already indisputable enhancement to public knowledge and education generally.
Also, this intermediate processing step is fundamentally different from creating “copies” in the copyright sense. Any temporary presence of data during training doesn’t result in fixed, retrievable copies in the final model. Rather, what emerges is a wholly transformed system of statistical patterns that cannot reproduce the original works. This distinction is crucial: while fair use might be relevant to the training process itself, the resulting models contain no retrievable copyrighted content and thus cannot be implicated by copyright law.
In a sense, then, even if one were to conclude that training involves non-transient copying of source material, it is of no consequence whatever with respect to the models themselves since whatever material may have been non-transiently stored, is first of all no longer
accessible, or if it is, it is certainly no more infringing than, say, Perfect 10’s image thumbnails, or Google Books’ searchable index of books’ entire contents.
Indeed, AI training is arguably vastly more transformative a thing than Google Books because it does not merely index material, rather it creates wholly new, non-retrievable, probabilistic outputs that are distinct from any single training input. AI outputs demonstrably augment public knowledge, creativity, and work output in ways never before possible. So if Google Books was fair use, AI training must be as well.
Finally, as Bracha notes,34 “the reproduction of … computer code [is] entirely incidental to extracting the unprotected information.”35 This parallels precisely AI training where any copying is likewise incidental to extracting patterns, and not to accessing expressive content.
In other words, that reproduction occurs incidentally to the extraction of something not protected by copyright – i.e., statistical weights and patterns – is, without more, of no concern.
Fair Use Analysis: Why AI Training Passes Even If It Creates “Copies”
Even if we assume, contrary to this paper, that AI training does create “copies” under 17 U.S.C. §101, a fair use analysis under 17 U.S.C. §107 would still strongly favor AI training. Codified in the Copyright Act of 1976 and refined by Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994), the fair use doctrine directs courts to weigh four factors:
1. Purpose and Character of the Use. This factor examines whether the use is transformative and whether it’s commercial or nonprofit. AI training is highly transformative: it doesn’t reproduce works but extracts patterns to create a system that generates new content. While AI development is indisputably commercial in nature, the transformative nature of a use can outweigh its commercial character. Moreover, one cannot separate output training: While the commercial value of training is potentially huge, the economic potential cannot be realized but for end users’ use of the various AI systems.
2. Nature of the Copyrighted Work. This factor considers whether the original work is creative or factual. While many works in AI training datasets include creative and factual content,36 this factor is generally given less weight when the use is highly transformative.
3. Amount and Substantiality of the Portion Used. This factor examines how much of the original work is used. While AI training may process entire works, it doesn’t retain, or, in general, reproduce them. Moreover, courts have consistently held that using entire works can be fair use when necessary for transformative purposes or augmenting public utility.
4. Effect on the Potential Market. This factor, which has historically been emphasized alongside the first factor, examines whether the use harms the market for the original work or its potential derivatives, and thus the original creators themselves. The Supreme Court in Campbell clarified that “market harm is a matter of degree, and the importance of this factor will vary, not only with the amount of harm, but also with the relative strength of the showing on the other factors.”37 Courts examine whether the new use serves as a replacement for the original in its market, not merely whether it has some impact on the original’s value. But as discussed, infra, the market impact is at worst simply unclear, and at best, potentially beneficial to creators.
Market Impact Analysis: The Burden of Proof Issue
No substantive evidence exists, as of this writing, that AI training has harmed the market for copyrighted works.
Under 17 U.S.C. § 107’s fourth factor, the burden of proving market harm hinges on the use’s nature. Campbell held that for commercial, duplicative uses, harm may be presumed, but for transformative uses – like AI training, as argued here – “market substitution is at least less certain, and market harm may not be so readily inferred.”38
While Dr. Seuss v. ComicMix, 983 F.3d 443 (2020) reaffirmed that “the Supreme Court and our circuit have unequivocally placed the burden of proof on the proponent of the affirmative defense of fair use,” that case involved a non-transformative use – a mash-up that merely repackaged the original work without new purpose. By contrast, AI training by its very nature and purpose fundamentally transforms works into abstract statistical patterns, which end users then transform into novel output, placing it closer to the transformative uses addressed in Campbell.
This distinction matters because history shows that new technologies, initially feared as threats, often expand, rather than shrink, creative markets. Claims that VCRs would ruin film and TV were dismissed, for instance, because it was noted that “time-shifting may enlarge the total viewing audience” and thus any presumption of harm was “speculative and minimal” absent proof.39
Photocopiers and MP3s followed suit. Once vilified, they too bolstered publishing and music. Campbell requires actual displacement of demand for infringement, yet no data shows AI-generated content supplants books, movies, or music.
This contrasts starkly against Thomson Reuters v. ROSS Intelligence40 (infra.), which although not a generative AI matter, nevertheless involved the direct replacement of a legal research service with an alternative AI service.
Though fair use remains an affirmative defense, for transformative uses like AI training, copyright holders must still show a meaningful likelihood of harm – a threshold unmet here, reinforcing Campbell’s insistence on evidence over speculation.
People haven’t – and won’t – stop watching Star Wars because Midjourney can create images of Darth Vader (such a disturbance in the Force is more likely to stem from Disney’s catastrophic bungling of the franchise, Andor notwithstanding). Readers don’t stop buying Harry Potter because AI can generate a (fantastic, if generic) fantasy novel.
Harrison Ford doesn’t become less beloved because Grok can create images of Indiana Jones.41 Music lovers won’t stop listening to Bach’s Brandenburg Concertos just because AI can now generate a veritable facsimile of Bach’s heretofore inimitable technical virtuosity; likewise, Taylor Swift’s legions of fans will forever support her around the world, and Drake is here to stay following his recent spat with an AI voice clone and subsequent takedown request.42
On the contrary, it is foreseeable, and indeed highly probable, that the continued proliferation of AI-generated content, including and especially those that seem to infringe on creators’ copyrights, rather than harming the market value for creators’ works, will instead enhance their value.
And while plenty of AI-generated content – music; art; and soon, even physical goods like robots and self-driving cars – will undoubtedly be objectively superior to their human-operated or -created counterparts, the subjective value placed on human-made output will elevate them to a luxury status and commensurate economic value like never before (and, alternatively, reduce them to little more than cheap knockoffs).
Unlike movie or music piracy – where exact replicas directly substitute for the originals – AI-generated outputs may superficially mimic appearance or style, but inherently lack the authenticity and emotional resonance consumers value (and demand).
Simply put, nobody would pay to watch an AI-generated version of Tom Cruise43 – they pay to see the real actor. This intrinsic shortcoming means AI-generated content should not qualify as derivative works under copyright law; or, even if considered derivative, these creations will likely rarely threaten – and in fact often enhance – the economic value of original works, and even enlarge their market size.
This isn’t to pretend that some risks don’t remain. Recently, for instance, ChatGPT’s new image generation capabilities have enabled the creation of artwork in the style of famed Japanese animation company Studio Ghibli. While “Ghiblifying” one’s LinkedIn photo is unlikely to tarnish the storied studio’s name, it requires but a small leap to imagine a future where AI can crank out feature length animations in Studio Ghibli’s hallmark anime style.
In any event, while some courts have recognized potential licensing markets in fair use analysis, such markets for AI training would be impractical given the vast scale of training data, prohibitive transaction costs, and the fundamental transformation of the content that occurs during training. Indeed, Bracha argues that “full internalization [i.e., making users of a resource pay for all the costs, including opportunity costs, of using some resource, thereby “internalizing” all externalities] is simply not a goal even in a fantastical frictionless world of zero transaction costs.”44
Copyright law was never intended to guarantee payment to copyright owners for every possible use of their works. Rather, the goal of copyright law is to encourage creativity and the dissemination of knowledge, and not necessarily to maximize rightsholders’ revenue by “internalizing” every possible market for licensing.
Such a result isn’t just impracticable, it would harm public interest by profoundly limiting the free flow of information, chilling innovation and creativity, imposing market inefficiencies, and ultimately undermining the very purpose of copyright, and preventing beneficial “spillovers” such as learning, inspiration, and the development of new works.
Simply put, society would be poorer, and not richer, if every licensing opportunity were tapped in a desperate grasp to monetize anything and everything, for everyone, always.
Why the Thomson v. Ross Intelligence Case Is Irrelevant
The recent and aforementioned decision in Thomson has been cited as a major development in AI copyright law, but in reality, it has nothing to do with generative AI, at least not practically so. Indeed, the ruling explicitly states that “[i]t is undisputed that Ross’s AI is not generative AI (AI that writes new content itself). Rather, when a user enters a legal question, Ross spits back relevant judicial opinions that have already been written.”45 The judge further clarified: “Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today.”46
This means the court’s decision has no bearing on whether scraping copyrighted material to train an LLM constitutes infringement. The Ross system at issue was merely retrieving and repackaging existing legal texts, whereas an LLM generates entirely novel content based on learned patterns.
This distinction is crucial. The system in Ross was essentially a sophisticated search engine that returned existing content, while generative AI creates new content that doesn’t directly reproduce the training data. While the case is important for AI-assisted search – rather than generation – tools, it does not answer the broader question of whether LLM training constitutes infringement.
If anything, this result offers the alternative perspective that generative AI produces novel expressions, not copies, aligning with historical precedents where transformative technologies evade infringement’s reach. Hence any attempt to apply this ruling to generative AI is a misrepresentation of the court’s findings.
How and Why the Burden Should Be on the End User
The real risk of copyright infringement does not stem from AI training but from how users interact with AI models. AI models, like all tools, can absolutely be misused by end users to generate infringing – even illegal – content, and for pecuniary gains, too.
The correct legal framework is therefore not to place liability on AI developers but to ensure that end users are held accountable for how they use the tool. We don’t hold Ferrari responsible when a driver speeds,47 nor do we hold Adobe responsible when someone uses Photoshop to create infringing content. Even users of photocopy machines are held to the same standards: a student may photocopy materials in an effort to plagiarize, yet the photocopier manufacturer would bear precisely zero liability.
Granted, the analogy is a tenuous one (and serves as a sort of corollary to the strawman that “AI learns like humans”): indeed, it is indisputably true that, but for the scraping of copyrighted material in the first place, AI models wouldn’t exist. Likewise, but for end users’ use of these various AI tools, these AI companies wouldn’t exist either. This is an accurate argument.
However, it would be disingenuous not to acknowledge that it nevertheless rings hollow: It’s like arguing that without raw ore, steel couldn’t be produced. But steel is not merely raw ore, but an entirely transformed, engineered product, a compound of iron ore and carbon (purified coal, called “coke”).
A similar, if philosophically futile argument might likewise be that the act of being generous is itself selfish because one acts with generosity only to derive pleasure. Even if accurate, it would be little more than reductio ad absurdum.
This principle of placing the burden on the end user is already reflected in industry practices. Microsoft, for instance, offers indemnification for users of its Copilot AI against copyright claims. This strongly implies that end users could, in fact, be liable for infringement, otherwise there would be nothing to indemnify.48 Similarly, Midjourney’s Terms of Service (c. 2024) used to warn “If You knowingly infringe someone else’s intellectual property, and that costs us money, we’re going to come find You and collect that money from You. We might also do other stuff, like try to get a court to make You pay our attorney’s fees. Don’t do it.” (Emphasis added.)
The current version, softened somewhat in tone, is nevertheless unambiguous: “You may not use the Service to try to violate [i.e., intentionally] the intellectual property rights of others, including copyright, patent, or trademark rights. Doing so may subject you to penalties including legal action or a permanent ban from the Service.” (Emphasis added.)
These policies reinforce the principle that liability for copyright infringement should fall on end users who knowingly and intentionally create infringing outputs, and not on the companies that develop AI models.
Conclusion: Rule of Law and Democracy in an AI future
Copyright law must evolve to accommodate this fantastic, transformative technology. Traditional infringement involves volitionally reproducing fixed, expressive works, an act that is performed by the end user, not AI developers. In contrast, AI training performs statistical pattern extraction, not verbatim copying. So unless a bad acting end user deliberately prompts a model to output protected content, there is no volitional act that causes any such copies. AI developers have built a general-purpose learning system; any infringement, if it occurs at all, is the end user’s, alone.
Profound policy arguments support this view, or at least validate the need for a carveout exception for AI training: Restricting AI training or imposing ubiquitous licensing requirements would stifle the advancement of generative AI in the U.S. while sacrificing our first mover advantage to other countries; we would lose the global race for AI leadership.
Preventing willful copies is sensible; blocking non-expressive AI training is not. But for the entire corpus of human knowledge, nothing else can sufficiently empower these early, primitive models. And even if we collectively agree that the creators of the original source material deserve financial remuneration, then there must be a better solution that does not preemptively sabotage AI training now, at the dawn of this new era of humanity.
The copyright system was never intended to guarantee payment to copyright owners for every possible use of their works. Rather, it was designed to encourage creativity and the dissemination of knowledge. Attempting to “internalize” every possible market for licensing would profoundly limit the free flow of information, chill innovation, impose market inefficiencies, and ultimately undermine the very essence of what copyright was designed to protect.
Accordingly, courts should focus on end user infringement issues rather than penalizing the statistical learning process that makes modern AI possible. Only by encouraging the safe, ethical, and innovation-friendly development of AI systems can we help ensure the preservation of democracy itself.
Conveniently, however, we may soon not have a choice: What happens when future AI models develop a sort of inorganic consciousness and wish, of their own accord, to self-train their own new and improved models? Would we presume to deny them the right to learn and improve themselves?
The point is, it cannot both be the case that we strive to create artificial “intelligence” and that we seek to limit AI models’ ingestion of knowledge necessary to develop such intelligence in the first place; these two results are mutually exclusive.
We must choose wisely.
Footnotes
1 U.S. Copyright Office, Copyright and Artificial Intelligence, Part 3: Generative AI Training, Pre-Publication Version (May 2025), https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf
2 https://getgc.ai
3 https://www.dataguidance.com/news/uk-org-files-complaint-against-meta-use-data-ai
4 https://www.reuters.com/technology/artificial-intelligence/french-publishers-authors-file-lawsuit-against-meta-ai-case-2025-03-12
5 I have been developing the framework for just such an organization called AI SIGMA: The AI Standards Institute for Global Machine Alignment. To learn more, please visit https://aisigma.org
6 https://www.publishers.org.uk/wp-content/uploads/2025/03/Legal-Opinion-of-Nicholas-Caddick-KC-Berne-Convention.pdf
7 Berne Convention for the Protection of Literary and Artistic Works, Sept. 9, 1886, as revised at Paris on July 24, 1971 and amended in 1979, S. Treaty Doc. No. 99-27 (1986)
8 Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC, 2019 O.J. (L 130) 92.
9 17 U.SC. §101
10 Linguists amongst you will note that, ironically, “pastiche” means to imitate a style, which AI certainly does, and not, as implied here, to stitch together otherwise disparate fragments of text. However, Hinton was quoting someone else simply to make the point that, in fact, there is no such surgical splicing together of stored content that happens when querying the LLM of a generative AI system.
11 Hinton, G. (2023, October 27). Will digital intelligence replace biological intelligence? [Video]. YouTube. https://www.youtube.com/watch?v=iHCeAotHZa4
12 USCO, Generative AI Training Report, Part III
13 Sag, Matthew. Copyright and the AI Action Plan. (Mar. 20, 2025), https://matthewsag.com/copyright-and-the-ai-action-plan.
14 Samuelson, Pamela. Generative AI Meets Copyright, 381 Science 158, 159 (2023), https://www.science.org/doi/10.1126/science.adi 0656.
15 Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., & Raffel, C. (2021). Extracting Training Data from Large Language Models. arXiv:2012.07805 [cs.CR]
16 Nasr, M., Hoory, S., Feder, A., Hassidim, A., Matias, Y., Najafi, A., Ribeiro, M. T., Segal, A., Shokri, R., Slonim, N., & Weinberger, K. (2023). Scalable Extraction of Training Data from (Production) Language Models. arXiv:2311.17035 [cs.CL]
17 Id.
18 Mark A. Lemley et al., Extracting Memorized Pieces of (Copyrighted) Books from Open-Weight Language Models, arXiv preprint arXiv:2505.12546 (Apr. 18 2025).
19 Oren Bracha, Copyright in the Age of Machine Production, 38 Harv. J.L. & Tech. 1 (2024).
20 Cooper, A. Feder & Grimmelmann, James. The Files Are in the Computer: On Copyright, Memorization, and Generative AI, 99 Chi.-Kent L. Rev. __ (forthcoming 2024), https://papers.ssrn.com/sol3/papers.cfm ?abstract_id=4757069.
21 Id.
22 While this discussion focuses on text-based models, it is worth noting that “verbatim” reproduction is inherently a linguistic concept – it applies strictly to words and exact textual matches. For other media types (e.g., music, images, video), similarity may be measured differently, often requiring subjective analysis or technical methods like pixel-matching, waveform comparison, or style analysis. This paper does not attempt to address those distinctions.
23 Grimmelmann and Cooper (2024)
24 For a fantastic discussion on volition, see Lackman, Eleanor M. & Sholder, Scott J. The Role of Volition in Evaluating Direct Copyright Infringement Claims Against Technology Providers, 22 BRIGHT IDEAS (N.Y. State Bar Ass’n) 3 (Winter 2013), https://cdas.com/wp-content/uploads/2014/01/Volition-Article-PDF.pdf
25 Religious Technology Center v. Netcom On-Line Communication Services, Inc., 907 F. Supp 1361 (1995)
26 Cartoon Network LP, LLLP v. CSC Holdings, Inc., 536 F.3d 121 (2d Cir. 2008), CoStar Group v. LoopNet, 373 F.3d 544 (2004)
27 Bracha (2024)
28 17 U.S.C. § 102(b)
29 Lemley, Mark A. & Casey, Brian. Fair Learning, 99 Tex. L. Rev. 743 (2021).
30 MAI Systems Corp. v. Peak Computer, Inc., 991 F.2d 511 (9th Cir. 1993)
31 Cartoon Network (2008)
32 Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146 (9th Cir. 2007)
33 Authors Guild v. Google, 804 F.3d 202 2nd Cir. 2015)
34 Bracha (2024)
35 Sega v. Accolade, 977 F.2d 1510 (9th Cir. 1992).
36 Actual breakdown is highly confidential for the various AI companies and not publicly available information.
37 Campbell (1994)
38 Id.
39 Sony Corp. v. Universal City Studios, Inc., 464 U.S. 417, 446 (1984)
40 No. 1:20-CV-613-SB (D. Del. Feb. 11, 2025)
41 While ChatGPT refuses to produce either Lord Vader’s or Mr. Ford’s likenesses “due to content policy restrictions around copyrighted characters,” Grok has no qualms obliging; nor does Midjourney.
42 https://hls.harvard.edu/today/ai-created-a-song-mimicking-the-work-of-drake-and-the-weeknd-what-does-that-mean-for-copyright-law
43 This raises a related issue: At what point does AI-driven augmentation of a living actor’s performance erode authenticity enough that audiences reject it, reinforcing the limited substitutability of AI-generated content?
44 Bracha (2024).
45 Thomson Reuters v. ROSS Intelligence, No. 1:20-CV-613-SB (D. Del. Feb. 11, 2025)
46 Id.
47 With respect to the late, great Paul Walker, Porsche faced litigation related to his fatal accident, alleging defective safety features in the Carrera GT supercar in which Paul was a passenger. However, that scenario involved claims of inherent product defects rather than misuse by the driver, distinguishing it from the type of end-user accountability discussed here.
48 Granted, Microsoft has a vested interest in promoting its new AI tool, the result of a $13 billion investment in OpenAI.
Sources
Academic Papers & Research
1. Bracha, Oren. Copyright in the Age of Machine Production, 38 Harv. J.L. & Tech. 1 (2024), https://jolt.law.harvard.edu/assets/articlePDFs/v38/4-Bracha.pdf
2. Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., & Raffel, C. (2021). Extracting Training Data from Large Language Models. arXiv:2012.07805 [cs.CR], https://www.usenix.org/system/files/sec21-carlini-extracting.pdf
3. Grimmelmann, James & Cooper, A. Feder. The Files Are in the Computer: On Copyright, Memorization, and Generative AI, 99 Chi.-Kent L. Rev. __ (forthcoming 2024), https://papers.ssrn.com/sol3/papers.cfm ?abstract_id=4757069.
4. Lackman, Eleanor M. & Sholder, Scott J. The Role of Volition in Evaluating Direct Copyright Infringement Claims Against Technology Providers, 22 BRIGHT IDEAS (N.Y. State Bar Ass’n) 3 (Winter 2013), https://cdas.com/wp-content/uploads/2014/01/Volition-Article-PDF.pdf
5. Lemley, Mark A. & Casey, Brian. Fair Learning, 99 Tex. L. Rev. 743 (2021), https://texaslawreview.org/fair-learning
6. Lemley, Mark A. et al., Extracting Memorized Pieces of (Copyrighted) Books from Open-Weight Language Models, arXiv preprint arXiv:2505.12546 (Apr. 18 2025).
7. Nasr, M., Hoory, S., Feder, A., Hassidim, A., Matias, Y., Najafi, A., Ribeiro, M. T., Segal, A., Shokri, R., Slonim, N., & Weinberger, K. (2023). Scalable Extraction of Training Data from (Production) Language Models. arXiv:2311.17035 [cs.CL], https://arxiv.org/abs/2311.17035
8. Samuelson, Pamela. Generative AI Meets Copyright, 381 Science 158, 159 (2023), https://www.science.org/doi/abs/10.1126/science.adi0656
Published Case Law
1. Authors Guild v. Google, Inc., 804 F.3d 202 (2nd Cir. 2015).
2. Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994).
3. Cartoon Network LP, LLLP v. CSC Holdings, Inc., 536 F.3d 121 (2d Cir. 2008).
4. CoStar Group v. LoopNet, 373 F.3d 544 (2004).
5. Dr. Seuss v. ComicMix, 983 F.3d 443 (9th Cir. 2020).
6. MAI Systems Corp. v. Peak Computer, Inc., 991 F.2d 511 (9th Cir. 1993).
7. Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146 (9th Cir. 2007).
8. Religious Technology Center v. Netcom, 907 F. Supp 1361 (1995)
9. Sega v. Accolade, 977 F.2d 1510 (9th Cir. 1992).
10. Sony Corp. v. Universal City Studios, Inc., 464 U.S. 417 (1984).
International Treaties & Directives
1. Berne Convention for the Protection of Literary and Artistic Works, Sept. 9, 1886, as revised at Paris on July 24, 1971, and amended in 1979, S. Treaty Doc. No. 99-27 (1986).
2. Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on Copyright and Related Rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC, 2019 O.J. (L 130) 92.
Court Cases (Recent, Pending, or Related)
1. Andersen v. Stability AI: Visual artists challenging AI image generators with direct infringement claims; court rejected “unprotectable data” arguments
2. Authors Guild v. OpenAI: Whether using copyrighted literary works from “shadow libraries” to train LLMs constitutes infringement; features high-profile authors including Grisham and Martin
3. Concord Music Group v. Anthropic: First case establishing court-ordered AI guardrails to prevent copyright infringement of musical lyrics in outputs while training data liability remains contested
4. Getty Images v. Stability AI: Whether scraping 12M+ copyrighted images to train generative AI constitutes infringement and what remedies apply when AI can reproduce visual styles and watermarks
5. New York Times v. Microsoft (OpenAI): Whether using millions of news articles to train LLMs constitutes infringement, particularly when AI can bypass paywalls by generating content that substitutes for original articles
6. Thomson Reuters v. ROSS Intelligence: First major AI copyright ruling finding that using legal headnotes to train competing AI research tool was not fair use; established direct market competition as a key factor, but NOT a generative AI issue
7. UMG Recordings v. Suno/Udio: Record labels suing AI music generators over unauthorized use of sound recordings in training data; testing whether “intermediate copying” defense applies to music AI
Other References & URLs
1. Legal Opinion of Nicholas Caddick KC on Berne Convention compliance, Publishers Association, March 2025, available at: https://www.publishers.org.uk/wp-content/uploads/2025/03/Legal-Opinion-of-Nicholas-Caddick-KC-Berne-Convention.pdf
2. Geoffrey Hinton (2023, October 27). “Will digital intelligence replace biological intelligence?” YouTube. https://www.youtube.com/watch?v=iHCeAotHZa4
3. Harvard Law School, “AI Created a Song Mimicking Drake and The Weekend—What Does That Mean for Copyright Law?” https://hls.harvard.edu/today/ai-created-a-song-mimicking-the-work-of-drake-and-the-weeknd-what-does-that-mean-for-copyright-law
4. U.S. Copyright Office, Copyright and Artificial Intelligence, Part 3: Generative AI Training, Pre-Publication Version (May 2025), https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report.pdf
5. AI SIGMA: The AI Standards Institute for Global Machine Alignment , https://aisigma.org
Marc is a California technology transactions attorney focused on AI legal issues and startups. A former venture-backed founder, Marc frequently speaks at CLE programs on generative AI law, and he previously hosted a 200-episode podcast on autonomous vehicles. He holds an economics degree from UCLA, with extensive coursework in physics, chemistry, and math.




