I. Introduction: US Copyright Office “pre-publication” of Part 3
In May 2025, the US Copyright Office released a “pre-publication version” of the third part of its overall report related to copyright and artificial intelligence.
This rather unusual reference to a “pre-publication version” may be seen (or not…) as a sign of an anticipated dissatisfaction from the Trump administration with the rather nuanced (and in my view) welcome approach of the US Copyright Office.
This dissatisfaction turned out to be a reality. Two days after having dismissed the Librarian of Congress Carla Hayden, President Donald Trump fired the director of the Copyright Office Shira Perlmutter.
As a result, one will have to wait and see whether the “pre-publication version” will turn into a “final version” in the coming months or, as may be expected, what will be left out of it.
This concern sounds all the more serious than there currently is a real divide among US scholars between a majority who tends to favor an anti-copyright approach (a trend which existed far before the Trump administration to be fair) and a minority “classical” authors who favor a traditional copyright approach. This split was highlighted by the resignation on May 12, 2025 of four prominent scholars from the American Law Institute (ALI) following the publication of the ALI Restatement of Copyright Law, considered to reflect the views of the reporters rather than a global consensus.
The debate around copyright and its scope of protection may never have been so hot. Although the “final version” of this Part 3 may therefore look different from this “pre-publication, I nevertheless believe it worth summing up the key points in light of what I personally consider a very good report.
The report is split into four parts:
- Infringement of reproduction right? (II)
- Fair use? (III)
- Licensing models (IV)
II. Use of copyrighted works for AI training as an infringement of the reproduction right?
The US Copyright office makes a distinction between:
- Data collection and curation: the steps required to produce a training dataset containing copyrighted works clearly implicate the right of reproduction.
- Training: the training process also implicates the right of reproduction on three accounts: (i) the speed and scale of training requires developers to download the dataset and copy it to high performance storage prior to training; (ii) works or substantial portions of works are temporarily reproduced as they are “shown” to the model in batches, with a persistence that may be long enough to infringe the right of reproduction depending on the model at issue and the specific hardware and software implementations used by developers; (iii) the training process may result in model weights that contain copies of works in the training data if substantial protectable expression from the works at issue are memorized.
- RAG also involves the reproduction of copyrighted works
- Outputs may infringe the reproduction right (potentially leading to derivative works) if they replicate or closely resemble copyrighted works, potentially also leading to an infringement of public display and public performance rights as well.
III. Does the reproduction of copyrighted works for AI training amount to a fair use case?
This is where the report deviates from the expectations that any use of copyrighted works to train a model would be (should have been) considered as a fair use.
Such is – rightfully – not the view of the US Copyright Office, which does however not exclude that fair use may actually successfully be invoked depending upon the circumstances.
The US Copyright Office assesses the four factor test in the following way:
a. Factor one: purpose and nature – transformativeness
The US Copyright Office first highlights the fact that copyrighted works are used in different ways during development and deployment of generative AI models, so that different uses during AI development and deployment require separate consideration.
On the key issue related to “transformativeness”, the Copyright Office, referring to Warhol, reminds of the fact that the question is “whether the new work merely supersedes the objects of the original creation, or instead adds something new, with a further purpose or different character, altering the first with new expression, meaning or message…”.
What matters is not the immediate act of copying, but its ultimate goal, a view that mirrors the one held on February 12, 2025 by the US District Court for the District of Delaware in Thomson Reuters v. Ross Intelligence. In the Office’s view, training a generative AI foundation model will often be transformative as the process converts a massive collection of data into a statistical model that can generate a wide range of outputs across a diverse array of new situations. The assessment will however depend on the functionality of the model and how it is deployed: while training of model on a large collection of data such as social media posts, articles and books to deploy a system for content moderation will be transformative, such will not be the case if such training is meant to generate outputs that are substantially similar to copyrighted works in the dataset. As an example, training an audio model on sound recordings for deployment in a system to generate new sound recordings aims to occupy the same space in the market for music and satisfy the same space in the market for music and satisfy the same desire.
Interestingly, the Office considers the argument that the use of copyrighted works to train AI models would be highly transformative because it is not for expressive purposes as mistaken. Language models are trained on hundreds of thousands of tokens for the way words are selected and arranged at the sentence, paragraph or document level, similarly for images, because their use is meant to lead to generate expressive content. As a result, the training cannot be considered as “non-expressive”.
Finally, the Office unsurprisingly considers that the knowing use of a dataset that consists of pirated or illegally accessed works should weigh against fair use without bein determinative.
b) Factor two: nature of the copyrighted work
The use of more creative or expressive works (such as novels, movies, art or music) is less likely to be fair use than use of factual or functional works (such as computer code). Taking into account the fact that models are regularly trained on a variety of works (both expressive and functional, published or unpublished), the assessment may vary depending upon the model and works at issue.
c) Factor three: amount and substantiality of the portion used
In most instances, downloading works, curating them into a training dataset and training on that dataset will involve using all or substantially all of those works. While the copying of entire works and the use of their expressive content for training usually weights against fair use, the Office acknowledges that, where there is a transformative purpose, and where there is a need to train on a large volume of works to effectively generalize, the copying of entire works may be reasonable. In these cases, the third factor may not weight against fair use.
d) Factor four: effect of the use upon the potential market
The Office makes it clear that enquiry must take account not only of harm to the original, but also of harm to the market for derivative works. The effect upon the market can originate from the following circumstances:
(i) Lost sales
Lost sales would deprive the rights holder of significant revenues because of the likelihood that potential purchasers may opt to acquire the copy in preference to the original. This may happen where a model can produce substantially similar outputs that directly substitute for works in the training data.
(ii) Market dilution
Market dilution, i.e. harm to a creator’s overall body of work or even the market more broadly, which can happen where a model’s outputs are not substantially similar to any specific copyrighted work, but can dilute the market for works that are similar to those found in the training data, including by generating material stylistically similar to those works.
(iii) Lost licensing opportunities
The Office considers that voluntary licensing is already happening in some sectors, and that it appears reasonable or likely to be developed in others, at least for certain types of works, training and models. In doing so, the Office rebuts Anthropic’s main argument in its response filed on March 27, 2025 in front of the United States District Court for the Northern District of California in Bartz v. Anthropic. The Office – rightfully – makes it clear that while concerns about the effects of licensing on competition among AI companies should not be discounted, licensing will always be easier for those with deeper pockets, so that these concerns should not alter the fair use analysis.
In conclusion, the Office unsurprisingly considers that the first and fourth factors can be expected to assume considerable weights in the analysis. The Office makes it clear that it will be for the courts to weight the statutory factors. As GenAI involves a spectrum of uses and impacts, some uses will qualify as fair use while others will not.
IV. What are the potential licensing models to train AI on copyrighted works?
To conclude, the Office considers that voluntary direct and collective licensing agreements have emerged over the past several years. Further market developments may provide more insight on the extent to which licensing agreements can effectively compensate copyright owners for the use of their works in AI training. Compensation structures based on percentage of revenue or profits, without large up-front cash outlays, may be an attractive alternative for smaller developers looking to enter the market.
Compulsory licensing schemes, as we know them in Switzerland in particular, are disfavored by the Office (and the majority of commenters). These would take years to develop, lead to fixed royalty rates likely to be below-market rates that would become difficult to change, and ultimately by a regretful derogation of the author’s right to control the use and distribution of theirs works. The same goes with the opt-out mechanism (as contemplated by Art. 4 of the Directive 2019/790), bearing in mind that, for the Office, copyright owners may want their work to be scraped for search engine purposes, but not for AI ingestion.
If market failures were to be demonstrated, extended collective licensing (as Nordic countries knows it and as we have it in Switzerland in Art. 43a of our Copyright Act) may come into play.
V. Key take-aways
Although this “pre-publication” version may still be reviewed, so that one still has to take it cautiously, the following key take-aways stem out of the paper (and may to a large extent apply to others jurisdictions):
| Action Area | Why It Matters | What You Should Do |
| Audit Your AI Training Datasets | Training AI on copyrighted material may infringe reproduction rights if sourced or stored improperly. | – Conduct an IP compliance audit. – Classify all data sources (licensed, public, risky). – Implement traceability and documentation practices. |
| Implement Risk-Tiered Licensing Strategy | Fair use is uncertain; licensing offers a safer and more reputationally sound route. | – Use voluntary licensing when training on expressive works. – Consider revenue-share models if budgets are tight. – Use extended collective licenses where in existence (still very rare). |
| Avoid Pirated or Illegally Obtained Content | Knowing use of pirated data strongly weighs against fair use and creates legal exposure. | – Vet datasets carefully. – Avoid scraped or unlicensed material. – Log and review all ingestion sources. |
| Monitor Output for Copyright Risk | Output similar to copyrighted material may infringe reproduction or derivative work rights. | – Use output monitoring tools. – Set up human review for creative domains. – Avoid deploying generative models that replicate training data. |
| Clarify Purpose & Use Case Early | Legal protection depends on whether the use is “transformative” and distinct from original purpose. | – Document your model’s intended purpose. – Distinguish moderation, summarization, or analytics from content generation. |
| Track Market Impact and Licensing Opportunities | Courts consider impact on both original and derivative markets. | – Assess substitution risk. – Monitor lost licensing opportunities. – Be ready to show market-neutral or positive impacts. |
| Stay Ahead of Global Divergences | US, EU, and Swiss legal approaches vary significantly. | – Follow Swiss and EU copyright frameworks (esp. DSM Directive). – Monitor for final version of the report and potential impact of the Restatement of US Copyright Law. – Localize your compliance strategy. |
| Embed Legal Oversight in AI Lifecycle | Waiting until deployment is too late to address compliance issues. | – Involve legal/compliance teams from day one. – Set legal checkpoints in your AI development process. – Maintain cross-disciplinary collaboration. |
About the author
Philippe Gilliéron is an attorney at BMG Avocats (www.bmglaw.ch) in Geneva, specializing in intellectual property law, in particular trademarks, designs, patents and copyrights, as well as technology rights, in particular artificial intelligence, and data protection. He advises Swiss and international companies on their intangible asset protection strategies, and represents his clients before the Swiss courts and the Swiss Federal Institute of Intellectual Property (IPI).
For any questions relating to intellectual property, digital technology, artificial intelligence or data protection, contact Philippe Gilliéron, intellectual property attorney in Geneva, at philippe.gillieron@bmglaw.ch.
