Faculty of law blogs / UNIVERSITY OF OXFORD

Discit ergo est: Training Data Provenance and Fair Use

Author(s)

Robert Mahari
JD-PhD Candidate at Harvard Law School & MIT
Shayne Longpre
PhD Candidate at the MIT Media Lab

Posted

Time to read

3 Minutes

This blog post summarizes our recent article Discit ergo est: Training Data Provenance and Fair Use. We focus on the legal status of finetuning data, which has been specifically curated to train AI models. While many legal scholars focus on AI’s reliance on large datasets scraped from the web without consent, less attention has been paid to data specifically created for AI, which has been pivotal in recent generative AI breakthroughs. We examine how curated data enhances AI performance, discuss fair use considerations, and explore the role of data provenance in responsible AI development.

Curated Data and Generative AI Performance

One of the breakthroughs that gave rise to the latest generation of widely adopted language models like ChatGPT, is the use of highly specialized finetuning datasets created to elicit a wide set of capabilities and enhance the AI’s usefulness, helpfulness, and human agreeability in its responses.

An illustrative example of the impact of this curated data is ChatGPT’s response to the prompt ‘Teach an elementary school student about the doctrine of Champerty’.

Figure 1

 

The foundational knowledge about the doctrine originates from the pretraining data and potentially from domain-specific finetuning with legal data. Meanwhile, the response’s structure as a chat-friendly explanation appropriate for a young learner derives from instruction finetuning and alignment tuning.

 

Fair Use Overview

The fair use doctrine originated to give courts flexibility in applying copyright law when rigid applications ‘would stifle the very creativity which that law is designed to foster’.  Iowa State University v. American Broadcasting, 621 F.2d 57, 60 (2d Cir. 1980). Courts consider several factors when determining fair use and weigh these differently depending on the context of a specific case.

Fair Use for Pretraining Data

Pretraining data, often scraped from the web, raises copyright concerns during data collection, model training, and output generation. Focusing on model training, fair use generally supports using pretraining datasets if there is significant transformation into model weights, limited retention of specific data, an aim to extract general insights, and negligible market impact on the original works. However, models may retain substantial excerpts from training data, an issue under active research.

Fair Use for Curated Data

By contrast, the unauthorized use of curated datasets for AI training is less likely to be considered fair use. Curated datasets contain expressive content specifically created to instruct AI models, so using them for their intended purpose is less likely to qualify as transformative. Given the vibrant market for AI training data, unpermitted use undermines the commercial potential for dataset creators. Notably, we find that 23% of supervised training datasets are published under research or non-commercial licenses, indicating potential willingness to license for commercial use with proper remuneration.

The Complex Interplay of Data Provenance and Fair Use

The reality of the AI training data supply chain is complex, involving multiple layers of crawling, annotation, and compilation by various entities. Understanding data provenance—how data was collected, modified, and by whom—is crucial for the fair use analysis and responsible AI development.

Key considerations include:

  • Who created the data? Some datasets compile preexisting artifacts, while others create new data from whole cloth to target specific priorities.
  • Was third-party data used? Using copyrighted data as a basis for datasets complicates the fair use analysis, especially when datasets contain verbatim excerpts.
  • Were large language models (LLMs) involved? The use of LLMs to generate training data introduces complex questions about ownership and the legality of distilling models into datasets.

Dataset Provenance and Responsible AI Practices

Data provenance is essential for the fair use analysis and the responsible development of AI. Enforceable licenses for curated data promote transparency, incentivize open sharing, and help prevent misuse of data. By giving dataset creators control over how their data is used, copyright law can encourage transparency and protect open science initiatives. Incentives that promote transparency are especially important as auditing training data can reveal limitations of AI systems training on this data. Additionally, these issues intersect with competition and antitrust concerns, affecting both AI developers and creators of training data.

Conclusion

The legal discourse around AI benefits from a nuanced understanding of technical processes. Copyright law can incentivize responsible AI by fostering transparency and attribution. Enforceable licenses for curated data are crucial for transparency, requiring developers to track data sources and protecting creators who share their work openly. By promoting transparency around training data, copyright law can empower audits of AI systems, contributing to safer and more reliable AI development.

 

The authors’ article can be found here.

Robert Mahari is a JD-PhD Candidate at Harvard Law School & MIT.

Shayne Longpre is a PhD Candidate at the MIT Media Lab.

 

Share

With the support of