THE CONTEXT: Two first-of-their-kind United States district-court orders have held that training large language models (LLMs) on copyrighted books can qualify as “transformative fair use”, even while leaving piracy-related damages to be decided.
-
- The rulings come amid at least 21 active US lawsuits and similar suits in the United Kingdom, India and the European Union, signalling a global jurisprudential churn on whether generative AI (Gen-AI) is built on “stolen” content or on a legally defensible text-and-data mining (TDM) foundation.
THE BACKGROUND:
-
- Copyright’s incentivisation theory posits that exclusive rights induce creativity, whereas the fair-use doctrine (four-factor test) protects socially valuable uses of protected works.
- Parallel to copyright, text-and-data mining exceptions (Japan, Singapore, EU) explicitly permit machine-learning ingestion of copyrighted works; India has no dedicated TDM clause.
- The 2021 UNESCO Recommendation on the Ethics of Artificial Intelligence and the 2019 OECD AI Principles together stress “human-centred, transparent and accountable AI”, laying a soft-law foundation for later hard-law measures.
THEORETICAL FRAMEWORK (WHAT–WHY–HOW)
ASPECT | KEY POINTS |
---|---|
What is at stake? | Whether non-consensual ingestion of full texts for model training violates the exclusive reproduction right of authors or is shielded by fair use/TDM exceptions. |
Why does it matter? | Gen-AI markets are forecast to touch USD 1.3 tn by 2030; unresolved IP risk can chill both technological diffusion and creator livelihoods. |
How to reconcile? | A calibrated regulatory-licensing model that prices data inputs, mandates transparency, and preserves open innovation. |
TECHNICAL UNDERPINNINGS:
-
- LLMs break books into tokens and learn probabilistic relationships; the model can regurgitate short passages when over-parameterised, giving rise to “memorisation risk”, a core complaint in the Meta suit.
- Gen-AI development costs (compute + data) incentivise firms to scrape vast corpora (Books3, LibGen), blurring lines between public-domain, licensed and pirated content.
CURRENT GLOBAL SCENARIO:
JURISDICTION | REGULATORY/ JUDICIAL SIGNAL | SALIENT PROVISION |
---|---|---|
United States | Anthropic – fair use (training); Meta – partial dismissal but piracy trial; US Copyright Office Report (2024-25) urges a Digital Replica Right. | Proposes statutory licence or compulsory negotiation for AI datasets. |
European Union | EU AI Act requires providers of Gen-AI to publish “sufficiently detailed summaries of all copyrighted data” used for training. | Combines ex-ante transparency with ex-post liability. |
United Kingdom | Government negotiating a Code of Practice on Copyright & AI to widen data-mining licences while protecting creator remuneration. | |
India | MeitY’s dual advisories (1 & 15 Mar 2024) ask platforms to label under-tested Gen-AI outputs and seek permission for unreliable models |
INDIAN CONTEXT:
-
- Legal Vacuum: Copyright Act 1957 lacks a TDM exception; data scraping cases (ANI v. OpenAI, 2024) are likely to be litigated piecemeal.
- Data-protection overlay: The Digital Personal Data Protection Act 2023 introduces fiduciary duties but is silent on non-personal copyrighted content.
- Compute Public Good: IndiaAI compute cloud (10,000 GPUs) can embed mandatory compliance APIs (provenance logs, watermarking) as a precondition for subsidised access.
ETHICAL DIMENSIONS:
-
- Integrity & Probity: Training on illicit datasets undermines the legitimacy of innovation; administraton must ensure procurement only from compliant vendors.
- Empathy & Fairness: Creators, often less powerful than Big-Tech, deserve equitable value capture; failure breeds “algorithmic predation” and erodes trust.
- Accountability: OECD Principle of traceability demands audit trails of training data; absence thereof creates governance opacity.
THE ISSUES:
-
- Economic Displacement: Royalty-free AI outputs undercut freelance writers, illustrators and micro-media enterprises.
- Data Provenance Obscurity: Lack of granular disclosure hinders infringement tracing, especially across jurisdictions.
- Regulatory Fragmentation: Divergent US fair-use, EU transparency, and Indian advisory regimes raise compliance costs.
- Piracy-Platform Nexus: Shadow libraries (Books3, LibGen) persist because enforcement against mirrors is weak.
- Concentration of Compute Capital: Few firms control high-end GPUs, potentially dictating data-licensing norms.
- Chilling Innovation: Over-broad liability fears may deter small Indian start-ups from open-source model research.
THE WAY FORWARD:
-
- Statutory Extended Collective Licensing (ECL): Amend Copyright Act to permit AI training upon payment to collecting societies; creators opt out if desired.
- Transparent Training Ledger: Mandate hash-based registries of datasets for every model above a defined compute threshold; integrate with IndiaAI compute portal.
- Algorithmic Impact Assessment: Require intermediaries to file ex-ante risk audits covering copyright, bias, and market impact before public deployment.
- Public-Domain Corpus Expansion: Accelerate digitisation of out-of-copyright Indian language texts to reduce inadvertent piracy and spur heritage-focused AI.
- Cross-Border IP Cooperation: Negotiate a G20 framework (India-Brazil-South Africa troika) for TDM licences, aligning with UNESCO and OECD principles.
- Ethics-by-Design Standards: BIS to release an Indian Standard for Gen-AI provenance watermarks, making adoption a tender prerequisite for all public projects.
- Creator–Tech Mediation Council: Institutionalise a quasi-judicial body under the Copyright Board for fast-track, tech-heavy infringement disputes.
THE CONCLUSION:
Generative AI sits at the cusp of an “innovation-copyright dialectic”. Fair-use-centred rulings confirm its transformative promise, yet unfinished piracy trials underscore moral debts to creators. India’s policy response must marry Compute for All with Compensation for Creators, embedding ethics, transparency and accountability by design.
UPSC PAST YEAR QUESTION:
Q. Impact of digital technology as a reliable source of input for rational decision making is a debatable issue. Critically evaluate with a suitable example. 2021
MAINS PRACTICE QUESTION:
Q. Examine the adequacy of India’s existing intellectual property and data-protection framework to safeguard innovation and creator rights.
SOURCE:
Spread the Word