Brief overview: Generative AI needs data. Copyright law (TDM exceptions and opt-out), the GDPR (legal bases, information obligations, rights of data subjects) and the AI Act (transparency and copyright compliance for general purpose models) come into direct conflict during training. A clean structure of legal bases, contractual assurances, technical opt-out mechanisms and processes for objections, deletions and evidence is crucial. This guide bundles the practical steps – with a focus on German and European rules.
Legal framework at a glance: TDM exemptions, opt-out and German implementation
The TDM exceptions of Directive (EU) 2019/790 (DSM) are the linchpin of EU law for training on copyright-protected content. Art. 3 privileges text and data mining by research institutions/cultural heritage institutions in the case of lawful access – without the right holders being able to object. Art. 4 opens up a general TDM barrier for other purposes (including commercial AI training), but only if rights holders do not expressly reserve the right of use “in an appropriate form” (opt-out, ideally machine-readable online). In Germany, these rules are implemented as Section 60d UrhG (research) and Section 44b UrhG (general TDM with opt-out). In practice, this means:
– Research training with lawful access regularly falls under Section 60d UrhG.
– Commercial training can be based on Section 44b UrhG, provided that no effective opt-out was set and access was lawful.
– Database rights may also be affected; the TDM exceptions also address extractions from protected databases.
In particular, the opt-out must be expressed online in a machine-readable form. Discussions and initial decisions in Germany have made it clear that “machine-readable” does not automatically mean classic robots.txt bans; rather, a specific TDM reservation that clearly and technically evaluably signals that TDM uses are reserved is gaining acceptance. Initial court decisions have also shown this: The legality of access, compliance with opt-outs and proper documentation are relevant to liability – even when creating data sets for training, not just during the actual model training.
2) GDPR in web and user data training: legal bases, limits, obligations
AI training on personal data requires a viable legal basis in accordance with Art. 6 GDPR. The debate revolves primarily around legitimate interests (Art. 6 para. 1 lit. f). Data protection authorities emphasize: Legitimate interests can be conceivable, but require a strict three-step test, security and transparency measures, balancing of interests, opt-out mechanisms and comprehensible accountability. For special categories (Art. 9 GDPR), the standard is considerably higher; it cannot be based on legitimate interests, e.g. explicit consent or another special exception is required.
Further key points:
– Transparency/information obligations (Art. 13/14): Information obligations must also be fulfilled in principle in the case of web scraping; exceptions must be justified and documented.
– Rights of data subjects: Objection (Art. 21), deletion (Art. 17), correction/comment on accuracy – also related to training data sets and, under certain circumstances, models.
– Data minimization & storage limitation (Art. 5 para. 1 lit. c/e): Curate corpora, filter sensitive fields, limit retention, maintain deletion routines and “do-not-train” blacklists.
– Risk management & DPIA (Art. 35): Regularly required for broad-based scraping/training projects; reflect outcome in policies and technology.
European and national authorities have published 2024/2025 guidelines and task force reports that sharpen the framework EDPB addresses transparency, accuracy risks and legal bases; CNIL explains conditions under which training can be based on legitimate interests (including technical/organizational safeguards); ICO (UK) specifies requirements for web scraping and legitimate interest testing. In practice, it is crucial to demonstrably anchor these requirements in governance and technology.
AI Act and copyright compliance: obligations for general purpose models
The AI Act has been in the Official Journal since July 2024; key parts will take effect in stages until 2026. The legal framework standardizes transparency and copyright compliance obligations for general purpose AI models (GPAI). Providers of GPAI models must, among other things, maintain a policy on compliance with EU copyright law and publish a sufficiently detailed summary of the content used for training – regardless of where the training took place. At the same time, a GPAI Code of Practice (2025) is being developed as a voluntary starting point to implement the obligations – including copyright respect and documentation – in practice. Consequence: Rights and data compliance will be subject to auditing and verification, not just “best efforts”.
Opt-out in practice: machine-readable caveats and how AI teams respect them
The DSM guideline requires a machine-readable reservation for content available online. In practice, the TDM Reservation Protocol (TDMRep) has established itself as a dedicated, analyzable standard. Among other things, it can signal via HTTP header or TDM file that TDM uses are reserved and optionally refer to license paths. There are also unofficial signals (e.g. “noai” meta/robots tags); these are not harmonized and are observed inconsistently. Anyone relying on Section 44b UrhG should consistently parse TDM signals in the pipeline and prove that opt-outs are respected – otherwise there is a risk of copyright infringement. Public bodies (Council/Commission) are driving forward parallel standards/registry considerations in order to make the opt-out interoperable across Europe.
Minimum technical measures for scrapers/loaders
– Parser for tdm-reservation and – if available – tdm-policy (fallback: robust robots honor alone is not sufficient).
– Positive/negative lists and blockers against known AI crawler blocks and TDM reservations.
– Evidence repository: For each source, time, HTTP header/file snapshot, status of opt-out, license path, legal access.
– Re-crawl rules: TDM opt-outs can be set retrospectively; reconcile runs must be scheduled.
– License router: If reservation is set, trigger the license path (e.g. rights contact URL from TDM policy).
Thinking copyright + GDPR together: four typical stumbling blocks
Legal access is not a free pass. Content that is accessible free of charge can be freely accessible under copyright law, but a legal basis is still required under data protection law. Without a viable Art. 6 basis and without transparent information, training on personal data becomes risky – even if no opt-out is set.
Special categories in web data creep into corpora on a large scale (health, political opinion, religion). There is regularly no viable exception for training without consent or the narrowest special circumstances. Filters/exclusions are therefore mandatory, as are blacklists for sensitive entities.
Database rights are underestimated. Many “open” collections are sui generis databases; mass extractions can infringe § 87b UrhG rights if no TDM privilege applies.
Subsequent opt-outs and data subject rights affect not only data records, but also model artifacts (e.g. vectors, embeddings). There is not always a “right to erasure in the model”, but robust processes for suppression, fine-tuning corrections and information are required – and are increasingly demanded by supervisory authorities.(Laws on the Internet, EDPB)
Practice roadmap: Governance, contracts, technology
Governance & documentation
– Policy stack: TDM compliance policy (opt-out respect, license paths), copyright policy (work/performance protection rights, database rights), privacy policy (Art. 6/9, transparency, data subject rights), retention policy for corpora/artifacts.
– Roles: Data Sourcing, Rights & Privacy Counsel, Dataset Steward, Security/ML-Ops, Audit.
– DPIA and Legitimate Interest Assessment with concrete safeguards(pseudonymization, blacklists, sensitive data filters, rate limits, access controls, purpose limitation).
– Transparency: Layered Notices, Model Cards/Datasheets; for GPAI: Training content summary according to AI Act.
Contracts & chain of rights
– Content sources: License clauses on TDM permission/restriction, purpose limitation “training/fine-tuning/evaluation”, territories, term, remuneration, audit/chain of rights, no-scrape warranty.
– API/partners: assurance of lawful provision, no opt-outs violated, no special categories without basis, exemption + audit rights.
– User content (SaaS/UGC): clear T&C permission or default no-training with granular opt-ins; or opt-out in privacy settings; explicit rules for finely granular purposes (e.g. “quality improvement only”, “no third-party model training”).
– Data providers (annotation, synthesis): Confidentiality, copyright/benefit protection, personal data, bias/quality KPIs, rights to labels.
Technology & processes
– Crawler/loader respects tdm-reservation; parsermandatory in the pipeline.
– Sensitive data filter before inclusion in training corpora; hash/heuristics/rules + human sample.
– Data subject rights: search/suppression function via corpus and artifacts; documented objection and deletion process; differentiated for training vs. evaluation sets and for fine-tuning adapters. evaluation sets and for fine-tuning adapters.
– Dataset provenance: content, source URL, timestamp, opt-out status, license path, legal basis; immutability (e.g. WORM store) and audit trail.
– Model-level controls: Red team eval for personal outputs, prompt guards, throttling, output transparency notices.
– Security by design: access/keys, segmentation, secret management; protection against data leakage and poisoning; regular audits.
Implementation steps for product teams: “Legal by Architecture”
Corpus design
– Initial sourcing only from sources without TDM reservation or with license; technical whitelists.
– Dedicated research corpus separate from commercial corpus; do not tip § 60d uses unchecked into commercial paths.
– Avoid recurrence sampling (repeated sampling of sensitive content) to reduce overfit to personal samples.
Transparency & user control
– For products with user uploads, granular consent/opt-ins for training; restrictive by default; separate consent for special data.
– Information layer for scraping sources and data subject rights; easy-to-find “Do-Not-Train” buttons.
Evaluation & Operation
– Address accuracy/accuracy for personally identifiable outputs; EDPB emphasizes accuracy requirements.
– Carefully curate content aggregation (AI Act): Categories, source classes, license paths, opt-out respect – without exposing trade secrets.
– Incident response for rights/data breaches: Intake channel, immediate action (block/suppress), notifications, remediation.
Common misconceptions – and how to avoid them
“Publicly accessible = freely trainable” – wrong. Publicly available content is also protected by copyright and data protection laws. It needs TDM privilege or license and GDPR basis.
“robots.txt is sufficient as an opt-out” – unreliable. The TDM reservation signal is the better, evaluable way.
“Once trained, never erasable” – not so generalized. A deletion/contradiction process can be linked to corpus (removal/suppression), artifacts (filter/adapter retraining) and output control; whether a model retrain is necessary depends on the individual case (proportionality, technical feasibility, risk).
“Research clause cures everything” – it does not. § Section 60d UrhG is limited to authorized carriers and lawful access; transfers to commercial use must be licensed/examined separately.
Checklist 2025: From legal theory to audit security
- Data source register with opt-out status (tdm-reservation), legality, license path.
- TDM parser productive, blocker for TDM reservations active.
- GDPR basis identified (Art. 6/9), LIA/DPIA documented, transparency texts available.
- Sensitive data mitigation before training, current exclusion lists.
- Data subject rights process (information, objection, deletion) end-to-end.
- AI-Act-GPAI: Copyright policy + training content summary implemented; Code of Practice signed where applicable.
- Contractual assurances with content/API partners (clearing, exemption, audit).
- Audit trail for sourcing, training, evaluation, releases; regular management reviews.
Conclusion
Legally compliant AI training is not a guessing game, but a process and evidence discipline. Those who technically respect TDM opt-outs, organizationally map GDPR obligations and substantially fulfill AI Act transparency significantly reduce the risk of disputes and sanctions – and at the same time gain the basis for predictable licensing with rights holders. The operational difference is not created in policy documents, but in crawler logs, parsers, filters, policies and contracts that stand up to audit.






















