A Technical Review of the EU's Mandatory AI Training Data Disclosure

On July 24, 2025, the European Commission published a final draft of its "Explanatory Notice and Template for the Public Summary of Training Content”, a document that operationalizes a key transparency requirement of the EU AI Act (Regulation (EU) 2024/1689). This document is not merely guidance; it holds significant legal value and establishes a mandatory reporting framework for all providers of general-purpose AI (GPAI) models placed on the EU market.

This review provides a technical deep-dive into the notice and the template, explaining the obligations, the meaning of key concepts, and how the reporting structure is designed to be implemented.

What does the mandatory AI training data disclosure template bring to the table?

The European Commission has just unveiled a crucial piece of the puzzle for anyone developing or deploying AI in Europe. On July 24, 2025, it released a new template that mandates how companies must report the data they use to train their general-purpose AI models. This isn't just another layer of bureaucracy; it's a fundamental step towards transparency and a core component of the landmark EU AI Act.

For the industry leaders understanding this development is not optional. It directly impacts your AI strategy, compliance roadmap, and risk management. Here’s a breakdown of what this means for the industry.

What is it about?

The document is an Explanatory Notice and a standardized template from the European Commission. It details the obligations for all providers of general-purpose AI models under Article 53(1)(d) of the EU AI Act. The goal is simple: to create transparency about the content used to train these powerful models.

The AI Act itself, which entered into force on August 1, 2024, establishes harmonized rules for artificial intelligence, with specific obligations for providers of general-purpose AI models. These transparency rules will become applicable on August 2, 2025.

What is the purpose of this mandatory disclosure?

The primary objective is to increase transparency for several key reasons:

  • Protecting Intellectual Property: It aims to help rights holders, especially copyright holders, understand if their content was used in training an AI model, enabling them to exercise and enforce their rights under EU law.
  • Enforcing Data Protection: The summary helps in the enforcement of data protection rules like GDPR by clarifying what data has been collected, including data scraped from the internet or gathered from user interactions.
  • Ensuring Fundamental Rights: Transparency about training data can help downstream developers assess data diversity and mitigate biases, thereby respecting fundamental rights like non-discrimination.
  • Promoting Fair Competition: It sheds light on whether models are trained on other publicly available AI models or on proprietary user data, which can help prevent market lock-in effects.

What does your company have to do?

If your company provides a general-purpose AI model in the EU market—even free and open-source ones—you must prepare and publicly release a detailed summary of the training data used. This summary must follow the Commission's template and be available on your website and with the model's distribution channels when it is placed on the market.

The template is divided into three main sections:

  • General Information: This includes identifying the provider and the specific model(s) the summary covers, including model dependencies if it's built on another AI model. You'll also need to disclose the types of data (modalities like text, image, audio), and the approximate size of the training data within broad ranges.
  • List of Data Sources: This is the core of the disclosure and requires a comprehensive overview of where your training data comes from. This includes:
    • Publicly available datasets: You must list "large" public datasets used. A dataset is considered "large" if it exceeds 3% of the total size of all public datasets used for that modality.
    • Private datasets from third parties: For data licensed from rights holders, disclosure is limited. For other private, non-publicly known datasets, a general description is required.
    • Data scraped from online sources: You must describe the crawlers used, the collection period, and the type of content scraped. Crucially, you must provide a summary list of the most relevant domain names scraped. For most companies, this means the top 10% of domains by content size; for SMEs, it's the top 5% or the top 1,000 domains, whichever is lower.
    • User data: Information on data collected from user interactions with your services or products must be described.
    • Synthetic data: If your model was trained using AI-generated data (e.g., through model distillation), you must identify the AI models used to create that data.
  • Data Processing Aspects: This section requires you to describe measures taken to respect copyright holders' "opt-out" rights for text and data mining and steps taken to remove illegal content from your training data.

Timelines and Enforcement: What to Expect

  • Application Date: This summary provision duty applies from August 2, 2025.
  • Existing Models: For existing models available on the market before this date, suppliers must publish their summaries no later than August 2, 2027.
  • Updates: At least once every six months, when further trained on new data, or sooner if a "materially significant update" exists, the summary must be updated.
  • Enforcement: Compliance will be monitored by the AI Office from August 2, 2026. Non-compliance is not a joke. Fines can go as high as 3% of your organization's total worldwide annual turnover or €15 million, whichever is larger.

Key Concepts and Safeguards Explained

The document navigates a careful balance between transparency and protecting providers' legitimate interests.

  • Trade Secrets: The template is designed to be "generally comprehensive" but not "technically detailed" to avoid forcing the disclosure of competitively sensitive information. This is implemented through varying levels of required detail. For instance:
    • For commercially licensed data, minimal public disclosure is needed as rights holders are party to the agreements.
    • For private, non-licensed datasets, they only need to be listed if they are already publicly known.
    • The exact "mix and composition of data sources" is not required, only high-level information on data size per modality.
  • Text and Data Mining (TDM) Exception: This refers to Article 4 of the EU Copyright Directive (2019/790), which allows for text and data mining for any purpose, provided that rights holders have not explicitly "opted-out" or reserved their rights in a machine-readable format. Section 3.1 of the template directly addresses this by requiring providers to describe the measures they have implemented to identify and comply with these reservations of rights. This is a crucial link between the AI Act and existing EU copyright law.
  • Model Distillation: Mentioned in the context of synthetic data, this is a process where a smaller, more specialized AI model (the "student") is trained on the outputs of a larger, more complex model (the "teacher"). The notice requires disclosure of which GPAI models were used for this purpose to prevent circumvention of transparency obligations. If a provider uses its own proprietary model to generate synthetic data, it must still provide a general description of that model's training data.
  • Enforcement: Compliance is not optional. The AI Office has the authority to verify the accuracy of the submitted summaries and can impose corrective measures. Non-compliance can lead to severe fines of up to 3% of the provider's total worldwide annual turnover or €15,000,000, whichever is higher.

Technical Breakdown of the Reporting Template

The following is a detailed walkthrough of the mandatory template.

Template for the Public Summary of Training Content

Version of the Summary: [Provider enters version, with links to previous versions] Last update: DD/MM/YY

Section 1: General Information

This section serves to identify the provider and the model(s) covered by the summary.

Field Requirement
1.1 Provider identification
Provider name and contact details: [Name and contact information of the provider]
Authorised representative name and contact details: [Applicable only if the provider is established outside the Union]
1.2. Model identification
Versioned model name(s): Provide the unique identifier for the model(s) or version(s) covered (e.g., Llama 3.1-405B). The same summary can be used for multiple models if their training content is identical.
Model dependencies: If the model modifies or fine-tunes another GPAI model, specify the original model's name and link to its summary.
Date of placement of the model on the Union market: [Date of market placement]

1.3. Modalities, overall training data size and other characteristics

This part provides a high-level overview of the training data's composition.

Modality Select the modalities present in the training data, to the extent that they are identifiable Training data size (For each selected modality, select the range within which the estimated total training data size for that modality falls. Dynamic datasets may be excluded from the estimation.) Types of content (For each selected modality, provide a general description of the type of content that has been included in the training data.)
☐ Text ☐ Less than 1 billion tokens
☐ 1 billion to 10 trillion tokens
☐ More than 10 trillion tokens
Alternatively, specify size in another unit
Examples of possible types of content include fiction and non fiction text, scientific text, press publications, legal and official documents, social media comments, source code.
☐ Image ☐ Less than 1 million images
☐ 1 million to 1 billion images
☐ More than 1 billion images
Examples of possible types of content include photography, visual art works, infographics, social media images, logos, brands.
☐ Audio ☐ Less than 10,000 hours
☐ 10,000 to 1 million hours
☐ More than 1 million hours
Examples of possible types of content include musical compositions and recordings, audiobooks, radio shows and podcasts, private audio communication.
☐ Video ☐ Less than 10,000 hours
☐ 10,000 to 1 million hours
☐ More than 1 million hours
Examples of possible types of content include music videos, films, TV programmes, performances, video games, video clips, journalistic videos, social media videos.
☐ Other Specify modality and approximate size/unit
Field Requirement
Latest date of data acquisition/collection for model training: Indicate the latest date when data was collected/obtained for the model training: MM/YYYY
Additionally, indicate if the model is continuously trained on new or dynamic data after this date.
Description of the linguistic characteristics of the overall training data: Where applicable, describe the languages covered by the training data (e.g., text, videos or speech), focusing in particular on EU official languages.
Other relevant characteristics of the overall training data: Where such information is readily available and in so far as it is relevant and practicable, describe other relevant characteristics of the overall training data, such as national/regional or demographic specificities of the training data.
Additional comments (optional): Providers may also disclose other relevant information on a voluntary basis (e.g. the compression or tokenization methodologies applied for the data size calculation, the sampling frequency/rate plays for audio or video content).

Section 2: List of Data Sources

This is the most critical section for transparency, detailing the specific origins of the training data.

2.1. Publicly available datasets

This covers pre-packaged datasets compiled by third parties and made publicly available for free (e.g., on public repositories).

Field Requirement

Latest date of data acquisition/collection for model training:

Indicate the latest date when data was collected/obtained for the model

training: MM/YYYY

Additionally, indicate if the model is continuously trained on new or

dynamic data after this date.

Description of the linguistic characteristics of the overall training data:

Where applicable, describe the languages covered by the training data

(e.g., text, videos or speech), focusing in particular on EU official

languages.

Other relevant characteristics of the overall training data:

Where such information is readily available and in so far as it is relevant

and practicable, describe other relevant characteristics of the overall

training data, such as national/regional or demographic specificities of

the training data.

Additional comments

(optional):

Providers may also disclose other relevant information on a voluntary

basis (e.g. the compression or tokenization methodologies applied for the

data size calculation, the sampling frequency/rate plays for audio or

video content).

2.2. Private non-publicly available datasets obtained from third parties

This section is split into two categories to protect commercial sensitivities.

2.2.1. Datasets commercially licensed by rightsholders or their representatives

Have you used publicly available datasets to train the model?

☐ Yes ☐ No

If yes, specify modality(ies):

☐ Text ☐ Image ☐ Video ☐ Audio ☐ Other

List of large publicly available datasets:

For each "large" dataset (defined as >3% of the total data size for that modality from public datasets ), provide its name and an access link. If a link isn't available, provide a general description.

General description of other publicly available datasets not listed above:

Provide a general description of their content, including modality types, nature of content (e.g., personal data, copyright protected), and linguistic characteristics.

2.2.2. Private datasets obtained from other third parties

This covers private datasets from data intermediaries or other third parties not licensed directly from rights holders.

Have you concluded transactional commercial licensing agreement(s) with rightsholder(s) or with their representatives?

☐ Yes ☐ No

If yes, specify the modality(ies) of the content covered by the datasets concerned:

☐ Text ☐ Image ☐ Video ☐ Audio ☐ Other

Note: No further detail on the datasets is required here to protect confidential business agreements.

2.3. Data crawled and scraped from online sources

This requires a comprehensive look at the provider's own data collection activities.

Field Requirement Notes
Were crawlers used by the provider or on behalf of? ☐ Yes ☐ No
If yes, specify crawler name(s)/identifier(s): [Name/ID of crawlers used]
Purposes of the crawler(s): [Explain the purpose of the crawling activity]
General description of crawler behaviour: Describe how crawlers behaved, e.g., respect for robots.txt, paywalls, captchas, etc.
Period of data collection: From MM/YYYY to MM/YYYY
Comprehensive description of the type of content and online sources crawled: Describe the type of content (geographical, linguistic characteristics) and websites scraped (e.g., news, blogs, social media, forums, government portals).
Summary of the most relevant domain names crawled: This is a key requirement. Providers must list the top-level internet domain names from which content was scraped and used. The threshold is the top 10% of all domains by content size. For SMEs, this is reduced to the top 5% or 1000 domains, whichever is lower. This list can be provided as a downloadable file.

2.4. User data

This covers data collected from user interactions with the provider's own services and products.

Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model?

☐ Yes ☐ No

Was data collected from user interactions with the provider's other services or products used to train the model?

☐ Yes ☐ No

If yes, provide a general description of the provider's services or products that were used to collect the user data:

[General description of services/products]

Type of modality covered:

☐ Text ☐ Image ☐ Video ☐ Audio ☐ Other

2.5. Synthetic data

This relates to data generated by another AI model, particularly through model distillation or alignment techniques.

Was synthetic AI-generated data created by the provider or on their behalf to train the model?

☐ Yes ☐ No

If yes, modality of the synthetic data:

☐ Text ☐ Image ☐ Video ☐ Audio ☐ Other

If yes, specify the general-purpose AI model(s) used to generate the synthetic data if available on the market:

Specify the name of the GPAI model(s) and provide a link to their summaries where available.

Information about other AI models, including provider's own AI model(s) not available on the market:

Provide information about other AI models used, including a general description of their training data to the extent needed for rights-holders and to avoid circumvention.

2.6. Other sources of data

A catch-all category for any data not covered above, such as offline sources or self-digitized media.

Have data sources other than those described in Sections 2.1 to 2.5 been used to train the model?

☐ Yes ☐ No

If yes, provide a narrative description of these data sources and the data:

[Narrative description] 

Section 3: Data Processing Aspects

This final section addresses policies and measures related to legal compliance.

3.1. Respect of reservation of rights from text and data mining exception or limitation

Are you a Signatory to the Code of Practice for general-purpose AI models that includes commitments to respect reservations of rights from the TDM exception or limitation? ☐ Yes ☐ No

Describe the measures implemented before model training to respect reservations of rights from the TDM exception or limitation before and during data collection...,Describe the opt-out protocols and solutions honored by the provider or third parties from which datasets were obtained.

3.2. Removal of illegal content

This concerns measures taken to remove illegal content (e.g., child sexual abuse material, terrorist content) from training data.

General description of measures taken: [Describe general measures like blacklists, keywords, model-based classifiers without disclosing trade secrets] .

Conclusion

This detailed framework represents a significant step-change in AI governance, moving from principles to concrete, enforceable obligations. For IT leaders, compliance will require meticulous data supply chain documentation and robust internal governance processes.

Broadly, this new reporting obligation is a core operational and strategic challenge. It brings this abstract argument about AI training data down to earth as a concrete compliance issue. Companies in the IT industry, particularly general-purpose AI constructors, must begin diligently tracking their data chains of supply. While this creates an administrative burden, it creates a chance for building confidence and demonstrating a willingness to construct AI ethically and lawfully. Opaque training data’s time is short in the EU. It’s time to prepare.

Previous Post Next Post