On July 24, 2025, the European Commission published a final draft of its "Explanatory Notice and Template for the Public Summary of Training Content”, a document that operationalizes a key transparency requirement of the EU AI Act (Regulation (EU) 2024/1689). This document is not merely guidance; it holds significant legal value and establishes a mandatory reporting framework for all providers of general-purpose AI (GPAI) models placed on the EU market.
This review provides a technical deep-dive into the notice and the template, explaining the obligations, the meaning of key concepts, and how the reporting structure is designed to be implemented.
The European Commission has just unveiled a crucial piece of the puzzle for anyone developing or deploying AI in Europe. On July 24, 2025, it released a new template that mandates how companies must report the data they use to train their general-purpose AI models. This isn't just another layer of bureaucracy; it's a fundamental step towards transparency and a core component of the landmark EU AI Act.
For the industry leaders understanding this development is not optional. It directly impacts your AI strategy, compliance roadmap, and risk management. Here’s a breakdown of what this means for the industry.
The document is an Explanatory Notice and a standardized template from the European Commission. It details the obligations for all providers of general-purpose AI models under Article 53(1)(d) of the EU AI Act. The goal is simple: to create transparency about the content used to train these powerful models.
The AI Act itself, which entered into force on August 1, 2024, establishes harmonized rules for artificial intelligence, with specific obligations for providers of general-purpose AI models. These transparency rules will become applicable on August 2, 2025.
The primary objective is to increase transparency for several key reasons:
If your company provides a general-purpose AI model in the EU market—even free and open-source ones—you must prepare and publicly release a detailed summary of the training data used. This summary must follow the Commission's template and be available on your website and with the model's distribution channels when it is placed on the market.
The template is divided into three main sections:
The document navigates a careful balance between transparency and protecting providers' legitimate interests.
The following is a detailed walkthrough of the mandatory template.
Version of the Summary: [Provider enters version, with links to previous versions] Last update: DD/MM/YY
This section serves to identify the provider and the model(s) covered by the summary.
Field | Requirement |
---|---|
1.1 Provider identification | |
Provider name and contact details: | [Name and contact information of the provider] |
Authorised representative name and contact details: | [Applicable only if the provider is established outside the Union] |
1.2. Model identification | |
Versioned model name(s): | Provide the unique identifier for the model(s) or version(s) covered (e.g., Llama 3.1-405B). The same summary can be used for multiple models if their training content is identical. |
Model dependencies: | If the model modifies or fine-tunes another GPAI model, specify the original model's name and link to its summary. |
Date of placement of the model on the Union market: | [Date of market placement] |
This part provides a high-level overview of the training data's composition.
Modality | Select the modalities present in the training data, to the extent that they are identifiable | Training data size (For each selected modality, select the range within which the estimated total training data size for that modality falls. Dynamic datasets may be excluded from the estimation.) | Types of content (For each selected modality, provide a general description of the type of content that has been included in the training data.) |
---|---|---|---|
☐ Text | ☐ Less than 1 billion tokens ☐ 1 billion to 10 trillion tokens ☐ More than 10 trillion tokens Alternatively, specify size in another unit |
Examples of possible types of content include fiction and non fiction text, scientific text, press publications, legal and official documents, social media comments, source code. | |
☐ Image | ☐ Less than 1 million images ☐ 1 million to 1 billion images ☐ More than 1 billion images |
Examples of possible types of content include photography, visual art works, infographics, social media images, logos, brands. | |
☐ Audio | ☐ Less than 10,000 hours ☐ 10,000 to 1 million hours ☐ More than 1 million hours |
Examples of possible types of content include musical compositions and recordings, audiobooks, radio shows and podcasts, private audio communication. | |
☐ Video | ☐ Less than 10,000 hours ☐ 10,000 to 1 million hours ☐ More than 1 million hours |
Examples of possible types of content include music videos, films, TV programmes, performances, video games, video clips, journalistic videos, social media videos. | |
☐ Other | Specify modality and approximate size/unit |
Field | Requirement |
---|---|
Latest date of data acquisition/collection for model training: | Indicate the latest date when data was collected/obtained for
the model training: MM/YYYY Additionally, indicate if the model is continuously trained on new or dynamic data after this date. |
Description of the linguistic characteristics of the overall training data: | Where applicable, describe the languages covered by the training data (e.g., text, videos or speech), focusing in particular on EU official languages. |
Other relevant characteristics of the overall training data: | Where such information is readily available and in so far as it is relevant and practicable, describe other relevant characteristics of the overall training data, such as national/regional or demographic specificities of the training data. |
Additional comments (optional): | Providers may also disclose other relevant information on a voluntary basis (e.g. the compression or tokenization methodologies applied for the data size calculation, the sampling frequency/rate plays for audio or video content). |
This is the most critical section for transparency, detailing the specific origins of the training data.
This covers pre-packaged datasets compiled by third parties and made publicly available for free (e.g., on public repositories).
Field | Requirement |
---|---|
Latest date of data acquisition/collection for model training: |
Indicate the latest date when data was collected/obtained for the model training: MM/YYYY Additionally, indicate if the model is continuously trained on new or dynamic data after this date.
|
Description of the linguistic characteristics of the overall training data: |
Where applicable, describe the languages covered by the training data (e.g., text, videos or speech), focusing in particular on EU official languages.
|
Other relevant characteristics of the overall training data: |
Where such information is readily available and in so far as it is relevant and practicable, describe other relevant characteristics of the overall training data, such as national/regional or demographic specificities of the training data.
|
Additional comments (optional): |
Providers may also disclose other relevant information on a voluntary basis (e.g. the compression or tokenization methodologies applied for the data size calculation, the sampling frequency/rate plays for audio or video content).
|
This section is split into two categories to protect commercial sensitivities.
Have you used publicly available datasets to train the model? |
☐ Yes ☐ No |
If yes, specify modality(ies): |
☐ Text ☐ Image ☐ Video ☐ Audio ☐ Other |
List of large publicly available datasets: |
For each "large" dataset (defined as >3% of the total data size for that modality from public datasets ), provide its name and an access link. If a link isn't available, provide a general description.
|
General description of other publicly available datasets not listed above: |
Provide a general description of their content, including modality types, nature of content (e.g., personal data, copyright protected), and linguistic characteristics.
|
This covers private datasets from data intermediaries or other third parties not licensed directly from rights holders.
Have you concluded transactional commercial licensing agreement(s) with rightsholder(s) or with their representatives? |
☐ Yes ☐ No |
If yes, specify the modality(ies) of the content covered by the datasets concerned: |
☐ Text ☐ Image ☐ Video ☐ Audio ☐ Other |
Note: No further detail on the datasets is required here to protect confidential business agreements. |
|
This requires a comprehensive look at the provider's own data collection activities.
Field | Requirement | Notes |
---|---|---|
Were crawlers used by the provider or on behalf of? | ☐ Yes ☐ No | |
If yes, specify crawler name(s)/identifier(s): | [Name/ID of crawlers used] | |
Purposes of the crawler(s): | [Explain the purpose of the crawling activity] | |
General description of crawler behaviour: | Describe how crawlers behaved, e.g., respect for robots.txt, paywalls, captchas, etc. | |
Period of data collection: | From MM/YYYY to MM/YYYY | |
Comprehensive description of the type of content and online sources crawled: | Describe the type of content (geographical, linguistic characteristics) and websites scraped (e.g., news, blogs, social media, forums, government portals). | |
Summary of the most relevant domain names crawled: | This is a key requirement. Providers must list the top-level internet domain names from which content was scraped and used. The threshold is the top 10% of all domains by content size. For SMEs, this is reduced to the top 5% or 1000 domains, whichever is lower. This list can be provided as a downloadable file. |
This covers data collected from user interactions with the provider's own services and products.
Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? |
☐ Yes ☐ No |
Was data collected from user interactions with the provider's other services or products used to train the model? |
☐ Yes ☐ No |
If yes, provide a general description of the provider's services or products that were used to collect the user data: |
[General description of services/products] |
Type of modality covered: |
☐ Text ☐ Image ☐ Video ☐ Audio ☐ Other |
This relates to data generated by another AI model, particularly through model distillation or alignment techniques.
Was synthetic AI-generated data created by the provider or on their behalf to train the model? |
☐ Yes ☐ No |
If yes, modality of the synthetic data: |
☐ Text ☐ Image ☐ Video ☐ Audio ☐ Other |
If yes, specify the general-purpose AI model(s) used to generate the synthetic data if available on the market: |
Specify the name of the GPAI model(s) and provide a link to their summaries where available.
|
Information about other AI models, including provider's own AI model(s) not available on the market: |
Provide information about other AI models used, including a general description of their training data to the extent needed for rights-holders and to avoid circumvention.
|
A catch-all category for any data not covered above, such as offline sources or self-digitized media.
Have data sources other than those described in Sections 2.1 to 2.5 been used to train the model? |
☐ Yes ☐ No |
If yes, provide a narrative description of these data sources and the data: |
[Narrative description] |
This final section addresses policies and measures related to legal compliance.
Are you a Signatory to the Code of Practice for general-purpose AI models that includes commitments to respect reservations of rights from the TDM exception or limitation? ☐ Yes ☐ No
Describe the measures implemented before model training to respect reservations of rights from the TDM exception or limitation before and during data collection...,Describe the opt-out protocols and solutions honored by the provider or third parties from which datasets were obtained.
This concerns measures taken to remove illegal content (e.g., child sexual abuse material, terrorist content) from training data.
General description of measures taken: [Describe general measures like blacklists, keywords, model-based classifiers without disclosing trade secrets] .
This detailed framework represents a significant step-change in AI governance, moving from principles to concrete, enforceable obligations. For IT leaders, compliance will require meticulous data supply chain documentation and robust internal governance processes.
Broadly, this new reporting obligation is a core operational and strategic challenge. It brings this abstract argument about AI training data down to earth as a concrete compliance issue. Companies in the IT industry, particularly general-purpose AI constructors, must begin diligently tracking their data chains of supply. While this creates an administrative burden, it creates a chance for building confidence and demonstrating a willingness to construct AI ethically and lawfully. Opaque training data’s time is short in the EU. It’s time to prepare.