French CNIL Opens Public Consultation On Guidance On The Creation Of AI Training Databases

By Kristof Van Quathem & Alix Bertrand on November 20, 2023

Posted in Artificial Intelligence (AI), Data Privacy, EU Data Protection, European Union, GDPR

On October 11, 2023, the French data protection authority (“CNIL”) issued a set of “how-to” sheets on artificial intelligence (“AI”) training databases. The sheets are open to consultation until December 15, 2023, and all AI stakeholders (including companies, researchers, NGOs) are encouraged to provide comments.

There are eight sheets in total, each covering a data protection issue AI providers should consider when designing their systems. We have outlined below the main takeaways for each sheet.

Scope of the how-to sheets: The sheets apply to the development phase (i.e., designing, creation of database, training) of AI systems based on machine learning (statistical/stochastic systems) or on logic and knowledge (deterministic systems), to the extent such systems rely on the collection and use of personal data and are subject to the GDPR.

Applicable legal regime: The development phase of an AI system and its operational use (deployment phase) involve separate processing activities, that may in some cases be subject to different legal regimes (e.g., the GDPR):
- Where the operational use of the AI system is already defined during the development phase (“Scenario 1”), the processing activities during both phases are generally subject to the same legal regime;
- Where the operational use of the AI system is not clearly identified from the development phase (e.g., for general purpose AI systems) (“Scenario 2”), the applicable legal regime for the development and the deployment phase may differ. In this case, the CNIL considers that processing activities conducted in the development phase would generally be subject to the GDPR.

Defining a purpose: The creation of an AI training database should have a “specified, explicit and legitimate” purpose. The CNIL illustrates this for three situations:
- In Scenario 1, since the overall purpose is the same in both the development phase and the deployment phase, the controller should ensure the identified operational use of the AI system is sufficiently specified, explicit and legitimate.
- In Scenario 2, in order to be considered sufficiently precise, the purpose of processing in the development phase must refer cumulatively to (i) the type of system developed (e.g., generative AI system for voices) and (ii) the anticipated technical functionalities and capabilities of the system.
- Where the developer is building an AI training database for scientific research, the controller must define the purpose of the research and related processing of personal data. However, the CNIL acknowledges that the degree of precision here may be lower than for Scenario 2, as it may be difficult for researchers to fully identify such purpose at the beginning of a new project.

Legal qualification of AI systems providers: This sheet is intended to help AI system providers to determine whether they act as a controller, a joint controller or a processor within the meaning of the GDPR. To that end, the CNIL provides practical examples of processing activities carried out by various providers along with the corresponding qualification for the relevant provider. By way of illustration, the CNIL highlights that where a provider reuses its customers’ data to train an AI recommendation model, this provider would likely be deemed a controller under the GDPR.

Legal basis for the creation of the AI training dataset: According to the CNIL, a controller creating an AI training database would typically rely on consent, legitimate interest, contractual necessity or public interest as a legal basis. The CNIL provides examples of situations where a controller may rely on one of those legal bases, explaining the conditions that would have to be met in each case. In addition, the CNIL points out that controllers should ensure that the dataset they intend to use for AI training purposes is built lawfully. This could involve conducting additional checks, in particular where the data was initially collected for another purpose (i.e., a compatibility test).

Data protection impact assessments (“DPIAs”): This sheet is intended to help developers of AI systems identify when they are required to conduct a DPIA and provides guidance on how to conduct a DPIA in relation to AI-related processing activities.

Taking data protection into account in the system design choices: This sheet lists privacy-by-design ideas to consider when making design choices for an AI system, to ensure these choices will enable compliance with the GDPR data protection principles (in particular, the data minimization principle).

Take data protection into account in data collection and management: This sheet provides some guidance on how AI system providers can implement data privacy-by-design principles when creating and managing an AI training dataset. It includes recommendations on checks to carry out when collecting data (in particular, from publicly available sources), on data minimization and data retention, as well as on the monitoring and updating of the dataset. It also mentions the need for AI providers to document their data protection choices relating to the creation and management of the dataset, and provides a documentation template to this end.

What are the next steps?

Stakeholders have until December 15, 2023, to submit their comments on these sheets. The CNIL will then review all the contributions before publishing an updated version of the sheets in early 2024. It will also issue additional AI how-to sheets by the end of 2023, covering other topics such as data subject’ rights.

* * *

Covington’s Data Privacy and Cybersecurity Team and our Technology and Communications Team regularly advise clients on the laws surrounding AI (see in particular our previous blogposts regarding the EU AI Act here, here and here) and we will continue to monitor developments in the field of AI. If you would like to submit a contribution to the CNIL, or if you have questions more broadly about how the regulation of AI will affect your business, please feel free to contact us.

Inside Privacy

About this Blog