Cohere Model Training Privacy Notice
Last Update: January 30, 2026
We have prepared this notice (“Notice”) to explain how our foundation models (“Models”) are trained and how they work. In this Notice, we also explain how personal information of individuals who do not use our products may incidentally be processed as part of the training process of our Models.For information about the personal information we handle in connection with our products, website, or other interactions with our customers, or for additional details not provided in this Notice, please see our Privacy Policy. You can also contact us at privacy@cohere.com.
Cohere Models and Who They are Made For
Cohere develops two types of Models to serve its customers:- Generative models, like our Command model family, used to generate human-like text in response to user inputs containing text or images.
- Search and retrieval models, like our Embed and Rerank model families, used to represent the meaning of text or image inputs as a list of numbers and to index documents based on relevance to a user query. Common use-cases include semantic search, clustering and classification. These models cannot generate human-like text as an output.
How Cohere Models are Trained
Our Models are developed in two key phases often called pre-training and post-training:- Pre-training provides a base layer of foundational capabilities. It involves the analysis of a high volume of varied content. Before the analysis process, content is converted into tokens, which are short chunks of text or formatting characters of different lengths. The model processes the tokens and learns general patterns, syntax, and semantics across multiple languages.
- Post-training refines and improves specific capabilities. It involves training on smaller, focused datasets relevant to the specific types of tasks or capabilities that we want to improve. Human feedback as well as AI tools are often used during this phase to develop datasets or evaluate a model’s performance on a specific task. Cohere’s post-training process is focused on capabilities, like tool use, coding, safety, math, and multilingual capabilities, that optimize its Models for enterprise use.
Collection of Personal Information
Cohere’s Models are trained on a proprietary mix of datasets from various sources including publicly available information, datasets developed by Cohere with human annotation or generated automatically with AI (i.e. synthetic data), and datasets sourced from specialised data vendors.Cohere does not intentionally collect any personal information for the purpose of model training. Personal information is not particularly useful for the enterprise capabilities we train our Models for, and we take steps to minimise the possibility of any personal information being included in the mix of datasets we use for training. However, it is possible we may receive personal information in the following cases:- Publicly available information on the web: Because the internet includes information about people, it is possible we may receive personal information when we use third party datasets that contain publicly available information. We take steps to remove any personal information we may have incidentally collected in this way before using content in training, like filtering out domains that are likely to contain high volumes of information about people (e.g. social media domains). If we collect information from the web directly ourselves, we take steps to ensure crawlers we use respect strict policies, like not accessing password-protected pages or content behind paywalls.
- Datasets sourced from third party vendors: We take steps to ensure our vendors do not include personal information in datasets provided to us, or where that is not possible, we take steps to de-identify the information before any use for training purposes.
- Data from our Products: In most cases, Cohere customers use Cohere Products in their own environments or in third party environments, meaning Cohere has no access to inputs submitted to its Models and other products. Where a user has given Cohere permission to use inputs and outputs for model training, we take steps to de-identify and strip personal information that may appear in inputs or outputs prior to use in model training. See our Privacy Policy for information about how we handle personal information on our Cohere API SaaS Platform.
Privacy and Security Safeguards for Model Training
We implement measures throughout the AI development lifecycle to mitigate the risks of this process, including privacy and security risks. These measures include:- Adherence to and implementation of ISO-certified controls for information security and AI management through our ISO 27001 and ISO 42001 certifications;
- Safety mitigations and evaluations to identify, assess, and mitigate risks of harms to individuals or society;
- Access controls to limit access to training data to only those personnel who require access to perform their functions;
- Supply chain controls for any third parties such as data vendors;
- Designing and testing our Models to perform well on real-word enterprise tasks, like providing accurate outputs in response to a user query.