Training Data
The text and information used to train LLMs, which shapes what they know about your brand.
Training data is the massive corpus of text that Large Language Models learn from. This includes websites, articles, books, documentation, and other content from across the internet.
What's in the training data matters for your brand because:
- LLMs form "opinions" based on what they've learned
- If your brand isn't in the training data, AI won't know about you
- Negative content in training data affects how AI describes you
- Training data has cutoff dates — newer brands may not be included
You can influence future training data by:
- Creating high-quality, authoritative content
- Getting mentioned on reputable sites
- Ensuring your messaging is clear and consistent
- Building a strong online presence that crawlers can index
Note: Training data is different from real-time retrieval (like Perplexity's web search). Some AI systems use both.
Related Terms
LLM
Large Language Model — AI systems like GPT-4, Claude, and Gemini that power conversational AI assistants.
Knowledge Cutoff
The date after which an LLM has no training data — it doesn't know about events or content after this date.
AI Crawler
Bots that AI companies use to index web content for training data or real-time retrieval.