Training data
Updated June 10, 2026
The corpus a model learns from — and the explanation for nearly every AI writing habit you can name.
Definition
Training data is the text an LLM learns from: web pages, books, articles, code, forums — trillions of tokens. The model absorbs the statistical patterns of that corpus; everything it writes is recombination of patterns it saw. No experiences, no opinions — just text about text.
How it explains AI style
The overused words ("delve", "crucial") are high-frequency-everywhere words — statistically safe in any context, distinctive in none. The template structure mirrors the average shape of explanatory web prose. Post-training (instruction tuning, RLHF) adds the diplomatic, both-sides voice. Models trained on similar data with similar methods converge on similar style — which is why detectors generalize across model families.
The loop nobody planned
As AI text floods the web, it becomes the next generation's training data — models learning from models, amplifying their own habits. Researchers call the degenerate end-state "model collapse". For writers the practical takeaway is simpler: distinctly human writing — experience, stance, voice — keeps getting scarcer in the training distribution, and therefore keeps getting more valuable on the page.