Training data

Updated June 10, 2026

The corpus a model learns from — and the explanation for nearly every AI writing habit you can name.

Definition

Training data is the text an LLM learns from: web pages, books, articles, code, forums — trillions of tokens. The model absorbs the statistical patterns of that corpus; everything it writes is recombination of patterns it saw. No experiences, no opinions — just text about text.

How it explains AI style

The overused words ("delve", "crucial") are high-frequency-everywhere words — statistically safe in any context, distinctive in none. The template structure mirrors the average shape of explanatory web prose. Post-training (instruction tuning, RLHF) adds the diplomatic, both-sides voice. Models trained on similar data with similar methods converge on similar style — which is why detectors generalize across model families.

The loop nobody planned

As AI text floods the web, it becomes the next generation's training data — models learning from models, amplifying their own habits. Researchers call the degenerate end-state "model collapse". For writers the practical takeaway is simpler: distinctly human writing — experience, stance, voice — keeps getting scarcer in the training distribution, and therefore keeps getting more valuable on the page.

Training data

Definition

How it explains AI style

The loop nobody planned

Keep reading

Large language model (LLM)

AI words to avoid (and what to write instead)

Hallucination

Why AI text sounds robotic

Humanize it — then verify it