Training data

How models work Last reviewed 2026-07-01

Definition The collection of text, images, or other examples a machine-learning model learns from. A model's knowledge, abilities, and biases all derive from what its training data did and did not contain.

In more depth

For large language models, training data typically includes web pages, books, code, and licensed datasets gathered up to a cutoff date, after which the model knows nothing unless given new information. Errors, gaps, and biases in the data surface as errors and biases in the model. The provenance of training data is the subject of ongoing copyright litigation against AI developers, and whether user inputs become training data is a central confidentiality question when evaluating any AI tool.

Related terms

About the editor: MHSB Solutions, Research desk. MHSB Solutions is not a law firm. This glossary is educational information, not legal advice.

Educational information, not legal advice. AI terminology and tools change quickly; definitions reflect usage as of the last-updated date. For what bar associations and courts actually require of lawyers using AI, see legalaicompliance.help and consult a licensed attorney in your jurisdiction.