Reinforcement learning from human feedback (RLHF)

How models work Last reviewed 2026-07-01

Definition A training method in which people rate a model's answers and those ratings teach it to produce responses humans prefer—more helpful, better aligned with instructions, less harmful. A key step in turning raw language models into usable assistants.

In more depth

After initial training on text, models are refined by collecting human judgments on sample outputs and using them to steer the model toward preferred behavior. RLHF is a large part of why modern chatbots follow instructions and decline inappropriate requests. It also contributes to a known side effect: models tuned to please can sound confident and agreeable even when they are wrong.

Related terms

About the editor: MHSB Solutions, Research desk. MHSB Solutions is not a law firm. This glossary is educational information, not legal advice.

Educational information, not legal advice. AI terminology and tools change quickly; definitions reflect usage as of the last-updated date. For what bar associations and courts actually require of lawyers using AI, see legalaicompliance.help and consult a licensed attorney in your jurisdiction.