Reinforcement learning from human feedback (RLHF)
Definition A training method in which people rate a model's answers and those ratings teach it to produce responses humans prefer—more helpful, better aligned with instructions, less harmful. A key step in turning raw language models into usable assistants.
In more depth
After initial training on text, models are refined by collecting human judgments on sample outputs and using them to steer the model toward preferred behavior. RLHF is a large part of why modern chatbots follow instructions and decline inappropriate requests. It also contributes to a known side effect: models tuned to please can sound confident and agreeable even when they are wrong.
Further reading: Wikipedia.
Related terms
Educational information, not legal advice. AI terminology and tools change quickly; definitions reflect usage as of the last-updated date. For what bar associations and courts actually require of lawyers using AI, see legalaicompliance.help and consult a licensed attorney in your jurisdiction.