Glossary

RLHF

RLHF, or reinforcement learning from human feedback, is a model-alignment method that uses human preference judgments to teach AI systems which outputs are more helpful, accurate, safe, or appropriate. The quality of the human feedback strongly affects the quality of the aligned model.

Request a Model Quality Audit Explore Corpus

How RLHF Works

RLHF usually starts with collecting human comparisons between model outputs. Those preferences train a reward model or feedback signal that guides the model toward better behavior.

Collect prompts and candidate responses
Rank or score outputs with human reviewers
Use feedback for reward modeling or alignment

Why Expert Feedback Matters

Generic crowd feedback is often not enough for regulated or technical domains. Expert reviewers catch factual, safety, and domain-specific failures that broad review pools miss.

Medical and legal review
Safety and refusal calibration
Domain-specific preference rubrics