How RLHF Works
RLHF usually starts with collecting human comparisons between model outputs. Those preferences train a reward model or feedback signal that guides the model toward better behavior.
- Collect prompts and candidate responses
- Rank or score outputs with human reviewers
- Use feedback for reward modeling or alignment