Tag
#rlhf
4 posts tagged rlhf.
- deep-dive
Constitutional AI Explained: How Principle-Based Training Builds Safer Models
Constitutional AI replaces human harm labels with a written set of principles and AI self-critique. Here is how the method works, where it sits in your
- deep-dive
KV Cache Compression Is Now an Alignment Problem
A new preprint argues that compressing KV cache during RL rollouts silently biases the policy you ship. For teams treating RLHF as a defensive control
- alignment
LLM Alignment: What It Does, Where It Breaks, How to Deploy
LLM alignment trains models to internalize safety constraints — but every technique has documented bypass paths.
- alignment
Model Alignment: What It Is, How It Works, and Where It Fails
Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're