Tag
#alignment
7 posts tagged alignment.
- bypass
G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%
A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.
- alignment
LLM Alignment Evaluation: Why Benchmarks Don't Predict Safety
Practitioners rely on alignment benchmarks that miss the attack surface that matters: agentic tasks, implicit harm, and low-resource languages.
- deep-dive
KV Cache Compression Is Now an Alignment Problem
A new preprint argues that compressing KV cache during RL rollouts silently biases the policy you ship. For teams treating RLHF as a defensive control
- guardrails
ChatGPT Safety: How OpenAI's Guardrails Work and Fail
ChatGPT safety explained: how RLHF, Rule-Based Rewards, safe-completions, and the Moderation API work, plus the jailbreaks that defeat each layer.
- alignment
LLM Alignment: What It Does, Where It Breaks, How to Deploy
LLM alignment trains models to internalize safety constraints — but every technique has documented bypass paths.
- defense-in-depth
LLM Safety: What It Actually Means and How to Build It
LLM safety spans alignment training, inference-time guardrails, and external filters — each with known failure modes.
- alignment
Model Alignment: What It Is, How It Works, and Where It Fails
Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're