Tag

#alignment

7 posts tagged alignment.

bypass

G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%

A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.
May 15, 2026
alignment

LLM Alignment Evaluation: Why Benchmarks Don't Predict Safety

Practitioners rely on alignment benchmarks that miss the attack surface that matters: agentic tasks, implicit harm, and low-resource languages.
May 13, 2026
deep-dive

KV Cache Compression Is Now an Alignment Problem

A new preprint argues that compressing KV cache during RL rollouts silently biases the policy you ship. For teams treating RLHF as a defensive control
May 11, 2026
guardrails

ChatGPT Safety: How OpenAI's Guardrails Work and Fail

ChatGPT safety explained: how RLHF, Rule-Based Rewards, safe-completions, and the Moderation API work, plus the jailbreaks that defeat each layer.
May 10, 2026
alignment

LLM Alignment: What It Does, Where It Breaks, How to Deploy

LLM alignment trains models to internalize safety constraints — but every technique has documented bypass paths.
May 10, 2026
defense-in-depth

LLM Safety: What It Actually Means and How to Build It

LLM safety spans alignment training, inference-time guardrails, and external filters — each with known failure modes.
May 10, 2026
alignment

Model Alignment: What It Is, How It Works, and Where It Fails

Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here's what the main techniques actually do, how they're
May 10, 2026