Tag
#bypass
3 posts tagged bypass.
- bypass
G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%
A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.
- guardrails
ChatGPT Safety: How OpenAI's Guardrails Work and Fail
ChatGPT safety explained: how RLHF, Rule-Based Rewards, safe-completions, and the Moderation API work, plus the jailbreaks that defeat each layer.
- content-filter
AI Content Filter: Architecture, Bypasses, and Layered Defense
A practitioner's breakdown of AI content filter approaches — classifier-based, LLM-as-judge, and guard models — with honest coverage of bypass techniques