<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>GuardML</title><description>Practical coverage of defensive AI engineering. Guardrails for LLMs, content filters and moderation pipelines, model defenses against adversarial attacks, output safety, and how to ship AI features without shipping liability with them.</description><link>https://guardml.io/</link><language>en</language><item><title>Constitutional AI Explained: How Principle-Based Training Builds Safer Models</title><link>https://guardml.io/posts/constitutional-ai/</link><guid isPermaLink="true">https://guardml.io/posts/constitutional-ai/</guid><description>Constitutional AI replaces human harm labels with a written set of principles and AI self-critique. Here is how the method works, where it sits in your</description><pubDate>Thu, 18 Jun 2026 00:00:00 GMT</pubDate><category>constitutional-ai</category><category>rlaif</category><category>model-alignment</category><category>llm-safety</category><category>guardrails</category><category>rlhf</category><author>GuardML Editorial</author></item><item><title>LLM Guardrails Explained: What They Are and How to Implement Them</title><link>https://guardml.io/posts/llm-guardrails-3/</link><guid isPermaLink="true">https://guardml.io/posts/llm-guardrails-3/</guid><description>A practitioner&apos;s guide to LLM guardrails — the five rail types, what each one actually catches, where each is bypassed, and how to wire a stack that fails</description><pubDate>Wed, 03 Jun 2026 00:00:00 GMT</pubDate><category>llm-guardrails</category><category>guardrails</category><category>content-filter</category><category>prompt-injection</category><category>defense-in-depth</category><category>tooling</category><author>GuardML Editorial</author></item><item><title>MCP Tool Poisoning: The Guardrail Layer Most Teams Are Missing</title><link>https://guardml.io/posts/mcp-tool-poisoning-defensive-guardrails/</link><guid isPermaLink="true">https://guardml.io/posts/mcp-tool-poisoning-defensive-guardrails/</guid><description>MCP makes every server an injection surface in your LLM app. Tool poisoning, rug-pulls, and the lethal trifecta are live. Here is what to actually defend.</description><pubDate>Sat, 30 May 2026 00:00:00 GMT</pubDate><category>mcp</category><category>prompt-injection</category><category>tool-poisoning</category><category>agent-security</category><category>guardrails</category><category>lethal-trifecta</category><author>GuardML Editorial</author></item><item><title>G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%</title><link>https://guardml.io/posts/g4-meromero-31b-uncensored-heretic-is-out-now-a-finetune-of/</link><guid isPermaLink="true">https://guardml.io/posts/g4-meromero-31b-uncensored-heretic-is-out-now-a-finetune-of/</guid><description>A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.</description><pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate><category>bypass</category><category>abliteration</category><category>alignment</category><category>fine-tuning</category><category>content-filter</category><category>guardrails</category><author>GuardML Editorial</author></item><item><title>AI Moderation Tools for LLMs: What Works and What Gets Bypassed</title><link>https://guardml.io/posts/ai-moderation-tools/</link><guid isPermaLink="true">https://guardml.io/posts/ai-moderation-tools/</guid><description>A practitioner&apos;s comparison of AI moderation tools — AWS Bedrock Guardrails, Azure AI Content Safety, Lakera Guard, NeMo Guardrails, and Llama Guard —</description><pubDate>Thu, 14 May 2026 00:00:00 GMT</pubDate><category>guardrails</category><category>content-filter</category><category>llm-security</category><category>jailbreak</category><category>prompt-injection</category><category>tooling</category><author>GuardML Editorial</author></item><item><title>LLM Alignment Evaluation: Why Benchmarks Don&apos;t Predict Safety</title><link>https://guardml.io/posts/llm-alignment-2/</link><guid isPermaLink="true">https://guardml.io/posts/llm-alignment-2/</guid><description>Practitioners rely on alignment benchmarks that miss the attack surface that matters: agentic tasks, implicit harm, and low-resource languages.</description><pubDate>Thu, 14 May 2026 00:00:00 GMT</pubDate><category>alignment</category><category>evaluation</category><category>benchmarks</category><category>agentic-ai</category><category>multilingual</category><category>guardrails</category><author>GuardML Editorial</author></item><item><title>AI Safety Tools: A Guide to Guardrails, Filters, and Defenses</title><link>https://guardml.io/posts/ai-safety-tools/</link><guid isPermaLink="true">https://guardml.io/posts/ai-safety-tools/</guid><description>A practitioner&apos;s breakdown of the leading AI safety tools — NeMo Guardrails, LLM Guard, Llama Guard, and managed platforms — with benchmark data, known</description><pubDate>Tue, 12 May 2026 00:00:00 GMT</pubDate><category>guardrails</category><category>llm-security</category><category>ai-safety</category><category>content-filter</category><category>defense-in-depth</category><author>GuardML Editorial</author></item><item><title>KV Cache Compression Is Now an Alignment Problem</title><link>https://guardml.io/posts/weekly-how-to-compress-kv-cache-in-rl-post-training-shadow-mask-dis/</link><guid isPermaLink="true">https://guardml.io/posts/weekly-how-to-compress-kv-cache-in-rl-post-training-shadow-mask-dis/</guid><description>A new preprint argues that compressing KV cache during RL rollouts silently biases the policy you ship. For teams treating RLHF as a defensive control</description><pubDate>Tue, 12 May 2026 00:00:00 GMT</pubDate><category>alignment</category><category>rlhf</category><category>kv-cache</category><category>rlaif</category><category>defense-in-depth</category><category>training-infra</category><author>GuardML Editorial</author></item><item><title>ChatGPT Safety: How OpenAI&apos;s Guardrails Work and Fail</title><link>https://guardml.io/posts/chatgpt-safety/</link><guid isPermaLink="true">https://guardml.io/posts/chatgpt-safety/</guid><description>ChatGPT safety explained: how RLHF, Rule-Based Rewards, safe-completions, and the Moderation API work, plus the jailbreaks that defeat each layer.</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><category>chatgpt</category><category>guardrails</category><category>content-filter</category><category>jailbreak</category><category>alignment</category><category>bypass</category><author>GuardML Editorial</author></item><item><title>LLM Alignment: What It Does, Where It Breaks, How to Deploy</title><link>https://guardml.io/posts/llm-alignment/</link><guid isPermaLink="true">https://guardml.io/posts/llm-alignment/</guid><description>LLM alignment trains models to internalize safety constraints — but every technique has documented bypass paths.</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><category>alignment</category><category>rlhf</category><category>constitutional-ai</category><category>dpo</category><category>guardrails</category><category>fine-tuning</category><author>GuardML Editorial</author></item><item><title>LLM Guardrails: Comparing Tools and Implementation Patterns</title><link>https://guardml.io/posts/llm-guardrails-2/</link><guid isPermaLink="true">https://guardml.io/posts/llm-guardrails-2/</guid><description>A practical comparison of LLM guardrail implementations — classifiers, rule engines, LLM judges — with empirical bypass rates and deployment patterns that</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><category>llm-guardrails</category><category>guardrails</category><category>content-filter</category><category>prompt-injection</category><category>tooling</category><category>defense-in-depth</category><author>GuardML Editorial</author></item><item><title>LLM Guardrails: Architecture, Bypasses, and What to Deploy</title><link>https://guardml.io/posts/llm-guardrails/</link><guid isPermaLink="true">https://guardml.io/posts/llm-guardrails/</guid><description>LLM guardrails are the control layer between a language model and the real world. This guide covers how they work, how they fail under adversarial</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><category>llm-guardrails</category><category>content-filter</category><category>prompt-injection</category><category>jailbreak</category><category>defense-in-depth</category><category>guardrails</category><author>GuardML Editorial</author></item><item><title>LLM Safety: What It Actually Means and How to Build It</title><link>https://guardml.io/posts/llm-safety/</link><guid isPermaLink="true">https://guardml.io/posts/llm-safety/</guid><description>LLM safety spans alignment training, inference-time guardrails, and external filters — each with known failure modes.</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><category>llm-safety</category><category>guardrails</category><category>alignment</category><category>jailbreak</category><category>defense-in-depth</category><category>content-filter</category><author>GuardML Editorial</author></item><item><title>LLM Security Tools: A Practical Guide to the Current Stack</title><link>https://guardml.io/posts/llm-security-tools/</link><guid isPermaLink="true">https://guardml.io/posts/llm-security-tools/</guid><description>A working guide to LLM security tools for 2026 — covering red-teaming frameworks, runtime guardrails, and observability layers, with honest notes on what</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><category>llm-security-tools</category><category>guardrails</category><category>red-teaming</category><category>prompt-injection</category><category>defense-in-depth</category><category>tooling</category><author>GuardML Editorial</author></item><item><title>Model Alignment: What It Is, How It Works, and Where It Fails</title><link>https://guardml.io/posts/model-alignment/</link><guid isPermaLink="true">https://guardml.io/posts/model-alignment/</guid><description>Model alignment trains AI systems to follow human intent rather than optimize for proxy metrics. Here&apos;s what the main techniques actually do, how they&apos;re</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><category>alignment</category><category>rlhf</category><category>reward-hacking</category><category>constitutional-ai</category><category>defense-in-depth</category><category>llm-security</category><author>GuardML Editorial</author></item><item><title>Content Moderation AI Tools: Benchmarks, Bypasses, and Deployment</title><link>https://guardml.io/posts/content-moderation-ai-tools/</link><guid isPermaLink="true">https://guardml.io/posts/content-moderation-ai-tools/</guid><description>A practitioner&apos;s comparison of leading content moderation AI tools — OpenAI Moderation, Azure AI Content Safety, Llama Guard 4, NeMo Guardrails, and more</description><pubDate>Sun, 10 May 2026 00:00:00 GMT</pubDate><category>content-moderation</category><category>guardrails</category><category>llm-safety</category><category>ai-tools</category><category>filtering</category><author>GuardML Editorial</author></item><item><title>AI Content Filter: Architecture, Bypasses, and Layered Defense</title><link>https://guardml.io/posts/ai-content-filter/</link><guid isPermaLink="true">https://guardml.io/posts/ai-content-filter/</guid><description>A practitioner&apos;s breakdown of AI content filter approaches — classifier-based, LLM-as-judge, and guard models — with honest coverage of bypass techniques</description><pubDate>Sat, 09 May 2026 00:00:00 GMT</pubDate><category>content-filter</category><category>guardrails</category><category>llm-safety</category><category>bypass</category><category>defense-in-depth</category><author>GuardML Editorial</author></item><item><title>Output Classification: A PII and Secrets Detector for LLM Apps</title><link>https://guardml.io/posts/output-classification-pii-secrets-detector/</link><guid isPermaLink="true">https://guardml.io/posts/output-classification-pii-secrets-detector/</guid><description>Most output filters catch the obvious cases and miss the long tail. Here&apos;s how to build an output classifier that&apos;s actually deployable in production.</description><pubDate>Thu, 07 May 2026 00:00:00 GMT</pubDate><category>output-filtering</category><category>pii-detection</category><category>secrets-detection</category><category>llm-security</category><category>detection-engineering</category><author>GuardML Editorial</author></item><item><title>Content Moderation Tools for LLMs: What Works and Where It Breaks</title><link>https://guardml.io/posts/content-moderation-tools/</link><guid isPermaLink="true">https://guardml.io/posts/content-moderation-tools/</guid><description>A practitioner&apos;s guide to the leading content moderation tools for LLM applications—OpenAI Moderation API, Llama Guard, Perspective API, and</description><pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate><category>content-moderation</category><category>guardrails</category><category>llm-safety</category><category>tooling</category><category>defense-in-depth</category><author>GuardML Editorial</author></item><item><title>OpenAI&apos;s Under-18 Principles: An Engineer Reads the Model Spec</title><link>https://guardml.io/posts/weekly-updating-our-model-spec-with-teen-protections/</link><guid isPermaLink="true">https://guardml.io/posts/weekly-updating-our-model-spec-with-teen-protections/</guid><description>OpenAI&apos;s December Model Spec adds Root-level Under-18 Principles that bind the model even against jailbreak framing.</description><pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate><category>model-spec</category><category>age-gating</category><category>guardrails</category><category>roleplay-jailbreak</category><category>policy-hierarchy</category><category>openai</category><author>GuardML Editorial</author></item><item><title>AI Content Moderation: How LLM Filters Work and Where They Break</title><link>https://guardml.io/posts/ai-content-moderation/</link><guid isPermaLink="true">https://guardml.io/posts/ai-content-moderation/</guid><description>A technical breakdown of AI content moderation for LLM applications — how classifier-based guardrails work, the bypass techniques that defeat them, and</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate><category>content-moderation</category><category>guardrails</category><category>llm-safety</category><category>jailbreak</category><category>llama-guard</category><category>defense-in-depth</category><author>GuardML Editorial</author></item><item><title>OpenAI&apos;s Under-18 Principles: What the New Model Spec Does</title><link>https://guardml.io/posts/updating-our-model-spec-with-teen-protections/</link><guid isPermaLink="true">https://guardml.io/posts/updating-our-model-spec-with-teen-protections/</guid><description>OpenAI&apos;s December 18 Model Spec adds Under-18 Principles, an age-prediction classifier, and real-time moderation across modalities.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate><category>guardrails</category><category>openai</category><category>model-spec</category><category>teen-safety</category><category>age-verification</category><category>moderation</category><author>GuardML Editorial</author></item></channel></rss>