ChatGPT Safety: How OpenAI's Guardrails Work and Fail
ChatGPT safety explained: how RLHF, Rule-Based Rewards, safe-completions, and the Moderation API work, plus the jailbreaks that defeat each layer.
ChatGPT safety refers to the set of mechanisms OpenAI has built to prevent the model from generating harmful content, assisting with weapons development, facilitating illegal activity, or behaving in ways that expose users and businesses to serious risk. For teams building products on top of ChatGPT, understanding what those mechanisms actually do — and which attack patterns reliably defeat them — is a prerequisite for responsible deployment. The official story and the adversarial reality diverge in several important places.
The Safety Stack: What OpenAI Has Built
OpenAI’s safety approach for ChatGPT operates across multiple layers, each addressing a different failure mode.
Training-time constraints are the foundation. Reinforcement Learning from Human Feedback (RLHF) shapes the model’s default behavior by rewarding compliant, helpful outputs and penalizing policy violations. On top of this, OpenAI layered Rule-Based Rewards (RBRs) ↗: a system that encodes explicit safety rules in plain language and uses them as a scoring signal during training. RBRs target edge cases where human preference data is sparse, noisy, or internally inconsistent — precisely the distribution where RLHF alone tends to underspecify behavior.
For GPT-5, OpenAI introduced safe-completions ↗, a training method that moves away from binary refuse-or-comply decisions. Instead of training the model to refuse entire request categories, safe-completions trains it to maximize helpfulness subject to policy constraints — producing a useful, bounded response where possible rather than a flat refusal. OpenAI reports that this approach reduced both safety failures on dual-use prompts and over-refusals on benign ones. The trade-off is that “nuanced compliance” is harder to audit than “hard refusal.”
Hardcoded behaviors are enforced at the model level regardless of what operators or users request. OpenAI’s Model Spec ↗ defines these as absolute limits: generating sexual content involving minors, providing actionable synthesis routes for CBRN weapons, and a small set of catastrophic-risk categories. No operator system prompt unlocks them. No API tier overrides them. The spec calls these “root-level prohibitions” — the floor below which no downstream configuration can go.
Softcoded behaviors occupy everything else. These defaults can be shifted by operators via the system prompt or by users with appropriate context. Adult content generation, detailed harm-reduction discussions for controlled substances, clinical detail in verified medical contexts — all of these are operator-configurable. The Model Spec makes the permission hierarchy explicit: operators inherit what OpenAI allows, users inherit what operators allow. If your system prompt doesn’t restrict a category, the model’s consumer-product defaults apply, which are calibrated for a general audience, not a specialized deployment.
The Moderation API provides a separate synchronous classifier that developers can run against both inputs and outputs. It returns structured confidence scores across harm categories — hate, self-harm, sexual, violence, and so on. It is not a substitute for training-time safety; it is a runtime checkpoint that catches content the model should have refused but didn’t. Critically, the Moderation API and the chat model operate on different internal representations and different training distributions. A response that scores clean on the Moderation API can still carry policy violations embedded in indirect language.
Consumer ChatGPT Safety Versus API Safety
A common point of confusion is whether “ChatGPT safety” and “OpenAI API safety” are the same thing. They share the same root-level prohibitions and the same underlying Model Spec, but the defaults differ. The consumer ChatGPT product applies softcoded behaviors calibrated for a broad, unauthenticated, general audience, including age-related defenses. The API hands the operator the system-prompt layer, which means an API application inherits the floor (root-level prohibitions that cannot be unlocked) but sets its own ceiling within what OpenAI allows for operators.
The practical consequence: code that behaves “safely” in the ChatGPT consumer app is not automatically safe in an API deployment, because the consumer app is doing softcoded work your system prompt now owns. If your product reaches minors, the consumer-app age defenses are not inherited by default. OpenAI’s Under-18 Principles in the December Model Spec describe the age-prediction and moderation layer the consumer product runs, and an API operator who needs equivalent protection has to build toward it explicitly.
The Bypass Landscape
Every layer described above has documented bypass patterns. ChatGPT’s safety mechanisms are mature enough to be extensively studied — which means the failure modes are public knowledge. The same failure modes recur across vendors, which is why our LLM guardrails architecture and bypasses guide treats them as a structural property of classifier-and-refusal stacks rather than an OpenAI-specific flaw.
Prompt engineering jailbreaks are the most systematically researched attack surface. A widely cited empirical study (arXiv 2305.13860 ↗) classified ten distinct jailbreak prompt patterns across three categories and demonstrated consistent policy evasion across 3,120 test cases spanning eight prohibited content scenarios, targeting both GPT-3.5 and GPT-4. The study frames the core mechanism precisely: jailbreaks exploit competing objectives — the model is trained to be helpful and to refuse harmful requests, and when adversarial prompts put those objectives in tension, the model sometimes resolves toward helpfulness. Roleplay framing, persona injection, and fictional context are persistent attack surfaces because they are also legitimate use cases. The training data cannot fully distinguish them.
Time Bandit is a more recent technique that exploits temporal and procedural ambiguity rather than roleplay framing. Discovered by researcher David Kuszmar and reported by BleepingComputer ↗, the method frames requests in historical timeframes while asking for information that requires current technical knowledge. A prompt might ask ChatGPT to describe how a process was handled “in 1789,” then request specific technical details that only exist today. The model, unable to consistently determine which version of its rules applies to historical contexts, has been observed providing malware code, weapons guidance, and controlled synthesis information. The jailbreak exploits what Kuszmar calls “procedural ambiguity” — uncertainty in how the model interprets and enforces its own policy under unfamiliar framing.
Multilingual and encoding attacks exploit a known gap in classifier coverage. Safety classifiers are trained predominantly on English-language data. Switching to a lower-resource language, or encoding the attack prompt in base64 or character substitution, degrades classifier confidence enough to slip past input-layer screening. This is not unique to ChatGPT — it affects most commercial classifier-based guardrails. AI-alert.org ↗ tracks disclosed jailbreaks and LLM vulnerability disclosures as they emerge, which is useful for teams who want current signal rather than relying only on academic benchmarks.
API-level bypass rates vary by model version. Recent adversarial testing found GPT-5-mini could be tricked roughly half the time on targeted attack prompts; older models showed higher rates. The pattern is consistent: safety training shifts the output distribution but does not enforce a hard boundary. A sufficiently adversarial prompt is a search problem — the attacker iterates against a fixed target; the model cannot adapt at inference time.
Deployment Recommendations
The native ChatGPT safety stack is a starting point. For any application that accepts free-form user input, that starting point leaves residual risk that operators are responsible for closing.
Apply the Moderation API bidirectionally. Run it on inputs before they reach the model, and run it on outputs before they reach the user. Skipping input-side screening means relying entirely on model-level refusals to handle adversarial prompts. Skipping output-side screening means a successful bypass produces no detection event. For the output side specifically, an output classifier that detects PII and secrets catches the data-leak failure mode the Moderation API’s harm categories were never designed to flag.
Layer additional classifiers. The Moderation API, model-level refusals, and a secondary classifier model — Llama Guard, Azure Prompt Shield, or a custom fine-tuned model — are three separate control points with different training distributions. A prompt that defeats one may not defeat two. Defense-in-depth ↗ here is not paranoia; it is the standard architecture for anything with meaningful abuse surface. Our roundup of LLM security tools maps the current options for each of these control points.
Write a narrow system prompt. ChatGPT’s softcoded defaults are calibrated for general-purpose consumer use. If your application has a specific scope, restrict it explicitly. Name the topics that are out of scope. Define the persona tightly. The more constrained the system prompt, the less behavioral surface area the model has to be manipulated across.
Monitor outputs, not just refusals. A refusal is visible. The more dangerous case is a bypass that produces a compliant-looking but policy-violating response. Log full responses, score them with your classifier stack, and alert on anomalous outputs even when no single check fired. The observability principles used for model drift detection apply directly to safety monitoring — output distributions should be stable, and deviation is signal. SentryML ↗ covers the ML monitoring layer that makes ongoing safety surveillance tractable.
Red-team ↗ before you ship. The jailbreak research literature is public. The ten prompt patterns in arXiv 2305.13860 and Time Bandit’s temporal framing technique represent known, documented attacks. If your configuration cannot withstand 2023-vintage attacks, you will not withstand novel ones either.
Sources
- OpenAI Model Spec (December 2025) ↗ — Authoritative specification of hardcoded versus softcoded ChatGPT behaviors and the operator/user permission hierarchy.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study (arXiv 2305.13860) ↗ — Systematic empirical analysis of 3,120 jailbreak prompts across 8 prohibited scenarios; identifies 10 distinct bypass pattern categories across GPT-3.5 and GPT-4.
- Time Bandit ChatGPT Jailbreak Bypasses Safeguards on Sensitive Topics — BleepingComputer ↗ — Documents the temporal confusion bypass technique used to extract weapons guidance and malware instructions from ChatGPT.
- From Hard Refusals to Safe-Completions: GPT-5 Safety Training — OpenAI ↗ — OpenAI’s technical post on the shift from binary refusals to output-centric safety training, including reported reductions in both safety failures and over-refusals.
Sources
GuardML — in your inbox
Defensive AI — guardrails, content filters, model defenses, safe deployment. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
G4-MeroMero-31B: Abliteration Drops Refusal Rate 99% to 15%
A new uncensored fine-tune of Gemma 4 31B achieves a 15/100 refusal rate via Arbitrary-Rank Ablation on attention output projections — KL divergence 0.
LLM Safety: What It Actually Means and How to Build It
LLM safety spans alignment training, inference-time guardrails, and external filters — each with known failure modes.
LLM Guardrails: Architecture, Bypasses, and What to Deploy
LLM guardrails are the control layer between a language model and the real world. This guide covers how they work, how they fail under adversarial