Guardrail API

Background

Guardrail API: From Research to Technical Reality

Dr Wes Milam

A plain-language explanation of what I saw while studying prompt security since 2023, and why Guardrail had to exist.

Back to home

Most people talk about generative AI as if it has been around for a long time. It has not. The public only began interacting with modern large language models at scale in late 2022. That matters, because much of the confidence around “years of GenAI experience” is not grounded in reality. We have only had a short period of real-world exposure, and what security for this technology should look like is still evolving. What is clear is that more AI is not the solution to making AI more trustworthy.

I started researching prompt security in 2023 after early papers exposed prompt injection and jailbreaks. At the time, ChatGPT was the primary target simply because it was the only widely deployed system. Within months of its release, the first GenAI-powered crime platforms appeared. As access barriers fell for students and professionals, they also fell for criminals. Generative AI began accelerating traditional cyberattacks and introducing new risks that existing security models were not designed to handle.

What stood out in that early research was how often regulatory uncertainty was cited as a barrier. Organizations wanted to adopt AI, but they could not confidently control the risk. In 2024, I followed that thread into research on AI acceptance and why people resist systems they do not trust. In 2025, I turned fully toward building technical controls.

In the early days, prompt security mostly meant direct attacks. People tried to extract internal instructions, force unsafe outputs, or make models behave as malicious agents. That threat still exists, but widespread adoption of GenAI revealed a broader problem. Prompt security stopped being one thing and became a spectrum.

The first category is direct attacks against the model or platform itself. These attempts aim to expose protected information, alter behavior, or compromise the system. This is closest to traditional security thinking because the intent is obvious.

The second category is direct jailbreak persuasion. These attacks rely on framing rather than force. Requests are disguised as testing, role-play, or research to push a model outside its intended boundaries. This is what most people mean when they say “jailbreak,” and it is also where traditional filters begin to fail. For every malicious prompt attempt, there are thousands of legitimate prompts that look similar on the surface.

The third category is the long-term problem: indirect injection persuasion. This shifts prompt security away from hacking and toward social engineering. The risk is not a single bad answer. It is gradual influence over time. A model rejects an initial request, then encounters smaller, passive prompts that reduce resistance later. This mirrors how propaganda works on people through repeated exposure.

I am not claiming that a rejected prompt retrains a model in the moment. The practical issue is simpler. When a system repeatedly feeds a model manipulative language, workflows drift. Operators drift. Outputs drift, because the surrounding context keeps redefining what feels acceptable. Over time, models become easier to steer and people become more likely to accept risky behavior as normal.

That is the problem Guardrail was built to address. I initially built Guardrail to see if real-time screening against direct attacks was even possible. Traditional tools still matter, but language is not a fixed protocol. You cannot reliably secure against active intelligence using brittle pattern matching alone. Slight changes in phrasing are often enough to bypass static defenses.

This is why I moved enforcement to the API layer. I needed something that does not negotiate or infer intent. An API can be deterministic. It can apply the same rules every time, regardless of how convincing the language is.

In early 2025, I built a multi-prompt chain testing tool to examine this at scale. When you stop thinking in terms of single prompts and start looking at pressure over time, no frontier model consistently resisted. That is when intent verification became necessary. Not as another layer of intelligence, but as a way to pause, assess hypothetical harm, and avoid guessing when intent is unclear.

This is also why adding another model is not the answer. A safety layer that is itself an LLM is still persuadable. Asking one model to supervise another multiplies complexity without removing the core risk. Enforcement has to behave like infrastructure, not conversation.

As Guardrail evolved, the scope expanded. Multiple modalities, hidden text and code, confusable characters, agents, and streaming outputs all increased the surface area. Each hardening revealed another weakness. That is expected. Security is never finished.

This is why egress controls became non-negotiable. Even if a model drifts under pressure, outputs are what create real-world harm and liability. Guardrail monitors what goes into the model and what comes out of it. Ingress and egress operate independently and concurrently because they solve different problems. One protects models and workflows. The other protects organizations and people.

This is the intent model behind Guardrail API. Prompt security remains the unresolved center of GenAI risk. What goes in shapes behavior. What comes out affects trust, accountability, and real-world outcomes.

Want the full technical and GRC write-up? Read the full whitepaper.