Skip to content

From API Gateway to AI Gateway: When Infrastructure Is Stretched by Tokens

The act of wiring an LLM call into your own product, the more I do it, the less it looks like just calling another API.

In the beginning everyone treats it that way. The frontend fires a request, the backend grabs the parameters, calls some model vendor's endpoint, ships the result back. On an architecture diagram that line is indistinguishable from any third-party integration we have done for the past fifteen years. One arrow in, one arrow out, with maybe a rate limiter and an auth check in the middle.

Then the trouble starts. Finance shows up because the bill chart has gotten interesting. The same endpoint, the same user, latency is wandering badly enough that nobody can hold a product rhythm against it. Security walks over to ask whether anyone is reading what users are typing in, and how sensitive content gets blocked. The business side wants to ship a slice of traffic to a different vendor, just to see how it performs.

None of these questions started inside the gateway's view. They each grew up in their own corner, got handled once, and only later did people realize that they were the same class of problem—just never centralized.

That is what this appendix is really about. The LLM call, as a new shape of request, is forcing a new piece of infrastructure into existence. Today people call it the AI Gateway. The name is not what matters. What matters is that the problems it has to solve are structural: not features of any one product, but the situation any team eventually walks into once they put AI behind their business.

1. The shape of the call has changed

An LLM call is HTTP at the protocol level. It is not HTTP at the behavior level.

The header still says Content-Type: application/json. The body is still JSON. From the seven-layer view, it looks like every other REST call. But push one layer in and you find a few things in this kind of call that traditional API calls have never had.

The input is no longer opaque. For a normal API, the request body is a black box to the gateway—it does not need to look at it, and it cannot anyway. An LLM request body has the user's question written directly inside it. It may contain sensitive data, prompt-injection templates, or jailbreak attempts. Whether the gateway should look at the body, and how deep it should look, is itself a new question.

The output is no longer deterministic. For a normal API the response is generated by code; given the same input you get the same output. An LLM response is generated by probability distribution. The same input run twice can produce content that diverges entirely, length that varies by a factor of several, and quality that varies even more.

The cost is no longer flat. A normal API call's cost tracks the number of calls and is roughly stable. An LLM call's cost tracks the number of tokens, and token count depends on the prompt length, the context length, the depth of reasoning, and the length of generation. The same endpoint, the same user, two adjacent calls—the dollar figures on the bill can sit in entirely different orders of magnitude.

Put those three together and the call has stopped being a deterministic request and become a piece of thinking that has to be understood before it can be scheduled.

The infrastructure of the cloud era was built on a stable triangle: compute, network, data. Each resource has its own unit. vCPU-seconds, GB of egress traffic, GB-month of storage. Every layer of infrastructure had a sensible meter against it. Tokens are the first new unit of metering since the cloud era to cut across all three dimensions at once. Tokens are compute (inference is computation). Tokens are network (every call is a cross-domain RPC). Tokens are data (context, cache, embeddings are all priced in tokens). Once a single resource straddles all three dimensions, every governance tool built around the old triangle—capacity planning, rate limiting, quotas, billing, SLOs—has to be reworked in front of tokens.

2. The assumptions of the old infrastructure have shifted

The phrase API gateway has been with us for almost twenty years. It did not start as a standalone product. It started as a reverse proxy with a few common features bolted on: routing, rate limiting, auth, logging. Once microservices took off and the service count exploded, rewriting these handful of features inside every service became wasteful and error-prone, so they got pulled out into a layer in front of everything and given a name. The gateway grew from there—protocol translation, canary rollouts, circuit breaking, observability—but every one of these is the same kind of concern radiating outward. At its core, it has always been about managing requests and responses.

Its working model is clean. A request arrives. Is the user real? Is the permission there? Is there quota left? Which backend should it land on? What is the timeout? How are errors reported? Walk through the rules, let the request through. The backend returns a response, send it back the way it came. The whole loop is rule-based. It does not need semantic understanding of the body. It does not need to care what is written in the response.

The entire value of an API gateway rests on a single precondition: requests and responses are deterministic, describable by rules, and opaque to the gateway itself.

That precondition is what lets the API gateway be infrastructure. It does not need to understand the business, so the business can change without breaking the proxy. It does not need to understand the content, so encrypted, compressed, or binary payloads all flow through fine. From SOAP to REST to gRPC, the protocols changed; the precondition did not. The API gateway has always been about handling more calls, not different kinds of calls.

LLM calls break both ends of that precondition at the same time.

They break opaque content. Without inspecting the body, you cannot route correctly. The same /v1/chat/completions endpoint should land on different backends depending on what was actually asked—small talk versus complex reasoning, factual lookup versus long-context code understanding, generic question versus a vertical-domain ask. The capability gap between models is large, the price gap is large, and you cannot pick the right backend without understanding the request.

They break deterministic responses. Responses come out of a distribution. Length floats. Quality floats. Cost floats. The traditional gateway tools—uniform timeouts, per-call billing, rule-based traffic splits—do not survive contact with that.

This is not more demand on the same infrastructure. The foundation has shifted. When a new shape of call rips open both opacity and determinism at once, it stops being something the old infrastructure can simply extend over. Something new has to sit on top of it.

3. New infrastructure grows on top of old infrastructure

This pattern repeats across infrastructure history.

The bare-metal era had ops scripts and monitoring. Virtualization broke through that ceiling, and we got vCenter and OpenStack. Containers broke through vCenter, and we got Kubernetes. None of those were replacements. Bare metal is still here. VMs are still here. Containers are still here. Each transition added a new layer on top of the previous one to handle the new problems brought by a new unit of granularity.

The AI Gateway is in exactly this kind of relationship with the API gateway.

What the API gateway used to do is still happening. An LLM request still has to terminate TLS, still has to be authenticated, still has to be rate-limited and logged. None of that has gone anywhere, and there is no reason to relocate it. The AI Gateway handles the additional layer: semantic-based routing across multiple models, token-level metering and quotas, prompt-injection defense and sensitive-content filtering, context-semantic-aware caching, and pulling token drift, model availability, and call quality into the observability stack. Stack the two layers together, and you can move an LLM call from we got it working in dev to we are willing to run it in production for a year.

The incumbents are growing into this. Kong AI Gateway and Higress AI are the API-gateway lineage extending into AI calls. Newer companies are forming directly around AI-call governance—Portkey is the canonical example. Different starting points, different paths, but the problem they are pulling toward is the same. Who consolidates is still open. That a layer of this shape needs to exist is no longer the question.

Step back, and the causality is not unique to the AI era. After HTTP standardized the shape of web calls, WAF, CDN, and Ingress grew up as the governance layer. After gRPC standardized microservice calls, Service Mesh grew up. Every new shape of call, on its way to maturity, walks the same three-segment mold: protocol layer → governance layer → ingress layer. This path has been walked several times. The cadence has been similar each time: the protocol stabilizes, the governance layer appears within a year or two, and over a longer period it gets pulled into a unified ingress. The AI Gateway accelerating right now is not coincidence. It is the same path repeating.

4. The fulcrum is no longer traffic, it is context

If you stop here, the AI Gateway still sounds like an extended traditional gateway with a few new plugins bolted on. Push one layer further in and the fulcrum has actually moved.

The traditional API gateway's fulcrum is traffic. Almost everything it does is built on I see one request and one response. Rate limiting counts requests. Auth reads headers. Routing reads paths. Logging records the pair. The relationship between calls is outside its field of view. Every call is independent. There is no state between the previous call and this one.

The AI Gateway's fulcrum is context. Many of the things it does are not about managing this single request. They are about managing the state across calls.

Take a few concrete cases. Caching in a traditional gateway works at the URL level: same URL, same query parameters, hit. The AI Gateway does not work that way. It has to hit on contextual similarity. Two questions phrased differently but semantically close should let the second one reuse the first answer. Doing that requires the gateway itself to perform an act of understanding. URL alone is not enough. Quota management in a traditional gateway means counting requests; in the AI Gateway it means counting tokens, and token count depends on how much context is hanging off this request and how the model's tokenizer is going to slice it. You cannot compute that without some understanding of the body. Security audit in a traditional gateway looks at the auth token in the header and checks IP allowlists; the AI Gateway has to look at the prompt itself, identify whether prompt injection is hiding inside, whether jailbreak templates are present, whether the system message is being coaxed out. Every one of these is no longer about traffic. It is about context.

Or, more precisely: the AI Gateway is the part of context engineering that has been pushed down from the application layer into the infrastructure layer. Context engineering—memory, compression, token budgeting, prompt safety—is something every team writes from scratch the first time. By the third or fourth team, the patterns that keep recurring start drifting downward. Down to where? Down to whatever every call has to pass through. That is the AI Gateway.

It is not that the AI Gateway came first and context engineering came second. It is that context engineering, after enough repetition in practice, naturally settled the recurring parts into the gateway.

5. The network layer is starting to carry thinking

Hold that fulcrum in place and push one step further. There is something genuinely strange here. For the gateway to make a routing decision, it has to call a model first. The routing itself is an inference.

In traditional networking that is unimaginable. A load balancer that has to think first before deciding where to forward sounds as wrong as a switch that has to read the payload before forwarding the packet. But this is the normal state of the AI Gateway. It does not behave this way out of cleverness. It behaves this way because the capability and pricing gaps between models are too large to ignore. Without understanding the request, you cannot pick the right backend. Sending small talk to a top-tier model is wasteful. Sending complex reasoning to a cheap model is broken. Sending a vertical-domain question to a generic model leaves quality on the table. A reasonable routing decision requires first understanding what the request is actually asking.

This is not an isolated event. There is a longer trend behind it.

Smart NICs have been pushing storage and network processing down into the NIC hardware—network hardware is starting to carry compute. Service Mesh has moved circuit breaking, retries, observability, and traffic dyeing into the sidecar—the network layer is starting to carry business policy. eBPF lets the network stack execute business logic directly in the kernel—the network's executable surface is being exposed to the application for the first time. The AI Gateway takes one more step on this line: the network layer is starting to carry thinking.

The direction is consistent. The network layer, originally responsible only for moving packets through, is being repeatedly asked to understand the content moving through it. Header → protocol → application semantics → inference output. The granularity of understanding keeps climbing. The AI Gateway is not the end of this line, but it is the station where the line clearly crosses the threshold of content understanding. What it processes is no longer a byte stream and no longer a request structure. It is the piece of thinking inside the request itself.

Once you accept that, the organizational mess shows up too. Who owns the AI Gateway—the network team, the platform team, or the AI team? In the past this question was rare: Kubernetes belonged to platform, CDN belonged to network, model serving belonged to AI. Ownership was clean. The AI Gateway is different. Every responsibility on it crosses two team boundaries. Routing governance looks like network team work. Quota management looks like platform work. Prompt safety looks like AI team work. The answer will differ from one company to another, but every company is going to be asked.

6. It is becoming the unified entry point for all token traffic

The AI Gateway's job is not going to stop at managing LLM calls. It is becoming the entry point for every kind of traffic in the company that is denominated in tokens.

A company's AI traffic today usually splits into three streams.

The first is direct model calls. The business code stitches a prompt, calls some vendor's endpoint, takes the result. This is the earliest form, and where most companies still are.

The second is MCP traffic. Once Agents start running, they need to call external tools, read external data, mount external resources. This used to be each business team rolling their own RPC integration. With MCP, the way context is organized, the way tools are surfaced, and the way resources are mounted have been standardized. MCP traffic is, underneath, still token traffic. Context goes in, context comes out, with a tool call detoured through the middle.

The third is multi-step reasoning traffic inside a single Agent. Hand an Agent a moderately complex task, and internally it will call the model repeatedly, use tools repeatedly, correct itself repeatedly. A single external request may explode into many internal model calls and tool calls. This stream used to be hidden inside the application framework, invisible from outside.

In the past these three streams ran on their own. Direct calls went through SDKs. MCP went through stdio or HTTP. Internal Agent calls ran through the application framework's event loop. They lived in different libraries, monitored separately.

But step back from a governance angle and they are the same thing. All three are spending tokens to call a piece of thinking. All three need billing. All three need quotas. All three need audit trails. All three need monitoring. No company is going to maintain one quota system for external model calls, a second for MCP, and a third for Agent-internal traffic. That is engineering debt nobody can carry. So they will eventually converge in some layer. Where? At the only place every call has to go through. Back to the AI Gateway.

MCP's role here deserves to be called out, because it is what makes any of this possible.

I have written before that MCP is not really about wiring up tools. That is the surface. What MCP is actually doing is standardizing what an AI call should look like: how context attaches, how tools are listed, how resources are exposed, even how a sampling can be triggered the other way. Each framework used to invent its own version of these motions. MCP pulls them into one protocol.

Once the shape stabilizes, governance gets a handle. The reason an AI Gateway can do quotas, caching, and audit at the call layer is that the protocol exposes what needs to be looked at. The gateway can see which tools this call attached, which resources it referenced, roughly what shape the context has. Without protocol-layer standardization, governance at the gateway layer has nothing to grip onto. So MCP is not just an application-layer protocol. It is also one of the preconditions that makes the AI Gateway feasible.

The relationship between MCP and the AI Gateway is upstream-and-downstream, not substitution. MCP shapes context cleanly on the application side. The AI Gateway picks up that shaped context on the infrastructure side and runs governance against it. One handles form, the other handles control. The two interlock; only then does unified entry point mean anything concrete.

7. Why this has to live on the enterprise side

The natural next question: can the model vendors (MaaS providers) do all of this themselves?

In theory yes. Every MaaS API today already includes some rate limiting, quota, and content moderation. But none of it can reach what an AI Gateway has to reach, and the reason is not technical. It is structural, in three pieces.

First, multi-model routing is fundamentally not something a single MaaS vendor can offer. No vendor's API will tell you for this kind of request you should route to my competitor, because they are cheaper or better at this. But the way enterprises actually use LLMs is multi-vendor. Cheap work goes to open-source models. Critical work goes to top-tier models. Vertical-domain work goes to specifically fine-tuned models. Cross-vendor routing has to live somewhere outside all the vendors. It cannot live inside any one of them.

Second, token quotas, content audit, and compliance are enterprise-side responsibilities, not the model vendor's. When a company's finance team says engineering's LLM budget for this month is X, that constraint cannot be delegated to a vendor for enforcement. The vendor does not know how many departments you have, what your budget structure is, how you split the bill internally. The same is true for we do not allow customer national ID numbers to be sent to the model—a compliance constraint that the vendor has no way to enforce on your behalf. These constraints have to land on the enterprise side, before the call leaves the enterprise's own gateway.

Third, model capability is still evolving fast, and enterprises will not bind their governance logic to one vendor. The vendor of choice this year may not be the vendor next year. The model version this quarter may not be the version next quarter. If quotas, audit, and monitoring all live on top of one vendor's API, switching vendors means rebuilding the whole governance stack. That cost is too high for any enterprise to actually accept. So the governance layer is going to be pulled out and built outside all vendors.

This pattern has happened before. After public cloud took over, enterprises built multi-cloud management platforms to decouple from any single cloud provider. After SaaS took over, enterprises built integration platforms to decouple from any single SaaS provider. Whenever a key capability comes from a small set of irreplaceable external vendors, the enterprise will build a decoupling layer on its own side. The AI Gateway sits exactly where this rule places it: it is the multi-cloud management layer for models. It is not driven by technology. It is driven by vendor structure.

Which means that if model capability ever consolidated tightly into one or two providers, this layer would shrink. As long as model capability is spread across multiple irreplaceable vendors, this layer stays. It will get thinner or thicker over time, but it will not be absorbed by any MaaS.

8. The defining property of this generation of infrastructure is non-determinism pushed downward

Pulling the threads together.

The AI Gateway did not appear out of nowhere. It is the natural sediment of context engineering being practiced repeatedly. It is the inevitable governance layer that follows once a protocol stabilizes. It is the decoupling layer enterprises always build when key capability is distributed across multiple irreplaceable suppliers. The earlier sections have been walking through each of these threads.

If only one judgment can survive, it is this: the defining property of this generation of infrastructure is non-determinism pushed downward.

Past infrastructure handled deterministic problems. Where a request should go was determined. Where data should sit was determined. Whether a service should come up was determined. Every piece of rate limiting, routing, caching, and monitoring rested on the same assumption—we know what the right answer is; the job is to execute it efficiently.

The AI Gateway is the first time infrastructure introduces non-determinism as a first-class citizen. Routing is non-deterministic—you have to understand the request before you know where it goes. Output is non-deterministic—you only know after the fact whether the quality is acceptable. Cost is non-deterministic—you only know what this call cost after the response comes back. Latency is non-deterministic—a single threshold cannot bound it. Correctness is the most non-deterministic of all—even what counts as correct sometimes requires another model call to decide. These uncertainties used to live in the application layer, hidden inside try/except blocks in business code, hidden inside the PM's resigned line the model is just like that sometimes. The AI Gateway lifts them up to infrastructure.

From an engineer's point of view, the analogy is what Kubernetes was at its moment. When Kubernetes appeared, it was not, for engineers, a new market, a new startup opportunity, or a question of whether it was worth learning. It was a piece of new infrastructure forcing a new division of responsibility, pressing every team to answer fresh questions: which concerns belong to the cluster, which stay in the application, who restarts a Pod when it dies, how traffic gets in, how config gets distributed, how secrets get wired. None of these questions were absent before Kubernetes. The answers were just scattered across scripts and docs, each team handling them their own way. Kubernetes pulled the questions out into the open and forced explicit answers.

The AI Gateway will pose its own set. Which governance steps drop down into the gateway and which stay in the application. Whether token budget is computed in the gateway or in the application. Where multi-model routing strategy is written and who maintains it. Whether prompt-safety policy is enforced uniformly at the gateway or implemented per-team. When something breaks, whether the responsibility lies with the gateway team or the AI team. Every company will answer these in its own way. Every company will be asked.

Place this inside a larger picture, and the AI Gateway is only the first station of the second-half infrastructure. The rule-codification era built an entire stack around deterministic requests: gateways, load balancers, service discovery, config systems, observability. The token-based era is building another stack around non-deterministic thinking. The AI Gateway is the first concrete landing of that stack at the call layer. More layers will appear—around evaluation, monitoring, replay, detection—and each will resurface the same question: who owns this, which standards apply, where does responsibility sit.

The second half of software eats the world is not only the application layer changing. The change keeps moving downward—into the call layer, into the network layer, into the parts of the system we used to assume would never feel any of this.