Memory and Context Engineering

The first two parts dealt with two questions: how the model works, and how an agent moves from "talking" to "doing." But there has been one problem we kept setting aside — the model does not remember.

Yesterday you spent a full day teaching it your project layout, your code conventions, your stack choices. You open a new session today and everything is gone. It did not "forget" — it was never holding on in the first place. Every inference is a stateless pure function. The moment the context window closes, every piece of information inside it disappears. This is not a flaw in some product; it is a basic property of the model architecture itself.

So engineers built memory systems on top of it — session memory, project memory, global memory — external persistent storage that gets selectively injected back into the context on each call. But context is not free. Every token has a cost: in money, in latency, and in attention dilution. Stuffing everything into the context is not the answer. It is just a new problem.

How do you make every token in a finite window earn its place? That is the question context engineering has to answer. And when the knowledge the model needs simply isn't in its training data — your company's internal frameworks, your team's private docs — how do you make the model "know" things it has never seen? RAG, fine-tuning, long context: three roads, each with its own costs and its own edges.

The three chapters in this part — from "how to make the AI remember" to "how to manage context cost" to "how to inject external knowledge" — are all answering the same underlying question. The model's knowledge and memory are bounded. Within that bound, how do engineers build a working information supply system around it?