Skip to content

8. Building Memory for AI Systems

You spend a full day pairing with your AI coding assistant.

In the morning you have it refactor the auth module. Along the way it picks up your project structure—Go, monorepo, business logic in internal/, shared libraries in pkg/. In the afternoon you ask it to write a new middleware. By then it already knows that all your middlewares follow the func(next http.Handler) http.Handler signature. It knows you use slog instead of logrus for logging. It knows your error-handling style returns a custom AppError type, not bare error.

By the evening, the collaboration is humming. You can say "add a cache layer to OrderService" and it knows to use the cache.Store interface that already exists in your project. It knows the cache-key convention is {service}:{entity}:{id}. It knows TTL should be read from the config service, not hardcoded. You no longer have to explain any of it—it has "learned" all of it through the day's conversation.

The next morning, you open a new session.

"Add a cache layer to UserService."

It returns a snippet. The snippet uses the go-cache library—your project does not depend on go-cache. The cache key is user_123—nothing like your project's naming convention. The TTL is hardcoded to five minutes—exactly what you told it not to do yesterday.

It does not remember. A full day of "learning" is gone.

You sigh, and you start re-explaining the project structure, the conventions, the technology choices. You did all of this yesterday. You will do it again tomorrow. And the day after.

This is not a defect of one product. This is the basic architecture of large language models.

8.1 The Cost of Being Stateless

Chapter 2 already took this mechanism apart: every inference call is a stateless pure function. The "memory" you experience inside a multi-turn conversation is just the client re-sending the full history each time. "Remembering" within a session works because the context window still has room for the prior messages. Once the window fills up, history gets truncated. Once the session closes, history is gone.

That chapter explained the mechanism. This chapter is about the practical cost of that mechanism over long-running collaboration.

The opening scenario is the most typical cost—repeated cold starts.

Every new session, you re-tell the AI your project structure, your stack choices, your code conventions, your team's agreements. None of this is "occasional background information." It is the baseline context you need every single time. Without it, the code the AI generates reads like something written by an intern on day one—syntactically correct, totally out of place inside your codebase.

The cost of repeated cold starts is more than "wasting time re-explaining things." There are several less visible costs underneath.

Cognitive load. You have to remember which pieces of context the AI does not yet have. Yesterday you told it to use slog instead of logrus; today you have to remember that today's session has not been told yet—otherwise it will use logrus, and you only realize "oh, I forgot to mention that" after you read the code. Your brain is now maintaining a checklist of "things the AI should know but does not know yet" on its behalf.

Information leakage. You will not remember to repeat all of the context every single time. You will skip some details that "feel unimportant but actually matter." You forget to mention "error messages should be in English," and the AI mixes Chinese and English. The rework caused by these omissions adds up to a serious time tax.

A ceiling on collaboration depth. Human teammates get more efficient over time—you build shared vocabulary, an implicit division of labor, an understanding of each other's style. Collaboration with the AI never moves past day one. There is no accumulated rapport, because there is no cross-session memory.

You may notice that Skills, from Chapter 5, look like a way out. Coding conventions, project agreements, architectural patterns—can these be auto-loaded at the start of every session through a Skill? Yes. Skills cover information that is predefined and relatively stable: the team's coding style, the project's stack, the standard error-handling pattern. Once that information is written into a Skill, it gets injected automatically every session and you do not re-explain it.

What Skills do not cover is dynamically accumulated knowledge—information that emerges during collaboration and cannot be defined ahead of time. The database schema you decided on in yesterday's discussion. The API-naming change agreed last week. The temporary workaround for a performance bottleneck found three days ago. These pieces emerge inside the conversation; you cannot pre-write them into a Skill. And they keep changing—today's decision may overturn last week's, a new finding may invalidate yesterday's plan.

Skills solve reuse of static knowledge. A memory system has to solve accumulation of dynamic knowledge. The two complement each other; they do not substitute for each other.

These costs are bearable in the short term—a one-off question, a quick code snippet—where the cold-start tax is negligible. In long-running project collaboration, the loss of dynamic knowledge becomes a serious efficiency bottleneck. Even with Skills covering the baseline conventions, you still have to re-explain "things only decided in the last conversation"—and those tend to be the most important context for the current task.

This gap will not close as models get smarter. A smarter model just makes better judgments based on the information it sees—and if the information is not in the context to begin with, intelligence does not help.

To close the gap, you have to build a stateful memory layer on top of the stateless inference engine. That layer is responsible for persisting key information across sessions and injecting the relevant pieces back into the context at the start of a new session, so the AI appears to remember you.

This is the core problem a memory system solves: build a stateful experience on top of a stateless engine.

8.2 What a Memory System Actually Does

The phrase "stateful memory layer" invites an easy misreading: did the model itself become stateful? Did the server bolt a hard drive onto it?

No. The model is still that stateless pure function. Every inference, all it sees is the text inside the context window. Anything outside the window still does not exist for it.

What the "memory system" actually does is plain. Outside the model, there is a persistent store—it can be a database, a flat file, even a JSON blob—holding information worth keeping. When the next conversation starts, the system pulls relevant pieces out of that store and stitches them into the context window, usually into the system prompt. The model "sees" those lines and behaves as if it remembers you. Delete that injected text and it stops remembering immediately.

In other words, a memory system does not give the model memory. It gives the model an "assistant" who takes notes on its behalf and reads the relevant notes aloud before each meeting. The model's own brain has not changed at all.

With that framing in mind, here is how today's AI tools actually implement that "assistant."

In early 2024 OpenAI added a Memory feature to ChatGPT. If you have used it, you may notice it works in a strangely simple way—simple enough that you might wonder "that's it?"

When you mention something in a conversation like "I'm a Go developer" or "my project uses PostgreSQL," the model quietly invokes an internal tool while answering you and writes that fact into an external store. The format is a short sentence: "User is a Go developer; prefers slog for logging." The next time you open a new conversation, the system pastes all of your memory entries into the head of the system prompt.

No vector database. No semantic retrieval. No fancy relevance ranking. Just paste them all in.

Why does this "crude" approach work? Because each memory entry is one or two sentences, and a hundred entries adds up to a few hundred tokens. In a context window that handles hundreds of thousands of tokens, that overhead is like pouring a cup of water into a swimming pool—you cannot detect the difference. Rather than spend engineering effort on precision retrieval (which still risks dropping something important), full injection guarantees nothing is lost.

The design exposes the first counter-intuitive fact about memory systems: when total memory volume is small, the dumbest approach is the optimal one. Zero retrieval latency, zero risk of dropping a relevant entry, near-zero implementation cost. You only need the fancy retrieval mechanisms when memory volume gets too large to inject in full.

ChatGPT Memory has one obvious limit: it does not separate by project. The "uses React" memory you accumulated while working on a frontend project will show up in your context when you switch to writing backend Go. It will not produce wrong code, but it wastes space—like having every project's papers piled on your desk and having to dig through unrelated stacks each time you want a specific one.

Cursor takes a finer-grained approach. It is built for one specific scenario—programming—and programming has a natural boundary: the project. You may be working on three projects at once, each with its own stack, conventions, and architectural decisions; mixing them only causes interference. So Cursor scopes memory to projects. Each project has its own memory store. When you open project A you see only project A's memories, never polluted by project B's. It is like giving each project its own notebook instead of writing everything onto one shared sheet.

Cursor also does something interesting. When it judges that a piece of information is worth keeping, it does not silently store it. It tells you explicitly: "Got it: your project uses gRPC for inter-service communication." You can confirm, or you can correct: "No, we were just discussing gRPC; we actually use REST."

Why this extra step? Because a wrong memory is worse than no memory. Imagine the system mistakenly stored "the project uses REST" (when in fact you were comparing REST and gRPC and ended up choosing gRPC), and from then on every conversation generates code based on that wrong premise. You see REST-style code, you correct it, you see it again, you correct it again—never realizing the bug lives inside a memory entry you never confirmed. The confirmation step trades a little automation for substantially higher accuracy. In a programming setting, that trade is almost always worth it.

Claude takes another path. Its memory is not a pile of loose sentences but structured information, organized by category—user preferences as one block, project stack as another, coding conventions as another. When you say "we switched the logging library to zerolog," the system knows to update the "logging library" field under "coding preferences," not append a new entry. That avoids a thorny problem: a memory store containing both "logging uses slog" and "logging uses zerolog" at the same time. The prerequisite is that the memory has to be classified correctly at write time—if React is tagged as "frontend framework" and Vue is tagged as "migration plan," the system will not detect a conflict.

Claude also gives you a full memory-management UI—you can view, edit, delete every entry. It looks like a small feature, but it solves the deepest trust problem in any memory system: you need to know what the AI actually remembers. If memory is a black box, you can never be sure whether some weird AI behavior is being driven by a wrong memory entry. Give the user an audit pane and let them see the whole picture; only then does trust accumulate.

After taking apart these three products, you will notice that the implementation details differ but the underlying job is the same: identify valuable information during a conversation, extract and store it, and inject it back into the next conversation. The differences are about the strategy at each step—who decides what is worth remembering, what format to store it in, how to handle conflicts, how much control to give the user.

These differences are not arbitrary product decisions. They are adaptations to different scenarios. General conversation (ChatGPT) only needs the simplest design—memory volume is small, full injection works, no retrieval needed. Professional programming (Cursor) needs project isolation and higher accuracy—a single wrong memory produces a stream of wrong code. Deep collaboration (Claude) needs structured management and full user control—because memory deeply shapes the AI's long-term behavior, and the user must be able to audit and correct it.

8.3 What's Actually Worth Remembering

The first and most fundamental question a memory system has to answer: out of an entire conversation, what should be stored?

A single conversation may run thirty turns. Twenty-five of them are throwaway exchanges—"rename this variable to count," "what does the third argument of this function mean?", "add a blank line here." Useful in the moment, no cross-session value. Next time, you do not need to know that you once renamed a variable.

But on turn 12 you say: "We decided to use gRPC instead of REST, because internal service-to-service calls happen at high frequency." This sentence has a completely different value—it is an architectural decision that will shape every later piece of work involving service communication. If the AI remembers it, the next time you ask for a new service call it will not generate REST-style code.

The core challenge of memory extraction is separating signal from noise. Across thirty turns, only two or three may contain information worth keeping long-term. The system has to find those two or three and ignore the other twenty-seven.

The mainstream approach is to ask the model itself to make the call. The system gives the model an extraction prompt that lists what is worth keeping across sessions—user preferences, technical facts about the project, settled design decisions, team-level conventions—and what is not: throwaway questions, rejected proposals, things that only matter inside the current task.

There is a subtlety in how that prompt is written: it has to include both "what to extract" and "what not to extract." If you only tell the model "extract valuable information," it will over-extract. The model has a natural lean toward "better to over-store than miss," and it will mark almost anything as valuable. A negative list—"do not extract throwaway questions; do not extract rejected proposals"—is what brings extraction precision back into a reasonable range.

There is one more design choice that is easy to overlook: let the model say "this turn has nothing worth remembering." Without this exit, the model will manufacture content. Because it has been asked to extract, it will find something to extract, even if that something has no real value. The result is a memory store filled with noise like "the user asked how to use fmt.Sprintf."

Once you decide what to remember, the next question is when to remember.

The intuitive choice is to extract right after each turn. This has a problem: at turn 5 you say "we use REST," at turn 15 you say "actually, let's use gRPC." If extraction runs after each turn, turn 5 writes "project uses REST," turn 15 writes "project uses gRPC," and the memory store contains a contradiction. If extraction waits until the session ends, the model can see the full discussion, knows the final decision is gRPC, and never stores the rejected REST option.

But waiting until session end has its own risk—if the browser crashes, the network drops, or the tab is closed, the entire session's memory is lost. So most systems run a hybrid in practice: information the user explicitly tagged with "remember this" gets written immediately (high confidence, no need to wait); information the model judges autonomously gets extracted at session end as a batch (it needs the full context to make accurate calls); particularly long sessions get incremental extractions every so often (so a sudden interruption does not destroy everything).

8.4 Storing It Is Not Using It

The information is in the store. That does not mean it is usable. Imagine the memory store has accumulated five hundred entries—stacks for three projects, two years of coding preferences, dozens of architectural decisions. Now you start a new session: "add a cache layer to OrderService."

If total memory volume still fits the "inject everything" range (a few hundred tokens), the answer is easy—paste it all in and let the model decide what is relevant. But five hundred entries may be several thousand or even tens of thousands of tokens, and full injection no longer works. The system has to pick the few entries that matter.

"Cache strategy is Cache-Aside; TTL is read from config service"—highly relevant. "Cache key naming convention is {service}:{entity}:{id}"—highly relevant. "Frontend uses React"—not relevant at all. The system has to make those calls.

The most common approach is semantic retrieval—convert your message and each memory entry into vectors and rank by vector distance. "Add a cache layer" and "Cache-Aside strategy" share no surface words but are semantically close, and vector retrieval picks up that link.

But semantic retrieval has a trap: semantic similarity is not task relevance. "We discussed Redis eviction policies last time" and "add a cache layer to OrderService" are semantically close—both about caching. But if last time's discussion was about a different project's Redis configuration, it has nothing to do with the current task. Vector retrieval only sees semantic distance. It does not see project boundaries.

This is exactly why Cursor's memories carry a project tag—retrieval first filters by project, then runs semantic matching inside that subset. The point is not "tidy organization." The point is retrieval precision. Without that filter, cross-project semantic noise drags retrieval results badly off target.

There is another, subtler trap. The user's first message often does not carry enough information. "Help me write an API"—too broad; it has some relevance to almost every memory entry, but high relevance to none. The retrieval comes back with a pile of "sort of related" entries, with the actually useful ones drowned in noise.

The fix is delayed retrieval—do not retrieve on the first message; wait until the conversation has run two or three turns and the context is richer. Or augment the retrieval query with IDE context—the file currently open, the cursor position—to give the query enough signal to be useful.

Once relevant memories are retrieved, one question remains: where in the context do they go? It looks like a technical detail, but it directly shapes the model's attitude toward memory. Placed inside the system prompt, the model treats memory as instruction"use slog for logging" gets read as a rule that must be followed. Placed as a separate context block (for example, wrapped in a <memories> tag), the model treats memory as reference information—it knows these are historical facts, it consults them, but it does not blindly obey.

In practice both placements are used together. User preferences and team conventions go into the system prompt (they are rules and should be obeyed strictly). Project knowledge and historical decisions go into a separate block (they are reference; the current request takes priority).

8.5 When New Memories Fight Old Memories

Run a memory system long enough and you hit a problem you cannot avoid. Last week you said "our frontend uses React." This week you say "we decided to migrate the frontend to Vue." Both statements are now in the memory store. Next time the AI retrieves both, what happens?

If it only sees "frontend uses React" (because that entry's semantics happen to match the current query better), it generates React code—wrong. If it sees both, it may get confused. If it is smart enough, it may infer that "migrate to Vue" is the newer information—but that requires understanding temporal order and the semantics of "migrate."

The deeper issue: the world changes, but old entries in the memory store do not disappear by themselves. Different products handle this differently.

Claude's structured store solves part of the conflict naturally. When memory is keyed, the "frontend framework" key has only one value, and a new value overwrites the old one. Conflict detection becomes simple key matching, no semantic understanding required. The prerequisite is that the memory was classified correctly at write time. If React is tagged as "frontend framework" and Vue is tagged as "migration plan," the system will not detect a conflict at all.

Another approach is to search semantically similar old entries before writing a new one, and ask the model whether the two contradict. The model can recognize that "REST" and "gRPC" are mutually exclusive choices in the "service communication" context, mark them as a conflict, and keep the newer one. This is more flexible, but every write now requires an extra model call for conflict detection.

The most reliable approach is also the simplest: rely on explicit user expression. When you say "remember: we now use gRPC, not REST anymore"—the sentence itself carries "override old information" semantics. "Now" implies something different was used before. "Not anymore" explicitly negates the old information. The model picks up these cues and does more than write the new memory—it actively searches for and removes the contradicting old one.

Beyond conflict, there is a slower problem: memory goes stale. "Project uses Go 1.20"—if the project has upgraded to 1.22, that entry is wrong. But how does the system know it went stale? Unless you say "we upgraded," the system has no way to detect it. "We discussed connection-pool sizing last time"—the entry may still be useful one or two weeks later, but three months later it is almost certainly stale.

In theory you could design a time-based decay—a memory's "importance score" drops over time, and entries that have not been used in a long time fade out. In practice most products choose the most conservative policy: do not forget automatically; rely on user management. The reasoning is pragmatic—the cost of automatic forgetting is too high. If the system mistakenly drops a still-valuable entry, the user may not notice (they only feel "the AI suddenly stopped remembering my preferences") and recovery is impossible. Compared to that, "some stale entries in the memory store" costs almost nothing—a bit of wasted context space at most.

8.6 How to Make a Memory System Actually Work for You

With the mechanics in hand, the practical question: what should you do?

Most people use memory systems purely passively—just have a normal conversation and let the system extract things in the background. That works, but the effect is limited. The system will miss things you find important (its sense of "important" does not always match yours), and it will keep things that do not matter.

A more effective mode is active feeding. When you make an important decision, tell the AI explicitly: "remember this." Do not wait for the system to discover it; mark it yourself. It is the same instinct you use with a new teammate—you say "this matters; write it down" rather than expecting them to infer importance from scattered fragments.

The first conversation on a new project is the best moment to seed baseline memory. Instead of letting the system slowly "discover" your project facts over the next dozens of conversations, hand the key facts over up front—language version, framework, database, project layout, coding conventions, error-handling style. That block gets written into memory once, and every conversation afterward benefits. It is far more efficient than scattering the same information across ten conversations, and far less likely to leak.

There is one more habit that is easy to overlook: review the memory store regularly. Every so often—say once a month—spend five minutes looking at what the AI has actually remembered. The project upgraded a dependency? Update the relevant entries. Switched a technical approach? Delete the old ones. Notice the AI behaving in ways that do not fit the project? Check whether a wrong memory is steering it. The investment is small, and it prevents the "AI keeps generating wrong code based on stale information" problem—a problem that, once it lands, costs far more than five minutes to debug.

One more practical call: if you have both a memory system and a rules file (for example, Cursor's .cursor/rules), where should a piece of information live? The rule of thumb is simple: is this shared by the team, or is it personal? Team-shared conventions—coding style everyone follows, project structure that does not change often, agreements that need to be version-controlled—go in the rules file, get committed to Git, and become visible to every team member. Personal preferences—your own habits, decisions that shift over time, ad-hoc choices that do not belong in the repo—go in memory. The rules file is "the team's shared agreement." Memory is "the rapport between you and the AI."


To trace the line of this chapter:

We started with a daily scenario—the AI does not remember yesterday—and surfaced the core tension: building a stateful experience on top of a stateless engine. We took apart three real products and saw their shared underlying architecture and their different design trade-offs. We went down to the engineering layer: what is worth remembering, when to remember it, how to store it, how to retrieve it, how to handle conflict and staleness. We landed on practice: active feeding, project bootstrapping, periodic review.

Memory systems are still a fast-moving area. The core problem will not change—building stateful experience on top of a stateless engine. Once you have the structure of the problem and the shape of today's solutions, you can read the next solution as it appears.

This chapter also surfaces a deeper question. The context space taken up by memory injection is space taken away from the current task. The richer the memory, the more gets injected, and the narrower the "bandwidth" left for actual work. This is not a memory-only problem. The system prompt, tool descriptions, Skill instructions, retrieval results, memory injections—all of them compete for the same finite context window. How do you allocate space among these competitors so that every token is spent where it counts? That is what the next chapter is about.