How Large Language Models Actually Work

A quick look at the kind of prompts that fill 2025-era AI coding sessions:

give me complete, runnable code
answer me in Chinese
I told you to reuse the existing interface
stop dropping a pile of TODOs into the file
why did you pull in another dependency

These look like requirements, but they are really constraints. We keep adding them to pull the model back from drifting, to force it closer to the context we actually care about, to fence in its freedom, and to correct its idea of what a "right answer" looks like. Prompts grow longer not because we have started to understand the model, but because we have not — and the only tool we have is to keep poking at its edges and patching what comes back.

This exposes a quietly uncomfortable position most engineers are in. Our understanding of large language models sits in a dangerous middle ground. We know they are powerful, but we cannot say clearly why they are powerful. We know they get things wrong, but we cannot say clearly why they get things wrong. So we treat the model as a near-omniscient senior engineer on one hand, and on the other we keep pasting in more constraints to drag it back from wherever it drifted. On the surface this looks like learning to "ask better questions"; underneath, it is the daily friction of working with a probabilistic system we do not yet have a real model for.

That is what this part is about. Not how to write a particular prompt, but a more basic and more important question: how does a large language model actually run? You do not need to become a machine learning researcher, but you do need a working picture of a few mechanisms — how the model turns your code and instructions into something it can process, how it builds correlations across enormous amounts of text, how it produces an answer one token at a time, and what the "context window" really means. These are not distant theory. They are the everyday basis for judging what the model can do, what it cannot do, why it got something right today, and why it got something wrong yesterday.

Only once that layer is clear does everything later in the book — agents, memory, RAG — stop floating. Because no matter how the tooling evolves, every capability we build on top of these systems still starts from one place: a clear-eyed understanding of how the model itself runs.