What's actually happening inside a model

Beneath all the chat UI polish, a model is still doing something surprisingly plain: it takes tokens in and produces probabilities for what should come next. That basic loop explains a lot of the stack above it.

The short version

A model is a function. A large, weird, statistical function.

You give the model a sequence of tokens, which are chunks of text rather than whole ideas. It looks at that sequence and produces probabilities for what token should come next.

Then one candidate gets chosen, appended, and the process runs again. Token after token, the output grows until the host hits a stop condition, a length limit, or some other rule.

That is the engine. Tool calling, memory, reasoning traces, agent loops, approvals — all of that is built around the same next-token process.

Tokens

Models don't read words. They read chunks.

Before the model can do anything, your text gets tokenized. That means it is split into subword pieces that are useful for the model's vocabulary. Those pieces are not quite words and not quite characters. A word like unhappiness might come through as something like un, happi, and ness.

Tokens often include whitespace, punctuation, or common fragments, which is why token counts never line up neatly with word counts. Roughly, 4000 tokens is about 3000 words of plain English. Code is usually denser because symbols and short identifiers get split more aggressively.

This ends up mattering more than people expect. Token count drives cost, context limits, and latency, so tokens are one of the basic budgeting units of AI work.

Approximate token split

The ␠quick ␠brown ␠fox

That example is simple on purpose. Real tokenizers are messier, and the exact split depends on the model family.

Weights

The model's knowledge lives in billions of numbers.

During training, the model sees huge amounts of text and keeps adjusting its internal parameters so it gets a little better at predicting the next token. Those parameters are called weights. Modern models can have billions of them.

The important thing is what weights are not. They are not a tidy database of sentences stored somewhere inside the model. They are a compressed statistical structure shaped by training, which is why a model can generalize, imitate styles, and answer questions it has never seen in exactly that wording.

After training, those weights are usually frozen for inference. Your chat does not update the model live unless someone fine-tunes or retrains it later. That is why knowledge cutoffs exist, and why the same model with the same settings can sometimes produce the same answer twice.

Inference

Running a model is called inference. It's just a forward pass.

When you send a prompt, the model does a forward pass through the network and computes probabilities for the next token. Then it does it again, and again. Nothing magical has to happen at runtime. It is math over a lot of weights, repeated very fast.

The catch is cost. Large models need serious compute, huge memory bandwidth, and fast hardware to stay responsive. That is why hosted APIs cost money, why local models often want a GPU, and why latency becomes part of the product experience.

Sampling settings shape how the output feels. Low temperature makes the model more conservative. Higher temperature flattens the distribution and makes the output more varied, which can be useful, but also less stable.

Why it matters

Everything in the tooling stack exists because of these constraints.

  • Context limits exist because the model can only attend to so many tokens at once. That is why memory and context management matter in Lab 12.
  • Models do not have live access to the world by default. That is why tools exist in Lab 01, Lab 02, and Lab 03.
  • Models are stateless between calls. That is why agents manage state explicitly in Lab 06 and Lab 12.
  • Models are probabilistic. That is why hooks, governance, and approval gates show up in Lab 05, Lab 09, and Lab 10.

Go deeper

If you want to go further.

Build the intuition from scratch

Andrej Karpathy's Neural Networks: Zero to Hero series is still one of the best ways to move from hand-wavy intuition to implementation-level understanding.

Read the paper that set the shape

Attention Is All You Need is the 2017 transformer paper. It is not light reading, but it is shorter and more readable than most people expect.

Come back to the access layer

Once the model itself makes more sense, the site's model-access page is a good next step for understanding the difference between the model and the path you use to reach it.

Ready to build

Understanding the model is step one.

Once the next-token loop feels real, the rest of the stack gets less mysterious. Start Here walks through that progression, and Lab 00 puts a live model endpoint in your hands in one sitting.