Operations layer

Cost & Tokens

If you are using hosted models, tokens are the real unit of account. This page is the practical framing: what tokens are, how pricing works, why long conversations get expensive, and what people actually do to keep spend under control.

The meter is running on token flow, not on vibes.

Plain English first: providers do not bill you for characters or pages. They bill you for tokens going in and tokens coming out.

Technical version: tokenized input and output are the units used for model pricing, context windows, and rate limits. If you want to reason about cost, latency ceilings, or prompt size, token accounting is the real place to start.

This is one of those details that feels small until you build anything with real usage. Then it becomes operational immediately.

A useful framing

The cheapest prompt is not always the shortest prompt. The useful question is whether the tokens you are sending are doing work. Reused boilerplate, stale conversation history, and overly long outputs are usually where the waste lives.

Section 1

Tokens, not characters

LLMs do not see raw characters. They see token IDs.

A token is roughly 3 to 4 characters of English text, but that varies by model and by language. Hello world is 2 tokens. Code, punctuation, emojis, and non-English text often tokenize differently, sometimes a lot differently.

The rough rule of thumb is that 1,000 tokens is about 750 words. That is approximate, not exact, but it is good enough for planning.

Billing

You are billed per token, not per message.

Context windows

Model limits are measured in tokens, so prompt size and retrieved context compete for the same budget.

Rate limits

A lot of providers also meter usage in tokens per minute, not just requests per minute.

Section 2

How pricing works

Input and output are usually priced separately.

Input tokens

Your prompt, your conversation history, and the system prompt all count here.

Output tokens

The model's response counts separately, and it is typically much more expensive.

Output tokens are often 3 to 5 times more expensive than input tokens. That means verbose answers really do cost more, and it is worth being explicit about response length when you do not need an essay.

Important: Treat the figures below as illustrative order-of-magnitude examples, not a current price sheet. Providers rename families, change discounts, and update rates often enough that you should always check the official pricing pages before budgeting: OpenAI, Anthropic, and Google.

Illustrative pricing examples only:

  • GPT-4o: $2.50/M input, $10/M output
  • GPT-4o-mini: $0.15/M input, $0.60/M output
  • Anthropic mid-tier family: roughly low-single-digit dollars per million input tokens, with output notably higher
  • Anthropic fast / small family: much cheaper than the mid-tier family, but still worth checking current pricing because this tier has moved around a lot
  • Google long-context family: often priced competitively for very large-context work, but the exact model names and rates change frequently
  • Local models: $0 usage charge after the hardware cost is already paid

The gap between GPT-4o and GPT-4o-mini is roughly 15 to 20 times on cost. For high-volume applications, that is not a rounding error. It is architecture.

Section 3

The context window problem

Long chats quietly resend a lot of old text.

Every message in a conversation costs input tokens. In a 10-turn chat, you usually resend the full conversation history on every turn. If your history is 100K tokens and your system prompt is another 10K, you are spending 110K input tokens every time the user sends one more message.

Summarization

Periodically replace older turns with a shorter summary that preserves the important state.

Sliding window

Keep only the last N turns when older context is no longer doing useful work.

RAG instead of long context

Retrieve the relevant history or documents rather than attaching everything every time.

Stateless design

For one-shot tasks that do not need memory, do not send history at all.

Section 4

Prompt caching

Reused prompt prefixes can be cheaper than they look.

Several providers discount cached prompt prefixes, especially when you keep reusing the same system prompt or the same few-shot examples.

  • OpenAI: automatic prompt caching with a 50% discount on cached input tokens
  • Anthropic: explicit cache control through the cache_control parameter, with discounts that can reach 90%

If your system prompt is long, or if you have a stable evaluation harness with repeated examples, caching can reduce spend dramatically without changing the application design very much.

Section 5

Counting tokens programmatically

Exact counts are model-specific, but rough estimates are still useful.

For OpenAI models, tiktoken is the standard library for exact counts. Other providers usually have their own tokenizer or token-count helper.

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens))  # 4

Anthropic also exposes token counting through the SDK with client.messages.count_tokens().

If you do not have a tokenizer handy, len(text.split()) * 1.3 is a decent English-language approximation. It is not exact, but it is good enough for budgeting and back-of-the-envelope comparisons.

Section 6

Cost optimization patterns

Most token savings come from boring habits, not magic tricks.

Right-size the model

Use smaller models for extraction, classification, and other simple tasks. Save the larger models for reasoning-heavy work.

Reduce output verbosity

Ask for concise responses when that is enough. Fewer output tokens usually means meaningfully lower cost.

Batch requests

For workloads that are not latency sensitive, batch APIs can cut price in half on some providers.

Cache aggressively

Reuse expensive system prompts and repeated few-shot examples whenever the provider supports it.

Keep retries under control

For deterministic tasks, lower temperature can reduce inconsistent outputs and wasted reruns.