Billing
You are billed per token, not per message.
Operations layer
If you are using hosted models, tokens are the real unit of account. This page is the practical framing: what tokens are, how pricing works, why long conversations get expensive, and what people actually do to keep spend under control.
Plain English first: providers do not bill you for characters or pages. They bill you for tokens going in and tokens coming out.
Technical version: tokenized input and output are the units used for model pricing, context windows, and rate limits. If you want to reason about cost, latency ceilings, or prompt size, token accounting is the real place to start.
This is one of those details that feels small until you build anything with real usage. Then it becomes operational immediately.
The cheapest prompt is not always the shortest prompt. The useful question is whether the tokens you are sending are doing work. Reused boilerplate, stale conversation history, and overly long outputs are usually where the waste lives.
Section 1
LLMs do not see raw characters. They see token IDs.
A token is roughly 3 to 4 characters of English text, but that varies by model and by language. Hello world is 2 tokens. Code, punctuation, emojis, and non-English text often tokenize differently, sometimes a lot differently.
The rough rule of thumb is that 1,000 tokens is about 750 words. That is approximate, not exact, but it is good enough for planning.
You are billed per token, not per message.
Model limits are measured in tokens, so prompt size and retrieved context compete for the same budget.
A lot of providers also meter usage in tokens per minute, not just requests per minute.
Section 2
Input and output are usually priced separately.
Your prompt, your conversation history, and the system prompt all count here.
The model's response counts separately, and it is typically much more expensive.
Output tokens are often 3 to 5 times more expensive than input tokens. That means verbose answers really do cost more, and it is worth being explicit about response length when you do not need an essay.
Important: Treat the figures below as illustrative order-of-magnitude examples, not a current price sheet. Providers rename families, change discounts, and update rates often enough that you should always check the official pricing pages before budgeting: OpenAI, Anthropic, and Google.
Illustrative pricing examples only:
The gap between GPT-4o and GPT-4o-mini is roughly 15 to 20 times on cost. For high-volume applications, that is not a rounding error. It is architecture.
Section 3
Long chats quietly resend a lot of old text.
Every message in a conversation costs input tokens. In a 10-turn chat, you usually resend the full conversation history on every turn. If your history is 100K tokens and your system prompt is another 10K, you are spending 110K input tokens every time the user sends one more message.
Periodically replace older turns with a shorter summary that preserves the important state.
Keep only the last N turns when older context is no longer doing useful work.
Retrieve the relevant history or documents rather than attaching everything every time.
For one-shot tasks that do not need memory, do not send history at all.
Section 4
Reused prompt prefixes can be cheaper than they look.
Several providers discount cached prompt prefixes, especially when you keep reusing the same system prompt or the same few-shot examples.
cache_control parameter, with discounts that can reach 90%If your system prompt is long, or if you have a stable evaluation harness with repeated examples, caching can reduce spend dramatically without changing the application design very much.
Section 5
Exact counts are model-specific, but rough estimates are still useful.
For OpenAI models, tiktoken is the standard library for exact counts. Other providers usually have their own tokenizer or token-count helper.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens)) # 4
Anthropic also exposes token counting through the SDK with client.messages.count_tokens().
If you do not have a tokenizer handy, len(text.split()) * 1.3 is a decent English-language approximation. It is not exact, but it is good enough for budgeting and back-of-the-envelope comparisons.
Section 6
Most token savings come from boring habits, not magic tricks.
Use smaller models for extraction, classification, and other simple tasks. Save the larger models for reasoning-heavy work.
Ask for concise responses when that is enough. Fewer output tokens usually means meaningfully lower cost.
For workloads that are not latency sensitive, batch APIs can cut price in half on some providers.
Reuse expensive system prompts and repeated few-shot examples whenever the provider supports it.
For deterministic tasks, lower temperature can reduce inconsistent outputs and wasted reruns.