Capability
Plain English: How good the model is at your actual task.
Technical view: Benchmarks like MMLU, MATH, and HumanEval are useful directional signals, but your own evals beat them because they measure the thing you really care about.
Reference
Picking a model is mostly about tradeoffs. The right choice depends on what you care about most: quality, speed, context, price, privacy, or whether the thing is even available when you need it.
Every model involves tradeoffs: cost, speed, capability, context window, privacy, and availability. The right model depends on your task. Pick the wrong one and you're either overpaying, getting mediocre results, or both.
Plain English version: the best model is the one that fits the job. Technical version: model choice is a multi-variable optimization problem, not a leaderboard trophy.
If you do not have measurements yet, start with a cheap fast model, build a tiny eval set, and only move up-market when the output quality justifies it.
Section 1
These are the knobs that usually matter most once the initial hype wears off.
Plain English: How good the model is at your actual task.
Technical view: Benchmarks like MMLU, MATH, and HumanEval are useful directional signals, but your own evals beat them because they measure the thing you really care about.
Plain English: How much you can fit into one request before the model runs out of room.
Technical view: GPT-4o supports 128K, Anthropic's Sonnet-tier models commonly support 200K, and Llama 3.3 70B is commonly deployed at 128K. Bigger context helps with large documents, but it is not the same as deeper reasoning.
Plain English: How long you wait before useful output starts showing up.
Technical view: Time to first token and tokens per second both matter. Smaller models are usually faster, and streaming helps perceived latency even when total runtime stays the same.
Plain English: Input and output tokens add up fast.
Technical view: There can easily be a 100x spread between the cheapest and most expensive options. A fast-tier model vs GPT-4o vs o1 is not a small cost difference.
Plain English: Decide whether your data can leave your machine or your region.
Technical view: Cloud APIs usually log or process your data unless you opt out under a specific enterprise policy. Local models keep inference on your own hardware.
Plain English: The best model on paper is not helpful if you cannot reliably call it.
Technical view: Rate limits, uptime, SLAs, and regional restrictions are part of model selection too, especially once a prototype becomes a real workflow.
Section 2
This is one of the first real forks in the road.
Plain English: Best capability, no hardware setup, and you pay as you go.
Technical view: OpenAI, Anthropic, and Google currently offer the strongest frontier capability for most general-purpose work. The tradeoff is that data leaves your machine and billing scales with usage.
Plain English: More private, effectively free after hardware, and bounded by what your machine can run.
Technical view: Ollama and llama.cpp make local inference approachable. You are limited by RAM or VRAM, but the capability gap is closing fast enough that a good 7B-13B model is already sufficient for many tasks.
The gap between frontier cloud models and the best local models like Llama 3.3 70B, Qwen 2.5 72B, and Mistral Large is still real, but it is shrinking. For a lot of classification, extraction, summarization, and drafting work, local is no longer just a toy.
Section 3
If you want a starting point instead of a taxonomy, use this.
Pick: GPT-4o-mini or an Anthropic fast-tier model
Cheap, fast, and usually good enough to validate whether the workflow is even worth building.
Pick: GPT-4o or an Anthropic Sonnet-tier model
The current sweet spot for many teams: strong quality without jumping all the way to the most expensive reasoning tier.
Pick: o1 or o3-mini
More expensive, but often worth it when you need deliberate multi-step reasoning rather than quick pattern matching.
Pick: an Anthropic fast-tier model, GPT-4o-mini, or a local Llama 3.3 70B setup
When throughput matters, shaving cost per token can matter more than squeezing out the last few points of benchmark quality.
Pick: local models only
Ollama plus llama.cpp plus a Llama, Qwen, or Mistral family model is the cleanest answer when the data should not leave your hardware.
Pick: an Anthropic Sonnet-tier model or Google's long-context Gemini family
Anthropic's 200K context is already useful; Google's long-context Gemini line is better suited when the whole problem is large-context recall.
Section 4
Prices change frequently. Treat these as illustrative order-of-magnitude examples and check provider pricing pages before making budget decisions.
Important: This table is not a live price sheet. Family names and rates move fast enough that you should use it for rough comparison only, then confirm against the official pricing pages from OpenAI, Anthropic, Google, or the relevant provider.
| Model family | Context window | Approx. cost / 1M tokens | Primary strength |
|---|---|---|---|
| GPT-4o | 128K | ≈ $2.50 input / $10 output | Strong general-purpose quality across chat, code, and multimodal work |
| GPT-4o-mini | 128K | ≈ $0.15 input / $0.60 output | Cheap, fast baseline for prototypes and volume |
| o1 | ~200K | ≈ premium tier pricing | Deliberate reasoning on harder math and planning tasks |
| Anthropic Sonnet-tier family | 200K | illustrative mid-tier hosted pricing | High-quality writing, coding, and long-context work |
| Anthropic fast-tier family | 200K | illustrative low-cost hosted pricing | Fast, inexpensive high-volume usage |
| Google Gemini long-context family | up to very large context windows | illustrative competitive hosted pricing | Very large context window tasks |
| Llama 3.3 70B | 128K | Local hardware cost only | Best local balance of privacy and capability |
| Mistral Large | 128K | ≈ $2 input / $6 output | Strong open-ish ecosystem fit with good general capability |
Section 5
Benchmarks are useful, but they are not your product.
MMLU measures knowledge breadth. HumanEval measures Python code generation. MATH measures symbolic reasoning. Those numbers do correlate with general capability, but they do not reliably predict how well a model will handle your extraction pipeline, your support workflow, or your internal docs assistant.
Build your own evals whenever you can. A small task-specific test set will usually tell you more than another round of leaderboard browsing. See also the tool comparison page for side-by-side tool tradeoffs.
Next move
Lab 17: Model Selector turns these tradeoffs into a simple scoring tool. Change the privacy, cost, speed, and capability requirements, then watch the recommendation move.