Model Selection — AI Tooling Field Guide

There is no best model

Every model involves tradeoffs: cost, speed, capability, context window, privacy, and availability. The right model depends on your task. Pick the wrong one and you're either overpaying, getting mediocre results, or both.

Plain English version: the best model is the one that fits the job. Technical version: model choice is a multi-variable optimization problem, not a leaderboard trophy.

A practical default

If you do not have measurements yet, start with a cheap fast model, build a tiny eval set, and only move up-market when the output quality justifies it.

Section 1

The main dimensions

These are the knobs that usually matter most once the initial hype wears off.

Capability

Plain English: How good the model is at your actual task.

Technical view: Benchmarks like MMLU, MATH, and HumanEval are useful directional signals, but your own evals beat them because they measure the thing you really care about.

Context window

Plain English: How much you can fit into one request before the model runs out of room.

Technical view: GPT-4o supports 128K, Anthropic's Sonnet-tier models commonly support 200K, and Llama 3.3 70B is commonly deployed at 128K. Bigger context helps with large documents, but it is not the same as deeper reasoning.

Speed (latency)

Plain English: How long you wait before useful output starts showing up.

Technical view: Time to first token and tokens per second both matter. Smaller models are usually faster, and streaming helps perceived latency even when total runtime stays the same.

Cost

Plain English: Input and output tokens add up fast.

Technical view: There can easily be a 100x spread between the cheapest and most expensive options. A fast-tier model vs GPT-4o vs o1 is not a small cost difference.

Privacy / data residency

Plain English: Decide whether your data can leave your machine or your region.

Technical view: Cloud APIs usually log or process your data unless you opt out under a specific enterprise policy. Local models keep inference on your own hardware.

Availability

Plain English: The best model on paper is not helpful if you cannot reliably call it.

Technical view: Rate limits, uptime, SLAs, and regional restrictions are part of model selection too, especially once a prototype becomes a real workflow.

Section 2

Cloud vs local

This is one of the first real forks in the road.

Cloud

Plain English: Best capability, no hardware setup, and you pay as you go.

Technical view: OpenAI, Anthropic, and Google currently offer the strongest frontier capability for most general-purpose work. The tradeoff is that data leaves your machine and billing scales with usage.

Local

Plain English: More private, effectively free after hardware, and bounded by what your machine can run.

Technical view: Ollama and llama.cpp make local inference approachable. You are limited by RAM or VRAM, but the capability gap is closing fast enough that a good 7B-13B model is already sufficient for many tasks.

The gap between frontier cloud models and the best local models like Llama 3.3 70B, Qwen 2.5 72B, and Mistral Large is still real, but it is shrinking. For a lot of classification, extraction, summarization, and drafting work, local is no longer just a toy.

Section 3

A practical decision framework

If you want a starting point instead of a taxonomy, use this.

Prototyping

Pick: GPT-4o-mini or an Anthropic fast-tier model

Cheap, fast, and usually good enough to validate whether the workflow is even worth building.

Production, high quality

Pick: GPT-4o or an Anthropic Sonnet-tier model

The current sweet spot for many teams: strong quality without jumping all the way to the most expensive reasoning tier.

Reasoning and math

Pick: o1 or o3-mini

More expensive, but often worth it when you need deliberate multi-step reasoning rather than quick pattern matching.

High volume, cost-sensitive

Pick: an Anthropic fast-tier model, GPT-4o-mini, or a local Llama 3.3 70B setup

When throughput matters, shaving cost per token can matter more than squeezing out the last few points of benchmark quality.

Private data

Pick: local models only

Ollama plus llama.cpp plus a Llama, Qwen, or Mistral family model is the cleanest answer when the data should not leave your hardware.

Long documents

Pick: an Anthropic Sonnet-tier model or Google's long-context Gemini family

Anthropic's 200K context is already useful; Google's long-context Gemini line is better suited when the whole problem is large-context recall.

Section 4

Model families quick reference

Prices change frequently. Treat these as illustrative order-of-magnitude examples and check provider pricing pages before making budget decisions.

Important: This table is not a live price sheet. Family names and rates move fast enough that you should use it for rough comparison only, then confirm against the official pricing pages from OpenAI, Anthropic, Google, or the relevant provider.

Model family	Context window	Approx. cost / 1M tokens	Primary strength
GPT-4o	128K	≈ $2.50 input / $10 output	Strong general-purpose quality across chat, code, and multimodal work
GPT-4o-mini	128K	≈ $0.15 input / $0.60 output	Cheap, fast baseline for prototypes and volume
o1	~200K	≈ premium tier pricing	Deliberate reasoning on harder math and planning tasks
Anthropic Sonnet-tier family	200K	illustrative mid-tier hosted pricing	High-quality writing, coding, and long-context work
Anthropic fast-tier family	200K	illustrative low-cost hosted pricing	Fast, inexpensive high-volume usage
Google Gemini long-context family	up to very large context windows	illustrative competitive hosted pricing	Very large context window tasks
Llama 3.3 70B	128K	Local hardware cost only	Best local balance of privacy and capability
Mistral Large	128K	≈ $2 input / $6 output	Strong open-ish ecosystem fit with good general capability

Section 5

What the benchmarks actually measure

Benchmarks are useful, but they are not your product.

MMLU measures knowledge breadth. HumanEval measures Python code generation. MATH measures symbolic reasoning. Those numbers do correlate with general capability, but they do not reliably predict how well a model will handle your extraction pipeline, your support workflow, or your internal docs assistant.

Build your own evals whenever you can. A small task-specific test set will usually tell you more than another round of leaderboard browsing. See also the tool comparison page for side-by-side tool tradeoffs.

Next move

Try the decision logic in a small script.

Lab 17: Model Selector turns these tradeoffs into a simple scoring tool. Change the privacy, cost, speed, and capability requirements, then watch the recommendation move.