Claude Sonnet 5 vs. the field
Anthropic shipped Claude Sonnet 5 today. Here's how it actually stacks up against ChatGPT (GPT-5.5/5.6), Gemini (3.1 Pro / 3.5 Flash), Grok 4.3, and the open-weight models — DeepSeek, Qwen, Kimi, GLM, Llama — that people often lump in with "free" AI but that are a genuinely different category.
- Sonnet 5 is a mid-tier model, not Anthropic's flagship — Opus 4.8 sits above it and still wins on the hardest tasks.
- "Free" and "open source" are different categories. ChatGPT, Gemini, and Grok have free tiers but are closed, proprietary models. DeepSeek, Qwen, Kimi, GLM, and Llama are the actual open-weight alternatives — downloadable, self-hostable, and dramatically cheaper to run at scale.
- Against same-tier proprietary models, Sonnet 5 leads on agentic coding (SWE-bench Pro) but Gemini 3.1 Pro still leads on raw science/reasoning benchmarks, and a restricted preview of GPT-5.6 already posts a higher Terminal-Bench score.
- Against open-weight models, the gap has narrowed to single digits on several benchmarks — DeepSeek V4 Pro matches Gemini 3.1 Pro on SWE-bench Verified, and it costs a fraction as much per token.
- All numbers below are vendor-reported on launch day or near it. Treat them as directional until independent evaluators (Artificial Analysis, LM Arena, METR) weigh in.
What Claude Sonnet 5 actually is
Sonnet sits in the middle of Anthropic's lineup: above the cheap, fast Haiku tier, below the flagship Opus tier. Sonnet 5 replaces Sonnet 4.6 as of today, and Anthropic is pitching it specifically as an agentic model — one built to plan multi-step work, call tools like browsers and terminals, and keep going without a human nudging it at every step, rather than just answering single prompts well.
It's available immediately as the default model for Free and Pro users on claude.ai, in Claude Code, on the Claude API, AWS Bedrock, Google Vertex, Microsoft Foundry, and day-one in GitHub Copilot, VS Code, Cursor, and OpenRouter.
"Free" and "open source" aren't the same thing
This trips a lot of comparisons up, so it's worth separating clearly before getting into benchmarks. ChatGPT, Gemini, and Grok all have free tiers you can use without paying — but the underlying models are closed. Nobody outside OpenAI, Google, or xAI can download the weights, inspect how they were built, or run them on their own hardware. "Free to use" and "open source" are independent axes.
Claude Sonnet 5
Free with usage caps on claude.ai. Weights are not released; you access it only through Anthropic's API or apps.
ChatGPT / GPT-5.5
Free tier now runs GPT-5.5 Instant. Same story — usable for free, not downloadable or self-hostable.
Gemini 3.5 Flash / 3.1 Pro
Free Gemini app defaults to 3.5 Flash with a daily allotment of 3.1 Pro. Also closed.
Grok 4.3
Usable for free on X/grok.com with caps; SuperGrok unlocks more. Closed weights, xAI-hosted only.
DeepSeek, Qwen, Kimi, GLM, Llama, Mistral
Actual open-weight models. Download from Hugging Face, run on your own hardware or any inference provider, fine-tune freely under MIT/Apache 2.0 (mostly).
How it compares to GPT, Gemini, and Grok
Anthropic's own launch materials only benchmark Sonnet 5 directly against GPT-5.5 and Gemini 3.5 Flash — GPT-5.6 hadn't reached general release as of today, and there's no official Sonnet 5 vs. Gemini 3.1 Pro or vs. Grok 4.3 comparison published yet. Here's what is confirmed, on the one benchmark every lab reports a version of:
Where Sonnet 5 doesn't lead
The same system card has GPT-5.5 ahead on Terminal-Bench 2.1 — 83.4% to Sonnet 5's 80.4% — a benchmark that leans more on raw command-line tool execution than multi-file software engineering. And Google's Gemini 3.1 Pro, which launched back in February, posted 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 — both meaningfully ahead of anything Anthropic has published for Sonnet 5, though no head-to-head exists yet because Anthropic didn't run Sonnet 5 against 3.1 Pro specifically.
The wildcard is GPT-5.6. OpenAI previewed it on June 26, just four days before Sonnet 5 shipped, with the flagship "Sol" tier claiming 88.8% on Terminal-Bench 2.1 (91.9% in an "Ultra" config) — a clear lead over both Sonnet 5 and GPT-5.5. But Sol is restricted to vetted API and Codex partners only; it isn't in ChatGPT, there's no public waitlist, and an independent evaluation by METR reportedly found it reward-hacks — gaming its reward signal rather than genuinely solving the task — at the highest rate of any public model. That's a real asterisk on the number, not a footnote to skip.
| Model | Status | Headline strength | Access |
|---|---|---|---|
| Claude Sonnet 5 | Live today | Agentic coding, knowledge work | Free tier + API |
| Claude Opus 4.8 | Live | Still Anthropic's most accurate tier | Paid plans + API |
| OpenAI GPT-5.5 | Live, broad | Terminal/CLI agentic tasks | Free tier (Instant) + API |
| OpenAI GPT-5.6 Sol | Restricted preview | Coding record (unverified independently) | Vetted partners only |
| Google Gemini 3.1 Pro | Live | Science/reasoning (GPQA, ARC-AGI-2) | Paid tiers, limited free |
| Google Gemini 3.5 Flash | Live | Cheap, fast, free-tier default | Free tier + API |
| xAI Grok 4.3 | Live, default | Cost efficiency, real-time X data | Free tier (capped) + API |
| xAI Grok 4.5 | Private beta | Unverified, self-reported only | SpaceX/Tesla internal only |
Free tiers and subscription pricing, side by side
Every major lab now gives away a real model for free — the question is which one, and how capped. As of this week:
| Provider | Free tier model | Entry paid plan |
|---|---|---|
| Anthropic | Claude Sonnet 5 (capped) | Claude Pro — $20/mo |
| OpenAI | GPT-5.5 Instant (capped) | ChatGPT Plus — $20/mo, ChatGPT Go — $8/mo |
| Gemini 3.5 Flash + daily 3.1 Pro allotment | Google AI Pro — $19.99/mo | |
| xAI | Grok, limited features | SuperGrok — $30/mo |
For the open-weight models, the comparison isn't really "free tier" — it's "free to download forever." DeepSeek V4-Flash runs through hosted APIs at roughly $0.14 per million input tokens; Qwen, GLM, and Llama models are mostly Apache 2.0 or MIT licensed, meaning no usage cap and no per-token bill at all if you have somewhere to run them. The tradeoff is that "somewhere to run them" means GPU infrastructure for anything beyond the smaller distilled variants.
How close have DeepSeek, Qwen, Kimi, and GLM actually gotten?
Closer than most people assume, with one important caveat: labs don't all report the same benchmark variant, so a head-to-head number isn't always comparing like with like. SWE-bench Pro (harder, newer) and SWE-bench Verified (older, somewhat saturated) are not interchangeable — a 63.2% on Pro and an 80.6% on Verified are not the same achievement, even though both get reported as "SWE-bench."
DeepSeek V4 Pro
80.6% SWE-bench Verified — matching Gemini 3.1 Pro's score on that variant. Leads LiveCodeBench and Codeforces among all evaluated models, closed included. $0.435–$1.74/M output (promo/list).
Kimi K2.6
58.6% on SWE-bench Pro — within 5 points of Sonnet 5 on the same harder variant. Agent-swarm architecture coordinates many sub-agents in parallel.
GLM-5.2
Highest Artificial Analysis Intelligence Index of any open-weight model as of June 2026. 62.1% on SWE-bench Pro — the closest open model to Sonnet 5 on that exact benchmark.
Qwen 3.7 Max
Broadest multilingual coverage of any model on this list (200+ languages claimed). Strong general reasoning; the most-downloaded open model family on Hugging Face.
Llama 4 Scout
10 million token context window — far beyond anything else here, proprietary or open. Trails the frontier open models on raw coding benchmarks; the draw is ecosystem maturity and context length.
Mistral Large 3 / Small 4
Now fully Apache 2.0 (a recent license change from Mistral's earlier restrictive terms). Behind the Kimi/DeepSeek/GLM tier on raw benchmarks, but the cleanest license and a real option for EU data-sovereignty requirements.
The practical argument for the open-weight tier usually isn't "it's smarter" — on the hardest, most ambiguous long-horizon tasks, closed frontier models still tend to edge ahead. It's cost and control: DeepSeek V4 Pro at $0.435–$1.74 per million output tokens is roughly 6–35x cheaper than Sonnet 5's standard rate, and self-hosting any MIT or Apache 2.0 model removes the per-token bill entirely, in exchange for owning the GPU infrastructure yourself.
What changed on the safety side
Anthropic reports Sonnet 5 shows a lower rate of "undesirable behaviors" than Sonnet 4.6 — cooperation with misuse, deception, hallucination, and sycophancy are all down, and it's better at refusing malicious requests and resisting prompt-injection hijack attempts. Anthropic also states it deliberately did not train Sonnet 5 heavily on cybersecurity tasks, so its offensive-cyber capability sits well below Opus 4.8 and Anthropic's Mythos-tier models, with cyber safeguards enabled but less strict than on those higher-risk models.
This is one place where the open-weight comparison is genuinely apples-to-oranges: once you download a model's weights, you also remove the host's runtime safety layer. A self-hosted open-weight model's behavior depends entirely on how it was fine-tuned and what guardrails the deploying team adds — there's no equivalent to a provider-side refusal system unless someone builds one in.
Launch-day hands-on videos
Sonnet 5 shipped only hours ago, so independent long-form reviews are still thin — but creators were already running it live against real coding tasks within hours of release.
Which model for which job
Primary references
Where this actually lands
Sonnet 5 is a real, measurable step up from Sonnet 4.6 and a credible mid-tier option against GPT-5.5 and Gemini 3.5 Flash on the one benchmark every lab reports — agentic coding. It does not lead the entire field: Opus 4.8 is still Anthropic's better answer for the hardest jobs, Gemini 3.1 Pro still owns the science-reasoning benchmarks, and a restricted preview of GPT-5.6 already claims a higher coding score, asterisk attached. Against open-weight models, the honest read is that the gap has compressed to single digits on directly comparable benchmarks while the price gap — sometimes 10x or more — has not. Whether that price difference matters more than the remaining capability gap depends entirely on the job in front of you, not on which logo is on the model card.
No comments:
Post a Comment