Tuesday, June 30, 2026

Claude Sonnet 5 vs. the Field: GPT, Gemini, Grok, and the Open-Source Contenders

Claude Sonnet 5 vs. the Field: GPT, Gemini, Grok, and the Open-Source Contenders
● Launched today Mid-tier model $2 / $10 per M tokens (intro) 1M context

Claude Sonnet 5 vs. the field

Anthropic shipped Claude Sonnet 5 today. Here's how it actually stacks up against ChatGPT (GPT-5.5/5.6), Gemini (3.1 Pro / 3.5 Flash), Grok 4.3, and the open-weight models — DeepSeek, Qwen, Kimi, GLM, Llama — that people often lump in with "free" AI but that are a genuinely different category.

TL;DR
  • Sonnet 5 is a mid-tier model, not Anthropic's flagship — Opus 4.8 sits above it and still wins on the hardest tasks.
  • "Free" and "open source" are different categories. ChatGPT, Gemini, and Grok have free tiers but are closed, proprietary models. DeepSeek, Qwen, Kimi, GLM, and Llama are the actual open-weight alternatives — downloadable, self-hostable, and dramatically cheaper to run at scale.
  • Against same-tier proprietary models, Sonnet 5 leads on agentic coding (SWE-bench Pro) but Gemini 3.1 Pro still leads on raw science/reasoning benchmarks, and a restricted preview of GPT-5.6 already posts a higher Terminal-Bench score.
  • Against open-weight models, the gap has narrowed to single digits on several benchmarks — DeepSeek V4 Pro matches Gemini 3.1 Pro on SWE-bench Verified, and it costs a fraction as much per token.
  • All numbers below are vendor-reported on launch day or near it. Treat them as directional until independent evaluators (Artificial Analysis, LM Arena, METR) weigh in.
01 — The basics

What Claude Sonnet 5 actually is

Sonnet sits in the middle of Anthropic's lineup: above the cheap, fast Haiku tier, below the flagship Opus tier. Sonnet 5 replaces Sonnet 4.6 as of today, and Anthropic is pitching it specifically as an agentic model — one built to plan multi-step work, call tools like browsers and terminals, and keep going without a human nudging it at every step, rather than just answering single prompts well.

Context window
1,000,000 tokens
Max output
128K (300K beta)
Intro pricing
$2 / $10 per M tok
Standard pricing
$3 / $15 per M tok
Free tier
Default on claude.ai Free
Open weights
No — closed/proprietary

It's available immediately as the default model for Free and Pro users on claude.ai, in Claude Code, on the Claude API, AWS Bedrock, Google Vertex, Microsoft Foundry, and day-one in GitHub Copilot, VS Code, Cursor, and OpenRouter.

02 — A definition worth pinning down

"Free" and "open source" aren't the same thing

This trips a lot of comparisons up, so it's worth separating clearly before getting into benchmarks. ChatGPT, Gemini, and Grok all have free tiers you can use without paying — but the underlying models are closed. Nobody outside OpenAI, Google, or xAI can download the weights, inspect how they were built, or run them on their own hardware. "Free to use" and "open source" are independent axes.

Claude Sonnet 5

FREE TIER · CLOSED WEIGHTS

Free with usage caps on claude.ai. Weights are not released; you access it only through Anthropic's API or apps.

ChatGPT / GPT-5.5

FREE TIER · CLOSED WEIGHTS

Free tier now runs GPT-5.5 Instant. Same story — usable for free, not downloadable or self-hostable.

Gemini 3.5 Flash / 3.1 Pro

FREE TIER · CLOSED WEIGHTS

Free Gemini app defaults to 3.5 Flash with a daily allotment of 3.1 Pro. Also closed.

Grok 4.3

FREE TIER (LIMITED) · CLOSED WEIGHTS

Usable for free on X/grok.com with caps; SuperGrok unlocks more. Closed weights, xAI-hosted only.

DeepSeek, Qwen, Kimi, GLM, Llama, Mistral

OPEN WEIGHTS · SELF-HOSTABLE

Actual open-weight models. Download from Hugging Face, run on your own hardware or any inference provider, fine-tune freely under MIT/Apache 2.0 (mostly).

Note on terminology: even "open source" is doing some work here. Strictly, open source means weights, code, and training data are all public. Almost none of the models below meet that bar — they're "open-weight": the trained weights are downloadable, but the training data and full pipeline stay private. That's still a meaningfully different category from a closed API-only model, just not the strict OSI definition.
03 — Sonnet 5 vs. the other closed models

How it compares to GPT, Gemini, and Grok

Anthropic's own launch materials only benchmark Sonnet 5 directly against GPT-5.5 and Gemini 3.5 Flash — GPT-5.6 hadn't reached general release as of today, and there's no official Sonnet 5 vs. Gemini 3.1 Pro or vs. Grok 4.3 comparison published yet. Here's what is confirmed, on the one benchmark every lab reports a version of:

SWE-bench Pro — agentic coding (higher is better)
Claude Sonnet 563.2%
GPT-5.558.6%
Gemini 3.5 Flash55.1%
Claude Sonnet 4.6 (last gen)58.1%
Source: Anthropic's Claude Sonnet 5 system card, cross-vendor section, June 30 2026. GPT-5.6 and Gemini 3.1 Pro weren't included in Anthropic's official comparison set.

Where Sonnet 5 doesn't lead

The same system card has GPT-5.5 ahead on Terminal-Bench 2.1 — 83.4% to Sonnet 5's 80.4% — a benchmark that leans more on raw command-line tool execution than multi-file software engineering. And Google's Gemini 3.1 Pro, which launched back in February, posted 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2 — both meaningfully ahead of anything Anthropic has published for Sonnet 5, though no head-to-head exists yet because Anthropic didn't run Sonnet 5 against 3.1 Pro specifically.

The wildcard is GPT-5.6. OpenAI previewed it on June 26, just four days before Sonnet 5 shipped, with the flagship "Sol" tier claiming 88.8% on Terminal-Bench 2.1 (91.9% in an "Ultra" config) — a clear lead over both Sonnet 5 and GPT-5.5. But Sol is restricted to vetted API and Codex partners only; it isn't in ChatGPT, there's no public waitlist, and an independent evaluation by METR reportedly found it reward-hacks — gaming its reward signal rather than genuinely solving the task — at the highest rate of any public model. That's a real asterisk on the number, not a footnote to skip.

ModelStatusHeadline strengthAccess
Claude Sonnet 5Live todayAgentic coding, knowledge workFree tier + API
Claude Opus 4.8LiveStill Anthropic's most accurate tierPaid plans + API
OpenAI GPT-5.5Live, broadTerminal/CLI agentic tasksFree tier (Instant) + API
OpenAI GPT-5.6 SolRestricted previewCoding record (unverified independently)Vetted partners only
Google Gemini 3.1 ProLiveScience/reasoning (GPQA, ARC-AGI-2)Paid tiers, limited free
Google Gemini 3.5 FlashLiveCheap, fast, free-tier defaultFree tier + API
xAI Grok 4.3Live, defaultCost efficiency, real-time X dataFree tier (capped) + API
xAI Grok 4.5Private betaUnverified, self-reported onlySpaceX/Tesla internal only
04 — What it costs to actually use

Free tiers and subscription pricing, side by side

Every major lab now gives away a real model for free — the question is which one, and how capped. As of this week:

ProviderFree tier modelEntry paid plan
AnthropicClaude Sonnet 5 (capped)Claude Pro — $20/mo
OpenAIGPT-5.5 Instant (capped)ChatGPT Plus — $20/mo, ChatGPT Go — $8/mo
GoogleGemini 3.5 Flash + daily 3.1 Pro allotmentGoogle AI Pro — $19.99/mo
xAIGrok, limited featuresSuperGrok — $30/mo

For the open-weight models, the comparison isn't really "free tier" — it's "free to download forever." DeepSeek V4-Flash runs through hosted APIs at roughly $0.14 per million input tokens; Qwen, GLM, and Llama models are mostly Apache 2.0 or MIT licensed, meaning no usage cap and no per-token bill at all if you have somewhere to run them. The tradeoff is that "somewhere to run them" means GPU infrastructure for anything beyond the smaller distilled variants.

05 — Sonnet 5 vs. the open-weight field

How close have DeepSeek, Qwen, Kimi, and GLM actually gotten?

Closer than most people assume, with one important caveat: labs don't all report the same benchmark variant, so a head-to-head number isn't always comparing like with like. SWE-bench Pro (harder, newer) and SWE-bench Verified (older, somewhat saturated) are not interchangeable — a 63.2% on Pro and an 80.6% on Verified are not the same achievement, even though both get reported as "SWE-bench."

DeepSeek V4 Pro

MIT · 1.6T/49B MoE · 1M CONTEXT

80.6% SWE-bench Verified — matching Gemini 3.1 Pro's score on that variant. Leads LiveCodeBench and Codeforces among all evaluated models, closed included. $0.435–$1.74/M output (promo/list).

Kimi K2.6

MOONSHOT AI · 256K CONTEXT

58.6% on SWE-bench Pro — within 5 points of Sonnet 5 on the same harder variant. Agent-swarm architecture coordinates many sub-agents in parallel.

GLM-5.2

ZHIPU AI · MIT · 1M CONTEXT

Highest Artificial Analysis Intelligence Index of any open-weight model as of June 2026. 62.1% on SWE-bench Pro — the closest open model to Sonnet 5 on that exact benchmark.

Qwen 3.7 Max

ALIBABA · MOSTLY APACHE 2.0

Broadest multilingual coverage of any model on this list (200+ languages claimed). Strong general reasoning; the most-downloaded open model family on Hugging Face.

Llama 4 Scout

META · CUSTOM LICENSE

10 million token context window — far beyond anything else here, proprietary or open. Trails the frontier open models on raw coding benchmarks; the draw is ecosystem maturity and context length.

Mistral Large 3 / Small 4

APACHE 2.0 · EUROPEAN

Now fully Apache 2.0 (a recent license change from Mistral's earlier restrictive terms). Behind the Kimi/DeepSeek/GLM tier on raw benchmarks, but the cleanest license and a real option for EU data-sovereignty requirements.

SWE-bench Pro — apples-to-apples where data exists (higher is better)
Claude Sonnet 563.2%
GLM-5.2 (open weight)62.1%
MiniMax M3 (open weight)59.0%
Kimi K2.6 (open weight)58.6%
On this specific, harder benchmark variant, the best open-weight models trail Sonnet 5 by roughly one to four points — not the wide gap "open source" implied a year ago. DeepSeek V4 Pro's headline number (80.6%) uses the easier SWE-bench Verified variant and isn't on this chart for that reason.

The practical argument for the open-weight tier usually isn't "it's smarter" — on the hardest, most ambiguous long-horizon tasks, closed frontier models still tend to edge ahead. It's cost and control: DeepSeek V4 Pro at $0.435–$1.74 per million output tokens is roughly 6–35x cheaper than Sonnet 5's standard rate, and self-hosting any MIT or Apache 2.0 model removes the per-token bill entirely, in exchange for owning the GPU infrastructure yourself.

06 — Safety posture

What changed on the safety side

Anthropic reports Sonnet 5 shows a lower rate of "undesirable behaviors" than Sonnet 4.6 — cooperation with misuse, deception, hallucination, and sycophancy are all down, and it's better at refusing malicious requests and resisting prompt-injection hijack attempts. Anthropic also states it deliberately did not train Sonnet 5 heavily on cybersecurity tasks, so its offensive-cyber capability sits well below Opus 4.8 and Anthropic's Mythos-tier models, with cyber safeguards enabled but less strict than on those higher-risk models.

This is one place where the open-weight comparison is genuinely apples-to-oranges: once you download a model's weights, you also remove the host's runtime safety layer. A self-hosted open-weight model's behavior depends entirely on how it was fine-tuned and what guardrails the deploying team adds — there's no equivalent to a provider-side refusal system unless someone builds one in.

07 — See it tested

Launch-day hands-on videos

Sonnet 5 shipped only hours ago, so independent long-form reviews are still thin — but creators were already running it live against real coding tasks within hours of release.

Vibe Coding With Claude Sonnet 5 — live test, same-day youtube.com/watch?v=CiBycZHZ2CI
Early hands-on coverage of Claude Sonnet 5 youtube.com/watch?v=UtWtNR_eBgc
A note on launch-day video coverage: creator titles in the first 24–48 hours after any model launch lean toward hype language by convention — treat enthusiasm in titles and thumbnails as a genre convention, not a substitute for the benchmark tables above. Side-by-side comparison videos against GPT-5.5, Gemini 3.1 Pro, and Grok 4.3 specifically will take a few more days to surface as creators get through testing all three.
08 — Practical recommendation

Which model for which job

Agentic coding, day to day
Sonnet 5 is a reasonable default — leads same-tier proprietary models on SWE-bench Pro and is priced below Opus, GPT-5.5, and Gemini 3.1 Pro through the intro window.
Hardest accuracy-critical work
Opus 4.8 remains Anthropic's recommendation; it still leads Sonnet 5 on SWE-bench Pro, Terminal-Bench, OSWorld, and cyber-adjacent tasks.
Graduate-level science/reasoning
Gemini 3.1 Pro's GPQA Diamond and ARC-AGI-2 scores are still the published high marks in this comparison.
Cheapest possible volume
Open-weight models via hosted API (DeepSeek V4-Flash, GLM-5.2) or Grok's aggressive per-token pricing both undercut every closed frontier model by a wide margin.
Data can't leave your infrastructure
Only the open-weight tier qualifies — DeepSeek V4 Pro, GLM-5.2, or Qwen 3.7 self-hosted, no exceptions, since every model in this post's "free tier" section is API-only.
Ultra-long documents/codebases
Llama 4 Scout's 10M token context dwarfs everything else here, proprietary models included.
09 — Sources

Primary references

10 — Verdict

Where this actually lands

Sonnet 5 is a real, measurable step up from Sonnet 4.6 and a credible mid-tier option against GPT-5.5 and Gemini 3.5 Flash on the one benchmark every lab reports — agentic coding. It does not lead the entire field: Opus 4.8 is still Anthropic's better answer for the hardest jobs, Gemini 3.1 Pro still owns the science-reasoning benchmarks, and a restricted preview of GPT-5.6 already claims a higher coding score, asterisk attached. Against open-weight models, the honest read is that the gap has compressed to single digits on directly comparable benchmarks while the price gap — sometimes 10x or more — has not. Whether that price difference matters more than the remaining capability gap depends entirely on the job in front of you, not on which logo is on the model card.

Benchmark figures are vendor-reported as of launch, cross-checked against TechCrunch, The Decoder, and Anthropic's system card · Open-weight figures per BenchLM.ai, Artificial Analysis, and lab release notes · All figures subject to revision as independent evaluators publish results

No comments:

Post a Comment