We ran both models against 200+ real coding tasks over six weeks. Same prompts, same codebases, same scoring rubric. Here's what landed.

The setup

Three codebases:

  • A 60k-line TypeScript monorepo with React, Next.js, and a complex API layer
  • A Python ML repo with mixed tooling (PyTorch, FastAPI, Polars)
  • A small Rust service with strict performance requirements

We tested both models in two modes:

  1. Chat mode — operator-driven. We give the model a task, copy in relevant code, evaluate the response.
  2. Agent mode — model runs in a loop with file-system access (Cursor for both, plus Claude Code for the Anthropic side and Codex CLI for the OpenAI side).

Tasks were drawn from real backlog items from active products: bug fixes, multi-file refactors, new endpoints with tests, performance optimization, and a few "explain why this is broken" puzzles.

Where Claude wins

Three categories, decisively.

Planning quality. Give Claude a complex task — "refactor this directory to use the new auth helper, but only where the helper is functionally equivalent" — and it will think before it types. The plan it produces is usually right, and when it isn't, the wrongness is in the spec rather than the implementation. GPT-5.5 tends to start editing immediately and reason as it goes, which works for simple tasks and breaks on complex ones.

Multi-file refactor accuracy. On the 22 multi-file refactor tasks in our suite, Claude got 19 fully right on first try; GPT-5.5 got 14. The other failures were concentrated on tasks where the refactor required understanding implicit invariants — exactly the cases where planning quality matters.

Reasoning about existing code. When asked "why is this slow?" or "where is the bug?", Claude's diagnoses were correct and well-explained noticeably more often. GPT-5.5 sometimes proposes the right fix without articulating why, which is fine until you need the explanation for code review.

Where GPT-5.5 wins

Three categories, also decisively.

Inline autocomplete. For short, predictable completions inside an editor, GPT-5.5 is faster and the suggestions feel marginally more on-target. If you're using model-as-autocomplete (Copilot-style), GPT-5.5 is the better backbone.

Latency. First-token latency is meaningfully lower on GPT-5.5 — 35% by OpenAI's published numbers, roughly matching that in our tests. For interactive use cases (chat-style coding assistance), it feels snappier.

Cost per task. Even before factoring in the GPT-5.5 price cuts last week, GPT-5.5 was the cheaper model per coding task. With the new pricing, it's about 40% cheaper for equivalent agent-loop work. If your unit economics are constrained, this matters.

Where they tie

Greenfield features that have low cross-cutting complexity — single-file additions, small new modules, isolated handlers — both models produce work that ships. Quality difference exists on close inspection but doesn't matter in practice.

What we'd actually pick

If you're building a coding agent product where one model has to power the whole loop, Claude Sonnet 4.6 is the right pick. The planning advantage compounds across long tasks, the agent-loop reliability is meaningfully better, and the gap on refactor accuracy is the kind of thing that shows up in production as fewer broken PRs.

If you're a solo engineer using AI to accelerate your own work, the answer is "use both". Claude for planning and complex refactors, GPT-5.5 for inline completion and cheap iterative tasks. Most engineers we know who care about this question run both.

If you're running a heavy hosted-API workload where cost matters more than the last 10% of quality, GPT-5.5 is now the more economical choice. The quality gap is real but often not worth the price difference.

What this isn't

This comparison is about coding specifically. For other workloads — long-form writing, image generation, voice, multimodal — the picture is different and we cover those separately.

We also haven't tested every coding-specific model on the market. Codestral, DeepSeek Coder, and the open Qwen Coder series are all credible alternatives in narrower contexts. Those are worth their own writeups.

See the side-by-side matrix above for the full breakdown by dimension.

Side-by-side

Feature Claude Sonnet 4.6 Anthropic GPT-5.5 OpenAI
Planning quality (multi-step tasks) ★★★★★★★★★
Multi-file refactor accuracy ★★★★★★★★★
Bug fix correctness ★★★★★★★★★
Greenfield feature build ★★★★★★★★
Inline autocomplete (short suggestions) ★★★★★★★★
Latency to first token ★★★★★★★★★
Cost per coding task ★★★★★★★★
Long-context retrieval (>128k) ★★★★★★★
Agent-loop reliability (5+ tool calls) ★★★★★★★★★
Explainability (why it made a change) ★★★★★★★★★