OpenAI just released three new realtime voice models in the API. The headline model, gpt-realtime-2, is the first OpenAI voice model with GPT-5-class reasoning. The context window jumps from 32K to 128K. The launch landed on May 7, 2026.
The second model is gpt-realtime-translate. Live speech translation, 70+ input languages, 13 output languages, paced with the speaker. The third is gpt-realtime-whisper — a streaming speech-to-text model that transcribes live as a person talks.
The shape that matters most for builders is multi-tool calling with audible narration. While reasoning, gpt-realtime-2 can call multiple tools simultaneously and read out what it is doing, with phrases like "checking your calendar" or "looking that up now." That removes the silent-pause UX problem that has held voice agents back.
Pricing tells you the deployment shape. gpt-realtime-2 is $32 per million audio input tokens and $64 per million audio output tokens. Translate is $0.034 per minute. Whisper is $0.017 per minute. The Translate and Whisper minute-billing makes them safe to put behind production traffic without surprise bills. The gpt-realtime-2 token price is premium, aimed at high-value live conversations — sales calls, support escalations, complex assistance — where one minute of high-quality reasoning is worth dollars, not cents.
Why this matters: until today, the production voice-agent stack required stitching together STT, an LLM, and TTS — three separate models, three latencies, three failure modes. OpenAI just collapsed that into one realtime model with reasoning. Builders should re-evaluate any voice product that was waiting for "real-time agent quality."
What to watch: whether Anthropic and Google ship parallel voice models in 4-6 weeks. The pattern from the last 90 days is that whichever lab ships a category first, the other two follow within a quarter. Voice agents just became a frontier-lab race.