Unified multimodal I/O
Accepts and reasons over text, audio, and images in one model — and can respond in text or natural speech without a separate TTS stage.
Unified multimodal foundation model with native voice, vision, and text in one model.
GPT-5 is OpenAI's unified multimodal foundation model, released in December 2025. It folds text, voice, and vision into a single model — so the same deployment can hold a spoken conversation, read a screenshot, and call tools without routing between specialised models.
It replaces the entire GPT-4 family and is tuned for low-latency, consumer-facing experiences: real-time voice, fast vision understanding, and broad tool-use across modalities. For most teams it is the default choice when speed and modality breadth matter more than the very strictest refusal behaviour.
It is available through OpenAI's API and Azure OpenAI, with native function-calling, structured outputs, and a large ecosystem of SDKs and integrations.
Accepts and reasons over text, audio, and images in one model — and can respond in text or natural speech without a separate TTS stage.
Sub-second voice responses with interruption handling, making it well suited to live phone and in-app voice agents.
Structured function calling and parallel tool calls across modalities, with reliable JSON-mode outputs for downstream systems.
First-class SDKs, Azure availability, and the widest set of third-party integrations of any frontier model.
Real-time spoken support that reads account context, answers in natural speech, and hands off cleanly to a human.
<1svoice time-to-first-responseIn-app assistants that take a photo, a voice note, or text and respond in whichever modality the user prefers.
1 modelvoice + vision + text in a single callDrafting copy, repurposing assets across formats, and summarising calls — all from mixed text/audio inputs.
3×faster content turnaround per teamNo self-hosting or fine-tuning of weights. For data sovereignty or air-gapped environments, an open-weights model like Llama 4 Behemoth is the alternative.
Less conservative than Claude on regulated workloads. For healthcare, legal, and finance agents that must fail safely, Claude Opus 4.7 generally has cleaner refusal behaviour.
Cheaper than Opus, but still a frontier tier. For very high-volume simple workloads, GPT-5 mini or Claude Haiku are materially cheaper and usually good enough.
Yes — it natively handles text, voice, and vision input in one model and can respond in text or speech, without separate models for each modality.
GPT-5 is priced around $10 per million input tokens and $40 per million output tokens. High-volume simple tasks are cheaper on GPT-5 mini.
GPT-5 wins on latency, voice, and native multimodal generation. Claude Opus 4.7 wins on long-horizon agentic reliability and safe refusal behaviour in regulated industries.
Our weekly AI brief — written by the team shipping it.
Joined by 4,200+ engineers, founders & product leads · Unsubscribe anytime