Is GPT-5 fully multimodal?

Yes — it natively handles text, voice, and vision input in one model and can respond in text or speech, without separate models for each modality.

How much does GPT-5 cost?

GPT-5 is priced around $10 per million input tokens and $40 per million output tokens. High-volume simple tasks are cheaper on GPT-5 mini.

GPT-5 or Claude Opus 4.7?

GPT-5 wins on latency, voice, and native multimodal generation. Claude Opus 4.7 wins on long-horizon agentic reliability and safe refusal behaviour in regulated industries.

Book a Strategy Call

BY OpenAI

Released December 2025 · v5.0

Live

GPT-5

Unified multimodal foundation model with native voice, vision, and text in one model.

Context: 400K tokens
Pricing · in / out: $10 / $40 per 1M
Modality: Text + Voice + Vision

At a glance

GPT-5 is OpenAI's unified multimodal foundation model, released in December 2025. It folds text, voice, and vision into a single model — so the same deployment can hold a spoken conversation, read a screenshot, and call tools without routing between specialised models.

It replaces the entire GPT-4 family and is tuned for low-latency, consumer-facing experiences: real-time voice, fast vision understanding, and broad tool-use across modalities. For most teams it is the default choice when speed and modality breadth matter more than the very strictest refusal behaviour.

It is available through OpenAI's API and Azure OpenAI, with native function-calling, structured outputs, and a large ecosystem of SDKs and integrations.

Key abilities

Unified multimodal I/O

Accepts and reasons over text, audio, and images in one model — and can respond in text or natural speech without a separate TTS stage.

Real-time voice

Sub-second voice responses with interruption handling, making it well suited to live phone and in-app voice agents.

Broad tool-use & function calling

Structured function calling and parallel tool calls across modalities, with reliable JSON-mode outputs for downstream systems.

Large ecosystem

First-class SDKs, Azure availability, and the widest set of third-party integrations of any frontier model.

How teams use it

Customer support

Voice support agents

Real-time spoken support that reads account context, answers in natural speech, and hands off cleanly to a human.

<1svoice time-to-first-response

Consumer apps

Multimodal assistants

In-app assistants that take a photo, a voice note, or text and respond in whichever modality the user prefers.

1 modelvoice + vision + text in a single call

Sales & marketing

Multimodal content generation

Drafting copy, repurposing assets across formats, and summarising calls — all from mixed text/audio inputs.

3×faster content turnaround per team

Drawbacks

Closed weights

No self-hosting or fine-tuning of weights. For data sovereignty or air-gapped environments, an open-weights model like Llama 4 Behemoth is the alternative.

Refusal behaviour

Less conservative than Claude on regulated workloads. For healthcare, legal, and finance agents that must fail safely, Claude Opus 4.7 generally has cleaner refusal behaviour.

Cost at scale

Cheaper than Opus, but still a frontier tier. For very high-volume simple workloads, GPT-5 mini or Claude Haiku are materially cheaper and usually good enough.

GPT-5

Unified multimodal foundation model with native voice, vision, and text in one model.

Context: 400K tokens
Pricing · in / out: $10 / $40 per 1M
Modality: Text + Voice + Vision

At a glance

It is available through OpenAI's API and Azure OpenAI, with native function-calling, structured outputs, and a large ecosystem of SDKs and integrations.

Key abilities

Unified multimodal I/O

Accepts and reasons over text, audio, and images in one model — and can respond in text or natural speech without a separate TTS stage.

Real-time voice

Sub-second voice responses with interruption handling, making it well suited to live phone and in-app voice agents.

Broad tool-use & function calling

Structured function calling and parallel tool calls across modalities, with reliable JSON-mode outputs for downstream systems.

Large ecosystem

First-class SDKs, Azure availability, and the widest set of third-party integrations of any frontier model.

How teams use it

Customer support

Voice support agents

Real-time spoken support that reads account context, answers in natural speech, and hands off cleanly to a human.

<1svoice time-to-first-response

Consumer apps

Multimodal assistants

In-app assistants that take a photo, a voice note, or text and respond in whichever modality the user prefers.

1 modelvoice + vision + text in a single call

Sales & marketing

Multimodal content generation

Drafting copy, repurposing assets across formats, and summarising calls — all from mixed text/audio inputs.

3×faster content turnaround per team

Drawbacks

Closed weights

No self-hosting or fine-tuning of weights. For data sovereignty or air-gapped environments, an open-weights model like Llama 4 Behemoth is the alternative.

Refusal behaviour

Less conservative than Claude on regulated workloads. For healthcare, legal, and finance agents that must fail safely, Claude Opus 4.7 generally has cleaner refusal behaviour.

Cost at scale

Cheaper than Opus, but still a frontier tier. For very high-volume simple workloads, GPT-5 mini or Claude Haiku are materially cheaper and usually good enough.

GPT-5

At a glance

Key abilities

Unified multimodal I/O

Real-time voice

Broad tool-use & function calling

Large ecosystem

How teams use it

Customer support

Voice support agents

Consumer apps

Multimodal assistants

Sales & marketing

Multimodal content generation

Drawbacks

Closed weights

Refusal behaviour

Cost at scale

People also ask

Read AI Insights weekly.

GPT-5

At a glance

Key abilities

Unified multimodal I/O

Real-time voice

Broad tool-use & function calling

Large ecosystem

How teams use it

Customer support

Voice support agents

Consumer apps

Multimodal assistants

Sales & marketing

Multimodal content generation

Drawbacks

Closed weights

Refusal behaviour

Cost at scale

People also ask

Read AI Insights weekly.