Engineering · LLM selection

Picking the right LLM for a voice agent.

Callibre Team · 21 May 2026 · 6 min read

Callibre runs ten LLMs at flat per-minute rates from $0.006 to $0.060. The right pick isn't always the most capable one, and the floor model is genuinely good enough for a surprising fraction of agent workloads. Here's how to choose without spending a week benchmarking.

The instinct on any voice-AI build is to reach for the most capable LLM available, on the theory that "better model = better agent." For about a quarter of workloads that's right. For the other three-quarters, you've bought yourself an extra three-and-a-half cents per minute and your callers cannot tell. The trick is knowing which side of the line your agent is on before you ship.

The menu, in three tiers

Callibre's flat per-minute LLM pricing falls into three natural buckets:

$0.006–$0.015

Budget · Gemini Flash, Llama 4

$0.025–$0.030

Balanced · Haiku 4.5, Qwen3

$0.040–$0.060

Premium · GPT-5, GPT-4.x, Sonnet 4.6

The all-in cost (Voice + LLM + telephony) ranges from $0.086/min on the cheapest config to $0.140/min on the most expensive. That's a 60% spread on infrastructure. If you can sell every client into the top tier and absorb that, fine. Most agencies can't, and shouldn't try.

Budget tier — Gemini Flash 2.5 ($0.006/min) and Llama 4 ($0.015/min)

The dirty secret about voice agents: the LLM is only one piece of what makes them feel good. ElevenLabs voice quality, sub-second turn-taking, clean barge-in, and grounded knowledge — most of the perceived "intelligence" of a well-built voice agent is upstream of the model choice.

Workloads where the budget tier is genuinely the right call:

High-volume routing. "Are you calling about a quote, a service appointment, or a billing question?" The agent picks intent off two answers and routes. Gemini Flash handles this perfectly at one-quarter the cost of Haiku.
Confirmation calls. Reading details back and capturing yes/no doesn't need deep reasoning.
Outbound list calls. Recall, win-back, recovery sequences where the script is well-defined and deviation is minimal.

Where the budget tier breaks down: any workflow that needs the agent to reason across multiple facts or hold context across a long, branching conversation. The model gets confused about who said what three turns ago, and the conversation feels slightly stilted.

Balanced tier — Claude Haiku 4.5 ($0.025/min) and Qwen3 ($0.030/min)

This is the default for a reason. Haiku 4.5 is the sweet spot for the vast majority of voice agent workloads — inbound qualification, appointment booking with branching availability, support triage, conversational follow-up. It holds context cleanly across multi-turn conversations, follows the system prompt without drifting, and handles knowledge-base grounding without making up details.

If you're not sure where to start, start here. The all-in cost is $0.105/min, and you can quote a client a flat $0.30/min with a $0.195/min margin floor that holds across any caller pattern.

Premium tier — GPT-5.x, GPT-4o, Claude Sonnet 4.6

You move up to the premium tier when one of three things is true:

The agent has to handle genuine reasoning — comparing options, walking through a multi-step troubleshooting tree, performing nuanced intent detection where the caller doesn't say what they actually want.
The knowledge base is large and unstructured and the agent needs to synthesize across many entries rather than retrieve one matching one.
The brand voice is unusual — premium models follow stylistic guidance more closely. If you're building an agent for a luxury brand or a regulated-language vertical, the larger model holds the voice better.

The premium tier costs $0.120 to $0.140 all-in. For a client paying $0.30/min, that's still a profitable margin floor at $0.16–$0.18/min — just less headroom. If the client's use case is materially better served, it's the right call. If it isn't, you're burning margin on capability you don't use.

Practical picking rules

Start at Haiku. Profile in production. Move only if needed.

Ship the first agent on Haiku 4.5 ($0.025/min LLM line). Listen to the first hundred calls. If the agent is consistently sharp, drop down to a Llama 4 or Gemini Flash variant and re-listen to the next hundred. If the calls degrade, go back. If they don't, you just halved your LLM cost.

Don't pick by demo, pick by production.

Every model demos well on a scripted call. The real differences emerge over volume. Run the same agent on three models for a week each on the same call mix, then compare. Callibre's per-call observability shows the call audio, the transcript, the tool calls fired, and the effective per-minute cost — enough to do this comparison without instrumenting it yourself.

BYOK exists for a reason.

If you have a direct contract with Anthropic or OpenAI, you can run the LLM line in BYOK mode — the model bills your provider directly, and the LLM line on the Callibre invoice drops to $0. This is rarely worth it under a few hundred thousand minutes a month, but at scale it's a real lever.

The pricing structure is the point

None of this picking logic is interesting if your LLM cost is metered by token at call time — because then your model choice doesn't fully determine your cost, the caller does. The whole reason this menu is choosable in a useful way is that every line is flat per-minute. Pick once, the rate locks, the margin holds.

Try the LLM picker in the calculator.

Drop in your call volume and switch LLMs to see the spread. Same calculator the comparison page uses.

Open the calculator →