Skip to main content

Choosing an LLM

The model is the assistant's brain. Pick the cheapest one that gets the job done — most assistants run beautifully on gpt-4o-mini at the lowest price tier. Reach for stronger models only when you have evidence the simpler one is failing.

Pricing is multiplied by a tier number that depends on the model family. The baseline text rate is 1.5¢ per query. Voice baseline is 6¢ per minute (see Pricing for the full breakdown).

The model list and what they cost

ModelCost tierWhen to pick it
gpt-4o-miniThe default. Fast, strong, cheap. Start here for every assistant.
gpt-5.4-mini (Azure)Same tier as gpt-4o-mini, hosted on Azure. Useful when you need an Azure-billed path.
gpt-3.5-turbo (and -0125, -1106)Legacy. Use if you have a prompt already tuned to it.
deepseek-r1-distill-llama-70b (Groq)Open-weight, very fast on Groq's hardware. Good for high-volume support traffic.
llama-3.1-70b-versatile (Groq)Open-weight Llama. Same speed advantage on Groq.
o3-miniReasoning model. Use when the task involves multi-step logic — tool chains, structured analysis, comparison.
gpt-4o-2024-05-1310×Top-tier general model. Use when mini and o3-mini both miss nuance.
gpt-4-0125-preview, gpt-4-1106-preview20×Older 4-class. The 4o family supersedes these — use only if you have a specific reason.
gpt-4o-mini-realtime-preview12¢/min bundleRealtime voice assistants. Default for Realtime OpenAI type.
gpt-4o-realtime-preview48¢/min bundleRealtime voice, premium quality. Use when the brand experience justifies the cost.

There is no Claude option in the dropdown today. The provider list is OpenAI, Azure OpenAI, and Groq.

Cost per query (chat)

ModelCost per query
gpt-4o-mini, gpt-5.4-mini, gpt-3.5-turbo, deepseek-*, llama-3.1-70b-versatile1.5¢
o3-mini3.0¢
gpt-4o-2024-05-1315¢
gpt-4-* (older)30¢

Realtime voice models

Realtime models collapse speech-to-text, the LLM, and text-to-speech into a single OpenAI streaming call. That delivers sub-500ms turn-taking — barely-noticeable latency, even barge-in works smoothly. The tradeoff is cost and voice choice:

  • gpt-4o-mini-realtime-preview12¢ per minute all-in. The default for Realtime assistants.
  • gpt-4o-realtime-preview48¢ per minute all-in. Premium voice quality.

You can't mix in an ElevenLabs voice on a Realtime assistant — the voice comes from OpenAI's built-in list (alloy, echo, fable, onyx, nova, shimmer, and newer additions).

Four business use cases

Restaurant — phone reservations on Groq. A 12-location restaurant chain runs a Phone assistant on deepseek-r1-distill-llama-70b (1× tier, very fast on Groq). High traffic, simple intent — confirm party size, time, contact details. Average call: 90 seconds. Cost: ~9¢ per call.

Fintech — compliance-grade reasoning on o3-mini. A neo-bank's KYC assistant has to compare provided documents against a checklist and decide whether to escalate. They use o3-mini for its step-by-step reasoning. Cost per chat conversation: ~12¢ across 4 turns — worth it given the alternative is a human reviewer at 100x the cost.

Dental practice — voice on gpt-4o-mini with Azure. Bright Smile Dental runs a Phone assistant on gpt-4o-mini + Azure STT + Azure TTS. Baseline voice rate: 6¢/min. Average call: 3 minutes. Cost per call: 18¢.

Luxury concierge — premium Realtime. A high-end travel concierge uses Realtime gpt-4o-realtime-preview for the brand-defining voice experience. Cost: 48¢/min. Average call: 4 minutes. Cost per call: $1.92 — justified by an average booking value above $5,000.

A decision tree

Chat or phone assistant?
├─ Default → gpt-4o-mini (1× cost)
├─ Multi-step reasoning needed → o3-mini (2× cost)
└─ Still failing → gpt-4o-2024-05-13 (10× cost)

Realtime voice assistant?
├─ Default → gpt-4o-mini-realtime-preview (12¢/min)
└─ Premium experience → gpt-4o-realtime-preview (48¢/min)

Switching models on a live assistant

The model is a single dropdown on the assistant. Change it, save, and the next conversation uses the new model. No re-indexing, no migration. A useful A/B technique:

  1. Clone the assistant.
  2. Change the clone's model.
  3. Point a test widget at the clone.
  4. Compare results in the Conversations view for a few days.

What does not automatically carry across a model switch is prompt tuning. A prompt tuned for gpt-4o-mini may need adjustment on o3-mini or gpt-4o. Re-tune in the Playground after any swap.

BYOK and pricing

If you've configured your own OpenAI key under Settings → BYOK Credentials, LLM costs are billed directly to your provider account. Insighto's wallet doesn't get touched for the LLM line item. See BYOK Credentials and Pricing.

Where to next