Choosing an LLM
The model is the assistant's brain. Pick the cheapest one that gets the job done — most assistants run beautifully on gpt-4o-mini at the lowest price tier. Reach for stronger models only when you have evidence the simpler one is failing.
Pricing is multiplied by a tier number that depends on the model family. The baseline text rate is 1.5¢ per query. Voice baseline is 6¢ per minute (see Pricing for the full breakdown).
The model list and what they cost
| Model | Cost tier | When to pick it |
|---|---|---|
gpt-4o-mini | 1× | The default. Fast, strong, cheap. Start here for every assistant. |
gpt-5.4-mini (Azure) | 1× | Same tier as gpt-4o-mini, hosted on Azure. Useful when you need an Azure-billed path. |
gpt-3.5-turbo (and -0125, -1106) | 1× | Legacy. Use if you have a prompt already tuned to it. |
deepseek-r1-distill-llama-70b (Groq) | 1× | Open-weight, very fast on Groq's hardware. Good for high-volume support traffic. |
llama-3.1-70b-versatile (Groq) | 1× | Open-weight Llama. Same speed advantage on Groq. |
o3-mini | 2× | Reasoning model. Use when the task involves multi-step logic — tool chains, structured analysis, comparison. |
gpt-4o-2024-05-13 | 10× | Top-tier general model. Use when mini and o3-mini both miss nuance. |
gpt-4-0125-preview, gpt-4-1106-preview | 20× | Older 4-class. The 4o family supersedes these — use only if you have a specific reason. |
gpt-4o-mini-realtime-preview | 12¢/min bundle | Realtime voice assistants. Default for Realtime OpenAI type. |
gpt-4o-realtime-preview | 48¢/min bundle | Realtime voice, premium quality. Use when the brand experience justifies the cost. |
There is no Claude option in the dropdown today. The provider list is OpenAI, Azure OpenAI, and Groq.
Cost per query (chat)
| Model | Cost per query |
|---|---|
gpt-4o-mini, gpt-5.4-mini, gpt-3.5-turbo, deepseek-*, llama-3.1-70b-versatile | 1.5¢ |
o3-mini | 3.0¢ |
gpt-4o-2024-05-13 | 15¢ |
gpt-4-* (older) | 30¢ |
Realtime voice models
Realtime models collapse speech-to-text, the LLM, and text-to-speech into a single OpenAI streaming call. That delivers sub-500ms turn-taking — barely-noticeable latency, even barge-in works smoothly. The tradeoff is cost and voice choice:
gpt-4o-mini-realtime-preview— 12¢ per minute all-in. The default for Realtime assistants.gpt-4o-realtime-preview— 48¢ per minute all-in. Premium voice quality.
You can't mix in an ElevenLabs voice on a Realtime assistant — the voice comes from OpenAI's built-in list (alloy, echo, fable, onyx, nova, shimmer, and newer additions).
Four business use cases
Restaurant — phone reservations on Groq. A 12-location restaurant chain runs a Phone assistant on deepseek-r1-distill-llama-70b (1× tier, very fast on Groq). High traffic, simple intent — confirm party size, time, contact details. Average call: 90 seconds. Cost: ~9¢ per call.
Fintech — compliance-grade reasoning on o3-mini. A neo-bank's KYC assistant has to compare provided documents against a checklist and decide whether to escalate. They use o3-mini for its step-by-step reasoning. Cost per chat conversation: ~12¢ across 4 turns — worth it given the alternative is a human reviewer at 100x the cost.
Dental practice — voice on gpt-4o-mini with Azure. Bright Smile Dental runs a Phone assistant on gpt-4o-mini + Azure STT + Azure TTS. Baseline voice rate: 6¢/min. Average call: 3 minutes. Cost per call: 18¢.
Luxury concierge — premium Realtime. A high-end travel concierge uses Realtime gpt-4o-realtime-preview for the brand-defining voice experience. Cost: 48¢/min. Average call: 4 minutes. Cost per call: $1.92 — justified by an average booking value above $5,000.
A decision tree
Chat or phone assistant?
├─ Default → gpt-4o-mini (1× cost)
├─ Multi-step reasoning needed → o3-mini (2× cost)
└─ Still failing → gpt-4o-2024-05-13 (10× cost)
Realtime voice assistant?
├─ Default → gpt-4o-mini-realtime-preview (12¢/min)
└─ Premium experience → gpt-4o-realtime-preview (48¢/min)
Switching models on a live assistant
The model is a single dropdown on the assistant. Change it, save, and the next conversation uses the new model. No re-indexing, no migration. A useful A/B technique:
- Clone the assistant.
- Change the clone's model.
- Point a test widget at the clone.
- Compare results in the Conversations view for a few days.
What does not automatically carry across a model switch is prompt tuning. A prompt tuned for gpt-4o-mini may need adjustment on o3-mini or gpt-4o. Re-tune in the Playground after any swap.
BYOK and pricing
If you've configured your own OpenAI key under Settings → BYOK Credentials, LLM costs are billed directly to your provider account. Insighto's wallet doesn't get touched for the LLM line item. See BYOK Credentials and Pricing.
Where to next
- Pricing — the full per-minute / per-query matrix.
- Writing a system prompt — most "the model is dumb" complaints are actually prompt complaints.
- Voice settings — for the STT/TTS half of a voice assistant.