Voice settings
Voice settings let you pick how your assistant hears (speech-to-text) and how it sounds (text-to-speech). The right combination is the difference between a call that feels effortless and one customers hang up on. This page covers every provider, the cost multipliers, language coverage, and voicemail handling.
Voice settings apply to Phone and Realtime OpenAI assistant types only.
The voice pipeline
For Realtime OpenAI assistants, the three boxes collapse into one streaming call — STT, the LLM, and TTS all happen inside OpenAI's Realtime API.
Speech-to-text (STT) providers
| Provider | Best for |
|---|---|
| Azure Speech | Widest language coverage; very reliable on phone-quality audio. Default for most assistants. |
| Deepgram | Lowest latency in English. Good when call-flow latency matters more than language coverage. |
| ElevenLabs | Available as an option for matched workflows where you're already using ElevenLabs TTS. |
STT cost is a flat 1¢ per minute across all three providers. Pick by language coverage and latency, not price.
Text-to-speech (TTS) providers and voice multipliers
The voice you pick decides part of the per-minute cost. Each TTS provider has a multiplier on the base 2¢/min voice line:
| Provider | Multiplier | Effective voice cost |
|---|---|---|
| Azure Speech | 1× | 2¢ / min |
| Cartesia | 1.75× | 3.5¢ / min |
| ElevenLabs | 2.5× | 5¢ / min |
| PlayHT | 2.5× | 5¢ / min |
| Custom-cloned voice (ElevenLabs) | 4× | 8¢ / min |
Azure neural voices are good enough that the premium often doesn't pay for itself — start there, upgrade only when you can hear the difference matters to your brand.
Per-minute pricing examples
Voice baseline is 6¢/min (1¢ STT + 1¢ LLM at 1× + 2¢ TTS at 1× + 2¢ platform). Switching providers shifts the components:
| Setup | Cost / min |
|---|---|
gpt-4o-mini + Azure STT + Azure TTS | 6¢ |
gpt-4o-mini + Azure STT + Cartesia | 7.5¢ |
gpt-4o-mini + Azure STT + ElevenLabs | 9¢ |
gpt-4o-mini + Azure STT + custom-cloned voice | 12¢ |
o3-mini + Azure STT + Azure TTS | 7¢ |
Realtime OpenAI (gpt-4o-mini-realtime-preview) | 12¢ (bundled) |
Realtime OpenAI (gpt-4o-realtime-preview) | 48¢ (bundled) |
See Pricing for the full table.
Picking a voice
From Build → Assistants → [your assistant] → Voice. The picker is the same catalogue as Build → Library → Voices — filter by language, provider, and gender. Every voice has a play button that streams a sample. Always preview before saving. Voices that look alike on paper sound very different.
For Realtime OpenAI assistants, the picker is OpenAI's built-in list: alloy, echo, fable, onyx, nova, shimmer, plus newer additions. You can't route Realtime audio through Azure or ElevenLabs.
Language
Set the working languages under Voice → Languages. This decides which STT model is loaded and filters the voice picker to matching locales. You can pick multiple — the assistant will auto-detect within the set.
Common choices:
en-US— North American Englishen-GB— British Englishde-DE— Germanes-ES— European Spanishes-MX— Latin American Spanishhi-IN— Hindimr-IN— Marathifr-FR— French
Azure has the largest catalogue; if your locale is unusual, start there.
Four business use cases
Dental clinic — English only, Azure. Bright Smile Dental uses Azure STT + an en-US-AvaNeural voice. Total voice cost: 6¢/min with gpt-4o-mini. Average call: 3 minutes. Cost per call: 18¢.
Real estate — bilingual EN/ES, Azure. A San Diego brokerage sets working languages to en-US and es-MX. The assistant detects which the caller is speaking and switches automatically. Both use Azure neural voices.
Luxury concierge — ElevenLabs custom clone. A boutique travel agency cloned its founder's voice via ElevenLabs and uses it for VIP callers. Voice cost: 8¢/min. Average call: 5 minutes. Cost per call: ~62¢. The brand experience justifies the premium.
Healthcare scheduling — Cartesia for low latency. A telehealth company uses Cartesia TTS specifically because Cartesia's first-byte latency is the fastest in the picker. Patients notice the absence of a half-second silence between their question and the assistant's reply.
Voicemail handling
When the call hits a voicemail box instead of a person, you usually want to leave a message and hang up — not let the assistant talk to dead air. Configure this in Voice → Advanced:
- Detect voicemail — set to
dropto enable voicemail-drop behaviour. - Voicemail message — the text the assistant will speak when it detects voicemail. Insighto renders this once to MP3 and plays it automatically.
Example message: Hi, this is the appointment desk at Bright Smile Dental returning your call. We were trying to confirm your visit on Friday. Please call us back at 512-555-0123.
Custom voice cloning
You can clone a voice on ElevenLabs and use it on any Phone assistant:
- Go to Build → Library → Voices → Create custom voice.
- Upload a clean audio sample — 16 kHz mono, no background music, at least 30 seconds.
- Save. The voice appears in the picker tagged as custom.
Custom voices require the custom_voice entitlement on your plan. Sample quality is the floor for clone quality — pay attention to room noise and microphone.
BYOK voice
If you've configured ElevenLabs or Azure Speech under Settings → BYOK Credentials, voice cost on this assistant drops to 0¢ because Insighto routes the audio through your account. LLM and platform cost still apply. See BYOK Credentials.
Common mistakes
- Defaulting to ElevenLabs because it sounds better in demos. Azure is good enough at 40% the cost — try it first.
- Forgetting to update voice after switching language. Voice locale and assistant language are independent. Re-pick the voice when you change languages.
- Using Realtime OpenAI for cost-sensitive workloads. Realtime is 12–48¢/min — pick it only when sub-500ms latency justifies the cost.
- Wiring a Phone-type assistant to a chat widget. No audio will render. Match assistant type to widget type.
Where to next
- Choosing an LLM — the LLM half of the voice pipeline.
- Voices library — the full catalogue and the custom-clone flow.
- Pricing — the full per-minute breakdown.