Skip to main content

Voice settings

Voice settings let you pick how your assistant hears (speech-to-text) and how it sounds (text-to-speech). The right combination is the difference between a call that feels effortless and one customers hang up on. This page covers every provider, the cost multipliers, language coverage, and voicemail handling.

Voice settings apply to Phone and Realtime OpenAI assistant types only.

The voice pipeline

For Realtime OpenAI assistants, the three boxes collapse into one streaming call — STT, the LLM, and TTS all happen inside OpenAI's Realtime API.

Speech-to-text (STT) providers

ProviderBest for
Azure SpeechWidest language coverage; very reliable on phone-quality audio. Default for most assistants.
DeepgramLowest latency in English. Good when call-flow latency matters more than language coverage.
ElevenLabsAvailable as an option for matched workflows where you're already using ElevenLabs TTS.

STT cost is a flat 1¢ per minute across all three providers. Pick by language coverage and latency, not price.

Text-to-speech (TTS) providers and voice multipliers

The voice you pick decides part of the per-minute cost. Each TTS provider has a multiplier on the base 2¢/min voice line:

ProviderMultiplierEffective voice cost
Azure Speech2¢ / min
Cartesia1.75×3.5¢ / min
ElevenLabs2.5×5¢ / min
PlayHT2.5×5¢ / min
Custom-cloned voice (ElevenLabs)8¢ / min

Azure neural voices are good enough that the premium often doesn't pay for itself — start there, upgrade only when you can hear the difference matters to your brand.

Per-minute pricing examples

Voice baseline is 6¢/min (1¢ STT + 1¢ LLM at 1× + 2¢ TTS at 1× + 2¢ platform). Switching providers shifts the components:

SetupCost / min
gpt-4o-mini + Azure STT + Azure TTS
gpt-4o-mini + Azure STT + Cartesia7.5¢
gpt-4o-mini + Azure STT + ElevenLabs
gpt-4o-mini + Azure STT + custom-cloned voice12¢
o3-mini + Azure STT + Azure TTS
Realtime OpenAI (gpt-4o-mini-realtime-preview)12¢ (bundled)
Realtime OpenAI (gpt-4o-realtime-preview)48¢ (bundled)

See Pricing for the full table.

Picking a voice

From Build → Assistants → [your assistant] → Voice. The picker is the same catalogue as Build → Library → Voices — filter by language, provider, and gender. Every voice has a play button that streams a sample. Always preview before saving. Voices that look alike on paper sound very different.

For Realtime OpenAI assistants, the picker is OpenAI's built-in list: alloy, echo, fable, onyx, nova, shimmer, plus newer additions. You can't route Realtime audio through Azure or ElevenLabs.

Language

Set the working languages under Voice → Languages. This decides which STT model is loaded and filters the voice picker to matching locales. You can pick multiple — the assistant will auto-detect within the set.

Common choices:

  • en-US — North American English
  • en-GB — British English
  • de-DE — German
  • es-ES — European Spanish
  • es-MX — Latin American Spanish
  • hi-IN — Hindi
  • mr-IN — Marathi
  • fr-FR — French

Azure has the largest catalogue; if your locale is unusual, start there.

Four business use cases

Dental clinic — English only, Azure. Bright Smile Dental uses Azure STT + an en-US-AvaNeural voice. Total voice cost: 6¢/min with gpt-4o-mini. Average call: 3 minutes. Cost per call: 18¢.

Real estate — bilingual EN/ES, Azure. A San Diego brokerage sets working languages to en-US and es-MX. The assistant detects which the caller is speaking and switches automatically. Both use Azure neural voices.

Luxury concierge — ElevenLabs custom clone. A boutique travel agency cloned its founder's voice via ElevenLabs and uses it for VIP callers. Voice cost: 8¢/min. Average call: 5 minutes. Cost per call: ~62¢. The brand experience justifies the premium.

Healthcare scheduling — Cartesia for low latency. A telehealth company uses Cartesia TTS specifically because Cartesia's first-byte latency is the fastest in the picker. Patients notice the absence of a half-second silence between their question and the assistant's reply.

Voicemail handling

When the call hits a voicemail box instead of a person, you usually want to leave a message and hang up — not let the assistant talk to dead air. Configure this in Voice → Advanced:

  • Detect voicemail — set to drop to enable voicemail-drop behaviour.
  • Voicemail message — the text the assistant will speak when it detects voicemail. Insighto renders this once to MP3 and plays it automatically.

Example message: Hi, this is the appointment desk at Bright Smile Dental returning your call. We were trying to confirm your visit on Friday. Please call us back at 512-555-0123.

Custom voice cloning

You can clone a voice on ElevenLabs and use it on any Phone assistant:

  1. Go to Build → Library → Voices → Create custom voice.
  2. Upload a clean audio sample — 16 kHz mono, no background music, at least 30 seconds.
  3. Save. The voice appears in the picker tagged as custom.

Custom voices require the custom_voice entitlement on your plan. Sample quality is the floor for clone quality — pay attention to room noise and microphone.

BYOK voice

If you've configured ElevenLabs or Azure Speech under Settings → BYOK Credentials, voice cost on this assistant drops to because Insighto routes the audio through your account. LLM and platform cost still apply. See BYOK Credentials.

Common mistakes

  • Defaulting to ElevenLabs because it sounds better in demos. Azure is good enough at 40% the cost — try it first.
  • Forgetting to update voice after switching language. Voice locale and assistant language are independent. Re-pick the voice when you change languages.
  • Using Realtime OpenAI for cost-sensitive workloads. Realtime is 12–48¢/min — pick it only when sub-500ms latency justifies the cost.
  • Wiring a Phone-type assistant to a chat widget. No audio will render. Match assistant type to widget type.

Where to next