Voices library
The Voices library is the catalogue of every text-to-speech voice your assistants can use, plus your custom-cloned voices. Browse by language and provider, preview each with a play button, and pick the one that fits your brand. The voice you choose directly affects per-minute cost — pick wisely.
TTS providers
Five providers, each with its own catalogue and price multiplier.
| Provider | Multiplier | Effective voice cost | Best for |
|---|---|---|---|
| Azure Speech | 1× | 2¢ / min | Default. Strong neural voices, the widest language catalogue. |
| Cartesia | 1.75× | 3.5¢ / min | Lowest first-byte latency. Good when call rhythm matters. |
| ElevenLabs | 2.5× | 5¢ / min | Most expressive voices. Premium brand experiences. |
| PlayHT | 2.5× | 5¢ / min | Distinctive voice library, conversational quality. |
| Custom-cloned (ElevenLabs) | 4× | 8¢ / min | Your own voice — the receptionist sounds like the founder. |
STT providers (separate from TTS)
For speech-to-text:
- Azure Speech — default, widest language coverage.
- Deepgram — lowest latency in English.
- ElevenLabs — matched workflows when you're using ElevenLabs TTS.
All STT providers bill at the same flat 1¢ / minute. Pick by language and latency.
What you see in the picker
The library page lets you filter by language, provider, and gender. Each row has a play button that streams a sample sentence in that voice — provider-supplied, so previewing costs you nothing.
Sample voice names you'll find:
- Azure:
en-US-AvaNeural,en-US-AndrewNeural,en-GB-LibbyNeural,de-DE-KatjaNeural,es-ES-ElviraNeural,hi-IN-SwaraNeural. - ElevenLabs:
Rachel,Adam,Bella,Antoni. - Cartesia: named voices with regional flavour.
- PlayHT: distinctive conversational voices.
For Realtime OpenAI assistants, the picker switches to OpenAI's built-in voices: alloy, echo, fable, onyx, nova, shimmer, plus newer additions. You can't mix in a third-party voice on a Realtime assistant.
Four business use cases
Dental clinic — Azure neural, the cheap-but-good choice. Bright Smile Dental picks en-US-AvaNeural from Azure. Voice cost: 2¢/min, total call cost 6¢/min with gpt-4o-mini. Patients consistently rate the assistant "sounded like a real person".
Real estate — bilingual EN/ES. A San Diego brokerage picks en-US-AvaNeural and es-MX-DaliaNeural. The assistant auto-switches between them based on the language of the call. Both stay on the 2¢/min Azure tier.
Luxury concierge — ElevenLabs custom clone. A boutique travel agency cloned its founder's voice via ElevenLabs and assigned it to its VIP-line assistant. Voice cost: 8¢/min. The 4× premium is justified by the brand experience for $5,000+ trips.
Telehealth — Cartesia for snap responsiveness. A virtual-clinic company picked Cartesia specifically for its first-byte latency. Patients notice the absence of the half-second silence between their question and the reply.
Picking a voice on an assistant
Build → Assistants → [your assistant] → Voice → pick from catalogue → Save.
Switching voices on a live assistant is safe — in-flight calls finish on the old voice; new calls use the new voice.
Custom voice cloning (ElevenLabs)
The cloning flow:
- Open Build → Library → Voices → Create custom voice.
- Give it a name and short description.
- Upload an audio sample. 16 kHz mono, no background music, at least 30 seconds. The cleaner the sample, the cleaner the clone.
- Save. ElevenLabs trains the clone (a few minutes).
- The new voice appears in your assistant's voice picker, tagged as custom.
Custom voices require the custom_voice entitlement on your plan. If you don't see the Create custom voice button, you don't have the entitlement.
API equivalent:
curl -X POST https://api.insighto.ai/api/v1/voice/custom_voice \
-H "Authorization: Bearer $TOKEN" \
-F "name=Founder voice" \
-F "description=Cloned from Founder Smith's onboarding call" \
-F "audio=@/path/to/sample.wav"
How voice choice affects cost
Each voice carries a multiplier shown in the picker. The voice line in your per-minute cost is 2¢ × multiplier. The other lines (STT, LLM, platform) don't change with the voice — so a switch from Azure (1×) to ElevenLabs (2.5×) adds 3¢ to every minute of every call.
If you've set up BYOK Credentials for ElevenLabs or Azure Speech, voice cost drops to 0¢ because Insighto calls your account directly. See BYOK Credentials.
Quick recipes
- Cheapest viable phone bot. Azure neural English — pick any
en-US-*Neuralvoice. 2¢/min. - Premium concierge. ElevenLabs — for brand-defining voices or a custom clone.
- Low-latency interactive. Cartesia — the fastest first-byte time in the picker.
- Wide language coverage. Azure — by far the largest catalogue of locales.
- Realtime sub-500ms latency. Realtime OpenAI with
novaorshimmer— bundles STT + TTS + LLM at 12¢/min.
Where to next
- Voice settings — the full STT/TTS pipeline configuration.
- Choosing an LLM — for the LLM half of voice pricing.
- Pricing — full per-minute breakdown.