Skip to main content

Voices library

The Voices library is the catalogue of every text-to-speech voice your assistants can use, plus your custom-cloned voices. Browse by language and provider, preview each with a play button, and pick the one that fits your brand. The voice you choose directly affects per-minute cost — pick wisely.

TTS providers

Five providers, each with its own catalogue and price multiplier.

ProviderMultiplierEffective voice costBest for
Azure Speech2¢ / minDefault. Strong neural voices, the widest language catalogue.
Cartesia1.75×3.5¢ / minLowest first-byte latency. Good when call rhythm matters.
ElevenLabs2.5×5¢ / minMost expressive voices. Premium brand experiences.
PlayHT2.5×5¢ / minDistinctive voice library, conversational quality.
Custom-cloned (ElevenLabs)8¢ / minYour own voice — the receptionist sounds like the founder.

STT providers (separate from TTS)

For speech-to-text:

  • Azure Speech — default, widest language coverage.
  • Deepgram — lowest latency in English.
  • ElevenLabs — matched workflows when you're using ElevenLabs TTS.

All STT providers bill at the same flat 1¢ / minute. Pick by language and latency.

What you see in the picker

The library page lets you filter by language, provider, and gender. Each row has a play button that streams a sample sentence in that voice — provider-supplied, so previewing costs you nothing.

Sample voice names you'll find:

  • Azure: en-US-AvaNeural, en-US-AndrewNeural, en-GB-LibbyNeural, de-DE-KatjaNeural, es-ES-ElviraNeural, hi-IN-SwaraNeural.
  • ElevenLabs: Rachel, Adam, Bella, Antoni.
  • Cartesia: named voices with regional flavour.
  • PlayHT: distinctive conversational voices.

For Realtime OpenAI assistants, the picker switches to OpenAI's built-in voices: alloy, echo, fable, onyx, nova, shimmer, plus newer additions. You can't mix in a third-party voice on a Realtime assistant.

Four business use cases

Dental clinic — Azure neural, the cheap-but-good choice. Bright Smile Dental picks en-US-AvaNeural from Azure. Voice cost: 2¢/min, total call cost 6¢/min with gpt-4o-mini. Patients consistently rate the assistant "sounded like a real person".

Real estate — bilingual EN/ES. A San Diego brokerage picks en-US-AvaNeural and es-MX-DaliaNeural. The assistant auto-switches between them based on the language of the call. Both stay on the 2¢/min Azure tier.

Luxury concierge — ElevenLabs custom clone. A boutique travel agency cloned its founder's voice via ElevenLabs and assigned it to its VIP-line assistant. Voice cost: 8¢/min. The 4× premium is justified by the brand experience for $5,000+ trips.

Telehealth — Cartesia for snap responsiveness. A virtual-clinic company picked Cartesia specifically for its first-byte latency. Patients notice the absence of the half-second silence between their question and the reply.

Picking a voice on an assistant

Build → Assistants → [your assistant] → Voice → pick from catalogue → Save.

Switching voices on a live assistant is safe — in-flight calls finish on the old voice; new calls use the new voice.

Custom voice cloning (ElevenLabs)

The cloning flow:

  1. Open Build → Library → Voices → Create custom voice.
  2. Give it a name and short description.
  3. Upload an audio sample. 16 kHz mono, no background music, at least 30 seconds. The cleaner the sample, the cleaner the clone.
  4. Save. ElevenLabs trains the clone (a few minutes).
  5. The new voice appears in your assistant's voice picker, tagged as custom.

Custom voices require the custom_voice entitlement on your plan. If you don't see the Create custom voice button, you don't have the entitlement.

API equivalent:

curl -X POST https://api.insighto.ai/api/v1/voice/custom_voice \
-H "Authorization: Bearer $TOKEN" \
-F "name=Founder voice" \
-F "description=Cloned from Founder Smith's onboarding call" \
-F "audio=@/path/to/sample.wav"

How voice choice affects cost

Each voice carries a multiplier shown in the picker. The voice line in your per-minute cost is 2¢ × multiplier. The other lines (STT, LLM, platform) don't change with the voice — so a switch from Azure (1×) to ElevenLabs (2.5×) adds 3¢ to every minute of every call.

If you've set up BYOK Credentials for ElevenLabs or Azure Speech, voice cost drops to 0¢ because Insighto calls your account directly. See BYOK Credentials.

Quick recipes

  • Cheapest viable phone bot. Azure neural English — pick any en-US-*Neural voice. 2¢/min.
  • Premium concierge. ElevenLabs — for brand-defining voices or a custom clone.
  • Low-latency interactive. Cartesia — the fastest first-byte time in the picker.
  • Wide language coverage. Azure — by far the largest catalogue of locales.
  • Realtime sub-500ms latency. Realtime OpenAI with nova or shimmer — bundles STT + TTS + LLM at 12¢/min.

Where to next