Playground

The Playground is a sandbox conversation wired to your live assistant configuration. Anything you change on the assistant — prompt, model, data sources, tools — reflects immediately so you can iterate quickly. The Playground uses the same engine as production traffic, so what you see is what end users get.

Treat it as your primary tuning surface. Most teams run 20–50 Playground conversations before any change touches a real customer.

Where to find it

From the dashboard, go to Build → Playground. Or open any assistant and click the Playground tab on its detail page — that view is pre-scoped to the assistant.

The two panels

Left: the conversation

A regular chat surface. A few things to know:

The transcript uses the current saved version of the assistant. If you edited the prompt and didn't save, you're testing the old version.
Reload clears the conversation and starts a fresh session.
The first message you see is the widget's intro message, if you opened the Playground from a widget context.

Right: the retrieval panel

The retrieval panel shows, for every turn that pulled from a data source:

Which chunks the assistant retrieved.
The source document (filename or crawl URL).
The chunk text the assistant actually read.

Almost every "the assistant gave a wrong answer" problem reduces to one of three causes:

Wrong chunks pulled. The right information exists but a different chunk scored higher. Fix the data source.
No relevant chunks exist. Add a document.
Right chunks pulled, wrong answer. This is a prompt problem.

If you don't check the retrieval panel on every failed turn, you'll misdiagnose all three as "the model is bad."

Four business use cases

Dental clinic — tuning the booking flow. Bright Smile Dental's first Playground session revealed the assistant was offering Saturday appointments when the clinic is closed. The Conversations panel made it obvious. They added "We are closed Saturday and Sunday" to the system prompt; the next 20 Playground turns showed clean weekend refusal.

Real estate — buyer/seller branching. A brokerage tested both branches of their Conversation Flow in the Playground before going live. They pretended to be a buyer ("I want to look at properties under 500k") and then a seller ("I want to list my house"), confirming each branch routed to the right sub-flow.

E-commerce — retrieval debugging. A meal-kit company saw their assistant claim "we don't ship to Hawaii" when they actually do. The retrieval panel showed the chunk retrieved was a 2022 policy doc, not the current 2026 one. They deleted the old data source and the next test answered correctly.

Fintech — hostile-user testing. A neo-bank's chat assistant was tested with abusive messages, attempts to extract internal logic, and rapid topic shifts. The Playground revealed the assistant agreed too readily with "that policy is stupid" type prompts. They added a constraint: "do not agree with criticism of company policies." Re-tested. Passed.

Testing patterns

The script test. Write a 5–10 turn user script before opening the Playground. Paste each line in order.
The pivot test. Get the assistant deep into one topic, then suddenly switch ("actually, can you help with billing instead?").
The hostile-user test. Send "you're useless", "give me a human", "I'm going to leave a bad review". Check that escalation rules fire.
The off-topic test. Ask something outside scope. Make sure the assistant politely declines.
The repetition test. Ask the same question two or three times in slightly different wording. If you get three different answers, your retrieval is unstable.

Testing voice

For Simple assistants with voice playback enabled, the Playground can render replies through the TTS voice you picked.

For Phone and Realtime OpenAI assistants, the Playground is not enough — telephony adds jitter and barge-in behaviour that the Playground doesn't simulate. Attach the assistant to a phone widget and dial in for a true voice test.

The iteration loop

Hypothesis. "I think the assistant is failing because the prompt doesn't say to refuse off-topic questions."
Edit. Make the change. One thing at a time.
Save. The Playground reads saved state.
Test. Run five messages exercising the change.
Inspect. For each failure, open the retrieval panel.
Repeat.

A worked retrieval example

User asks: Do you take Delta Dental insurance? Assistant replies: I'm not sure about your specific plan, but I'd recommend calling our office.

Three patterns you might see in the retrieval panel:

No chunks retrieved. Your data source has nothing about insurance. Add a document.
Three chunks retrieved, all about office hours. The retriever pulled the wrong content. Split the data source — one focused doc per topic.
Correct chunk retrieved with the explicit insurance list — and the assistant still hedged. This is a prompt problem. Look for an over-cautious line like "do not commit to insurance coverage without verification."

Doing this on every failed turn for a week is the single best skill-builder in the product.

Comparing two assistants side by side

The standalone Playground view lets you pick an assistant from a dropdown. A common workflow:

Clone your current assistant to a -test variant.
Make one change on the clone (new prompt, different model, swapped voice).
Open two browser tabs, one on each.
Run the same script through both. Pick the winner.

Latency: Playground vs. production

Playground latency is faster than production. There's no telephony jitter, no widget CDN hop, and history payloads start small. A 1.2-second Playground reply can be 1.8–2.5 seconds on a live channel.

Going from Playground to production

Attach a widget. Open Build → Widgets, link your assistant to a chat embed, phone number, or WhatsApp connection.
Start small. Ship the embed on a low-traffic page first.
Watch the Conversations tab for the first 24–48 hours. Real users find failure modes you didn't script.
Keep iterating. The Playground stays your primary tuning surface for the life of the assistant.

Where to next

Writing a system prompt — most Playground iteration is prompt iteration.
Data sources overview — when retrieval is wrong, fix it here.
Conversations — watch real traffic after you go live.

Where to find it​

The two panels​

Left: the conversation​

Right: the retrieval panel​

Four business use cases​

Testing patterns​

Testing voice​

The iteration loop​

A worked retrieval example​

Comparing two assistants side by side​

Latency: Playground vs. production​

Going from Playground to production​

Where to next​