Data sources overview
A data source is the knowledge your assistant draws on to answer questions. Upload a PDF, crawl a website, paste a block of text, or connect a CRM, and the assistant can search that material on every turn and quote relevant passages back to your customer. This is what makes the assistant accurate about your business instead of just sounding generally smart.
Data sources are decoupled from assistants on purpose. One knowledge base can power multiple assistants — your sales SDR, your support bot, and your phone receptionist can all draw from the same set of product docs without duplicating anything.
How retrieval works at query time
Behind the scenes, every document is split into small passages, each turned into a numerical fingerprint (an embedding). When a customer asks a question, the assistant searches for the passages whose fingerprints best match the question and feeds those passages into the LLM alongside the prompt. This is what people mean when they say "RAG" — retrieval-augmented generation.
Source types
| Type | What it is | How it stays fresh |
|---|---|---|
| File upload | PDF, DOC, DOCX, or pasted text | Re-upload to refresh |
| Website crawl | Pages discovered from a starting URL | Re-run the crawl manually |
| CRM | HubSpot, Zoho, GoHighLevel, ModMed | Always live — looked up per conversation |
| Database | Postgres tool wired in as a data source | Always live — queried at runtime |
| Custom HTTP | Your own API exposed as a data source | Always live — called at runtime |
The first two are indexed (chunked and embedded once, then queried fast). The last three are live (no indexing — fetched on demand whenever the assistant needs them).
Four business use cases
Dental practice — clinic FAQ. Bright Smile Dental uploads three PDFs: services and pricing, insurance providers, and clinic policies. Total: 12 pages. Indexing takes 90 seconds. Every patient question about a service, a price, or a policy now retrieves the right paragraph and the assistant quotes it.
Real estate brokerage — listings. A San Diego brokerage crawls its public listings site every Monday morning. The crawl picks up new properties, retired listings drop off on the next crawl. The phone assistant can describe any active listing without anyone manually maintaining a knowledge base.
E-commerce — help center + live order lookup. A meal-kit company crawls its help-center subdomain (60 articles, refreshed weekly) for general questions, and wires in the Postgres tool for live order status lookups. Customers asking "where is my order" get the real shipping status; customers asking "how do I pause delivery" get the help-center answer.
B2B SaaS — product docs + HubSpot context. An analytics startup uploads its product documentation as a single PDF, then connects HubSpot as a live data source. The sales assistant answers product questions from the docs and personalises every reply with the lead's HubSpot lifecycle stage and last touchpoint.
How content becomes searchable
For uploaded files and crawls, the pipeline is:
- Upload or crawl. The raw content lands in storage.
- Extract. Text is pulled from the PDF, DOC, or HTML pages.
- Chunk. The text is split into passages of a few hundred words.
- Embed. Each passage becomes a fingerprint vector.
- Index. Vectors are stored in the searchable index.
You'll see the status of each data source walk through: empty → initialized → in process → ready for use. If something fails (a scanned PDF with no extractable text, for example), the status is failed and you can re-upload after fixing the source.
Live data sources (CRM, database, custom HTTP) skip this pipeline — they're fetched on every conversation instead.
Linking a data source to an assistant
A data source on its own does nothing. To attach one:
- Open the assistant under Build → Assistants.
- Go to the Data sources tab.
- Pick the source.
- Save.
One source can be linked to many assistants. They all share the same underlying index, so a single update propagates everywhere.
Words allowance
Every ingested document burns "words" against your plan allowance. A 100-page PDF is roughly 50,000 words; a 100-page crawl can be 200,000 or more. Watch the Usage page if you're planning a big ingest.
When RAG is the wrong answer
- Fast-moving data like inventory or current orders → use a tool, not a data source.
- Computations like sums, conversions, or comparisons across rows → use a tool.
- Highly structured data like a product catalogue with 12 attributes per SKU → use the Postgres tool.
- Writes like booking a meeting or creating a ticket → use a tool.
Data sources are read-only. The moment the assistant needs to take an action or read something that changed two seconds ago, the right answer is a tool.
Where to next
- Upload PDFs and documents — the simplest starting point.
- Crawl a website — for everything that already lives on your site.
- Connect a CRM — for per-contact context.