Data sources overview

A data source is the knowledge your assistant draws on to answer questions. Upload a PDF, crawl a website, paste a block of text, or connect a CRM, and the assistant can search that material on every turn and quote relevant passages back to your customer. This is what makes the assistant accurate about your business instead of just sounding generally smart.

Data sources are decoupled from assistants on purpose. One knowledge base can power multiple assistants — your sales SDR, your support bot, and your phone receptionist can all draw from the same set of product docs without duplicating anything.

How retrieval works at query time

Behind the scenes, every document is split into small passages, each turned into a numerical fingerprint (an embedding). When a customer asks a question, the assistant searches for the passages whose fingerprints best match the question and feeds those passages into the LLM alongside the prompt. This is what people mean when they say "RAG" — retrieval-augmented generation.

Source types

Type	What it is	How it stays fresh
File upload	PDF, DOC, DOCX, or pasted text	Re-upload to refresh
Website crawl	Pages discovered from a starting URL	Re-run the crawl manually
CRM	HubSpot, Zoho, GoHighLevel, ModMed	Always live — looked up per conversation
Database	Postgres tool wired in as a data source	Always live — queried at runtime
Custom HTTP	Your own API exposed as a data source	Always live — called at runtime

The first two are indexed (chunked and embedded once, then queried fast). The last three are live (no indexing — fetched on demand whenever the assistant needs them).

Four business use cases

Dental practice — clinic FAQ. Bright Smile Dental uploads three PDFs: services and pricing, insurance providers, and clinic policies. Total: 12 pages. Indexing takes 90 seconds. Every patient question about a service, a price, or a policy now retrieves the right paragraph and the assistant quotes it.

Real estate brokerage — listings. A San Diego brokerage crawls its public listings site every Monday morning. The crawl picks up new properties, retired listings drop off on the next crawl. The phone assistant can describe any active listing without anyone manually maintaining a knowledge base.

E-commerce — help center + live order lookup. A meal-kit company crawls its help-center subdomain (60 articles, refreshed weekly) for general questions, and wires in the Postgres tool for live order status lookups. Customers asking "where is my order" get the real shipping status; customers asking "how do I pause delivery" get the help-center answer.

B2B SaaS — product docs + HubSpot context. An analytics startup uploads its product documentation as a single PDF, then connects HubSpot as a live data source. The sales assistant answers product questions from the docs and personalises every reply with the lead's HubSpot lifecycle stage and last touchpoint.

How content becomes searchable

For uploaded files and crawls, the pipeline is:

Upload or crawl. The raw content lands in storage.
Extract. Text is pulled from the PDF, DOC, or HTML pages.
Chunk. The text is split into passages of a few hundred words.
Embed. Each passage becomes a fingerprint vector.
Index. Vectors are stored in the searchable index.

You'll see the status of each data source walk through: empty → initialized → in process → ready for use. If something fails (a scanned PDF with no extractable text, for example), the status is failed and you can re-upload after fixing the source.

Live data sources (CRM, database, custom HTTP) skip this pipeline — they're fetched on every conversation instead.

Linking a data source to an assistant

A data source on its own does nothing. To attach one:

Open the assistant under Build → Assistants.
Go to the Data sources tab.
Pick the source.
Save.

One source can be linked to many assistants. They all share the same underlying index, so a single update propagates everywhere.

Words allowance

Every ingested document burns "words" against your plan allowance. A 100-page PDF is roughly 50,000 words; a 100-page crawl can be 200,000 or more. Watch the Usage page if you're planning a big ingest.

When RAG is the wrong answer

Fast-moving data like inventory or current orders → use a tool, not a data source.
Computations like sums, conversions, or comparisons across rows → use a tool.
Highly structured data like a product catalogue with 12 attributes per SKU → use the Postgres tool.
Writes like booking a meeting or creating a ticket → use a tool.

Data sources are read-only. The moment the assistant needs to take an action or read something that changed two seconds ago, the right answer is a tool.

Where to next

Upload PDFs and documents — the simplest starting point.
Crawl a website — for everything that already lives on your site.
Connect a CRM — for per-contact context.

How retrieval works at query time​

Source types​

Four business use cases​

How content becomes searchable​

Linking a data source to an assistant​

Words allowance​

When RAG is the wrong answer​

Where to next​