Skip to main content

Upload PDFs and documents

The fastest way to teach an assistant about your business is to upload a file. Drop in a PDF, DOC, or pasted text, and within a minute or two the assistant can quote any paragraph as part of an answer. No engineering, no syncing, no manual chunking — the platform handles all of it.

Use this for content that doesn't change daily: product manuals, policies, pricing sheets, FAQs, training materials.

The upload flow

The whole pipeline usually finishes in 30 seconds to 2 minutes for documents under 50 pages. Larger files (200+ pages) can take 5–10 minutes.

Supported formats

FormatWhat it's good for
PDFManuals, policies, signed contracts, anything already produced as a PDF
DOC / DOCXWord documents straight from your team
Text (paste)Quick notes, scraped article text, copy-pasted FAQs
Text + imageDocuments where some content is OCR'd image text
ImagePure image data sources for visual lookups

PDFs are the most common. Scanned PDFs (images of text) don't work — convert to OCR'd text first.

Size and word limits

  • 25 MB per file. Split larger files into chapters before uploading.
  • Word allowance. Each upload counts against your plan's words_count allowance. A typical 30-page PDF is ~15,000 words.

Four business use cases

Dental practice — three PDFs, done. Bright Smile Dental uploads services-and-prices.pdf, insurance-providers.pdf, and clinic-policies.pdf. Total 12 pages, ~6,000 words, indexed in under two minutes. Every patient question about a service, a price, or a policy now retrieves the right paragraph.

Real estate — buyer's guide as a PDF. A brokerage uploads its 40-page Buyer's Guide. The assistant can answer "what does an inspection contingency mean" or "what's the typical closing timeline in California" without anyone manually transcribing the FAQ.

B2B SaaS — product docs as one file. An analytics startup exports its product docs to a single PDF and uploads it. Whenever docs change, they re-export and re-upload — about a 5-minute task. The assistant always reflects the latest version.

Restaurant chain — menu and allergen sheet. A 12-location pizza chain uploads its menu and allergen guide. The voice assistant answers "do you have gluten-free crust" and "what's in the spicy sausage" with full accuracy because the answers live in the source documents.

Uploading a file

  1. Open Build → Data sources → Add data source.
  2. Pick File upload.
  3. Drag your file into the drop zone (or click to browse).
  4. Give the data source a name (e.g. clinic-policies-2026).
  5. Click Create.

You'll land on the data source detail page. The status indicator shows initialized → in process → ready for use. You can attach the source to an assistant before indexing finishes — retrieval will simply return only the chunks already indexed.

API equivalent

# Create the data source
curl -X POST https://api.insighto.ai/api/v1/datasource \
-H "Authorization: Bearer $TOKEN" \
-d '{"name": "clinic-policies-2026", "ds_type": "pdf"}'

# Attach the file
curl -X POST https://api.insighto.ai/api/v1/datasource/<id>/file \
-H "Authorization: Bearer $TOKEN" \
-F "datasourcefile_file=@/path/to/policies.pdf" \
-F "ds_type=pdf" \
-F "name=policies.pdf"

Updating a file

Files are one-shot. To replace a PDF with a newer version, delete the old file entry and re-upload. The vectors for the old version are removed automatically. If you keep the data source name stable, every assistant linked to it picks up the new content with no re-linking required.

Linking to an assistant

A data source on its own does nothing. Open the assistant under Build → Assistants → [your assistant] → Data sources and toggle the source on. One source can be linked to many assistants.

Common problems

"The assistant says it doesn't know, but the answer is right there in the PDF." Check the status. If it's still in process or failed, the content isn't indexed yet. If the status is ready for use but the chunk count is zero, the PDF is probably scanned — try OCR-ing it first and uploading as text.

"The PDF indexed but the assistant cites the wrong section." Chunks are produced with a fixed boundary policy. If a section header gets separated from its body, retrieval can pick the wrong passage. Splitting the PDF into smaller logical files (one per topic) usually fixes this.

"Indexing failed." Most common cause: a PDF whose text can't be extracted. Re-create it as a text-based PDF, or convert via OCR and upload the OCR'd output as a text data source.

Where to next

  • Crawl a website — for content that already lives on your site.
  • Connect a CRM — for per-contact live context.
  • Playground — confirm the assistant actually retrieves your content correctly.