Skip to main content

Crawl a website

If your content already lives on your website — help center articles, blog posts, product pages, FAQ pages — a website crawl is the fastest way to teach the assistant about your business. Paste a starting URL, set how many pages to crawl, and within a few minutes the assistant can answer questions using the freshest version of your public content.

Crawls are powered by Apify under the hood. You don't need an Apify account; everything runs through Insighto.

The crawl flow

A crawl of 100 pages typically finishes in 5–15 minutes. Larger crawls (500+ pages) can take an hour or more.

What the crawler does

The crawler:

  • Starts from the URL you provide.
  • Walks up to max_crawl_pages pages — defaults to 100. You can raise this if you need more.
  • Honours include and exclude URL patterns so you can scope the crawl to /blog/** or skip /tag/**.
  • Honours sitemaps if you turn that option on — start from sitemap.xml instead of following links.
  • Respects robots.txt by default.

Known limit: JavaScript-rendered sites don't work well. The crawler fetches HTML directly and doesn't run JavaScript. If your site is a single-page app (React, Vue, or Angular without server-side rendering), the crawler sees <div id="root"></div> and not much else. Workarounds below.

Setting up a crawl

  1. Open Build → Data sources → Add data source.
  2. Pick Website crawl.
  3. Paste the starting URL. The most specific URL that still covers everything you want is best — https://example.com/help is better than https://example.com.
  4. Set Max crawl pages (default 100).
  5. Optional: Add include patterns (e.g. **/help/**) to restrict the crawl.
  6. Optional: Add exclude patterns (e.g. **/tag/**, **/login) to skip noise.
  7. Optional: Toggle Use sitemap to start from sitemap.xml.
  8. Click Create.

You'll land on the data source detail page. The status walks initialized → in process → ready for use.

Four business use cases

E-commerce — help center, 60 pages. A meal-kit company crawls help.example.com once a week. The crawl picks up new articles and refreshes existing ones. The chat widget on the customer dashboard now answers "how do I pause my subscription" with the freshest article content.

Real estate — active listings, 200 pages. A brokerage crawls its public listings site every Monday morning at 5 AM. New listings get indexed; retired listings drop on the next crawl. The voice assistant can describe any current property without anyone editing a knowledge base.

Restaurant chain — menu and locations, 30 pages. A chain with 12 locations crawls its main site. Menu changes propagate to every location's voice receptionist after a single re-crawl. No more "the website says something different from what the assistant says."

B2B SaaS — public docs, 400 pages. An analytics startup crawls docs.example.com with the include pattern **/docs/** and the page cap raised to 500. The crawl runs nightly. Developers asking the in-product assistant get answers from the same docs they'd read on the site.

Re-crawling

A data source's Re-crawl action re-runs with the same configuration. Edit the configuration first if you want a different scope. Each re-crawl counts against your words_count allowance just like the first crawl, so plan for it.

For weekly automated refreshes, run the re-crawl manually on a recurring calendar reminder, or call the data source API from your own scheduler.

Watching progress

The data source detail page polls and re-renders status. You'll see:

  • Initialized — request accepted, crawl is being prepared.
  • In process — pages are being fetched and indexed.
  • Ready for use — every retrieved page is indexed.
  • Failed — usually means the crawler couldn't extract content (see below).

Common gotchas

"The crawler returned 12 pages out of 500 — what happened?"

Almost always one of:

  • The site is a single-page app. The crawler doesn't run JavaScript. Open one of your pages, view source, and look for the actual content. If the body is mostly <div id="root"></div>, the crawler sees the same thing.
  • robots.txt blocks most paths. Check your site's robots.txt.
  • Max crawl pages is too low. Bump it (and be ready for the word cost).

"Indexing finished but the assistant pulls garbage."

The crawler keeps the page text as-is. Cookie banners, sidebars, and "related articles" can dominate chunks. Tighten your exclude patterns to skip those paths, or export to clean text/markdown and upload as a file instead.

"The crawl is stuck at in process forever."

Large crawls can take an hour or two. If it's been longer than that, contact support and quote the data source name — they can read the crawl ID and check what's happening.

Working around JS-rendered sites

If your site is a single-page app:

  1. Add server-side rendering or static pre-rendering. Most SPA frameworks have this option.
  2. Crawl the sitemap. Many SPAs still publish a sitemap of canonical URLs the crawler can follow.
  3. Export to markdown or PDF. Run a tool like httrack or your CMS's export feature, then upload the output as a file data source.

Linking the crawl to an assistant

Same as any data source — open the assistant under Build → Assistants → [your assistant] → Data sources and toggle the crawl on. Chunks become queryable as they're indexed, so even a still-running crawl is partially usable.

Where to next