Skip to content

How Web Crawling Works

Before your bot can answer questions, it needs content. Crawling is the process of visiting your website, reading each page, and extracting the useful text. This page explains how ChatbotIQ’s crawler works and why it matters.


The challenge: not all websites are the same

Section titled “The challenge: not all websites are the same”

A simple HTML blog is easy to crawl — just download the page and read the text. But modern websites are much more complex:

  • Content Management Systems (WordPress, Zendesk) add navigation, sidebars, related posts, and footer content alongside the actual article.
  • Single-Page Applications (React, Vue, Angular) render content with JavaScript. The initial HTML is often just a blank shell.
  • Wiki platforms (Confluence) load content via background API calls after the page structure renders.

Many chatbot tools only handle the simple case. ChatbotIQ is designed to handle all of these.


Standard mode reads your website’s sitemap.xml file to discover pages. This is fast and reliable for most websites:

  1. ChatbotIQ fetches your sitemap (or auto-discovers it).
  2. You see a list of all URLs instantly (seconds, not minutes).
  3. You review the list and select which pages to include.
  4. Selected pages are crawled and indexed in a separate step.

Best for: Any site with a complete sitemap — documentation sites, blogs, CMS platforms, most commercial websites.

Advanced mode discovers pages by visiting your starting URL and following links:

  1. The crawler visits the starting page with a real browser.
  2. It extracts the content and finds all links on the page.
  3. It follows each link to discover new pages, repeating the process.
  4. Content is captured during discovery (no separate crawl step).

Best for: Sites without a sitemap, SPAs where links are generated by JavaScript, or when the sitemap is missing pages.


Once a page is reached, ChatbotIQ extracts the text content. By default, pages are loaded in a full browser to handle JavaScript-rendered content. Several settings let you control extraction — stripping navigation clutter, targeting specific page sections, or skipping the browser entirely for simple HTML sites.

See Crawling Settings Reference for all options, or Configure Crawling for Your Site for ready-to-use recipes by site type.


ChatbotIQ doesn’t just blindly index whatever it downloads. Built-in quality checks ensure your knowledge base contains meaningful content:

  • Significance check — pages with very little real content are flagged for your review.
  • Degradation protection — when re-crawling, ChatbotIQ detects if content has unexpectedly shrunk (e.g., due to a site issue) and blocks the update to prevent data loss.
  • Auto-tuning — on the first crawl, ChatbotIQ can automatically test different approaches and pick the one that extracts the best content from your site.

Many modern websites (React apps, Confluence, etc.) don’t include their content in the initial page — the content loads dynamically after the page appears. Basic crawlers miss this entirely.

ChatbotIQ handles these sites by loading pages in a real browser, waiting for the content to fully appear, and then extracting it. For slower sites, the Content Load Delay setting gives pages extra time to finish loading.


After content is extracted, ChatbotIQ automatically processes it — cleaning up the HTML, splitting it into searchable pieces, and making it available for your bot to use. This happens in the background and typically takes a few minutes.