How Web Crawling Works

Before your bot can answer questions, it needs content. Crawling is the process of visiting your website, reading each page, and extracting the useful text. This page explains how ChatbotIQ’s crawler works and why it matters.

The challenge: not all websites are the same

A simple HTML blog is easy to crawl: just download the page and read the text. But modern websites are much more complex:

Content Management Systems (WordPress, Zendesk) add navigation, sidebars, related posts, and footer content alongside the actual article.
Single-Page Applications (React, Vue, Angular) render content with JavaScript. The initial HTML is often just a blank shell.
Wiki platforms (Confluence) load content via background API calls after the page structure renders.

Many chatbot tools only handle the simple case. ChatbotIQ is designed to handle all of these.

Two modes of discovery

In the Add Source wizard these are labeled Sitemap Discovery and Link Crawling (under “Discovery Mode” in Manual configuration mode).

Sitemap Discovery: sitemap-based

Sitemap Discovery reads your website’s sitemap.xml file to discover pages. This is fast and reliable for most websites:

ChatbotIQ fetches your sitemap (or auto-discovers it).
You see a list of all URLs instantly (seconds, not minutes).
You review the list and select which pages to include.
Selected pages are crawled and indexed in a separate step.

Best for: Any site with a complete sitemap: documentation sites, blogs, CMS platforms, most commercial websites.

Link Crawling: link-following

Link Crawling discovers pages by visiting your starting URL and following links:

The crawler visits the starting page with a real browser.
It extracts the content and finds all links on the page.
It follows each link to discover new pages, repeating the process.
Content is captured during discovery (no separate crawl step).

Best for: Sites without a sitemap, SPAs where links are generated by JavaScript, or when the sitemap is missing pages.

How content extraction works

Once a page is reached, ChatbotIQ extracts the text content. By default, pages are loaded in a full browser to handle JavaScript-rendered content. Several settings let you control extraction: stripping navigation clutter with Reader Mode, targeting specific page sections with the Content Area Selector, or skipping the browser entirely with Simple Mode for simple HTML sites.

See Crawling Settings Reference for all options, or Configure Crawling for Your Site for ready-to-use recipes by site type.

Quality verification

ChatbotIQ doesn’t just blindly index whatever it downloads. Built-in quality checks ensure your knowledge base contains meaningful content:

Significance check - pages with very little real content are flagged for your review.
Degradation protection - when re-crawling, ChatbotIQ detects if content has unexpectedly shrunk (e.g., due to a site issue) and blocks the update to prevent data loss.
Auto-tuning - on the first crawl, ChatbotIQ can automatically test different approaches and pick the one that extracts the best content from your site.

Why JavaScript-heavy sites are hard

Many modern websites (React apps, Confluence, etc.) don’t include their content in the initial page, the content loads dynamically after the page appears. Basic crawlers miss this entirely.

ChatbotIQ handles these sites by loading pages in a real browser, waiting for the content to fully appear, and then extracting it. For slower sites, the Wait Time for Page Content (seconds) setting gives pages extra time to finish loading.

What happens after crawling

After content is extracted, ChatbotIQ automatically processes it: cleaning up the HTML, splitting it into searchable pieces, and making it available for your bot to use. This happens in the background and typically takes a few minutes.

Configure Crawling for Your Site - practical recipes
Crawling Settings Reference - every setting explained
How RAG Works - what happens after content is indexed