How Web Crawling Works
Before your bot can answer questions, it needs content. Crawling is the process of visiting your website, reading each page, and extracting the useful text. This page explains how ChatbotIQ’s crawler works and why it matters.
The challenge: not all websites are the same
Section titled “The challenge: not all websites are the same”A simple HTML blog is easy to crawl — just download the page and read the text. But modern websites are much more complex:
- Content Management Systems (WordPress, Zendesk) add navigation, sidebars, related posts, and footer content alongside the actual article.
- Single-Page Applications (React, Vue, Angular) render content with JavaScript. The initial HTML is often just a blank shell.
- Wiki platforms (Confluence) load content via background API calls after the page structure renders.
Many chatbot tools only handle the simple case. ChatbotIQ is designed to handle all of these.
Two modes of discovery
Section titled “Two modes of discovery”Standard mode: sitemap-based
Section titled “Standard mode: sitemap-based”Standard mode reads your website’s sitemap.xml file to discover pages. This is fast and reliable for most websites:
- ChatbotIQ fetches your sitemap (or auto-discovers it).
- You see a list of all URLs instantly (seconds, not minutes).
- You review the list and select which pages to include.
- Selected pages are crawled and indexed in a separate step.
Best for: Any site with a complete sitemap — documentation sites, blogs, CMS platforms, most commercial websites.
Advanced mode: link-following
Section titled “Advanced mode: link-following”Advanced mode discovers pages by visiting your starting URL and following links:
- The crawler visits the starting page with a real browser.
- It extracts the content and finds all links on the page.
- It follows each link to discover new pages, repeating the process.
- Content is captured during discovery (no separate crawl step).
Best for: Sites without a sitemap, SPAs where links are generated by JavaScript, or when the sitemap is missing pages.
How content extraction works
Section titled “How content extraction works”Once a page is reached, ChatbotIQ extracts the text content. By default, pages are loaded in a full browser to handle JavaScript-rendered content. Several settings let you control extraction — stripping navigation clutter, targeting specific page sections, or skipping the browser entirely for simple HTML sites.
See Crawling Settings Reference for all options, or Configure Crawling for Your Site for ready-to-use recipes by site type.
Quality verification
Section titled “Quality verification”ChatbotIQ doesn’t just blindly index whatever it downloads. Built-in quality checks ensure your knowledge base contains meaningful content:
- Significance check — pages with very little real content are flagged for your review.
- Degradation protection — when re-crawling, ChatbotIQ detects if content has unexpectedly shrunk (e.g., due to a site issue) and blocks the update to prevent data loss.
- Auto-tuning — on the first crawl, ChatbotIQ can automatically test different approaches and pick the one that extracts the best content from your site.
Why JavaScript-heavy sites are hard
Section titled “Why JavaScript-heavy sites are hard”Many modern websites (React apps, Confluence, etc.) don’t include their content in the initial page — the content loads dynamically after the page appears. Basic crawlers miss this entirely.
ChatbotIQ handles these sites by loading pages in a real browser, waiting for the content to fully appear, and then extracting it. For slower sites, the Content Load Delay setting gives pages extra time to finish loading.
What happens after crawling
Section titled “What happens after crawling”After content is extracted, ChatbotIQ automatically processes it — cleaning up the HTML, splitting it into searchable pieces, and making it available for your bot to use. This happens in the background and typically takes a few minutes.
Related
Section titled “Related”- Configure Crawling for Your Site — practical recipes
- Crawling Settings Reference — every setting explained
- How RAG Works — what happens after content is indexed