HTML Settings

HTML Settings control how the Knowledge Base (KB) Engine crawls and extracts content from website data sources. Use them to exclude non-content page elements, skip unwanted links, and run scripts on dynamic pages before extraction.

These settings apply to website data sources. Configure defaults at the KB level on this page, or override them per data source when a site needs different selectors, link rules, or scripts.

HTML Selector Tags to Ignore

Specify CSS selectors for HTML elements to exclude during extraction. Matching elements are not indexed, which keeps navigation, headers, footers, modals, and other structural UI out of the KB.

Enter one selector per line.

Copy

Examples

img
svg
script
style
noscript
iframe
nav
footer
header
[role="alert"]
[role="banner"]
[role="heading"]
[role="dialog"]
.hidden
.exclusive
div.sidemenu
div.topbar
div.search-content
div.navigations-container
[role="alertdialog"]
[class *= "breadcrumb"]
[role="region"][aria-label*="skip" i]
[aria-modal="true"]

Link Patterns to Ignore

Define regular expression (Regex) patterns for URLs the crawler should not follow or index. Use this to skip social profiles, media embeds, phone links, and mailto links that add noise without useful knowledge.

Enter one pattern per line.

Copy

Examples:

^\s*https?:\/\/(?:[a-zA-Z0-9-]+\.)*facebook\.com
^\s*https?:\/\/(?:[a-zA-Z0-9-]+\.)*instagram\.com
^\s*https?:\/\/(?:[a-zA-Z0-9-]+\.)*whatsapp\.com
^\s*https?:\/\/(?:[a-zA-Z0-9-]+\.)*twitter\.com
^\s*https?:\/\/(?:[a-zA-Z0-9-]+\.)*linkedin\.com
^\s*https?:\/\/(?:[a-zA-Z0-9-]+\.)*youtube\.com
^\s*https?:\/\/(?:[a-zA-Z0-9-]+\.)*x\.com
^\s*tel:
^\s*mailto:

HTML Post Render Script

Enter Playwright-compatible JavaScript that runs after the page loads and before extraction. The script can change the DOM so dynamically rendered content is visible to the crawler—for example, expanding accordions or tabs, clicking Load more, closing cookie banners, or waiting for API-driven content.

NOTE: This setting applies only to website data sources where Crawl using HTTP requests is disabled. If you have multiple website data sources, configure the script on each data source that needs it. For more information, see Step 2. Provide content rendering script (only for websites with dynamically rendered content).

Leave the field empty when pages are fully static and no interaction is required before scrape.

Copy

Script example

(async () => {
  const delay = ms => new Promise(r => setTimeout(r, ms));
  const getCards = () => [...document.querySelectorAll('#accordionFaq .card')];
  const findToggle = card => card.querySelector('.card-header h5 a[href^="javascript"]');
  const findPanel  = card => card.querySelector('.card-header+div');
  const panelOpen  = p => p && ((p.style.display !== 'none' && p.offsetHeight > 0) || /open|show|expanded|active/i.test(p.className));

  for (let i = 0; i < 10 && !document.querySelector('#accordionFaq .card-header h5 a'); i++)
    await delay(100);

  for (let i = 0, total = getCards().length; i < total; i++) {
    let card  = getCards()[i];
    let toggle = findToggle(card);
    let panel  = findPanel(card);
    if (!toggle) continue;

    toggle.scrollIntoView({ block: 'center' });

    let attempt = 0;
    while (!panelOpen(panel) && attempt < 3) {
      attempt++;
      try { toggle.click(); } catch {}
      await delay(200);
      card  = getCards()[i];
      toggle = findToggle(card) || toggle;
      panel  = findPanel(card) || panel;
    }

    if (!panelOpen(panel) && panel)
      panel.style.display = 'block';

    await delay(80);
  }
})()

Data source settings override KB defaults for that source only. For several websites with different layouts, set selectors and scripts per data source rather than only at the KB level.

To apply KB HTML settings defaults to all data sources and tree elements, use Save to All from the Knowledge Base advanced settings actions menu.