Website Data Sources

Website data sources, classified as unstructured data sources, are a type of data collection mechanism within DRUID that specifically target information extracted from websites. They capture unorganized or semi-structured data from web pages, providing a means to integrate online content into the Knowledge Base.

By default, when extracting paragraphs from an HTML website, the extractor excludes content from the following elements:

HTML tags:

img
svg
script
style
noscript
nav
footer
header

Elements with specific roles:

[role="alert"]
[role="banner"]
[role="heading"]
[role="dialog"]
[role="alertdialog"]
[role="region"][aria-label*="skip" i]

Elements with specific attributes or classes:

.hidden
.exclusive
div.sidemenu
div.topbar
div.search-content
div.navigations-container
[class*="breadcrumb"]
[aria-modal="true"]

Starting with DRUID 8.6, you can configure the HtmlSelectorTagsToIgnore parameter under the "featureFlags" collection in the KB Advanced Settings JSON field. This parameter allows you to control which HTML elements are excluded during content extraction. Setting it to an empty list ([]) includes all content, while specifying values filters out unwanted elements.

Copy

    {
      "caption": null,
      "name": "HtmlSelectorTagsToIgnore",
      "value": "[\"img\",\"svg\",\"script\",\"style\",\"noscript\",\"nav\",\"footer\",\"header\",\"[role=\\\"alert\\\"]\",\"[role=\\\"banner\\\"]\",\"[role=\\\"heading\\\"]\",\"[role=\\\"dialog\\\"]\",\".hidden\",\".exclusive\",\"div.sidemenu\",\"div.topbar\",\"div.search-content\",\"div.navigations-container\",\"[role=\\\"alertdialog\\\"]\",\"[class *= \\\"breadcrumb\\\"]\",\"[role=\\\"region\\\"][aria-label*=\\\"skip\\\" i]\",\"[aria-modal=\\\"true\\\"]\"]"
    }

Adding website data sources

To add a website data source, follow these steps:

Step 1: Create the data source

Click the Add New button to open the Add New Data Source page.
In the Name field, enter a name for the data source to help you identify and search for it easily.
From the Language drop-down, select the language of the data you are uploading. It must be one of the bot languages.
From the Type drop-down, select Website.
Leave Crawl using HTTP requests selected if you want DRUID to retrieve data from websites by sending HTTP requests directly to your servers. This method is suitable for accessing information from websites that do not require JavaScript execution or interaction with dynamic content.
Select Include subdomains if you want the KB Engine to retrieve data from subdomains of the provided URL.

Note: Crawling subdomains is available in DRUID 8.18 and higher.

In the URL field, provide the URL of the website starting with https://. E.g. https://druidai.com.

Note: DRUID currently supports adding unstructured data sources in English; therefore, make sure to provide the URL of a website that contains content written in English.

Optionally, set the Min score threshold and the Target match score for the data source. If not set, the thresholds from the Knowledge Base will apply.
To verify the URL you provided, click the Test button. If the test fails, review the URL to ensure it is correct. You can also verify the URL later by going to the Details tab of the website data source and clicking the Test button at the bottom of the page.
Click Create to save the data source. The website data source appears on the Knowledge base page.

The data source requires crawling.

Step 2. Crawl the website data source

On the Knowledge base page click on the website data source. The data source page displays by default on the Extracted paragraphs tab. There are no text articles yet because the data source requires crawling.

Click the Start crawling button () at the top left corner of the data source. The Start Crawling Parameters page appears.

Define the crawling policy by setting the parameters described in the table below.

Parameter	Description
Depth	The depth of the child links to be crawled.
Throttle	The crawling speed, that is, the minimum number of seconds to wait between 2 consecutive crawling requests.
Use site maps	Select if you want the crawler to use the site map. If the website you entered does not have a site map, do not select this option.
Follow links	Select if you want the crawler to visit all the hyperlinks in the retrieved web pages identified on the specified URL.
Document content extraction	Enable this option if you want the crawler to detect non-password-protected document assets available on the website. The crawler will identify only documents that match the file types supported by the KB Engine. Note: This option is available in technology preview in DRUID version 8.11 and higher. Important! The crawler extracts document content from files with a maximum size of 20MB. Files exceeding this limit will not be processed. This feature is available in DRUID 8.12 and higher.

After you define the crawling policy, click Start.

Hint: Based on the crawling policy set, it might take up to a few minutes for the crawling to complete. You might want to refresh from time to time to see when the action has completed.

As the crawler visits the link provided in the URL field, it will identify all the hyperlinks in the retrieved web pages and will add them to the list of URLs to visit.

Note: If the website includes links to other domains (different from the one specified in the website data source URL), those domains won’t be crawled automatically. To include them, add a new data source for each external domain you want to crawl. For more information, see Crawl External Domains.

After the crawling completes, you can exclude from scrapping certain hyperlinks (sub-links) on the website that you do not want DRUID to extract information from during the extraction process. To exclude from scrapping specific hyperlinks (sub-links), click the dots next to the desired hyperlink and select Exclude. By excluding specific hyperlinks, users can ensure that only relevant content is captured and added to the Knowledge Base.

The URLs excluded from scrapping appear on the Details tab, in the Exclude from scraping area.

Step 3. Extract the text articles

On the Extracted Paragraphs tab of your website data source, click the Extract button () at the top left corner of the page.

Note: The extraction might take a few seconds, depending on the number of links included in the scrapping.

When the extraction completes, the extracted paragraphs display under Extracted paragraphs > Content tab.

Step 4. Train the data source

After updating the paragraphs, it's crucial to retrain your data source. This ensures the KB Engine recognizes these updates and provides accurate responses to user queries. Click the Train button at the top-left corner of the data source or select Train data source from the actions menu. Alternatively, you can Train all data sources.

Hint: If you've updated the Trainable Elements in a data source's Advanced Settings, the data source status doesn't change. You need to Retrain the data source to apply those changes. You also have the option to Retrain all data sources directly from the data source. The Retrain feature is available in DRUID 8.15 and higher.

Crawl External Domains

Note: This feature is available in DRUID 8.19 and higher.

By default, DRUID's web crawler only crawls the domain specified in your initial website data source URL (and subdomains if selected). If your website contains links to other domains (external links), these domains will not be automatically crawled.

To include these external domains in your crawled data, you need to add them as separate website data sources.

To add a new website data source for an external domain:

Navigate to the KB website data source you are currently working with.
Click on the Details tab.
Scroll down to the External links section. Here you will see a list of domains linked from your primary website.
Locate the domain for which you want to create a new data source and click the Add icon next to it.

The Add new data source page will appear.

Enter a descriptive Name for your new data source.
Click Save.

Once saved, this new data source will be created, and you can then initiate the crawling process for it independently.

Note: Links to external domains like Social Media (e.g., Facebook, LinkedIn, etc.) and phone numbers are excluded from the External links section.

Editing articles

To ensure your Knowledge Base high quality, we recommend you to review the extracted articles and take the proper actions to improve them: open the URL from where the crawler extracted the article and compare the content, edit or delete the article.

To edit an article, click the dots ( ) next to the article and click Edit. Update the article (user intent / question / title / short description and answer) and click the Save icon at the top right corner of the page.

Important! After making updates to your paragraphs, it's crucial to retrain your data source. This ensures the KB Engine recognizes these updates and provides accurate responses to user queries. Click the Train button at the top-left corner of the data source.

Fine-tuning Predictions

You can configure Advanced Settings at both the data source and node/leaf levels to achieve more precise predictions. This approach offers granular control, allowing you to adjust the extractors and trainable elements, resulting in better accuracy and performance. Unlike KB-level settings, which apply changes broadly, this targeted method adapts configurations to the unique needs of each data source or element, streamlining your authoring process.

Fine-tuning at the data source level

Navigate to the desired data source.
Select the Advanced Settings tab.
Modify advanced parameters as needed and save the settings.

Fine-tuning at the node or leaf level

In the tree explorer, select the desired node or leaf.
On the right side, select the Advanced Settings tab.
Modify advanced parameters as needed and save the settings.

Reset advanced settings

Note: This feature is available in DRUID version 7.16 onwards.

To reset advanced configurations at the data source and node/leaf levels to match the KB Advanced settings, go to Knowledge Base > Advanced Settings and click the Save to All button. This action streamlines your settings management by applying consistent KB Advanced settings across your entire configuration with just one click.

Enhance KB prediction

Refine your articles by transforming unstructured data into a question-and-answer format. Edit articles and add question / title / short description.

Access the Knowledge Base Advanced Settings, set the "trainableColumns" parameter to "Question,Answer", then train the Knowledge Base. The KB Engine will leverage both questions and answers from unstructured data sources during the prediction process, ultimately leading to improved prediction accuracy.

Note: For new bots created in DRUID 7.10 onwards, the engine will predict against both the question and answer by default. For existing bots, the engine will only predict against the answer ("trainableColumns": null) until you update the setting.

Identify the reasons why specific links are missing from the data source

Note: This feature is available in DRUID 8.1 and higher.

Website data sources provide clear error messages for links that were not successfully crawled, enabling you to swiftly pinpoint the underlying issues. This enhanced visibility helps you maintain the accuracy and completeness of your data sources.

To understand why specific links are missing from the data source:

In the tree explorer, click the Actions icon corresponding to the page you know links were not crawled and select Page Info.

In the Page Info page, click the Extracted links tab.

Review the information provided for each link. Error messages will highlight the specific reasons why a link was not crawled.