Website Data Sources
Website data sources, classified as unstructured data sources, are a type of data collection mechanism within DRUID that specifically target information extracted from websites. They capture unorganized or semi-structured data from web pages, providing a means to integrate online content into the Knowledge Base.
By default, when extracting paragraphs from an HTML website, the extractor excludes content from the following elements:
- HTML tags:
- img
- svg
- script
- style
- noscript
- nav
- footer
- header
- Elements with specific roles:
- [role="alert"]
- [role="banner"]
- [role="heading"]
- [role="dialog"]
- [role="alertdialog"]
- [role="region"][aria-label*="skip" i]
- Elements with specific attributes or classes:
- .hidden
- .exclusive
- div.sidemenu
- div.topbar
- div.search-content
- div.navigations-container
- [class*="breadcrumb"]
- [aria-modal="true"]
Starting with DRUID 8.6, you can configure the HtmlSelectorTagsToIgnore parameter under the "featureFlags" collection in the KB Advanced Settings JSON field. This parameter allows you to control which HTML elements are excluded during content extraction. Setting it to an empty list ([]) includes all content, while specifying values filters out unwanted elements.
{
"caption": null,
"name": "HtmlSelectorTagsToIgnore",
"value": "[\"img\",\"svg\",\"script\",\"style\",\"noscript\",\"nav\",\"footer\",\"header\",\"[role=\\\"alert\\\"]\",\"[role=\\\"banner\\\"]\",\"[role=\\\"heading\\\"]\",\"[role=\\\"dialog\\\"]\",\".hidden\",\".exclusive\",\"div.sidemenu\",\"div.topbar\",\"div.search-content\",\"div.navigations-container\",\"[role=\\\"alertdialog\\\"]\",\"[class *= \\\"breadcrumb\\\"]\",\"[role=\\\"region\\\"][aria-label*=\\\"skip\\\" i]\",\"[aria-modal=\\\"true\\\"]\"]"
}
Adding website data sources
To add a website data source, follow these steps:
Step 1: Create the data source
- Click the Add New button to open the Add New Data Source page.
- In the Name field, enter a name for the data source to help you identify and search for it easily.
- From the Language drop-down, select the language of the data you are uploading. It must be one of the bot languages.
- From the Type drop-down, select Website.
- Leave Crawl using HTTP requests selected if you want DRUID to retrieve data from websites by sending HTTP requests directly to your servers. This method is suitable for accessing information from websites that do not require JavaScript execution or interaction with dynamic content.
- In the URL field, provide the URL of the website starting with https://. E.g. https://druidai.com.
- Optionally, set the Min score threshold and the Target match score for the data source. If not set, the thresholds from the Knowledge Base will apply.
-
To verify the URL you provided, click the Test button. If the test fails, review the URL to ensure it is correct. You can also verify the URL later by going to the Details tab of the website data source and clicking the Test button at the bottom of the page.
- Click Create to save the data source. The website data source appears on the Knowledge base page.
The data source requires crawling.
Step 2. Crawl the website data source
On the Knowledge base page click on the website data source. The data source page displays by default on the Extracted paragraphs tab. There are no text articles yet because the data source requires crawling.
Click the Start crawling button () at the top left corner of the data source. The Start Crawling Parameters page appears.
Define the crawling policy by setting the parameters described in the table below.
Parameter | Description |
---|---|
Depth | The depth of the child links to be crawled. |
Throttle | The crawling speed, that is, the minimum number of seconds to wait between 2 consecutive crawling requests. |
Use site maps | Select if you want the crawler to use the site map. If the website you entered does not have a site map, do not select this option. |
Follow links | Select if you want the crawler to visit all the hyperlinks in the retrieved web pages identified on the specified URL. |
Document content extraction |
Enable this option if you want the crawler to detect non-password-protected document assets available on the website. The crawler will identify only documents that match the file types supported by the KB Engine. Note: This option is available in technology preview in DRUID version 8.11 and higher.
Important! The crawler extracts document content from files with a maximum size of 20MB. Files exceeding this limit will not be processed. This feature is available in DRUID 8.12 and higher.
|
After you define the crawling policy, click Start.
As the crawler visits the link provided in the URL field, it will identify all the hyperlinks in the retrieved web pages and will add them to the list of URLs to visit.
After the crawling completes, you can exclude from scrapping certain hyperlinks (sub-links) on the website that you do not want DRUID to extract information from during the extraction process. To exclude from scrapping specific hyperlinks (sub-links), click the dots next to the desired hyperlink and select Exclude. By excluding specific hyperlinks, users can ensure that only relevant content is captured and added to the Knowledge Base.
The URLs excluded from scrapping appear on the Details tab, in the Exclude from scraping area.
Step 3. Extract the text articles
On the Extracted Paragraphs tab of your website data source, click the Extract button () at the top left corner of the page.
When the extraction completes, the extracted paragraphs display under Extracted paragraphs > Content tab.
Step 4. Train the data source
After updating the paragraphs, it's crucial to retrain your data source. This ensures the KB Engine recognizes these updates and provides accurate responses to user queries. Click the Train button at the top-left corner of the data source.
Editing articles
To ensure your Knowledge Base high quality, we recommend you to review the extracted articles and take the proper actions to improve them: open the URL from where the crawler extracted the article and compare the content, edit or delete the article.
To edit an article, click the dots ( ) next to the article and click Edit. Update the article (user intent / question / title / short description and answer) and click the Save icon at the top right corner of the page.
Fine-tuning Predictions
You can configure Advanced Settings at both the data source and node/leaf levels to achieve more precise predictions. This approach offers granular control, allowing you to adjust the extractors and trainable elements, resulting in better accuracy and performance. Unlike KB-level settings, which apply changes broadly, this targeted method adapts configurations to the unique needs of each data source or element, streamlining your authoring process.
Fine-tuning at the data source level
- Navigate to the desired data source.
- Select the Advanced Settings tab.
- Modify advanced parameters as needed and save the settings.
Fine-tuning at the node or leaf level
- In the tree explorer, select the desired node or leaf.
- On the right side, select the Advanced Settings tab.
- Modify advanced parameters as needed and save the settings.
Reset advanced settings
To reset advanced configurations at the data source and node/leaf levels to match the KB Advanced settings, go to Knowledge Base > Advanced Settings and click the Save to All button. This action streamlines your settings management by applying consistent KB Advanced settings across your entire configuration with just one click.
Enhance KB prediction
Refine your articles by transforming unstructured data into a question-and-answer format. Edit articles and add question / title / short description.
Access the Knowledge Base Advanced Settings, set the "trainableColumns" parameter to "Question,Answer", then train the Knowledge Base. The KB Engine will leverage both questions and answers from unstructured data sources during the prediction process, ultimately leading to improved prediction accuracy.
Identify the reasons why specific links are missing from the data source
Website data sources provide clear error messages for links that were not successfully crawled, enabling you to swiftly pinpoint the underlying issues. This enhanced visibility helps you maintain the accuracy and completeness of your data sources.
To understand why specific links are missing from the data source:
- In the tree explorer, click the Actions icon corresponding to the page you know links were not crawled and select Page Info.
- In the Page Info page, click the Extracted links tab.
- Review the information provided for each link. Error messages will highlight the specific reasons why a link was not crawled.