Custom Data Sources

Custom data sources allow you to extract data from third-party systems that expose data via APIs. These sources support integration responses and help retrieve content when none of the standard DRUID data source types apply.

Custom data sources can include databases, where each table represents a node, and table content represents records (such as articles). You can also use them to extract data from systems like Confluence, Zendesk, and Salesforce Ticketing/KB.

How Custom Data Sources Work

Custom Knowledge Base (KB) data sources enable DRUID to access and retrieve data from external systems using DRUID integrations (DRUID Connectors). This approach leverages DRUID's existing Hybrid Deployment capabilities to securely access on-premise data sources.

To create a custom data source, you need to configure two key processes (integrations):

  • Crawl Integration – Responsible for discovering and retrieving the structure (nodes and hierarchy) of the data source, including file nodes.
  • Extract Integration – Responsible for extracting the actual content from the files identified and retrieved by the Crawl Integration.

The Crawling Process

The crawling process is iterative, meaning it repeats until a specific condition is met. It starts at a defined point in the data source and retrieves nodes for each hierarchy level. This process continues, exploring each level of the hierarchy, until one of the following occurs:

  • Crawl Depth limit reached: The crawling process reaches the maximum depth level specified in the crawling configuration.
  • No more child nodes available: The crawling process reaches a point in the hierarchy where a node has no further child nodes.

For example:

  • Depth = 1: The integration runs once at the root level (level 0) and retrieves all direct child nodes of the root..
  • Depth = 2: The integration runs ONCE for the root node (level 0) and then once for each of its direct child nodes (level 1), to discover the subsequent children of level 2, for each child in level 1. If the root node has 10 child nodes, the integration executes 11 times in total (once for the root and once for each child node).

During each execution, the crawling integration populates the request entity with the data of the currently processed level. This request entity includes the parent node’s ID, specifically the value of [[KBCustomDSDiscoverTask]].ParentNode.Id, which is essential for correctly establishing the hierarchical relationship between nodes in the data source. It ensures that each discovered child node is correctly associated with its parent in the response returned by the integration.

Iterating through the discovered hierarchy of nodes is controlled by the Knowledge Base engine itself. You don’t need to implement logic for iterating through the hierarchy in the crawling integration. The crawling integration should focus only on returning the child nodes of a single given parent node. The crawling engine will further call the integration for each parent node, depending on the requested depth and the number of parent nodes.

The Extraction Process

The extraction process focuses on retrieving the content from the nodes identified during crawling. The Extract integration is executed once for each node of the data source, discovered in crawing stage, and as such, it is designed to deliver the content of a single node per execution.

The DRUID KB Engine orchestrates this by invoking the Extract integration for every node, populating the request entity with the relevant information for that specific node.

For example, if the crawling process discovers 15 nodes under the root, the DRUID KB Engine will execute the Extract Integration 15 times, once per node, to extract their respective content.

Setting up custom data sources

This section describes how to set up a custom data source.

Step 1: Create an integration for node crawling

To create a crawling integration:

  1. Search for the system entity [[KBCustomDSDiscoverTask]] and create an integration for crawling.

    Add a custom code integration task and other integrations to accommodate your use case (REST, SQL, SOAP, etc.).

  2. Note:  The crawling integration must exclusively use this dedicated system entity; otherwise, it will not appear in the data source configuration panel.
  3. Define the crawling logic.
Copy
Crawling logic template
(function main() {
    // Request Entity manipulation
    let requestEntity = Context.GetRequestEntity();

    // When Request Entity has not been initialized in the context, its value is null
    if (requestEntity == null) {
        Context.RaiseError("01", "Request entity is empty. Contact technical support.");
        Context.CompleteAction();
    }

    Context.ExecuteTask("Wordpress Discover Sites"); // Calls the actual API that loads data from the external data source
    let responseEntity = Context.GetResponseEntity(); // in this example, we map the Name, the URL and the exteranl ID (ExternalObjectId) of each node directly in the API call response

    // Add additional technical metadata to each node
    if (responseEntity.Nodes != null && responseEntity.Nodes.Count > 0) {
        for (let i = 0; i < responseEntity.Nodes.Count; i++) {
            responseEntity.Nodes[i].Processed = true; // Marks crawling with success in the KB tree
            responseEntity.Nodes[i].ExternalObjectType = "0"; // 0 - WebPage; 1 - Repository File; 2 - Repository Folder; 3 - WebFile;
            responseEntity.Nodes[i].Id = generateUUID(); // Each node must have a UUID
            responseEntity.Nodes[i].ParentId = requestEntity.ParentNode.Id; // From the request entity -sets the parent-child relationship
            responseEntity.Nodes[i].PageType = 1; // 0 - WebPage; 1 - Repository File; 2 - Repository Folder; 3 - WebFile;
            responseEntity.Nodes[i].ContentMimeType = "text/html"; // Set the appropriate content MIME type based on the file extension
            
            // │ File Type  │ MIME Type                                                  │
            // ├────────────┼─────────────────────────────────────────────────────────────┤
            // │ .docx      │ application/vnd.openxmlformats-officedocument.wordprocessingml.document │
            // │ .pdf       │ application/pdf                                              │
            // │ .doc       │ application/msword                                           │
            // │ .xls       │ application/vnd.ms-excel                                     │
            // │ .xlsx      │ application/vnd.openxmlformats-officedocument.spreadsheetml.sheet │
            // │ .xlsm      │ application/vnd.ms-excel.sheet.macroEnabled.12              │
            // │ .pptx      │ application/vnd.openxmlformats-officedocument.presentationml.presentation │
            // │ .ppsx      │ application/vnd.openxmlformats-officedocument.presentationml.slideshow   │
            // │ .csv       │ text/csv                                                     │
            // │ .json      │ application/json                                             │
            // │ .html      │ text/html                                                    │
        }
    }

    responseEntity.Status = 1; // Possible values: 1-Complete, 2-NextPage, 3-Error. Complete → set this to 1 to stop the iteration for the given level (in case the data is retrieved by pagination)
    Context.SetResponseEntity(responseEntity);

    Context.CompleteAction();
})();

}

How this template script works:

  1. Retrieves the request entity from the crawling context. If the request entity has not been initialized in the context, an error message is displayed and the process stops.
  2. Executes a discovery task to fetch data. Calls an external task (for example, Wordpress Discover Sites) to retrieve data from the source system, such as WordPress pages or files.
  3. Retrieves and prepares the response entity. Gets the response entity, which will hold the list of discovered nodes (for example, KBWebsitePage items).
  4. Adds metadata to each node. For each discovered node, assigns values such as name, URL, MIME type, UUID, external object ID, and the parent node’s ID.
  5. Completes the response entity. Sets the status of the current level to complete, indicating that all nodes at this level have been discovered (for example, when pagination is not used).
  6. Sends back the response. Returns the response entity to the Knowledge Base engine so it can continue building the knowledge tree.

Step 2: Create an integration for data extraction

To create an extraction integration:

  1. In the [[KBCustomDSProcessNodeTask]] system entity, create an integration for data extraction.
  2. Note:  The extraction integration must exclusively use this dedicated system entity; otherwise, it will not appear in the data source configuration panel..
    1. Add a REST integration task to retrieve content files from the third-party system.
    2. Save the response in [[KBCustomDSProcessNodeTask]].<field of type file>. Set Project Result To Entity Path to [[KBCustomDSProcessNodeTask]].<field of type file>.
    Hint:  The most efficient way to extract data from third-party systems, you can configure the extraction integration to construct JSON content as per the Druid structured JSON format. This allows structured content from external systems to be processed efficiently. Save the response in a JSON file. For details on the structure of the JSON file, see JSON File Structure.
  3. Add a custom code task to extract the content from the previous response and store it inside FileContent in the corresponding KB node (file node).
Copy
Data Extraction custom code template
(function main() {
    let responseEntity = EntityFactory.CreateEntityByName("KBCustomDSProcessNodeTask");
    responseEntity.Node = EntityFactory.CreateEntityByName("KBWebsitePage");
    responseEntity.Node.Context = {};
    responseEntity.Node.Context.FileContent = Context.GetResponseEntity().Node.Content;
 
    Context.SetResponseEntity(responseEntity);
    Context.CompleteAction();
})(); 

You can add additional integration tasks for example to use generative AI for rephrasing content.

Step 3. Create a Custom Data Source

To create a custom data source:

  1. In the Knowledge Base page, click Add Data Source.
  2. Enter a name and select Custom as the data source type.
  3. From the Crawl Integration field, select the integration created for crawling.
  4. From the Extract Integration field, select the integration created for extraction.
  5. Click Save.

Once the custom data source is set up:

  • Click Crawl to start the crawling process and retrieve all nodes and child nodes.
  • Click Extract to run the extraction integration and extract content from the external system.

JSON File Structure

For Custom data sources, third-party tools must support REST APIs for data exchange. Extracted content — whether in Word documents, Excel files, PDFs, or JSON format — must be mapped into a structured JSON file before integration.

The JSON file should follow this format:

Copy

JSON structure

[
  {
    "Title": "Sample Title",
    "Content": "Content 1"
  },
  {
    "Title": "Sample Title 2",
    "Content": "Content 2",
    "PageNumber": "3"
  },
  {
    "Title": "Sample Title 3",
    "Content": "Sample Content 3",
    "SheetName": "Sheet1"
  }
]

The following table provides the description of each JSON property:

Property Required Description
Title Yes The title of the content entry.
Content Yes The content to be added to the Knowledge Base.
PageNumber No Relevant only when mapping data from PDFs or Word documents. Specifies the page number where the content was extracted.
SheetName No Relevant only when mapping data from Excel files. Specifies the sheet name where the content was extracted.