Custom Data Sources

Custom data sources allow you to extract data from third-party systems that expose data via APIs. These sources support integration responses and help retrieve content when none of the standard DRUID data source types apply.

Custom data sources can include databases, where each table represents a node, and table content represents records (such as articles). You can also use them to extract data from systems like Confluence, Zendesk, and Salesforce Ticketing/KB.

How Custom Data Sources Work

Custom Knowledge Base (KB) data sources enable DRUID to access and retrieve data from external systems using DRUID integrations (DRUID Connectors). This approach leverages DRUID's existing Hybrid Deployment capabilities to securely access on-premise data sources.

To create a custom data source, you need to configure two key processes (integrations):

  • Crawl Integration – Responsible for discovering and retrieving the structure (nodes and hierarchy) of the data source, including file nodes.
  • Extract Integration – Responsible for extracting the actual content from the files identified and retrieved by the Crawl Integration.

The Crawling Process

The crawling process is iterative, meaning it repeats until a specific condition is met. It starts at a defined point in the data source and retrieves nodes and their immediate child nodes. This process continues, exploring each level of the hierarchy, until one of the following occurs:

  • Crawl Depth limit reached: The crawling process reaches the maximum depth level specified in the crawling configuration.
  • No more child nodes available: The crawling process reaches a point in the hierarchy where a node has no further child nodes.

The Crawl integration is executed for each node during the crawling process. The number of times it's executed depends on the configured crawl depth.

For example:

  • Depth = 1: The integration runs once at the root level (level 0) and retrieves all direct child nodes of the root.
  • Depth = 2: The integration runs for the root node (level 0) and then once for each of its direct child nodes (level 1). If the root node has 10 child nodes, the integration executes 11 times in total (once for the root and once for each child node).

During each execution, the crawling integration populates the request entity with the data of the currently processed node. This includes the parent node's ID, specifically the [[KBCustomDSDiscoverTask]].ParentNode value, which is essential for correctly establishing the hierarchical relationship between nodes in the data source.

The Extraction Process

The extraction process focuses on retrieving the content from the files identified during crawling. The Extract integration is executed once for each node discovered in the data source, and as such, it is designed to deliver the content of a single node per execution.

The DRUID KB Engine orchestrates this by invoking the Extract integration for every node, populating the request entity with the relevant information for that specific node.

For example, if the crawling process discovers 15 nodes under the root, the DRUID KB Engine will execute the Extract Integration 15 times, once per node, to extract their respective content.

Setting up custom data sources

This section describes how to set up a custom data source.

Step 1: Create an integration for node crawling

To create a crawling integration:

  1. Search for the system entity [[KBCustomDSDiscoverTask]] and create an integration for crawling.

    Add a custom code integration task and other integrations to accommodate your use case (REST, SQL, SOAP, etc.).

  2. Note:  The crawling integration must exclusively use this dedicated system entity; otherwise, it will not appear in the data source configuration panel.
  3. Define the crawling logic.
Copy
Crawling logic template
 (function main() {

     // Retrieve the incoming request entity
     let requestEntity = Context.GetRequestEntity();

     // Create a response entity of type KBCustomDSDiscoverTask
     let responseEntity = EntityFactory.CreateEntityByName("KBCustomDSDiscoverTask");

     // Create a KBWebsitePage entity (represents a knowledge base item)
     var pageItem = EntityFactory.CreateEntityByName("KBWebsitePage");
     pageItem.Name = "Page Name"; // Replace with actual name
     pageItem.Url = "Page URL"; // Replace with actual URL
     pageItem.Processed = true; // Mark as processed
     pageItem.Context = {
         "customField1": "value1",
         "customField2": ["item1", "item2"]
     };
     pageItem.ExternalObjectId = "unique-external-id"; // Replace with actual external ID
     pageItem.ExternalObjectType = "3"; // Replace with actual type value (e.g., 3 for text/html)
     pageItem.ContentMimeType = "application/pdf"; // Change based on content type (e.g., text/html, application/json)
     pageItem.PageType = 1 // to stop in-depth navigation once the desired depth has been reached
     pageItem.Id = generateUUID() // see below the function definition

     // Create a collection to store page items
     let nodes = EntityFactory.CreateCollection("KBWebsitePage");
     nodes.Add(pageItem);

     // Set the response entity properties
     responseEntity.DataSource = requestEntity.DataSource;
     responseEntity.Nodes = nodes;
     responseEntity.Status = 1;
     responseEntity.Context = requestEntity.Context || {};

     // Add additional context fields if needed
     responseEntity.Context["customResponseField"] = "customValue";

     // Set the response entity and complete the action
     Context.SetResponseEntity(responseEntity);
     Context.CompleteAction();

 })();

 function generateUUID() {
     return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
         var r = Math.random() * 16 | 0,
             v = c == 'x' ? r : (r & 0x3 | 0x8);
         return v.toString(16);
     })
 }

How this template script works:

  1. Retrieves the request entity (which contains input details for crawling, like depth).
  2. Creates a response entity (KBCustomDSDiscoverTask) to hold the crawled nodes.
  3. Creates a KBWebsitePage node representing an item (such as a table or file).
  4. Adds metadata, including name, URL, content type, and external object ID.
  5. Adds the node to the response entity and marks the crawling as complete.
  6. Sends back the response entity so DRUID can recognize the discovered nodes.

How to use this template

  • Replace "Page Name", "Page URL", and other placeholder values with actual data.
  • Modify ContentMimeType to match the document type (application/pdf, text/html, etc.).
  • Extend the Context object with any required custom fields.
  • Update the code if you want to query only for data that matches specific filtering criteria.
Copy
Crawling logic example
(function main() {
    let pageSize = 100;
    let maxPages = 1000;
    var requestEntity = Context.GetRequestEntity();
    let queryFilter = "PublishStatus='Online'+AND+IsVisibleInCsp+=+true+AND+Product__c+IN+('Product A;Category X','Category X','Feature Y;Category X','Product A;Category Z;Category X','Solution B;Platform C;Category X')"
    let iteration = 0;
    if (!!requestEntity?.Context) {
        let lastId = !!requestEntity.Context["lastId"] ? requestEntity.Context["lastId"] : null;
        iteration = !!requestEntity.Context["iteration"] ? parseInt(requestEntity.Context["iteration"]) : 0
        if (lastId) {
            queryFilter = queryFilter + "+AND Id>'" + lastId + "'+";
        }
    }

    Context.SetContextVariable("queryFilter", queryFilter);
    Context.SetContextVariable("pageSize", pageSize.toString());
    Context.ExecuteTask("Query KB Articles SF");
    var initialResponseEntity = Context.GetResponseEntity();
    var responseEntity = EntityFactory.CreateEntityByName("KBCustomDSDiscoverTask");

    // Request
    // entity is KBCustomDSDiscoverTask
    // DataSource - KBDataSource
    // Context - JObject
    // ParentNode - KBWebsitePage
    var nodes = EntityFactory.CreateCollection("KBWebsitePage");
    var baseURL = Context.GetConnectorApplicationVariable("sfdc_community_baseurl");
    // One response WebsitePage;
    let lastId = "";
    for (let i = 0; i < initialResponseEntity.KnowledgeArticles.Count; i++) {
        let page = EntityFactory.CreateEntityByName("KBWebsitePage");
        page.Name = initialResponseEntity.KnowledgeArticles[i].Name;
        page.Url = baseURL + initialResponseEntity.KnowledgeArticles[i].Url;
        page.Processed = true;
        page.ExternalObjectId = initialResponseEntity.KnowledgeArticles[i].Id;
        page.ExternalObjectType = "3";
        page.ContentMimeType = "text/html"; // "text/html";
        page.Id = generateUUID();

        // Response

        nodes.Add(page);
        //responseEntity.Name = requestEntity.Name + "_response";

        lastId = page.ExternalObjectId;
    }
    //initialResponseEntity.nextRecordsUrl
    iteration++;
    responseEntity.Context = {
        "lastId": lastId,
        "iteration": iteration
    }

    /*
     * Possible values: 1-Complete, 2-NextPage, 3-Error
     */
    responseEntity.Status = (initialResponseEntity.KnowledgeArticles.Count < pageSize || iteration >= maxPages) ? 1 : 2; // Complete
    responseEntity.Nodes = nodes;
    Context.SetResponseEntity(responseEntity);

    Context.CompleteAction();

})();

function generateUUID() {

    return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
        var r = Math.random() * 16 | 0,
            v = c == 'x' ? r : (r & 0x3 | 0x8);
        return v.toString(16);
    })
}

The example script:

  1. Queries Salesforce KB for published, customer-visible articles filtered by specific products.
  2. Handles pagination to fetch large datasets in batches of 100 articles.
  3. Processes each article into a structured format (name, URL, ID).
  4. Generates unique IDs for each article using a UUID generator.
  5. Returns the articles to DRUID while managing pagination through context variables.

In the example above, the crawling integration also has a REST task that queries the Salesforce KB and maps the Salesforce articles metadata to DRUID entity fields.

Step 2: Create an integration for data extraction

To create an extraction integration:

  1. In the [[KBCustomDSProcessNodeTask]] system entity, create an integration for data extraction.
  2. Note:  The extraction integration must exclusively use this dedicated system entity; otherwise, it will not appear in the data source configuration panel..
    1. Add a REST integration task to retrieve content files from the third-party system.
    2. Save the response in [[KBCustomDSProcessNodeTask]].<field of type file>. Set Project Result To Entity Path to [[KBCustomDSProcessNodeTask]].<field of type file>.
    Hint:  The most efficient way to extract data from third-party systems, you can configure the extraction integration to construct JSON content as per the Druid structured JSON format. This allows structured content from external systems to be processed efficiently. Save the response in a JSON file. For details on the structure of the JSON file, see JSON File Structure.
  3. Add a custom code task to extract the content from the previous response and store it inside FileContent in the corresponding KB node (file node).
Copy
Data Extraction custom code template
(function main() {
    let responseEntity = EntityFactory.CreateEntityByName("KBCustomDSProcessNodeTask");
    responseEntity.Node = EntityFactory.CreateEntityByName("KBWebsitePage");
    responseEntity.Node.Context = {};
    responseEntity.Node.Context.FileContent = Context.GetResponseEntity().Node.Content;
 
    Context.SetResponseEntity(responseEntity);
    Context.CompleteAction();
})(); 

You can add additional integration tasks for example to use generative AI for rephrasing content.

Step 3. Create a Custom Data Source

To create a custom data source:

  1. In the Knowledge Base page, click Add Data Source.
  2. Enter a name and select Custom as the data source type.
  3. From the Crawl Integration field, select the integration created for crawling.
  4. From the Extract Integration field, select the integration created for extraction.
  5. Click Save.

Once the custom data source is set up:

  • Click Crawl to start the crawling process and retrieve all nodes and child nodes.
  • Click Extract to run the extraction integration and extract content from the external system.

JSON File Structure

For Custom data sources, third-party tools must support REST APIs for data exchange. Extracted content — whether in Word documents, Excel files, PDFs, or JSON format — must be mapped into a structured JSON file before integration.

The JSON file should follow this format:

Copy

JSON structure

[
  {
    "Title": "Sample Title",
    "Content": "Content 1"
  },
  {
    "Title": "Sample Title 2",
    "Content": "Content 2",
    "PageNumber": "3"
  },
  {
    "Title": "Sample Title 3",
    "Content": "Sample Content 3",
    "SheetName": "Sheet1"
  }
]

The following table provides the description of each JSON property:

Property Required Description
Title Yes The title of the content entry.
Content Yes The content to be added to the Knowledge Base.
PageNumber No Relevant only when mapping data from PDFs or Word documents. Specifies the page number where the content was extracted.
SheetName No Relevant only when mapping data from Excel files. Specifies the sheet name where the content was extracted.