File Extractors

File Extractors are DRUID-specific tools that help the Knowledge Base (KB) engine extract content from various file formats.

By default, the Standard Extractor is applied to all file MIME types, with the Basic Content Chunker enabled. You can switch the content chunker to an LLM Chunker for any file MIME type as needed.

This section provides an overview of the available file extractors for different MIME types, allowing you to select the most suitable option based on your data format.

You can configure file extractors at multiple levels:

  • Global level (applies to all files and data sources within the KB)
  • Unstructured data dource devel (specific to a data source)
  • Node and leaf level (granular customization within a data source tree).

Once you've selected the preferred file extractor(s), click Save to apply your changes.

In DRUID 8.11 onwards, the Modified Configuration icon indicates updates to advanced settings at both the data source level and tree elements. In the Advanced Settings tab, the icon also appears next to modified parameters, allowing authors to easily identify what has changed.

CSV Extractors

File Extractor Description When to use
Pan The Pan Extractor is designed for handling complex, mixed, or loosely formatted CSV files. It is best suited for situations where the structure of the CSV is not strictly uniform, making it ideal for handling variations or irregularities in the data. You need to extract data from CSV files with unstructured or inconsistent formats.
Structured The Structured Extractor is optimized for clean, well-formed CSV files where each row follows a consistent format. It is faster and more efficient for extracting data when the file adheres to a regular, predefined structure. You have well-organized and consistent CSV files with a fixed structure

DOC and DOCX Extractors

For DOC and DOCX file MIME types, DRUID supports only the Standard Extractor. This extractor processes text content while preserving the document’s basic structure, including headings, paragraphs, and lists. It ensures accurate text extraction for knowledge base indexing and search.

Starting from DRUID 8.3, the Standard extractor can also extract images from DOC and DOCX files.

To enable image extraction, go to Advanced Settings JSON, and under featureFlags in the JSON configuration set "LinkAndPersistFileResourcesOnExtract": true.

In DRUID 8.13 and higher, image extraction is enabled by default.

Copy
  "featureFlagTypes": [
    {
      "key": "LinkAndPersistFileResourcesOnExtract",
      "name": "LinkAndPersistFileResourcesOnExtract",
      "value": "true"
    },
    {
      "key": "ContentChunkerLlmInstructions",
      "name": "ContentChunkerLlmInstructions"
    }
      ]
Note:  Extracted images are stored in DRUID storage and linked with a 30-minute authentication token. The image link is embedded in the extracted article paragraph, allowing temporary access within the chat.

HTML Extractors

For HTML file MIME types, DRUID supports only the Standard Extractor. This extractor processes and extracts text content from HTML files while preserving the basic structure, such as headings, paragraphs, lists, and links. It removes formatting to ensure clean and efficient content extraction for the Knowledge Base.

PDF Extractors

DRUID supports the following extractors for PDF file MIME types: Elpis, Omni, Standard, and Structured. Each extractor is designed for different types of PDF documents, ensuring optimal content extraction based on document structure and format.

File Extractor Description When to use
Elpis

The Elpis extractor is optimized for multimedia-rich PDFs. In DRUID 8.3 and higher, it can extract both text and images, making it ideal for documents that include diagrams, charts, and embedded visuals.

In DRUID 8.13 and higher, image extraction is enabled by default.

To enable image extraction, go to Advanced Settings JSON, and under featureFlags in the JSON configuration set "LinkAndPersistFileResourcesOnExtract": true.

Copy
  "featureFlagTypes": [
    {
      "key": "LinkAndPersistFileResourcesOnExtract",
      "name": "LinkAndPersistFileResourcesOnExtract",
      "value": "true"
    },
    {
      "key": "ContentChunkerLlmInstructions",
      "name": "ContentChunkerLlmInstructions"
    }
      ]
Note:  Extracted images are stored in DRUID storage and linked with a 30-minute authentication token. The image link is embedded in the extracted article paragraph, allowing temporary access within the chat.
If your PDFs contain important images that should be accessible in extracted content.
Omni The Omni Extractor is specifically designed to extract content from structured PDFs. For structured PDFs added to unstructured data sources to improve article quality.
Standard The Standard Extractor is a general-purpose tool that extracts text content only, without preserving document structure or layout. It works well for simple PDFs without complex formatting. When you need basic text extraction without concerns about layout or formatting.
Structured The Structured Extractor is optimized for PDFs with consistent formatting, ensuring accurate extraction of headings, tables, and paragraphs. Extract text from highly structured PDFs with a defined layout.

XLS, XLSM, and XLSX Extractors

DRUID provides multiple extractors for XLS, XLSM, and XLSX file types, each designed for different extraction needs. Choose the appropriate extractor based on your need for table structure, formatting, or bulk data extraction.

File Extractor Description When to use
Pan Extracts content from Excel files while preserving table structures. Use when maintaining the original table layout is important.
Structured Extracts structured data by identifying patterns within rows and columns, ensuring a clean and organized output. Ideal for extracting well-structured tables for better indexing and search accuracy.
Standard Extracts text-based content while ignoring complex formatting or embedded objects. Suitable for general text extraction without requiring table structure preservation.
Reader Processes the entire spreadsheet and extracts data efficiently, including multiple sheets if applicable. Best for bulk extraction where data needs to be read from multiple sheets.

PPTX and PPSX Extractors

For PPTX and PPSX file MIME types, DRUID supports only the Standard extractor for content processing.

Note:  The extractor for PPTX and PPSX files, is available in DRUID 8.11 onwards.

JSON Extractors

For JSON file MIME types, DRUID supports only the Structured extractor for content processing.

To add content from third-party systems (e.g., Salesforce, Confluence) to the Knowledge Base, you can use structured JSON files. This extraction method is available for Unstructured, File Repository, and Custom data sources starting from DRUID 8.9.

For Custom data sources, third-party tools must support REST APIs for data exchange. Extracted content — whether in Word documents, Excel files, PDFs, or JSON format — must be mapped into a structured JSON file before integration.

JSON File Structure

The JSON file should follow this format:

Copy

JSON structure

[
  {
    "Title": "Sample Title",
    "Content": "Content 1"
  },
  {
    "Title": "Sample Title 2",
    "Content": "Content 2",
    "PageNumber": "3"
  },
  {
    "Title": "Sample Title 3",
    "Content": "Sample Content 3",
    "SheetName": "Sheet1"
  }
]

The following table provides the description of each JSON property:

Property Required Description
Title Yes The title of the content entry.
Content Yes The content to be added to the Knowledge Base.
PageNumber No Relevant only when mapping data from PDF documents. Specifies the page number from where the content was extracted.
SheetName No Relevant only when mapping data from Excel files. Specifies the sheet name where the content was extracted.