Content Chunkers

The DRUID KB Engine supports two types of content chunking (article extraction) methods: Basic and LLM. These options help define how content is divided into chunks for processing, allowing you to customize the level of detail extracted from documents.

Basic Content Chunking

This method extracts articles in chunks of 512 tokens (approximately 2,000 characters).

Use Case: Use this option for simple content extraction when you do not require deep semantic understanding. It’s ideal for smaller, less complex documents or when speed is a priority over nuanced content interpretation.

LLM Content Chunking

LLM chunking uses generative AI technology to perform smart content extraction, focusing on extracting more meaningful and contextually relevant chunks. This approach is more sophisticated, as it leverages language models to identify and extract key pieces of content, improving the relevance of each chunk.

HINT: Starting with DRUID 8.19, LLM-based chunking is enabled by default to extract contextually relevant content chunks. If the default LLM resource (such as Becus) is unavailable, DRUID automatically falls back to an alternative model like Azure OpenAI GPT-4o-mini, if it's configured. If no generative resource is activated on your tenant, DRUID defaults to the basic (non-AI) chunking method. Resetting advanced settings restores LLM chunking using the available resource.

Adding a new LLM chunker

To add a new LLM chunker:

Click Add new.
Select LLM as the chunker type.
Select the generative endpoint you want to use.
Configure the chunker parameters (see details below).
Click Save.

Adding Content Chunkers in DRUID Versions Prior to 8.4

In versions prior to DRUID 8.4, content chunkers can be added manually through the Advanced Settings JSON field. To do this:

Navigate to the Advanced Settings JSON section.
In the contentChunkers section, copy the structure of an existing chunker and paste it.
Set the caption parameter to define the name of your new chunker.
Save the changes, and your new chunker will appear in the Content Chunkers section of the UI.
Select LLM from the Type field.
Select the generative endpoint you want to use.
Configure the chunker parameters (see details below).
Click Save.

Configuring LLM Chunker Parameters

Caption

Customize the name of the content chunker that will be displayed in the user interface. This helps differentiate between different chunkers if you have multiple configurations.

Max Tokens and Max Lines

These parameters are automatically populated when you select the generative endpoint. They are specific to the generative model being used. For optimal performance, you should adjust these values based on the capabilities of the selected generative endpoint.

Both Max Tokens and Max Lines are limits that can be applied in text processing, but they serve different purposes.

Max Tokens limits the number of tokens (words or characters) the model can extract from each chunk. It controls the length of a response in AI models to prevent excessively long outputs.

NOTE: Max Tokens cannot exceed the remaining space in the Model Context Window Size after accounting for the input tokens. If your input takes up most of the context window, the maximum number of tokens for the output will be reduced. The Max Tokens parameter is not available for LLM chunkers.

Max Lines limits the number of lines to be extracted from each chunk regardless of token count. It structures the response into a fixed number of lines.

HINT: If you set both limits, the AI will generate up to Max Tokens but must also stop at Max Lines, whichever comes first.

Example: If you use a model with a token limit of 4,096 tokens, you might want to set Max Tokens to 2,000 and Max Lines to 30 to ensure each chunk is appropriately sized without overwhelming the model.

Prompt

This parameter is a part of the system's code and cannot be edited. It defines how the model interprets and processes the chunking task.

Keep Table Structure

This option, allows you to exclude tables from LLM chunking and process them with DRUID’s proprietary chunking technology. Enabling this option is useful for maintaining the structure and readability of tables in the extracted content, ensuring tables are handled more precisely than with LLM-based chunking.

NOTE: This option is available in DRUID 8.4 and higher.

HINT: If your documents include structured tables or data that need to be accurately retained, consider enabling this option for better results.

Extended options for Content Chunker with LLM

NOTE: LLM content chunker extended options are available starting with Druid version 9.12. While these options may increase chunking execution time, they ensure higher quality matching at prediction time.

These options are configured in the Advanced Settings JSON field under the ContentChunkerLlmOptions feature flag. By default, boolean fields are set to false. Set them to true to enable them.

Option	Type	Description	Note
SystemPromptGenerateSummary	Boolean	Determines if the LLM content chunker generates a summary for each document paragraph during the initial chunking process.
SystemPromptTrainGeneratedSummary	Boolean	Specifies whether the LLM uses the generated summary for document chunks in training sets.	This option is available under Advanced settings > Trainable elements. To provide maximum flexibility, you can configure it at the knowledge base (KB) or data source level (for any element type: root, node, or leaf). IMPORTANT! It requires setting the SystemPromptGenerateSummary to true, otherwise the summary will not be generated.
SystemPromptGenerateSplitReasons	Boolean	Controls whether the LLM provides a logical explanation for content split decisions during the document processing phase.
SystemPromptGenerateTags	Boolean	Enables the automatic generation of metadata tags for each chunk to improve searchability and filtering accuracy.
SystemPromptGeneratedQuestionsNoForEvaluation	Integer	Defines the number of questions the LLM generates automatically for evaluation data purposes.
SystemPromptGeneratedQuestionsNoForTrain	Integer	Defines the number of additional questions the LLM generates for training data to enhance model performance.	This option is available under Advanced settings > Trainable elements. To provide maximum flexibility, you can configure it at the knowledge base (KB) or data source level (for any element type: root, node, or leaf).

To review the data generated by the LLM when any of these options are active, create a dedicated view and add it to a workspace.

Customizing Content Chunker LLM Instructions

By default, DRUID provides generic instructions for LLM content chunkers. You can customize these instructions directly in the UI.

To modify the content chunker LLM instructions:

In the KB Advanced Settings, click on the Content chunker LLM instructions section.
Make your changes to the instructions as needed.

Click the Save & Close button at the bottom of the page.

Starting with DRUID 9.6, the default instructions for LLM content chunkers have been updated to ensure article titles are always extracted in the same language as the source document. If you use a custom LLM prompt, you can add these rules manually.

Copy

- Detect the main language of the document once at the beginning.
- All chapter titles MUST be written strictly in the same language as the document text.
- Do not translate, do not switch languages, and do not mix languages in chapter titles.
- If the document is entirely in English, every chapter title must also be in English.
- If the document is entirely in another language, every chapter title must also be in that same language.
- You are strictly forbidden to generate titles in a language different from the text.

To restore the updated default instructions, click Reset Advanced Settings (this action removes any custom changes).

In DRUID versions prior to 8.17, the Content chunker LLM instructions section is hidden by default.

To enable it:

Go to the Advanced Settings JSON section.
Under the featureFlags object, locate the ContentChunkerLlmInstructions property.
Set the caption value to a string, such as "null".

Click the Save & Close button at the bottom of the page.