Content Chunkers
The DRUID KB Engine supports two types of content chunking (article extraction) methods: Basic and LLM. These options help define how content is divided into chunks for processing, allowing you to customize the level of detail extracted from documents.
Basic Content Chunking
This method extracts articles in chunks of 512 tokens (approximately 2,000 characters).
Use Case: Use this option for simple content extraction when you do not require deep semantic understanding. It’s ideal for smaller, less complex documents or when speed is a priority over nuanced content interpretation.
LLM Content Chunking
LLM chunking uses generative AI technology to perform smart content extraction, focusing on extracting more meaningful and contextually relevant chunks. This approach is more sophisticated, as it leverages language models to identify and extract key pieces of content, improving the relevance of each chunk.
Adding a new LLM chunker
To add a new LLM chunker:
- Click Add new.
- Select LLM as the chunker type.
- Select the generative endpoint you want to use.
- Configure the chunker parameters (see details below).
- Click Save.
Adding Content Chunkers in DRUID Versions Prior to 8.4
In versions prior to DRUID 8.4, content chunkers can be added manually through the Advanced Settings JSON field. To do this:
- Navigate to the Advanced Settings JSON section.
- In the contentChunkers section, copy the structure of an existing chunker and paste it.
- Set the caption parameter to define the name of your new chunker.
- Save the changes, and your new chunker will appear in the Content Chunkers section of the UI.
- Select LLM from the Type field.
- Select the generative endpoint you want to use.
- Configure the chunker parameters (see details below).
- Click Save.
Configuring LLM Chunker Parameters
Caption
Customize the name of the content chunker that will be displayed in the user interface. This helps differentiate between different chunkers if you have multiple configurations.
Max Tokens and Max Lines
These parameters are automatically populated when you select the generative endpoint. They are specific to the generative model being used. For optimal performance, you should adjust these values based on the capabilities of the selected generative endpoint.
Both Max Tokens and Max Lines are limits that can be applied in text processing, but they serve different purposes.
Max Tokens limits the number of tokens (words or characters) the model can extract from each chunk. It controls the length of a response in AI models to prevent excessively long outputs.
Max Lines limits the number of lines to be extracted from each chunk regardless of token count. It structures the response into a fixed number of lines.
Example: If you use a model with a token limit of 4,096 tokens, you might want to set Max Tokens to 2,000 and Max Lines to 30 to ensure each chunk is appropriately sized without overwhelming the model.
Prompt
This parameter is a part of the system's code and cannot be edited. It defines how the model interprets and processes the chunking task.
Keep Table Structure
This option, allows you to exclude tables from LLM chunking and process them with DRUID’s proprietary chunking technology. Enabling this option is useful for maintaining the structure and readability of tables in the extracted content, ensuring tables are handled more precisely than with LLM-based chunking.
Extended options for Content Chunker with LLM
These options are configured in the Advanced Settings JSON field under the ContentChunkerLlmOptions feature flag. By default, boolean fields are set to false. Set them to true to enable them.
| Option | Type | Description | Note |
|---|---|---|---|
| SystemPromptGenerateSummary | Boolean | Determines if the LLM content chunker generates a summary for each document paragraph during the initial chunking process. | |
| SystemPromptTrainGeneratedSummary | Boolean | Specifies whether the LLM uses the generated summary for document chunks in training sets. |
This option is available under Advanced settings > Trainable elements. To provide maximum flexibility, you can configure it at the knowledge base (KB) or data source level (for any element type: root, node, or leaf). IMPORTANT! It requires setting the SystemPromptGenerateSummary to true, otherwise the summary will not be generated.
|
| SystemPromptGenerateSplitReasons | Boolean | Controls whether the LLM provides a logical explanation for content split decisions during the document processing phase. | |
| SystemPromptGenerateTags | Boolean | Enables the automatic generation of metadata tags for each chunk to improve searchability and filtering accuracy. | |
| SystemPromptGeneratedQuestionsNoForEvaluation | Integer | Defines the number of questions the LLM generates automatically for evaluation data purposes. | |
| SystemPromptGeneratedQuestionsNoForTrain | Integer | Defines the number of additional questions the LLM generates for training data to enhance model performance. | This option is available under Advanced settings > Trainable elements. To provide maximum flexibility, you can configure it at the knowledge base (KB) or data source level (for any element type: root, node, or leaf). |
To review the data generated by the LLM when any of these options are active, create a dedicated view and add it to a workspace.
Customizing Content Chunker LLM Instructions
By default, DRUID provides generic instructions for LLM content chunkers. You can customize these instructions directly in the UI.
To modify the content chunker LLM instructions:
- In the KB Advanced Settings, click on the Content chunker LLM instructions section.
- Make your changes to the instructions as needed.
- Click the Save & Close button at the bottom of the page.
Starting with DRUID 9.6, the default instructions for LLM content chunkers have been updated to ensure article titles are always extracted in the same language as the source document. If you use a custom LLM prompt, you can add these rules manually.
- Detect the main language of the document once at the beginning.
- All chapter titles MUST be written strictly in the same language as the document text.
- Do not translate, do not switch languages, and do not mix languages in chapter titles.
- If the document is entirely in English, every chapter title must also be in English.
- If the document is entirely in another language, every chapter title must also be in that same language.
- You are strictly forbidden to generate titles in a language different from the text.
To restore the updated default instructions, click Reset Advanced Settings (this action removes any custom changes).
In DRUID versions prior to 8.17, the Content chunker LLM instructions section is hidden by default.
To enable it:

