High Availability On-Premises Deployment
Druid HA deployments leverage industry-standard Kubernetes technology. This setup is designed to handle light to moderate chat traffic, averaging 100 messages per minute, with occasional spikes up to 300 messages per minute, and no significant load on the Druid Connector.
Standard Deployment Architecture Diagram
Components Description
Name |
Description |
Type |
---|---|---|
Provisioning |
Used for provisioning bots related resources. |
Druid |
It will provision: the bot itself, channels, and manages export/import of authored elements: dialogs, integrations, entities, etc. |
||
APC |
Admin Portal – used for administration of bot solutions, users, tenants etc. |
Druid |
It hosts the web portal interface for chatbot authoring and user management. |
||
Kibana |
Used for logs investigation |
Third-party |
The web application used to explore the technical Druid applications’ logs which are stored in elasticsearch database. |
||
Image: docker.elastic.co/kibana/kibana |
||
Elasticsearch |
Used for logs storage |
Third-party |
The elasticsearch (timeseries type) database where all Druid applications’ logs are gathered. |
||
Image: docker.elastic.co/elasticsearch/elasticsearch |
||
RabbitMQ |
Message broker solution for intercommunication of Druid applications. |
Third-party |
The communication protocol is AMQPS. |
||
Image: rabbitmq |
||
Redis |
Used for memory cache. |
Third-party |
Druid is using it for in-memory data store of several applications, also for multi-instance synchronization (High Availability) of our applications and internal notifications system. |
||
Image: redis/redis-stack-server |
||
Nginx |
Inbound traffic to Druid platform. |
Third-party |
The only way to interact with Druid platform. |
||
Images: registry.k8s.io/ingress-nginx/controller registry.k8s.io/ingress-nginx/kube-webhook-certgen |
||
Grafana |
Used for dashboards |
Third-party |
The GUI used to explore the monitoring KPIs. |
||
Image: grafana/Grafana |
||
Prometheus |
Used for metrics collector and storage |
Third-party |
It manages a timeseries database, which is automatically updated by Druid applications. |
||
Images: quay.io/prometheus/node-exporter k8s.gcr.io/kube-state-metrics/kube-state-metrics quay.io/prometheus/alertmanager jimmidyson/configmap-reload quay.io/prometheus/prometheus |
||
BotService |
Message manager for chat bot |
Druid |
This is main messaging endpoint for DirectLine channel (public webchats communicate with this service for user and chatbot messages transfer). |
||
FlowEngine |
Used for chat session flows |
Druid |
The main dialog management engine for executing the configured dialogs in all chat conversations. |
||
Endpoints |
Flow starter – external apps |
Druid |
This application hosts the endpoints to allow the external integration from third-party applications to Druid conversational engine (e.g., RPAs, Electronic Signature solutions, etc.). |
||
BotApp |
The chatbot |
Druid |
This application is the message dispatcher between public communication channels (e.g., WhatsApp, Facebook, Viber, etc.) and our conversational engine (the Flow Engine). Practically, all conversation will pass through BotApp and forward to the right channel. |
||
Connector |
Used for integration with enterprise services |
Druid
|
The main automation service which performs all activities related to data exchange between the conversational engine and third-party applications, databases, etc., through specific interfaces, e.g., REST, SOAP, SQL, MSCRM, AZ Blob Storage, document generator, file download, etc. Druidconnector also persists the conversations’ transcripts to the history database. |
||
ML Api Gateway | Proxying the calls between ML services and their clients. | Druid |
Practically the application is proxying the calls from FlowEngine and APC to ML Model Serving and ML Model Training. |
||
ML Model Serving | Resolves NLU predict requests | Druid |
The application acts as an active NLP engine providing responses to intent/entity predict requests based on the NLU models provided by ML Model Training. | ||
ML Model Training | Creates NLU models. | Druid |
Generated NLU models based on training phrases from APC. |
||
Ignite | Persisted cache for conversational engine | Druid |
Especially used by conversation user’s management. | ||
Antimalware | File signature checker | Druid |
This component is used by druidflowengine component to verify file signature versus its extension and validate extension against supported extensions: pdf, png, jpg, jpeg, doc, docx, xls, xlsx, odt, ods, tiff, tif, mp3, mp4, mkv, webm, txt, json, csv. Also, it can be integrated with any 3rd party antimalware system which is AMSI interface compliant | ||
API | Conversational authorizer and live agent notification service | Druid |
Exposes web sockets for Druid live agent webpage, to manage live chat notifications. It also hosts light web resources for certain chat functionality like sensitive data input, SSO auth, etc. | ||
BotApi |
Used for managing messages status |
Druid |
Statuses: Sent, Received, Read |
||
Dataservice |
Druid proprietary solution for conversational context storage. |
Druid |
Used to persist DRUID entity records created and managed within the DRUID Platform simplifying records authoring. |
||
Webview |
Conversational Business Applications |
Druid |
Hosts the Druid CBA interface. |
||
ContactCenterIntegration |
Integrations Flow Engine with 3rd party Contact Centre solutions |
Druid |
e.g., Oracle B2C, Amazon Connect, FreshChat, SalesForce, etc. |
||
Knowledgebase Agent |
Main knowledgebase engine. |
Druid |
Manages KB related requests (web-crawl, document-extraction, embedding, train and predict). |
||
Knowledgebase API |
Proxying the calls between KB services and their clients. |
Druid |
Practically, the application is proxying the calls from FlowEngine and APC to Knowledgebase Agent and Connector |
||
Service Gateway |
Proxying the calls between KB agent and embeddings servers (Tritor, etc.). |
Druid |
Through Service Gateway, embeddings services are offered "as a service" to requesting clients (KB agent, Model Serving, and other) |
||
MongoDB |
Databases used by Knowledgebase Agent and Dataservice. |
Third-party |
Image: mongo |
||
Triton |
AI Nvidia model |
Druid |
Generates semantic embeddings for ML and KB services. |
||
vLLM |
Generative AI server |
Third-party |
Used with Druid Knowledgebase service to generate completions over KB responses. |
H/W and S/W requirements - Non-Cloud Specifications
Production Environment
# |
Item |
Qty (Nodes) |
OS |
CPU (Intel Xeon) |
RAM |
SSD |
Data |
Notes |
---|---|---|---|---|---|---|---|---|
1 |
App Server - The host of the Druid platform |
5 |
Linux min kernel 3.10 i.e., Ubuntu 18.04 LTS, RedHat 7.4 (newer, equivalent) |
8 vCPU |
32 GB |
OS 120 GB |
100 GB (Scale as required) |
Kubernetes Cluster (min version 1.19) |
2 |
App Server – Druid semantic classification machine |
1 |
Linux min kernel 3.10 i.e., Ubuntu 18.04 LTS, RedHat 7.4 (newer, equivalent) |
4 vCPU |
8 GB |
OS 120 GB |
50 GB (Scale as required) |
NVIDIA 16 GB GPU with compute capability 7.5 (e.g., T4, V100, P100) |
3 |
App Server – LLM Service for Gen.AI |
1 |
Linux min kernel 3.10 i.e., Ubuntu 18.04 LTS, RedHat 7.4 (newer, equivalent) |
8 vCPU |
32 GB |
OS 120 GB |
200 GB (Scale as required) |
2 X NVIDIA A100 80GB GPU |
4 |
Microsoft server (App server + Land bot page) |
1 |
Windows 2019+; Updates “up to date” |
2 vCPU |
8 GB |
OS 120 GB |
- |
ASP.NET 4.6.1. Hosting IIS is required (Dedicated or shared) |
5 |
Microsoft SQL server (DB server) |
1 |
Windows 2019+; Updates “up to date” |
4 vCPU |
16 GB |
OS 120 GB |
400 GB (Scale as required) |
Microsoft SQL Server Enterprise 2019+ Enterprise Database Service (Dedicated or shared) |
6 |
Dedicated storage – container and infrastructure storage |
|
|
|
|
|
100 GB (Scale as required) |
Dedicated or shared - NFS |
Testing Environment
# |
Item |
Qty (Nodes) |
OS |
CPU (Intel Xeon) |
RAM |
SSD |
Data |
Notes |
---|---|---|---|---|---|---|---|---|
1 |
App Server - The host of the Druid platform |
1 |
Linux min kernel 3.10 i.e., Ubuntu 18.04 LTS, RedHat 7.4 (newer, equivalent) |
10 vCPU |
40 GB |
OS 120 GB |
100 GB (Scale as required) |
Kubernetes Cluster (min version 1.19) |
2 |
App Server – Druid semantic classification machine |
1 |
Linux min kernel 3.10 i.e., Ubuntu 18.04 LTS, RedHat 7.4 (newer, equivalent) |
4 vCPU |
8 GB |
OS 120 GB |
50 GB (Scale as required) |
NVIDIA 16 GB GPU with compute capability 7.5 – Optional for testing Env |
3 |
App Server – LLM Service for Gen.AI |
1 |
Linux min kernel 3.10 i.e., Ubuntu 18.04 LTS, RedHat 7.4 (newer, equivalent) |
8 vCPU |
32 GB |
OS 120 GB |
200 GB (Scale as required) |
2 X NVIDIA A100 80GB GPU |
4 |
Microsoft test server (App server + Land bot page) |
1 |
Windows Server 2016+; Updates “up to date” |
2 vCPU |
8 GB |
OS 120 GB |
- |
ASP.NET 4.6.1. Hosting IIS is required. (Dedicated or shared) |
5 |
Microsoft SQL server (DB server) |
1 |
Windows Server 2016+; Updates “up to date” |
2 vCPU |
8 GB |
OS 120 GB |
50 GB (Scale as required) |
Microsoft SQL Server Standard 2019+ Database Service (Dedicated or shared) |
Testing Environment non-GPU specs
# |
Item |
Qty (Nodes) |
OS |
CPU (Intel Xeon) |
RAM |
SSD |
Data |
Notes |
---|---|---|---|---|---|---|---|---|
1 |
App Server - The host of the Druid platform |
1 |
Linux min kernel 3.10 i.e., Ubuntu 18.04 LTS, RedHat 7.4 (newer, equivalent) |
16 vCPU |
64 GB |
OS 120 GB |
150 GB (Scale as required) |
Kubernetes Cluster (min version 1.19) |
2 |
App Server – LLM Service for Gen.AI |
N/A |
N/A |
N/A |
N/A |
N/A |
N/A |
N/A |
3 |
Microsoft test server (App server + Land bot page) |
1 |
Windows Server 2016+; Updates “up to date” |
2 vCPU |
8 GB |
OS 120 GB |
- |
ASP.NET 4.6.1. Hosting IIS is required. (Dedicated or shared) |
4 |
Microsoft SQL server (DB server) |
1 |
Windows Server 2016+; Updates “up to date” |
2 vCPU |
8 GB |
OS 120 GB |
50 GB (Scale as required) |
Microsoft SQL Server Standard 2019+ Database Service (Dedicated or shared) |
H/W and S/W requirements - Cloud (Azure, EKS, etc.)
Production Environment
# |
Item |
Qty (Nodes) |
OS |
CPU (Intel Xeon) |
RAM |
SSD |
Data |
Notes |
---|---|---|---|---|---|---|---|---|
1 |
App Server - The host of the Druid platform |
5 |
Cloud specific |
8 vCPU |
32 GB |
Cloud specific |
- |
Kubernetes Cluster (min version 1.19) |
2 |
App Server – Druid semantic classification machine |
1 |
Cloud specific |
4 vCPU |
8 GB |
Cloud specific |
- |
NVIDIA 16 GB GPU with compute capability 7.5 (e.g., T4, V100, P100) |
3 |
App Server – LLM Service for Gen.AI |
1 |
Cloud specific |
8 vCPU |
32 GB |
Cloud specific |
- |
2 X NVIDIA A100 80GB GPU |
4 |
Microsoft server (App server + Land bot page) |
1 |
Windows Server 2016+; Updates “up to date” |
2 vCPU |
8 GB |
OS 120 GB |
- |
ASP.NET 4.6.1. Hosting IIS is required. (Dedicated or shared) |
5 |
Microsoft SQL server (DB server) |
1 |
Windows Server 2016+; Updates “up to date” |
4 vCPU |
16 GB |
OS 120 GB |
400 GB |
Microsoft SQL Server Enterprise 2019+ Enterprise Database Service (Dedicated or shared) |
6 |
Network disks |
- |
- |
- |
- |
- |
700 GB |
Cumulated for entire platform. |
Testing Environment
# |
Item |
Qty (Nodes) |
OS |
CPU (Intel Xeon) |
RAM |
SSD |
Data |
Notes |
---|---|---|---|---|---|---|---|---|
1 |
App Server - The host of the Druid platform |
1 |
Cloud specific |
10 vCPU |
40 GB |
Cloud specific |
- |
Kubernetes Cluster (min version 1.19) |
2 |
App Server – Druid semantic classification machine |
1 |
Cloud specific |
4 vCPU |
8 GB |
Cloud specific |
- |
NVIDIA 16 GB GPU with compute capability 7.5 (e.g., T4, V100, P100) |
3 |
App Server – LLM Service for Gen.AI |
1 |
Cloud specific |
8 vCPU |
32 GB |
Cloud specific |
- |
2 X NVIDIA A100 80GB GPU |
4 |
Microsoft test server (App server + Land bot page) |
1 |
Windows Server 2016+; Updates “up to date” |
2 vCPU |
8 GB |
OS 120 GB |
- |
ASP.NET 4.6.1. Hosting IIS is required. (Dedicated or shared) |
5 |
Microsoft SQL server (DB server) |
1 |
Windows Server 2016+; Updates “up to date” |
2 vCPU |
8 GB |
OS 120 GB |
50 GB (Scale as required) |
Microsoft SQL Server Standard 2019+ Database Service (Dedicated or shared) |
6 |
Network disks |
- |
- |
- |
- |
- |
300 GB (Scale as required) |
Cumulated for entire platform. |
DRUID Platform DB Server - Additional software requirements
- SQL Server instance attributes:
- Collation: Latin1_General_CI_AS
- Windows and SQL Server Authentication mode enabled.
- TCP Protocol enabled (in SQL Server Configuration Manager)
- SQL Server port is open in the firewall of the DB Server
- Must be fixed port, not on a dynamically allocated one.
- SQL Server Management Studio (SSMS). Alternatively Azure Data Studio or osql utility can be used to run T-SQL statements necessary in installation process.
Detailed components CPU and memory requests and limits
Pod Name |
Mem Req. |
CPU Req. |
Mem Lim. |
CPU Lim. |
---|---|---|---|---|
Aantimalware |
512 |
100 |
512 |
1000 |
ApcBack |
1536 |
500 |
4096 |
2000 |
Apc |
100 |
100 |
384 |
250 |
Api |
512 |
100 |
1024 |
1000 |
BotApi |
512 |
100 |
1024 |
1000 |
BotApp |
768 |
100 |
1536 |
1000 |
BotService |
512 |
100 |
1024 |
1000 |
Connector |
768 |
200 |
2048 |
2000 |
Dataservice |
512 |
100 |
1024 |
1000 |
Endpoints |
512 |
100 |
1024 |
1000 |
Flow Engine |
1024 |
250 |
2048 |
2000 |
Ignite |
512 |
100 |
5120 |
1500 |
Knowledgebase API |
512 |
100 |
1024 |
1000 |
Knowledgebase Agent |
3584 |
600 |
15360 |
6000 |
ML Api Gateway |
512 |
100 |
1024 |
1000 |
ML Model Serving |
512 |
100 |
2048 |
1000 |
ML Model Training |
2048 |
500 |
4096 |
2000 |
Migrator |
Best Effort |
|||
Provisioning |
512 |
50 |
1024 |
400 |
Service Gateway |
512 |
100 |
1024 |
1000 |
Webview |
512 |
100 |
1024 |
1000 |
RabbitMQ |
2048 |
1000 |
2048 |
1000 |
Redis |
256 |
200 |
1024 |
1000 |
Elasticsearch |
2048 |
500 |
2048 |
1000 |
Kibana |
512 |
100 |
1024 |
500 |
Nginx |
90 |
100 |
|
|
MongoDB |
2048 |
250 |
2048 |
1000 |
Triton Server |
512 |
100 |
6144 |
3500 |
Triton Models |
Best Effort |
|||
Grafana |
512 |
300 |
1024 |
2000 |
Prometheus Node Exporter |
Best Effort |
|||
Prometheus Server |
Best Effort |
Network Communication Matrix
Source (Name, IP, URL, etc.) |
Destination (Name, IP, URL, etc.) |
Protocol |
Port |
Function |
Used For |
---|---|---|---|---|---|
App Server* |
druidcontainerregistry.azurecr.io |
HTTPS |
443 |
Druid Container Registry |
Installation |
api.dso.docker.com api.segment.io auth.docker.io cdn.auth0.com cdn.segment.com desktop.docker.com docker-pinata-support.s3.amazonaws.com docker.elastic.co hub.docker.com k8s.gcr.io login.docker.com mcr.microsoft.com notify.bugsnag.com nvcr.io production.cloudflare.docker.com quay.io registry-1.docker.io sessions.bugsnag.com |
Third-party Containers |
||||
WebApp (public) |
druidapc.{{domain}}* |
HTTPS |
443 |
Chatbot interaction |
Utilization |
druidapcback.{{domain}} |
|||||
druidapi.{{domain}} |
|||||
druidbapi.{{domain}} |
|||||
druidbs.{{domain}} |
|||||
Intranet*** |
druidapc.{{domain}} |
HTTPS |
443 |
Platform administration |
Utilization |
druidapcback.{{domain}} |
|||||
druidapi.{{domain}} |
|||||
druidbapi.{{domain}} |
|||||
druidbapp.{{domain}} |
|||||
druidbs.{{domain}} |
|||||
druidep.{{domain}} |
|||||
druidkib.{{domain}} |
|||||
druidrmq.{{domain}} |
|||||
App Server (Connector) |
<TBD> |
<TBD> |
<TBD> |
Enterprise Services |
Utilization |
* This entry is necessary at installation or upgrade time for Kubernetes engine to automatically download needed binaries.
** In case the client doesn’t want to expose APC component, some specific files must be downloaded (from APC) and made them accessible (as resources) to WebApp. DRUID team will provide the list. There is only one downside: the files must be copied to WebApp within any DRUID Platform’s upgrade process.
*** Dedicated names for Intranet access only can be accommodated; this will require additional certificates.
Applications’ Technical Users
Application |
User |
Notes |
---|---|---|
Druid APC |
admin |
Used for platform administration. |
{{WEB-API-USER-NAME}} |
Used for programmatic access to platform API. Password parameter: {{WEB-API-USER-PASSWORD}} |
|
RabbitMQ |
{{RMQ-USER}} |
Used for queues admin. Main usage is for troubleshooting. Password parameter: {{RMQ-PASSWORD}} |
Kibana |
{{KIBANA-USER}} |
Used for logs exploring, mainly troubleshooting. Password parameter: {{KIBANA-PASSWORD}} |
BotApp BotService |
**** |
Only password. It is used by Bot App to authenticate on Bot Service (two of the Druid components). It cannot be used from outside. Parameter: {{BOTSERVICE-PASSWORD}} |
Redis |
**** |
Only password. It cannot be used from outside. Parameter: {{REDIS-PASSWORD}} |
Endpoints |
**** |
Only password. Parameter: {{ENDPOINTS-PASSWORD}} |
DNS entries
DNS registration of druid services FQDNs: Please register in your DNS and provide us with the list of the following FQDNs (example provided for the first few, please extrapolate for the rest).
Domain |
Type |
Name |
Value (IP addresses) |
FQDN |
---|---|---|---|---|
{{DOMAIN}} |
A |
Kibana |
{{APP-SERVER-IP}} |
druidkib.example.com |
RabbitMQ |
druidrmq.example.com |
|||
Apc |
druidapc.{{domain}} |
|||
ApcBack |
druidapcback.{{domain}} |
|||
Api |
druidapi.{{domain}} |
|||
BotAPI |
druidbapi.{{domain}} |
|||
BotApp |
druidbapp.{{domain}} |
|||
BotService |
druidbs.{{domain}} |
|||
EndPoints |
druidep.{{domain}} |
SSL Certificate
To access Druid platform via HTTPS protocol, the SSL certificate(s) must be prepared. The certificate(s) must cover all names defined in section “DNS Entries” documented above.
You can provide one or more certificates. The following approaches are valid for the Druid platform use case (we strongly recommend the last two options):
- Multiple certificates: One certificate for each service in the list of names.
- A single certificate with multiple hosts (CN or SANs).
- A wildcard certificate.
Specific components need
Component |
Storage Class |
Ingress |
Load Balancer |
Special configs/reqs |
|
---|---|---|---|---|---|
RWO | RWM | ||||
nginix/traefik/ other |
No |
No |
No |
Yes |
No |
rabbitmq |
Yes |
No |
Yes |
No |
No |
redis |
Yes |
No |
No |
No |
optional: |
sysctl -w net.core.somaxconn=10000 |
|||||
echo never > /sys/kernel/mm/transparent_hugepage/e nabled (+ adding in /etc/rc.local) |
|||||
elasticsearch |
Yes |
No (opt. Yes) |
No |
No |
mandatory: |
sysctl -w vm.max_map_count=262144 for OpenShift: https://developers.redhat.com/blog/2019 /11/12/using-the-red-hat-openshift- tuned-operator-for-elasticsearch |
|||||
kibana |
No |
No |
Yes |
No |
No |
druid components |
Yes |
Yes |
Yes |
No |
No |