Voice

Druid enables AI Agent voice capabilities to meet the demand for hands-free, conversational interactions across various business scenarios. This allows users to communicate with AI Agents naturally through two primary modes:

  • Telephony. Users can interact with AI Agents via traditional phone lines. This is ideal for automating call center triage before agent hand-off, or providing automated HR and IT Help Desk support through a dedicated phone number and telephone exchange.
  • Voice intranet page. Users can use voice commands directly within a web interface. For example, a user can verbally instruct an AI Agent to perform tasks or edit documents while working within an intranet page.

The voice channel is currently available as a technology preview via the Druid web snippet. You can configure and test voice conversations within the Druid Portal or on hosted web snippets.

How the Voice Channel works with the WebChat snippet

  1. Press the microphone button in the chat snippet to start talking with the AI Agent.
  2. HINT: If you don’t see the microphone icon, set up the voice service provider and configure the voice channel.

  3. Your voice is processed by the Speech-to-Text (STT) service as you speak. You will see the transcript in the input area. When you complete the sentence, the text is sent to the AI Agent.
  4. The AI Agent processes the text and responds with a text message. You will see the text response in the chat snippet, and the AI Agent will also speak the response to you.
  5. The spoken response is delivered by the Text-to-Speech (TTS) service.
  6. NOTE: In Flow authoring, you have a dedicated <Voice> setting for each step where you can customize a spoken response specifically for Voice channels, distinct from the text response.

Set up the speech provider

Druid delivers Speech-to-Text (STT) and Text-to-Speech (TTS) functionality through integrations with industry-leading Technology Partners. Out-of-the-box support includes:

  • Microsoft Cognitive Services
  • ElevenLabs (available starting with Druid 9.15)
  • Deepgram (STT only)

To integrate a preferred speech provider not listed above, please contact Druid Tech Support.

Setting up the Microsoft Cognitive Service

IMPORTANT! To use the Voice channel in production environments, contact Druid Tech Support for the necessary keys.
  1. In the Druid Portal, go to your AI Agent settings.
  2. Select the AI & Cognitive Services category and click Microsoft Cognitive Service.
  3. Provide the Key and Region provided by Druid Support Team in the voice channel activation email.
  4. HINT: For demo purposes, you can request a test key to Druid Tech Support.
  5. Map the languages your AI Agent supports to specific voices in the configuration table.
    1. In the table below the Voice channel details, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Voice dropdown, select the specific voice the AI Agent will use to respond.
    4. Click the Save icon displayed inline.
  6. Click Save at the bottom of the page and close the modal.

Setting up Deepgram

IMPORTANT! You can use Deepgram as voice provider for Webchat in DRUID 9.1 and higher for Speech-To-Text (STT) only.

Prerequisites

  • You need a Deepgram API Key with Member Permissions. Refer to Deepgram documentation (Token-Based Authentication) for information on how to create a key with Member permissions.

Setup procedure

  1. In the Druid Portal Portal, go to your AI Agent settings.
  2. Select the AI & Cognitive Services category and click Deepgram.
  3. Enter your Deepgram API Key.
  4. Map the languages your AI Agent supports to specific Deepgram models in the configuration table.
    1. In the table, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Model dropdown, select the specific Deepgram model the AI Agent will use to respond.
    4. Click the Save icon displayed inline.

    HINT: For Druid versions prior to 9.6, provide the Deepgram model (e.g., nova-2-medical). See Deepgram documentation for the complete list of models available.
  5. Click Save at the bottom of the page and close the modal.

  6. Click on the Webchat channel, select Deepgram as Speech-to-Text Provider and Azure as Text-to-Speech Provider.
  7. (Optional) Select Azure as a Fallback Speech-to-Text Provider in Webchat. Azure is used when Deepgram does not support the chat user’s language.
  8. IMPORTANT! Using Azure as Text-to-Speech Provider or Fallback Speech-to-Text Provider requires setting up the Voice Channel with MS Cognitive Service as well.

  9. Click the Save button at the bottom of the page.

The voice button appears in the chat snippet and users can click on it and speak with the chatbot.

Setting up ElevenLabs

Druid supports ElevenLabs as a high-quality Text-to-Speech (TTS) provider, enabling your AI Agent to communicate using specialized synthetic voices and custom voice clones.

NOTE: ElevenLabs is available in Druid 9.15 and higher.

Prerequisites

Setup procedure

  1. In the Druid Portal Portal, go to your AI Agent settings.
  2. Select the AI & Cognitive Services category and click ElevenLabs.
  3. Enter your ElevenLabs API Key.
  4. Map the languages your AI Agent supports to specific ElevenLabs languages in the configuration table.
    1. In the table, click the plus icon (+) to add a row.
    2. From the Language dropdown, select the AI Agent language (default or additional).
    3. From the Voice dropdown, select the specific ElevenLabs voice the AI Agent will use to respond. The model is automatically filled in after you select the voice.
    4. Click the Save icon displayed inline.

  5. Click Save at the bottom of the page and close the modal.

Configure the Voice channel

Once a speech provider is active, you must explicitly tell the Webchat channel to use these services:

  1. Select the Web & Email category and click the WebChat channel.
  2. Select the desired Speech-to-Text Provider. If you selected Deepgram, you should also select Azure as a Fallback Speech-to-Text Provider. Azure will be used automatically Deepgram does not support the user’s language.
  3. Select the desired Text-to-Speech Provider. If you selected ElevenLabs, you should also select Azure as a Fallback Text-to-Speech Provider. Azure will be used automatically if ElevenLabs does not support the user’s language.
  4. Click Save at the bottom of the page and close the modal.

A microphone icon will automatically appear in the webchat snippet. This allows users to switch from text to voice conversations seamlessly, enabling natural vocal interaction with the AI Agent.

How the Voice Channel works with SDL Real-time Machine Translation

If you use a translation service for real-time translation and activate the Voice channel, the AI Agent will play back the response in the user's language.

When activating SDL machine translation, you can choose when the translation is performed: at conversation time or authoring time. For more information, see Using Machine Translation.

Voice Channel with Conversation Time Translation

  1. The user speaks in Language A.
  2. Speech-to-Text (STT) is performed in Language A.
  3. The text is translated into the AI Agent default language.
  4. NLP is performed in the AI Agent default language.
  5. A response is generated in the AI Agent default language.
  6. The response is translated back into Language A.
  7. The AI Agent responds with text in Language A.
  8. The response text is converted into audio by the Text-to-Speech (TTS) service.
NOTE: When Conversation Time Translation is enabled for the AI Agent, the language selector in the webchat snippet is replaced by a non-selectable World icon. This indicates that the AI Agent can process languages beyond its natively authored set. The webchat snippet automatically adapts to the user's language code. If a user sends a voice message in a language different from the AI Agent pre-configured languages, the snippet detects the change and sends the audio to the Speech-to-Text (STT) service in that specific language. This ensures the AI Agent can accurately transcribe and translate the user's input regardless of the initial language settings.

Voice Channel with Authoring Time Translation

When using Authoring Time Translation, Druid translates the message written in the Voice setting of flow steps from the default AI Agent language to all additional languages.

  1. The user speaks in Language A (default or additional AI Agent language).
  2. Speech-to-Text (STT) is performed in Language A.
  3. NLP is performed in Language A.
  4. The AI Agent responds with text in Language A.
  5. The response text is converted into audio by the Text-to-Speech (TTS) service and spoken to the user.