Integrating STT and TTS for Real-time Transcription and Text-to-Speech in Wazo Calls

Hi everyone,

I’m working on a project to integrate Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities into Wazo calls. The goal is to achieve real-time transcription of user speech and playback of the transcribed text as synthesized audio.

Here’s a breakdown of the desired workflow:

  1. User Speaks: A user speaks into their phone or headset during a Wazo call.
  2. Audio Capture: Wazo captures the audio stream from the call.
  3. STT Integration: The captured audio is sent to an STT API (e.g., Google Cloud Speech-to-Text, Amazon Transcribe) for transcription.
  4. Transcription Return: The transcribed text is returned to Wazo.
  5. TTS Integration: The transcribed text is sent to a TTS API (e.g., Google Text-to-Speech, Amazon Polly) for text-to-speech conversion.
  6. Audio Playback: The synthesized audio is streamed back to Wazo and played in the call session.

I’m looking for guidance and advice on the following:

  • Wazo APIs: Which Wazo APIs or modules can be used to capture audio streams and inject synthesized audio into a call?
  • External API Integration: How can I integrate external STT and TTS APIs with Wazo? Node Red would be a good option
  • Real-time Processing: What strategies can be employed to ensure low-latency processing and minimal impact on call quality?
  • Error Handling: How can I handle potential errors or failures in the STT and TTS processes?

Any insights, code examples, or best practices would be greatly appreciated.

Thank you in advance for your help!

I haven’t done it, but here some ideas.

1/ using application endpoint

you can create an application
POST /applications/{application_uuid}/calls/{call_id}/stt/start
with options

max_time = 0 (unlimited),
engine = 2

Look at

2/ using node Red

I have examples, but here some docs:

cheers