Skip to content

feat: add optional speech-sdk TTS backend (14 cloud providers, BYOK)#142

Merged
recabasic merged 3 commits into
CaviraOSS:mainfrom
btpod:btpod/speech-sdk-tts
Jun 11, 2026
Merged

feat: add optional speech-sdk TTS backend (14 cloud providers, BYOK)#142
recabasic merged 3 commits into
CaviraOSS:mainfrom
btpod:btpod/speech-sdk-tts

Conversation

@btpod

@btpod btpod commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Disclosure up front: I work on speech-sdk (Apache 2.0, TypeScript). This PR adds it as an optional TTS backend. It runs fully BYOK against each provider's API with your users' own keys; no account with us is needed for any of it.

Proposed in #141 per the CONTRIBUTING "discuss first" guideline; opening the PR alongside so the diff is concrete. Happy to close either if this isn't a fit.

Summary

  • TTS_PROVIDER=speechsdk enables a fourth backend in backend/src/utils/tts/index.ts.
  • The whole podcast renders through one generateConversation() call: segments become dialogue turns with alternating A/B voices (per-segment voice still wins, like the other backends). When the selected provider has a native multi-speaker dialogue model the SDK uses it in a single API call; otherwise it synthesizes turns in parallel and stitches them itself.
  • The SDK inserts a natural gap between speakers and loudness-normalizes the merged audio to -20 dBFS (the usual podcast target), so episodes don't jump in volume when the two hosts come from different voices.
  • This backend needs no ffmpeg. The SDK returns one merged mp3, so the per-segment file + concat pass is skipped entirely.
  • SPEECH_SDK_MODEL=<provider>/<model> selects the engine across 14 providers: openai, elevenlabs, cartesia, hume, deepgram, google (Gemini TTS), minimax, fish-audio, murf, resemble, fal-ai, mistral, xai, inworld. Voices come from SPEECH_SDK_VOICE_A / SPEECH_SDK_VOICE_B.
  • Defaults to openai/gpt-4o-mini-tts, so podcast audio works with the OPENAI_API_KEY most installs already set for the LLM, with zero new keys to configure.
  • Retries with exponential backoff come from the SDK itself.
  • edge remains the default and the existing backends are untouched.

Implementation notes

  • @speech-sdk/core is ESM-only and the backend compiles to CJS, so the module loads through a transpile-proof dynamic import, the same reason the google backend uses await import('@google-cloud/text-to-speech').
  • Provider API keys are read from the standard env vars (OPENAI_API_KEY, ELEVENLABS_API_KEY, etc.) by the SDK. Optionally, setting SPEECHBASE_API_KEY routes the same provider/model strings through speechbase.ai, the hosted gateway we run, so one key covers every provider; without it, calls go directly to the provider. Direct is the default.
  • Unknown provider prefixes throw speechsdk_unknown_provider_<prefix>, matching the existing error style.
  • The audio_progress emit fires once on completion; as far as I can tell nothing in frontend/src consumes it today, but say the word if you'd rather keep per-segment events and I'll switch to per-turn generateSpeech() calls plus the existing ffmpeg concat (that variant is in this branch's history at 8e0643c).
  • The SDK adds four transitive deps (mediabunny + mp3 encoder for audio conversion, p-retry, zod).

Test plan

  • npm run build passes.
  • Manually generated a three-segment podcast with TTS_PROVIDER=speechsdk and only OPENAI_API_KEY set: alternating voices, single merged mp3 (ffprobe: format_name=mp3, 15.2s for three short turns with audible inter-speaker gaps), no ffmpeg invoked.
  • Verified the unknown-provider error path.

Review-driven changes

  • Reworked the backend from per-segment generateSpeech() + ffmpeg concat to a single generateConversation() call (commit ad06480): native dialogue support, parallel turn synthesis, inter-speaker gaps, loudness normalization, and no ffmpeg requirement for this backend.

I'll maintain this integration and take responsibility for breakage in it. Happy to rename the provider key, drop the README table tweak, or restructure however you prefer.

🤖 Generated with Claude Code

btpod and others added 2 commits June 10, 2026 16:43
Adds TTS_PROVIDER=speechsdk as a fourth synthesis backend. One integration
covers OpenAI, ElevenLabs, Cartesia, Hume, Deepgram, Google Gemini TTS,
MiniMax, Fish Audio, Murf, Resemble, fal, Mistral, xAI, and Inworld, selected
with SPEECH_SDK_MODEL=<provider>/<model>. Defaults to openai/gpt-4o-mini-tts
using the OPENAI_API_KEY most installs already set for the LLM. Existing
backends and the edge default are untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
One call renders the full dialogue: native multi-speaker models where the
provider has one, otherwise parallel per-turn synthesis stitched and
loudness-normalized to -20 dBFS by the SDK. Removes the per-segment ffmpeg
concat from this backend.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@recabasic recabasic merged commit 736f22b into CaviraOSS:main Jun 11, 2026
2 checks passed
@btpod btpod deleted the btpod/speech-sdk-tts branch June 11, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants