feat: add optional speech-sdk TTS backend (14 cloud providers, BYOK) by btpod · Pull Request #142 · CaviraOSS/PageLM

btpod · 2026-06-10T23:44:44Z

Disclosure up front: I work on speech-sdk (Apache 2.0, TypeScript). This PR adds it as an optional TTS backend. It runs fully BYOK against each provider's API with your users' own keys; no account with us is needed for any of it.

Proposed in #141 per the CONTRIBUTING "discuss first" guideline; opening the PR alongside so the diff is concrete. Happy to close either if this isn't a fit.

Summary

TTS_PROVIDER=speechsdk enables a fourth backend in backend/src/utils/tts/index.ts.
The whole podcast renders through one generateConversation() call: segments become dialogue turns with alternating A/B voices (per-segment voice still wins, like the other backends). When the selected provider has a native multi-speaker dialogue model the SDK uses it in a single API call; otherwise it synthesizes turns in parallel and stitches them itself.
The SDK inserts a natural gap between speakers and loudness-normalizes the merged audio to -20 dBFS (the usual podcast target), so episodes don't jump in volume when the two hosts come from different voices.
This backend needs no ffmpeg. The SDK returns one merged mp3, so the per-segment file + concat pass is skipped entirely.
SPEECH_SDK_MODEL=<provider>/<model> selects the engine across 14 providers: openai, elevenlabs, cartesia, hume, deepgram, google (Gemini TTS), minimax, fish-audio, murf, resemble, fal-ai, mistral, xai, inworld. Voices come from SPEECH_SDK_VOICE_A / SPEECH_SDK_VOICE_B.
Defaults to openai/gpt-4o-mini-tts, so podcast audio works with the OPENAI_API_KEY most installs already set for the LLM, with zero new keys to configure.
Retries with exponential backoff come from the SDK itself.
edge remains the default and the existing backends are untouched.

Implementation notes

@speech-sdk/core is ESM-only and the backend compiles to CJS, so the module loads through a transpile-proof dynamic import, the same reason the google backend uses await import('@google-cloud/text-to-speech').
Provider API keys are read from the standard env vars (OPENAI_API_KEY, ELEVENLABS_API_KEY, etc.) by the SDK. Optionally, setting SPEECHBASE_API_KEY routes the same provider/model strings through speechbase.ai, the hosted gateway we run, so one key covers every provider; without it, calls go directly to the provider. Direct is the default.
Unknown provider prefixes throw speechsdk_unknown_provider_<prefix>, matching the existing error style.
The audio_progress emit fires once on completion; as far as I can tell nothing in frontend/src consumes it today, but say the word if you'd rather keep per-segment events and I'll switch to per-turn generateSpeech() calls plus the existing ffmpeg concat (that variant is in this branch's history at 8e0643c).
The SDK adds four transitive deps (mediabunny + mp3 encoder for audio conversion, p-retry, zod).

Test plan

npm run build passes.
Manually generated a three-segment podcast with TTS_PROVIDER=speechsdk and only OPENAI_API_KEY set: alternating voices, single merged mp3 (ffprobe: format_name=mp3, 15.2s for three short turns with audible inter-speaker gaps), no ffmpeg invoked.
Verified the unknown-provider error path.

Review-driven changes

Reworked the backend from per-segment generateSpeech() + ffmpeg concat to a single generateConversation() call (commit ad06480): native dialogue support, parallel turn synthesis, inter-speaker gaps, loudness normalization, and no ffmpeg requirement for this backend.

I'll maintain this integration and take responsibility for breakage in it. Happy to rename the provider key, drop the README table tweak, or restructure however you prefer.

🤖 Generated with Claude Code

Adds TTS_PROVIDER=speechsdk as a fourth synthesis backend. One integration covers OpenAI, ElevenLabs, Cartesia, Hume, Deepgram, Google Gemini TTS, MiniMax, Fish Audio, Murf, Resemble, fal, Mistral, xAI, and Inworld, selected with SPEECH_SDK_MODEL=<provider>/<model>. Defaults to openai/gpt-4o-mini-tts using the OPENAI_API_KEY most installs already set for the LLM. Existing backends and the edge default are untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

One call renders the full dialogue: native multi-speaker models where the provider has one, otherwise parallel per-turn synthesis stitched and loudness-normalized to -20 dBFS by the SDK. Removes the per-segment ffmpeg concat from this backend. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

btpod and others added 2 commits June 10, 2026 16:43

recabasic approved these changes Jun 11, 2026

View reviewed changes

Merge branch 'main' into btpod/speech-sdk-tts

d58fd7a

recabasic merged commit 736f22b into CaviraOSS:main Jun 11, 2026
2 checks passed

btpod deleted the btpod/speech-sdk-tts branch June 11, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add optional speech-sdk TTS backend (14 cloud providers, BYOK)#142

feat: add optional speech-sdk TTS backend (14 cloud providers, BYOK)#142
recabasic merged 3 commits into
CaviraOSS:mainfrom
btpod:btpod/speech-sdk-tts

btpod commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

btpod commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation notes

Test plan

Review-driven changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

btpod commented Jun 10, 2026 •

edited

Loading