feat: add optional speech-sdk TTS backend (14 cloud providers, BYOK)#142
Merged
Conversation
Adds TTS_PROVIDER=speechsdk as a fourth synthesis backend. One integration covers OpenAI, ElevenLabs, Cartesia, Hume, Deepgram, Google Gemini TTS, MiniMax, Fish Audio, Murf, Resemble, fal, Mistral, xAI, and Inworld, selected with SPEECH_SDK_MODEL=<provider>/<model>. Defaults to openai/gpt-4o-mini-tts using the OPENAI_API_KEY most installs already set for the LLM. Existing backends and the edge default are untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
One call renders the full dialogue: native multi-speaker models where the provider has one, otherwise parallel per-turn synthesis stitched and loudness-normalized to -20 dBFS by the SDK. Removes the per-segment ffmpeg concat from this backend. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
recabasic
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disclosure up front: I work on speech-sdk (Apache 2.0, TypeScript). This PR adds it as an optional TTS backend. It runs fully BYOK against each provider's API with your users' own keys; no account with us is needed for any of it.
Proposed in #141 per the CONTRIBUTING "discuss first" guideline; opening the PR alongside so the diff is concrete. Happy to close either if this isn't a fit.
Summary
TTS_PROVIDER=speechsdkenables a fourth backend inbackend/src/utils/tts/index.ts.generateConversation()call: segments become dialogue turns with alternating A/B voices (per-segmentvoicestill wins, like the other backends). When the selected provider has a native multi-speaker dialogue model the SDK uses it in a single API call; otherwise it synthesizes turns in parallel and stitches them itself.SPEECH_SDK_MODEL=<provider>/<model>selects the engine across 14 providers: openai, elevenlabs, cartesia, hume, deepgram, google (Gemini TTS), minimax, fish-audio, murf, resemble, fal-ai, mistral, xai, inworld. Voices come fromSPEECH_SDK_VOICE_A/SPEECH_SDK_VOICE_B.openai/gpt-4o-mini-tts, so podcast audio works with theOPENAI_API_KEYmost installs already set for the LLM, with zero new keys to configure.edgeremains the default and the existing backends are untouched.Implementation notes
@speech-sdk/coreis ESM-only and the backend compiles to CJS, so the module loads through a transpile-proof dynamic import, the same reason the google backend usesawait import('@google-cloud/text-to-speech').OPENAI_API_KEY,ELEVENLABS_API_KEY, etc.) by the SDK. Optionally, settingSPEECHBASE_API_KEYroutes the sameprovider/modelstrings through speechbase.ai, the hosted gateway we run, so one key covers every provider; without it, calls go directly to the provider. Direct is the default.speechsdk_unknown_provider_<prefix>, matching the existing error style.audio_progressemit fires once on completion; as far as I can tell nothing infrontend/srcconsumes it today, but say the word if you'd rather keep per-segment events and I'll switch to per-turngenerateSpeech()calls plus the existing ffmpeg concat (that variant is in this branch's history at 8e0643c).Test plan
npm run buildpasses.TTS_PROVIDER=speechsdkand onlyOPENAI_API_KEYset: alternating voices, single merged mp3 (ffprobe:format_name=mp3, 15.2s for three short turns with audible inter-speaker gaps), no ffmpeg invoked.Review-driven changes
generateSpeech()+ ffmpeg concat to a singlegenerateConversation()call (commit ad06480): native dialogue support, parallel turn synthesis, inter-speaker gaps, loudness normalization, and no ffmpeg requirement for this backend.I'll maintain this integration and take responsibility for breakage in it. Happy to rename the provider key, drop the README table tweak, or restructure however you prefer.
🤖 Generated with Claude Code