feat(openai): emit interim input audio transcription deltas from Realtime API#5544
Open
F1nnM wants to merge 3 commits intolivekit:mainfrom
Open
feat(openai): emit interim input audio transcription deltas from Realtime API#5544F1nnM wants to merge 3 commits intolivekit:mainfrom
F1nnM wants to merge 3 commits intolivekit:mainfrom
Conversation
Handle `conversation.item.input_audio_transcription.delta` events from the OpenAI Realtime API instead of dropping them. Deltas are accumulated per item and emitted as `input_audio_transcription_completed` with `is_final=False`, matching the pattern already used by the Gemini realtime plugin. The completed event cleans up state and continues to emit `is_final=True` as before. This enables streaming user input transcription for applications that subscribe to `user_input_transcribed` on `AgentSession`.
The existing test_input_audio_transcription now waits for is_final=True before asserting, since interim deltas now fire first. New test_input_audio_transcription_interim validates that interim transcription events (is_final=False) arrive before the final transcript.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The OpenAI Realtime API sends
conversation.item.input_audio_transcription.deltaevents withstreaming transcription text as the user speaks, but the plugin currently drops them with
pass.This PR handles those deltas by accumulating text per item and emitting
input_audio_transcription_completedwithis_final=False, matching the pattern already usedby the Google Gemini realtime plugin.
The
.completedevent continues to emitis_final=Trueas before, and cleans up accumulatedstate.
item_idinRealtimeSession._input_transcriptionsInputTranscriptionCompleted(is_final=False)on each delta.completedtest_input_audio_transcriptionto wait foris_final=True(interims now fire first)test_input_audio_transcription_interimasserting interim events arrive before finalContext
The same plugin's
openai.STTclass (instt.py) already handles these deltas via its ownWebSocket connection. This brings parity to the
RealtimeModelpath so thatAgentSession'suser_input_transcribedevent receives streaming transcripts in realtime mode.Test plan
make checkpasses (format, lint, type-check)test_input_audio_transcription_interimvalidates interim deltas arrive before final