Skip to content

feat(openai): emit interim input audio transcription deltas from Realtime API#5544

Open
F1nnM wants to merge 3 commits intolivekit:mainfrom
F1nnM:feat/openai-realtime-streaming-input-transcription
Open

feat(openai): emit interim input audio transcription deltas from Realtime API#5544
F1nnM wants to merge 3 commits intolivekit:mainfrom
F1nnM:feat/openai-realtime-streaming-input-transcription

Conversation

@F1nnM
Copy link
Copy Markdown
Contributor

@F1nnM F1nnM commented Apr 24, 2026

Summary

The OpenAI Realtime API sends conversation.item.input_audio_transcription.delta events with
streaming transcription text as the user speaks, but the plugin currently drops them with pass.
This PR handles those deltas by accumulating text per item and emitting
input_audio_transcription_completed with is_final=False, matching the pattern already used
by the Google Gemini realtime plugin.

The .completed event continues to emit is_final=True as before, and cleans up accumulated
state.

  • Accumulate delta text per item_id in RealtimeSession._input_transcriptions
  • Emit InputTranscriptionCompleted(is_final=False) on each delta
  • Clean up state on .completed
  • Fix existing test_input_audio_transcription to wait for is_final=True (interims now fire first)
  • Add test_input_audio_transcription_interim asserting interim events arrive before final

Context

The same plugin's openai.STT class (in stt.py) already handles these deltas via its own
WebSocket connection. This brings parity to the RealtimeModel path so that AgentSession's
user_input_transcribed event receives streaming transcripts in realtime mode.

Test plan

  • make check passes (format, lint, type-check)
  • All 18 existing OpenAI realtime tests pass
  • New test_input_audio_transcription_interim validates interim deltas arrive before final

F1nnM added 2 commits April 24, 2026 11:05
Handle `conversation.item.input_audio_transcription.delta` events
from the OpenAI Realtime API instead of dropping them. Deltas are
accumulated per item and emitted as `input_audio_transcription_completed`
with `is_final=False`, matching the pattern already used by the
Gemini realtime plugin. The completed event cleans up state and
continues to emit `is_final=True` as before.

This enables streaming user input transcription for applications
that subscribe to `user_input_transcribed` on `AgentSession`.
The existing test_input_audio_transcription now waits for is_final=True
before asserting, since interim deltas now fire first.

New test_input_audio_transcription_interim validates that interim
transcription events (is_final=False) arrive before the final transcript.
devin-ai-integration[bot]

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant