Skip to content

Add Nemotron-ASR streaming inference to Rust SDK#613

Open
rui-ren wants to merge 11 commits intomainfrom
ruiren/live-audio-stream-rust
Open

Add Nemotron-ASR streaming inference to Rust SDK#613
rui-ren wants to merge 11 commits intomainfrom
ruiren/live-audio-stream-rust

Conversation

@rui-ren
Copy link
Copy Markdown
Contributor

@rui-ren rui-ren commented Apr 8, 2026

Add Nemotron-ASR streaming inference to Rust SDK"

Description

Ports the C# live audio transcription feature (PR #485) to the Rust SDK with full API parity.

The existing AudioClient only supports file-based transcription. This PR introduces LiveAudioTranscriptionSession that accepts continuous PCM audio chunks (e.g., from a microphone) and returns partial/final transcription results as an async stream.

What's included

New files

  • sdk/rust/src/openai/live_audio_client.rs — Streaming session with start(), append(), get_transcription_stream(), stop(), plus types, cancellation support, and unit tests
  • sdk/rust/tests/integration/live_audio_test.rs — E2E integration test with synthetic PCM audio
  • samples/rust/live-audio-transcription-example/ — Full sample with real microphone capture (cpal) and resampling

Modified files

  • sdk/rust/src/detail/core_interop.rs — Added StreamingRequestBuffer FFI struct and execute_command_with_binary() for binary audio data
  • sdk/rust/src/openai/audio_client.rs — Added create_live_transcription_session() factory method
  • sdk/rust/src/detail/model.rs, model_variant.rs — Wired factory method to Model
  • sdk/rust/src/openai/mod.rs, src/lib.rs — Module registration and public exports
  • sdk/rust/Cargo.toml — Added tokio-util dependency for CancellationToken

API surface

let audio_client = model.create_audio_client();
let session = audio_client.create_live_transcription_session();

session.settings.sample_rate = 16000;
session.settings.channels = 1;
session.settings.language = Some("en".into());

session.start(None).await?;

// Push audio from microphone callback
session.append(&pcm_bytes, None).await?;

// Read results as async stream
use tokio_stream::StreamExt;
let mut stream = session.get_transcription_stream()?;
while let Some(result) = stream.next().await {
    let result = result?;
    println!("{}", result.content[0].text);
}

session.stop(None).await?;

C# API parity

C# Rust Status
CreateLiveTranscriptionSession() create_live_transcription_session()
StartAsync(CancellationToken) start(Option<CancellationToken>)
AppendAsync(ReadOnlyMemory<byte>, CancellationToken) append(&[u8], Option<CancellationToken>)
GetTranscriptionStream(CancellationToken) get_transcription_stream()
StopAsync(CancellationToken) + cancel-safe cleanup stop(Option<CancellationToken>) + cancel-safe cleanup
IAsyncDisposable.DisposeAsync() Drop with best-effort native stop
LiveAudioTranscriptionResponse.Content[0].Text response.content[0].text
LiveAudioTranscriptionResponse.Content[0].Transcript response.content[0].transcript
LiveAudioTranscriptionResponse.IsFinal response.is_final
LiveAudioTranscriptionResponse.StartTime/EndTime response.start_time / response.end_time
LiveAudioTranscriptionOptions (SampleRate, Channels, BitsPerSample, Language, PushQueueCapacity) LiveAudioTranscriptionOptions (sample_rate, channels, bits_per_sample, language, push_queue_capacity)
CoreErrorResponse.TryParse() CoreErrorResponse::try_parse()
Native commands: audio_stream_start, audio_stream_push, audio_stream_stop Same commands via execute_command / execute_command_with_binary

Design highlights

  • CancellationToken supportstart/append/stop accept Option<CancellationToken> via tokio_util::sync::CancellationToken
  • Cancel-safe stopstop() always performs native audio_stream_stop even if token fires, preventing native session leaks (matches C# StopAsync pattern)
  • Response envelopeLiveAudioTranscriptionResponse uses content: Vec<ContentPart> matching C#'s ConversationItem.Content[0].Text/Transcript
  • Bounded push queue — Backpressure via bounded channel (capacity=100); prevents unbounded memory growth
  • Push loop on blocking threadexecute_command_with_binary FFI calls run on spawn_blocking, keeping async runtime free
  • Settings freeze — Audio format settings are cloned at start() and immutable during the session
  • Drop safety — Best-effort synchronous audio_stream_stop in Drop to prevent native session leaks
  • FFI null pointer safety — Empty binary slices use std::ptr::null() to avoid dangling pointer across FFI boundary

Verified working

  • ✅ SDK build succeeds (0 errors, 0 clippy warnings)
  • ✅ 13 unit tests passing (JSON deserialization, settings defaults, error parsing, content envelope)
  • ✅ E2E pipeline: Microphone (48kHz/2ch/F32) → Resample (16kHz/mono/16-bit) → SDK → Core.dll → onnxruntime-genai.dll → nemotron model
  • ✅ Synthetic audio test: 30 chunks (96KB PCM) pushed with clean session lifecycle
  • ✅ Live microphone test: real-time capture, session start/stop, no native errors

Stats

  • 14 files changed, 1,329 additions, 2 deletions

Copilot AI review requested due to automatic review settings April 8, 2026 19:45
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
foundry-local Ready Ready Preview, Comment Apr 24, 2026 9:12pm

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds live (chunked) PCM audio transcription streaming to the Foundry Local Rust SDK, aligning the Rust API with the existing C# live audio transcription session feature and extending the SDK beyond file-based transcription.

Changes:

  • Introduces LiveAudioTranscriptionSession + associated response/options/types and stream wrapper in the Rust SDK.
  • Extends the Rust FFI bridge with execute_command_with_binary() to send JSON params + binary PCM payloads.
  • Adds integration test coverage and a new Rust sample demonstrating microphone capture (cpal) and streaming transcription.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdk/rust/src/openai/live_audio_client.rs New streaming session implementation, types, cancellation support, and unit tests
sdk/rust/src/detail/core_interop.rs Adds StreamingRequestBuffer + optional execute_command_with_binary symbol support
sdk/rust/src/openai/audio_client.rs Adds create_live_transcription_session() factory method
sdk/rust/src/detail/model.rs Exposes Model::create_live_transcription_session()
sdk/rust/src/detail/model_variant.rs Wires variant factory for live transcription sessions
sdk/rust/src/openai/mod.rs Registers and re-exports live audio transcription module/types
sdk/rust/src/lib.rs Public re-exports for the new live transcription session/types
sdk/rust/Cargo.toml Adds tokio-util for CancellationToken
sdk/rust/tests/integration/main.rs Registers the new integration test module
sdk/rust/tests/integration/live_audio_test.rs New E2E-ish integration test using synthetic PCM audio
samples/rust/live-audio-transcription-example/src/main.rs New microphone/synthetic streaming transcription sample
samples/rust/live-audio-transcription-example/Cargo.toml Declares sample dependencies (cpal, tokio, sdk path dep)
samples/rust/Cargo.toml Adds the new sample crate to the workspace
codex-feedback.md Adds review/validation notes for the feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
Comment thread sdk/rust/tests/integration/live_audio_test.rs Outdated
Comment thread samples/rust/live-audio-transcription-example/src/main.rs Outdated
Comment thread samples/rust/live-audio-transcription-example/Cargo.toml Outdated
Comment thread codex-feedback.md Outdated
Comment thread sdk/rust/tests/integration/live_audio_test.rs
Comment thread sdk/rust/src/openai/live_audio_client.rs
Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
Comment thread samples/rust/live-audio-transcription-example/src/main.rs Outdated
Comment thread samples/rust/live-audio-transcription-example/src/main.rs Outdated
Comment thread samples/rust/live-audio-transcription-example/src/main.rs Outdated
Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
Comment thread sdk/rust/src/openai/live_audio_client.rs Outdated
@rui-ren rui-ren changed the title Add live audio transcription streaming support to Foundry Local Rust SDK Add Nemotron-ASR streaming inference to Rust SDK Apr 14, 2026
ruiren_microsoft and others added 9 commits April 23, 2026 15:13
Port the C# live audio transcription feature (PR #485) to the Rust SDK
with full API parity.

New files:
- src/openai/live_audio_client.rs: LiveAudioTranscriptionSession with
  start/append/get_transcription_stream/stop lifecycle, response types,
  CoreErrorResponse, and unit tests
- tests/integration/live_audio_test.rs: E2E test with synthetic PCM audio

Modified files:
- src/detail/core_interop.rs: StreamingRequestBuffer FFI struct and
  execute_command_with_binary method for binary audio data
- src/openai/audio_client.rs: create_live_transcription_session() factory
- src/detail/model.rs, model_variant.rs: create_live_transcription_session()
- src/openai/mod.rs, src/lib.rs: Module and public type exports

API surface:
  let audio_client = model.create_audio_client();
  let session = audio_client.create_live_transcription_session();
  session.settings.sample_rate = 16000;
  session.start().await?;
  session.append(&pcm_bytes).await?;
  let mut stream = session.get_transcription_stream()?;
  // use tokio_stream::StreamExt;
  while let Some(result) = stream.next().await { ... }
  session.stop().await?;

Design highlights:
- Bounded push channel with backpressure (capacity=100)
- Push loop runs on blocking thread via spawn_blocking
- Fail-fast on native errors (no retry logic)
- Settings frozen at start() via clone snapshot
- Output channel completed on stop() after final result

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- core_interop.rs: Use std::ptr::null() for empty binary_data slices
  to avoid passing dangling pointer across FFI boundary
- live_audio_client.rs: Call native audio_stream_stop synchronously
  in Drop to prevent native session leaks when stop() is not called

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address codex-feedback.md parity gaps:

1. CancellationToken support: start/append/stop now accept
   Option<CancellationToken> (via tokio_util::sync::CancellationToken).
   stop() uses cancel-safe pattern matching C# StopAsync  native
   session stop is always performed even if token fires.

2. Response envelope matches C#: LiveAudioTranscriptionResponse now
   has content: Vec<ContentPart> with text/transcript fields, so
   callers use result.content[0].text (identical to C# Content[0].Text).

3. Added tokio-util dependency for CancellationToken.

Updated E2E sample and integration test to use new API shape.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Sample: update download progress callback from &str to f64 to match
  upstream API change (PR #608)
- Apply cargo fmt to all SDK and sample files for CI compliance

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The function and its AudioClient import triggered -D warnings (dead_code)
in the CI build. The E2E test creates the session directly via
model.create_audio_client() and doesn't use this helper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SDK: fix append() deadlock (clone tx before await), start()
cancellation leak, double-stop, get_transcription_stream() async,
remove unused field, fix error parsing, remove manual Unpin,
improve error messages, refactor stop() into helpers, rewrite
push_loop per reviewer suggestion.

Sample: extract convert_audio(), edition 2024, remove
codex-feedback.md, document BufferSize::Default, replace
Arc+spawn with channel-based mic forwarding.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

/// Await the start future with cancellation safety. If cancelled, any
/// native session that was already created is cleaned up.
async fn await_start(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this leak the handle if we cancel? Should cancellation call audio_stream_stop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the leak concerns. Thanks

Comment thread sdk/rust/src/detail/model.rs Outdated

state
.push_tx
.as_ref()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just clone

@nenad1002
Copy link
Copy Markdown
Contributor

Run cargo fmt & cargo clippy

Comment thread sdk/rust/src/detail/model_variant.rs Outdated
EmbeddingClient::new(&self.info.id, Arc::clone(&self.core))
}

pub(crate) fn create_live_transcription_session(&self) -> LiveAudioTranscriptionSession {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here

…implify clone

- Remove create_live_transcription_session() from Model and ModelVariant
  per bmehta001/kunal-vaishnavi: session should only be created via
  AudioClient, not directly from Model (matches C# pattern)
- Simplify .as_ref().cloned() to .clone() per nenad1002

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Always await start_future to completion  if cancellation was requested,
check is_cancelled() afterwards and call audio_stream_stop to clean up
any native session that was created. This prevents native handle leaks
when CancellationToken fires during start().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let _ = output_tx.send(Err(FoundryLocalError::CommandExecution {
reason: format!("Push failed (code={code}): {e}"),
}));
break;
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a fatal push error, the push loop breaks but the output channel sender stored in session state remains alive, so get_transcription_stream() will not terminate (it will yield the error once, then await forever) unless stop()/Drop is called to drop output_tx. Consider mirroring the JS/C# behavior by ensuring the stream is completed on fatal push errors (e.g., drop/close the state-held sender, or send a control event to an async task that finalizes state and closes the output channel).

Suggested change
break;
// Fatal push failures are terminal for the transcription stream.
// Drop the loop-owned sender and return immediately so the stream
// can complete once no other senders remain.
drop(output_tx);
return;

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +76
while let Some(result) = stream.next().await {
match result {
Ok(r) => results_clone.lock().await.push(r),
Err(e) => {
eprintln!("Stream error: {e}");
break;
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integration test swallows transcription stream errors: the spawned read task only logs Err(e) and breaks, but the test still passes (especially if results ends up empty, since the final assertions are inside a for loop). To make this test meaningful as an E2E signal, propagate the first stream error back to the test (e.g., store it in shared state and assert!(error.is_none()) after stop()), and/or assert that at least one valid response was received once the session completes.

Copilot uses AI. Check for mistakes.
Comment on lines +550 to +552
let code = CoreErrorResponse::try_parse(&e.to_string())
.map(|ei| ei.code)
.unwrap_or_else(|| "UNKNOWN".into());
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoreErrorResponse::try_parse(&e.to_string()) will never successfully parse JSON from the native core because FoundryLocalError’s Display format prefixes the message (e.g., "command execution error: ..."). This makes code always fall back to "UNKNOWN" even when the core returns structured JSON. Match on FoundryLocalError::CommandExecution { reason } (or otherwise access the raw reason string) and pass that to try_parse instead of e.to_string().

Suggested change
let code = CoreErrorResponse::try_parse(&e.to_string())
.map(|ei| ei.code)
.unwrap_or_else(|| "UNKNOWN".into());
let code = match &e {
FoundryLocalError::CommandExecution { reason } => {
CoreErrorResponse::try_parse(reason)
.map(|ei| ei.code)
.unwrap_or_else(|| "UNKNOWN".into())
}
_ => "UNKNOWN".into(),
};

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants