#qwen #asr #text-to-speech #inference

qwen-asr

CPU-only Qwen3-ASR speech recognition (pure Rust)

11 releases (4 breaking)

0.5.0 Mar 21, 2026
0.4.2 Mar 14, 2026
0.3.0 Feb 23, 2026
0.2.3 Feb 22, 2026
0.1.2 Feb 22, 2026

#153 in Audio

Download history 4/week @ 2026-02-18 61/week @ 2026-02-25 18/week @ 2026-03-04 33/week @ 2026-03-11 128/week @ 2026-03-18 2/week @ 2026-03-25 5/week @ 2026-04-08

141 downloads per month
Used in qwen-asr-cli

MIT license

320KB
7K SLoC

qwen_asr

CPU-only Qwen3-ASR speech recognition in pure Rust. No Python, no ONNX runtime, no framework dependencies — just libc and BLAS. BF16 weights stay memory-mapped for minimal RAM usage; SIMD kernels (NEON / AVX2+FMA) accelerate inference.

Prerequisites

  • Rust 1.70+
  • BLAS: Accelerate (macOS, linked automatically) or OpenBLAS (Linux)

Building

Platform-specific optimizations are detected automatically at compile time:

Platform BLAS SIMD
macOS (Apple Silicon) Accelerate + vDSP NEON (always available)
macOS (Intel) Accelerate + vDSP AVX2+FMA
Linux (x86_64) OpenBLAS AVX2+FMA
Linux (aarch64) OpenBLAS NEON
Other OpenBLAS Generic scalar fallback

For best performance, build with native CPU tuning so the compiler can emit AVX2+FMA instructions on x86_64:

RUSTFLAGS="-C target-cpu=native" cargo build --release

On AArch64 (Apple Silicon, ARM Linux) NEON is baseline — no extra flags needed, though -C target-cpu=native is still recommended for other micro-architecture tuning.

Important: Always use --release mode. Debug builds are 10-50x slower due to missing optimizations and are not usable for real-time inference.

Model Download

# Install huggingface-cli if needed
pip install huggingface_hub

# Download the 0.6B model (~1.3 GB)
huggingface_hub download Qwen/Qwen3-ASR-0.6B --local-dir qwen3-asr-0.6b

# Download the 0.6B forced-aligner model (~1.3 GB)
huggingface_hub download Qwen/Qwen3-ASR-0.6B-Aligner --local-dir qwen3-aligner-0.6b

Usage

use qwen_asr::context::QwenCtx;
use qwen_asr::transcribe;

fn main() {
    // Load model (returns None on failure)
    let mut ctx = QwenCtx::load("qwen3-asr-0.6b").expect("failed to load model");

    // Transcribe a WAV file
    let text = transcribe::transcribe(&mut ctx, "audio.wav").unwrap();
    println!("{}", text);
}

Segmented Mode

For long audio files, split into overlapping segments to reduce memory usage and improve accuracy:

use qwen_asr::context::QwenCtx;
use qwen_asr::transcribe;

let mut ctx = QwenCtx::load("qwen3-asr-0.6b").unwrap();
ctx.segment_sec = 30.0; // split every ~30 seconds

let text = transcribe::transcribe(&mut ctx, "long-meeting.wav").unwrap();

Raw PCM Input

use qwen_asr::context::QwenCtx;
use qwen_asr::transcribe;

let mut ctx = QwenCtx::load("qwen3-asr-0.6b").unwrap();

// f32 samples at 16 kHz, mono, range [-1, 1]
let samples: Vec<f32> = load_audio_somehow();
let text = transcribe::transcribe_audio(&mut ctx, &samples).unwrap();

Streaming API

For real-time incremental transcription, use StreamState and stream_push_audio:

use qwen_asr::context::QwenCtx;
use qwen_asr::transcribe::{StreamState, stream_push_audio};

let mut ctx = QwenCtx::load("qwen3-asr-0.6b").unwrap();
let mut state = StreamState::new();

// As audio arrives (e.g., from a microphone), accumulate samples
let mut all_samples: Vec<f32> = Vec::new();
loop {
    let new_audio = get_audio_chunk(); // your audio source
    all_samples.extend_from_slice(&new_audio);

    // Push all accumulated audio; stream_push_audio tracks its own cursor
    if let Some(delta) = stream_push_audio(&mut ctx, &all_samples, &mut state, false) {
        if !delta.is_empty() {
            print!("{}", delta); // incremental output
        }
    }
}

// Finalize to flush remaining tokens
stream_push_audio(&mut ctx, &all_samples, &mut state, true);

Forced Alignment

Produce word-level timestamps for a known transcript. Requires the ForcedAligner model variant (Qwen3-ASR-0.6B-Aligner).

use qwen_asr::context::QwenCtx;
use qwen_asr::align;

let mut ctx = QwenCtx::load("qwen3-aligner-0.6b").unwrap();
let samples: Vec<f32> = load_audio_somehow();

let results = align::forced_align(&mut ctx, &samples, "Hello world", "English")
    .expect("alignment failed");

for r in &results {
    println!("{}: {:.0} ms – {:.0} ms", r.text, r.start_ms, r.end_ms);
}

CLI:

qwen-asr -d qwen3-aligner-0.6b -i audio.wav --align "Hello world" --align-language English

Each AlignResult contains the word text, start_ms, and end_ms timestamps. For CJK languages the text is split at character level; for others it is split on whitespace.

Feature Flags

Feature Default Description
blas yes Link Accelerate (macOS) or OpenBLAS (Linux) for matrix ops
vdsp yes Use vDSP/vForce from Accelerate for dot products and exp (macOS only)
ios no Build C-FFI API for iOS integration
android no Build C-FFI + JNI API for Android integration

Performance

Benchmarks on Apple M2 Pro (10-core), 0.6B model:

Mode Audio Wall Time Realtime Factor
Offline 11 s 1.8 s 6.2x
Offline 28 s 4.0 s 7.0x
Segmented (-S 30) 45 s 4.6 s 9.8x
Streaming 28 s 10.4 s 2.7x
Streaming (live) 51 s 14.1 s 3.6x

License

MIT

Dependencies