#detection #voice #activity #speech #web-rtc #fixed-point #pure

no-std earshot

Ridiculously fast voice activity detection in pure #[no_std] Rust

1 unstable release

0.1.0 Sep 29, 2024

#428 in Audio

Download history 163/week @ 2024-09-24 38/week @ 2024-10-01 10/week @ 2024-10-08 5/week @ 2024-10-15 2/week @ 2024-10-29 5/week @ 2024-11-05 17/week @ 2024-11-12 47/week @ 2024-11-19 7/week @ 2024-11-26 9/week @ 2024-12-03 15/week @ 2024-12-10 37/week @ 2024-12-17 40/week @ 2024-12-24 16/week @ 2025-01-07

93 downloads per month
Used in voice-stream

BSD-3-Clause

66KB
1K SLoC

Earshot

Ridiculously fast, only slightly bad voice activity detection in pure Rust. Port of the famous WebRTC VAD.

Features

  • #![no_std], doesn't even require alloc
    • Internal buffers can get pretty big when stored on the stack, so the alloc feature is enabled by default, which allocates them on the heap instead.
  • Stupidly fast; uses only fixed-point arithmetic
    • Achieves an RTF of ~3e-4 with 30 ms 48 KHz frames, ~3e-5 with 30 ms 8 KHz frames.
    • Comparatively, Silero VAD v4 w/ ort achieves an RTF of ~3e-3 with 60 ms 16 KHz frames.
  • Okay accuracy
    • Great at distinguishing between silence and noise, but not between noise and speech.
    • Earshot provides alternative models with slight accuracy gains compared to the base WebRTC model.

lib.rs:

Earshot is a fast voice activity detection library.

For more details, see VoiceActivityDetector.

use earshot::{VoiceActivityDetector, VoiceActivityProfile};

let mut vad = VoiceActivityDetector::new(VoiceActivityProfile::VERY_AGGRESSIVE);

while let Some(frame) = stream.next() {
	let is_speech_detected = vad.predict_16khz(&frame).unwrap();
	# assert_eq!(is_speech_detected, false);
}

No runtime deps