#audio-stream #voice #detection #stream-processing #sample-rate #web-rtc #input

voice-stream

Voice stream is a real-time audio stream processing with voice activity detection. This library provides a high-level interface for capturing audio input, performing voice detection using both WebRTC VAD and Silero VAD, and processing audio streams

3 releases (breaking)

new 0.3.0 Nov 18, 2024
0.2.0 Nov 16, 2024
0.1.0 Nov 16, 2024

#130 in Audio

Download history 282/week @ 2024-11-13

282 downloads per month

BSD-3-Clause

38KB
633 lines

Voice Stream

A Rust library for real-time voice activity detection and audio stream processing. This library provides a high-level interface for capturing audio input, performing voice detection using both WebRTC VAD and Silero VAD, and processing audio streams.

Features

  • Real-time audio capture from input devices
  • Audio resampling to desired sample rate (default 16kHz)
  • Dual voice activity detection using:
    • WebRTC VAD
    • Silero VAD
  • Configurable buffer sizes and voice detection parameters
  • Channel-based audio data transmission
  • Support for multiple sample formats (I8, I16, I32, F32)
  • Conversion from multi channel to mono channel

Usage

use voice_stream::VoiceStream;
use voice_stream::cpal::traits::StreamTrait;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a default voice stream with receiver
    let (voice_stream, receiver) = VoiceStream::default_device().unwrap();

    // Start capturing audio
    voice_stream.play().unwrap();

    // Receive voice data chunks
    for voice_data in receiver {
        // Process voice data (Vec<f32>)
        println!("Received voice data chunk of size: {}", voice_data.len());
    }

    Ok(())
}

Diagram

flowchart TD
    Start --> Capture[Capture audio input from device mono/multi channels at various sample rates]
    Capture --> Convert
    IntoMono --> TakeBuffer[Buffer f32 samples to at least 512 size]
    TakeBuffer --> Step1[Split off samples buffer when >= 512]

    subgraph Resampler
        %% Nodes
        Convert[Convert i8, i16, i32 or f32 samples to f32]
        Resample[Resample to target sample rate 16,000 Hz]
        IntoMono[Convert multi channel sound to mono]

        %% Flow connections
        Convert --> Resample --> IntoMono
    end

    subgraph Voice Detection
        %% Nodes
        Step1[Convert to 8kHz and check is_noise]
        webrtc[is_noise = webrtc_vad_is_noise samples]
        Step2[Get predict from silero_vad_prediction]
        silero[predict = silero_vad_prediction samples]
        is_voice[is_voice = predict > silero_vad_voice_threshold]
        Decision{Match is_noise, is_voice}

        %% Subgraphs for Each Case
        subgraph CaseTrueTrue ["Case: is_noise and is_voice"]
            ActionTT[Accumulate samples into samples_buffer]
            ReturnTT[Return None]
        end

        subgraph CaseTrueFalse ["Case: is_noise and !is_voice"]
            ActionTF[Clear samples_buffer]
            ReturnTF[Return None]
        end

        subgraph CaseFalse ["Case: is_noise"]
            ActionF[Push predict to silero_predict_buffer]
            BufferEmpty{Is samples_buffer empty?}
            ReturnNone[Return None]
            ReturnSamples[Return all voice samples]
        end

        %% Flow connections
        Step1 --> webrtc --> Step2 --> silero --> is_voice --> Decision

        %% Decision branches
        Decision -->|is_noise = true and is_voice = true| CaseTrueTrue
        CaseTrueTrue --> ReturnTT

        Decision -->|is_noise = true and is_voice = false| CaseTrueFalse
        CaseTrueFalse --> ReturnTF

        Decision -->|is_noise = false| CaseFalse
        CaseFalse --> BufferEmpty
        BufferEmpty -->|Yes| ReturnNone
        BufferEmpty -->|No| ReturnSamples
    end

    %%Nodes
    ProcessVoiceDetectionSamples{Process voice detection}
    ChannelSendData{Channel send}
    NoiseDiscard[Disregarded into noise void]
    User[User channel receiver]

    ReturnNone -->|None| ProcessVoiceDetectionSamples
    ReturnTT -->|None| ProcessVoiceDetectionSamples
    ReturnTF -->|None| ProcessVoiceDetectionSamples
    ReturnSamples -->|Some| ProcessVoiceDetectionSamples
    ChannelSendData --> User

    %% ChannelSendData branches
    ProcessVoiceDetectionSamples -->|Some voice| ChannelSendData

    ProcessVoiceDetectionSamples -->|No voice| NoiseDiscard

Advanced Configuration

The library provides a builder pattern for advanced configuration:

use voice_stream::{VoiceStreamBuilder, WebRtcVoiceActivityProfile};
use voice_stream::cpal::traits::{DeviceTrait, HostTrait, StreamTrait};

let (tx, rx) = std::sync::mpsc::channel();

let host = cpal::default_host();

let select_device = "default";

// Set up the input device and stream with the default input config.
let device = if select_device == "default" {
    host.default_input_device()
} else {
    host.input_devices()
        .expect("Failed to get input devices")
        .find(|x| x.name().map(|y| y == select_device).unwrap_or(false))
}
.expect("failed to find input device");

let config = device
    .default_input_config()
    .expect("Failed to get default input config");

let voice_stream = VoiceStreamBuilder::new(config, device, tx)
    .with_sound_buffer_until_size(1024)
    .with_voice_detection_silero_voice_threshold(0.5)
    .with_voice_detection_webrtc_profile(WebRtcVoiceActivityProfile::AGGRESSIVE)
    .build()
    .unwrap();

Dependencies

~8–41MB
~561K SLoC