#rss #transcription #model #whisper #podcast #feed #ai

stt

Library for transcription using whisper ai model

1 unstable release

0.0.1 Nov 20, 2022

#13 in #transcription

27 downloads per month
Used in 2 crates (via jupiter_downloader)

MIT license

3.5MB
10K SLoC

C 6K SLoC // 0.1% comments C++ 3K SLoC // 0.1% comments Rust 325 SLoC // 0.1% comments Python 224 SLoC // 0.2% comments Objective-C 196 SLoC // 0.3% comments Shell 123 SLoC // 0.2% comments Batch 48 SLoC JavaScript 1 SLoC

Jupiter Search

Crates.io MIT licensed APACHE 2 licensed Build Status

Complete set of tools for making your favourite podcast searchable.

Originally created for jupiter network podcasts using meilisearch.

Overview

Project contains two main modules:

  • podcast2text a cli tool for downloading RSS feed and transcribing podcast episodes
  • search-load a cli tool for loading obtained transcriptions to instance of meilisearch

Getting started

To build you would need following packages on your system:

  • cargo
  • pkg-config
  • openssl
  • ffmpeg

There is a nix flake configured to ship build dependencies just run direnv allow and run:

cargo build --release

To appease the gods of good taste please add following pre commit hook:

git config --local core.hooksPath .githooks

Usage

Run downloading podcasts

Process audio from RSS feed

  1. Create cache directories and download the whisper model
mkdir -p {models,output}
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=tiny.en

curl -L --output models/$model.bin https://huggingface.co/datasets/ggerganov/whisper.cpp/resolve/main/ggml-$model.bin
  1. Run the inference on the RSS feed
# get information about the cli
docker run flakm/podcast2text rss --help

docker run \
    -v $PWD/models:/data/models \
    -v $PWD/output:/data/output \
    flakm/podcast2text \
    rss \
    --num-of-episodes 2 \
    https://feed.jupiter.zone/allshows 

# or using cargo
cargo run --bin podcast2text --release -- \
    --model-path=models/tiny.en.bin \
    --output-dir=output/ \
    --threads-per-worker=4 \
    --download-dir=catalog \
    rss \
    --worker-count=6 \
    https://feed.jupiter.zone/allshows 

The output directory should now contain json files with files' transcription and metadata. Note that the results will be cached - so if you restart the job it will not redownload and process already seen rss entries.

Create search engine

Install meilisearch

Project uses meilisearch as engine back end for search functionality

docker pull getmeili/meilisearch:v0.29
docker run -it --rm \
    -p 7700:7700 \
    -e MEILI_MASTER_KEY='MASTER_KEY'\
    -v $(pwd)/meili_data:/meili_data \
    getmeili/meilisearch:v0.29 \
    meilisearch --env="development"

Run index creation and data loading

Dependencies

~14MB
~322K SLoC