#podcast #jupiter

jupiter_downloader

Library for downloading episodes from rss and running transcription

1 unstable release

0.0.1 Nov 20, 2022

#12 in #transcription


Used in podcast2text

MIT license

3.5MB
10K SLoC

C 6K SLoC // 0.1% comments C++ 3K SLoC // 0.1% comments Rust 593 SLoC // 0.1% comments Python 223 SLoC // 0.2% comments Objective-C 195 SLoC // 0.3% comments Shell 122 SLoC // 0.2% comments Batch 47 SLoC

Jupiter Search

Crates.io MIT licensed APACHE 2 licensed Build Status

Complete set of tools for making your favourite podcast searchable.

Originally created for jupiter network podcasts using meilisearch.

Overview

Project contains two main modules:

  • podcast2text a cli tool for downloading RSS feed and transcribing podcast episodes
  • search-load a cli tool for loading obtained transcriptions to instance of meilisearch

Getting started

To build you would need following packages on your system:

  • cargo
  • pkg-config
  • openssl
  • ffmpeg

There is a nix flake configured to ship build dependencies just run direnv allow and run:

cargo build --release

To appease the gods of good taste please add following pre commit hook:

git config --local core.hooksPath .githooks

Usage

Run downloading podcasts

Process audio from RSS feed

  1. Create cache directories and download the whisper model
mkdir -p {models,output}
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=tiny.en

curl -L --output models/$model.bin https://huggingface.co/datasets/ggerganov/whisper.cpp/resolve/main/ggml-$model.bin
  1. Run the inference on the RSS feed
# get information about the cli
docker run flakm/podcast2text rss --help

docker run \
    -v $PWD/models:/data/models \
    -v $PWD/output:/data/output \
    flakm/podcast2text \
    rss \
    --num-of-episodes 2 \
    https://feed.jupiter.zone/allshows 

# or using cargo
cargo run --bin podcast2text --release -- \
    --model-path=models/tiny.en.bin \
    --output-dir=output/ \
    --threads-per-worker=4 \
    --download-dir=catalog \
    rss \
    --worker-count=6 \
    https://feed.jupiter.zone/allshows 

The output directory should now contain json files with files' transcription and metadata. Note that the results will be cached - so if you restart the job it will not redownload and process already seen rss entries.

Create search engine

Install meilisearch

Project uses meilisearch as engine back end for search functionality

docker pull getmeili/meilisearch:v0.29
docker run -it --rm \
    -p 7700:7700 \
    -e MEILI_MASTER_KEY='MASTER_KEY'\
    -v $(pwd)/meili_data:/meili_data \
    getmeili/meilisearch:v0.29 \
    meilisearch --env="development"

Run index creation and data loading

Dependencies

~21–38MB
~674K SLoC