1 unstable release
0.0.1 | Nov 20, 2022 |
---|
#13 in #transcription
27 downloads per month
Used in 2 crates
(via jupiter_downloader)
3.5MB
10K
SLoC
Jupiter Search
Complete set of tools for making your favourite podcast searchable.
Originally created for jupiter network podcasts using meilisearch.
Overview
Project contains two main modules:
podcast2text
a cli tool for downloading RSS feed and transcribing podcast episodessearch-load
a cli tool for loading obtained transcriptions to instance of meilisearch
Getting started
To build you would need following packages on your system:
- cargo
- pkg-config
- openssl
- ffmpeg
There is a nix flake configured to ship build dependencies
just run direnv allow
and run:
cargo build --release
To appease the gods of good taste please add following pre commit hook:
git config --local core.hooksPath .githooks
Usage
Run downloading podcasts
Process audio from RSS feed
- Create cache directories and download the whisper model
mkdir -p {models,output}
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=tiny.en
curl -L --output models/$model.bin https://huggingface.co/datasets/ggerganov/whisper.cpp/resolve/main/ggml-$model.bin
- Run the inference on the RSS feed
# get information about the cli
docker run flakm/podcast2text rss --help
docker run \
-v $PWD/models:/data/models \
-v $PWD/output:/data/output \
flakm/podcast2text \
rss \
--num-of-episodes 2 \
https://feed.jupiter.zone/allshows
# or using cargo
cargo run --bin podcast2text --release -- \
--model-path=models/tiny.en.bin \
--output-dir=output/ \
--threads-per-worker=4 \
--download-dir=catalog \
rss \
--worker-count=6 \
https://feed.jupiter.zone/allshows
The output directory should now contain json files with files' transcription and metadata. Note that the results will be cached - so if you restart the job it will not redownload and process already seen rss entries.
Create search engine
Install meilisearch
Project uses meilisearch as engine back end for search functionality
docker pull getmeili/meilisearch:v0.29
docker run -it --rm \
-p 7700:7700 \
-e MEILI_MASTER_KEY='MASTER_KEY'\
-v $(pwd)/meili_data:/meili_data \
getmeili/meilisearch:v0.29 \
meilisearch --env="development"
Run index creation and data loading
Dependencies
~14MB
~322K SLoC