3 releases
0.0.3 | Nov 21, 2022 |
---|---|
0.0.2 | Nov 21, 2022 |
0.0.1 | Nov 21, 2022 |
#14 in #transcription
32KB
352 lines
Jupiter Search
A showcase for indexing jupiter network podcasts using meilisearch. This repository is build in order to provide possible solution to following problems:
DISCLAIMER!
Warning! This is a work in progress version to showcase how indexing/transcription might work.
Overview
Project contains two main modules:
podcast2text
a cli tool for downloading RSS feed and transcribing podcast episodessearch-load
a cli tool for loading obtained transcriptions to instance of meilisearch
Building
To build you would need following packages on your system:
- cargo
- pkg-config
- openssl
- ffmpeg
There is a nix flake configured to ship build dependencies
just run direnv allow
and run:
git submodule update --init --recursive
cargo build --release
To appease the gods of good taste please add following pre commit hook:
git config --local core.hooksPath .githooks
Usage
Run downloading podcasts
Process audio from RSS feed
- Download the whisper model
mkdir models
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=medium.en
curl --output models/model.bin https://ggml.ggerganov.com/ggml-model-whisper-$model.bin
- Run the inference on the RSS feed
# get information about the cli
docker run flakm/podcast2text --help
docker run \
-v $PWD/models:/data/models \
flakm/podcast2text \
rss https://feed.jupiter.zone/allshows
Install meilisearch
docker pull getmeili/meilisearch:v0.29
docker run -it --rm \
-p 7700:7700 \
-e MEILI_MASTER_KEY='MASTER_KEY'\
-v $(pwd)/meili_data:/meili_data \
getmeili/meilisearch:v0.29 \
meilisearch --env="development"
Run index creation and data loading
Running inference of some audio
- Download whisper model
mkdir models
# this might be one of:
# "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large"
model=medium.en
curl --output models/ggml-$model.bin https://ggml.ggerganov.com/ggml-model-whisper-$model.bin
- Download the example audio from rss feed
curl https://feed.jupiter.zone/link/19057/15745245/55bb5263-04be-43a3-8b92-678072a9cfc8.mp3 -L -o action.mp3
-
Install
ffmpeg
and put it onPATH
variable. -
Run the inference example
cargo run --release --example=get_transcript -- models/ggml-medium.en.bin action_short.wav | tee output.txt
Dependencies
~29–47MB
~732K SLoC