2 unstable releases
0.2.0 | Aug 21, 2024 |
---|---|
0.1.0 | Jul 19, 2024 |
#398 in Text processing
76KB
2K
SLoC
Ragtime
Ragtime is a rust library that makes building self hosted retrieval augmented generation (RAG) applications easier. Currently it uses Phi3 for question answering and summarizing, and supports multiple embedding models. It has a generic model interface to facilitate integrating additional models as they become available.
Currently onnx and llama.cpp backends are supported for running models, additional backends (such as burn or candle) may be added as they mature.
At the moment the best results are obtained with the gte-Qwen2-instruct family of embedding models and Phi3. This combination can index both source code and text documentation in the same vector database for question answering or simple retrieval.
use ragtime::{llama, RagQaPhi3GteQwen27bInstruct};
use llama_cpp_2::llama_backend::LlamaBackend;
use anyhow::Result;
use std::{io::{stdout, Write}, sync::Arc};
let backend = Arc::new(LlamaBackend::init()?);
let mut qa = RagQaPhi3GteQwen27bInstruct::new(
64,
backend.clone(),
llama::Args::default().with_model("gte-Qwen2-7B-instruct/ggml-model-q8_0.gguf"),
backend,
llama::Args::default().with_model("Phi-3-mini-128k-instruct/ggml-model-q8_0.gguf")
)?;
// add documents
qa.add_document("doc0", 256, 128)?;
qa.add_document("doc1", 256, 128)?;
// query
for tok in qa.ask("question about your docs", None)? {
let tok = tok?;
print!("{tok}");
stdout().flush()?;
}
While ragtime does not directly use any async runtime, it's iterator based return mechanism makes it very simple to stick in a background thread and push tokens into an IPC mechanism such as an mpsc channel. This allows it to be async runtime agnostic, and easier to use in non async applications such as command line tools.
Utils
As well as the core crate there are several simple command line wrappers in the utils folder.
- qa: a question answering command line utility that can ingest documents, persist and load the index, and provide a question answering REPL
- gen: a driver to run a Phi3 session with no RAG support
- chunk: a utility to display how a document will be split into chunks
Models
Both the Phi3 and all of the supported embedding model weights are available on hugging face. In some cases you will need to convert them to gguf format for llama.cpp using the python script included in the llama.cpp repository. In the case of onnx models, many are available for direct download from hugging face, otherwise you will have to convert them from hugging face format.
- Phi3: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
- gte-Qwen2: https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct
Deployment
Ragtime applications can be deployed with minimal dependencies if using CPU or Vulkan acceleration. In the case of cuda, only the cuda runtime is required.
Dependencies
~30–44MB
~828K SLoC