#llama #cpp #language #wrapper #generation #model #drama

bin+lib drama_llama

A library for language modeling and text generation

4 releases

new 0.2.0 Apr 25, 2024
0.1.2 Apr 20, 2024
0.1.1 Apr 20, 2024
0.1.0 Apr 20, 2024

#424 in Text processing

Download history 180/week @ 2024-04-15 123/week @ 2024-04-22

303 downloads per month

Custom license

2MB
11K SLoC

drama_llama

llama with drama mask logo

drama_llama is yet another Rust wrapper for llama.cpp. It is a work in progress and not intended for production use. The API will change.

For examples, see the bin folder. There are two example binaries.

  • Dittomancer - Chat with well represented personalities in the training.
  • Regurgitater - Test local language models for memorized content.

Supported Features

  • LLaMA 3 Support.
  • Iterators yielding tokens and pieces.
  • Stop criteria at regex, token sequence, and/or string sequence.
  • Metal support. CUDA may be enabled with the cuda and cuda_f16 features.
  • Rust-native sampling code. All sampling methods from llama.cpp have been translated.
  • N-gram based repetition penalties with custom exclusions for n-grams that should not be penalized.
  • Support for N-gram blocking with a default, hardcoded blocklist.

Contributing

  • Code is poetry. Make it pretty.
  • Respect is universal.
  • Use rustfmt.

Roadmap

  • Candidate iterator with fine-grained control over sampling
  • Examples for new Candidate API.
  • Support for chaining sampling methods using SampleOptions. mode will become modes and applied one after another until only a single Candidate token remains.
  • Common command line options for sampling. Currently this is not exposed.
  • API closer to Ollama. Potentially support for something like Modelfile.
  • Logging (non-blocking) and benchmark support.
  • Better chat and instruct model support.
  • Web server. Tokenization in the browser.
  • Tiktoken as the tokenizer for some models instead of llama.cpp's internal one.
  • Reworked, functional, public, candidate API
  • Grammar constraints (maybe or maybe not llama.cpp style)
  • Async streams, better parallelism with automatic batch scheduling
  • Backends other than llama.cpp (eg. MLC, TensorRT-LLM, Ollama)

Known issues

  • With LLaMA 3, safe vocabulary is not working yet so --vocab unsafe must be passed as a command line argument or VocabKind::Unsafe used for an Engine constructor.
  • The model doesn't load until genration starts, so there can be a long pause on first generation. However because mmap is used, on subsequent process launches, the model should already be cached by the OS.

Generative AI Disclosure

  • Generative, AI, specifically Microsoft's Bing Copilot, GitHub Copilot, and Dall-E 3 were used for portions of this project. See inline comments for sections where generative AI was used. Completion was also used for getters, setters, and some tests. Logos were generated with Dall-E and post processed in Inkscape.

Dependencies

~14–53MB
~798K SLoC