#language-model #language #model #sanitization

langsan

A library for sanitizing language model input and output

11 releases

0.0.10 Oct 17, 2024
0.0.9 Oct 16, 2024

#775 in Text processing

Download history 717/week @ 2024-10-12 111/week @ 2024-10-19 2/week @ 2024-10-26 5/week @ 2024-11-02 9/week @ 2024-11-09

250 downloads per month
Used in misanthropic

MIT license

100KB
2K SLoC

langsan is a sanitization library for language models

Build Status codecov

Out of a desire to be first to market, many companies from OpenAI to Anthropic are releasing language models without proper input or output sanitization. This can lead to a variety of safety and security issues, including but not limited to human-invisible adversarial attacks, data leakage, and generation of harmful content.

langsan provides immutable string wrappers guaranteeing their contents are within restricted unicode ranges, generally those only officially supported by a particular language model. Almost all unicode code blocks are available as features (crates.io has a limit set at 300).

Dependencies

~0.3–1MB
~23K SLoC