93 releases

0.5.5 Nov 21, 2022
0.5.1 Sep 10, 2022
0.4.18 Jul 10, 2022
0.4.13 Mar 24, 2022
0.1.33 Nov 29, 2021

#96 in Rust patterns

Download history 197/week @ 2022-08-10 51/week @ 2022-08-17 338/week @ 2022-08-24 332/week @ 2022-08-31 436/week @ 2022-09-07 535/week @ 2022-09-14 441/week @ 2022-09-21 926/week @ 2022-09-28 1399/week @ 2022-10-05 1618/week @ 2022-10-12 1672/week @ 2022-10-19 1861/week @ 2022-10-26 1413/week @ 2022-11-02 981/week @ 2022-11-09 1147/week @ 2022-11-16 1048/week @ 2022-11-23

4,824 downloads per month
Used in 14 crates (3 directly)

MIT/Apache

500KB
2.5K SLoC

rustrict

rustrict is a sophisticated profanity filter for Rust.

Features

  • Multiple types (profane, offensive, sexual, mean, spam)
  • Multiple levels (mild, moderate, severe)
  • Resistant to evasion
    • Alternative spellings (like "fck")
    • Repeated characters (like "craaaap")
    • Confusable characters (like 'ᑭ', '𝕡', and '🅿')
    • Spacing (like "c r_a-p")
    • Accents (like "pÓöp")
    • Bidirectional Unicode (related reading)
    • Self-censoring (like "f*ck")
    • Safe phrase list for known bad actors]
    • Censors invalid Unicode characters
    • Battle-tested in Mk48.io
  • Resistant to false positives
    • One word (like "assassin")
    • Two words (like "push it")
  • Flexible
    • Censor and/or analyze
    • Input &str or Iterator<Item = char>
    • Can track per-user state with context feature
    • Can add words with the customize feature
    • Accurately reports the width of Unicode via the width feature
    • Plenty of options
  • Performant
    • O(n) analysis and censoring
    • No regex (uses custom trie)
    • 3 MB/s in release mode
    • 100 KB/s in debug mode

Limitations

  • Mostly English/emoji only
  • Censoring removes most diacritics (accents)
  • Does not detect right-to-left profanity while analyzing, so...
  • Censoring forces Unicode to be left-to-right
  • Doesn't understand context
  • Not resistant to false positives affecting profanities added at runtime

Usage

Strings (&str)

use rustrict::CensorStr;

let censored: String = "hello crap".censor();
let inappropriate: bool = "f u c k".is_inappropriate();

assert_eq!(censored, "hello c***");
assert!(inappropriate);

Iterators (Iterator<Type = char>)

use rustrict::CensorIter;

let censored: String = "hello crap".chars().censor().collect();

assert_eq!(censored, "hello c***");

Advanced

By constructing a Censor, one can avoid scanning text multiple times to get a censored String and/or answer multiple is queries. This also opens up more customization options (defaults are below).

use rustrict::{Censor, Type};

let (censored, analysis) = Censor::from_str("123 Crap")
    .with_censor_threshold(Type::INAPPROPRIATE)
    .with_censor_first_character_threshold(Type::OFFENSIVE & Type::SEVERE)
    .with_ignore_false_positives(false)
    .with_ignore_self_censoring(false)
    .with_censor_replacement('*')
    .censor_and_analyze();

assert_eq!(censored, "123 C***");
assert!(analysis.is(Type::INAPPROPRIATE));
assert!(analysis.isnt(Type::PROFANE & Type::SEVERE | Type::SEXUAL));

If you cannot afford to let anything slip though, or have reason to believe a particular user is trying to evade the filter, you can check if their input matches a short list of safe strings:

use rustrict::{CensorStr, Type};

// Figure out if a user is trying to evade the filter.
assert!("pron".is(Type::EVASIVE));
assert!("porn".isnt(Type::EVASIVE));

// Only let safe messages through.
assert!("Hello there!".is(Type::SAFE));
assert!("nice work.".is(Type::SAFE));
assert!("yes".is(Type::SAFE));
assert!("NVM".is(Type::SAFE));
assert!("gtg".is(Type::SAFE));
assert!("not a common phrase".isnt(Type::SAFE));

If you want to add custom profanities or safe words, enable the customize feature.

#[cfg(feature = "customize")]
{
    use rustrict::{add_word, CensorStr, Type};

    // You must take care not to call these when the crate is being
    // used in any other way (to avoid concurrent mutation).
    unsafe {
        add_word("reallyreallybadword", (Type::PROFANE & Type::SEVERE) | Type::MEAN);
        add_word("mybrandname", Type::SAFE);
    }
    
    assert!("Reallllllyreallllllybaaaadword".is(Type::PROFANE));
    assert!("MyBrandName".is(Type::SAFE));
}

But wait, there's more! If your use-case is chat moderation, and you can store data on a per-user basis, you might benefit from the context feature.

#[cfg(feature = "context")]
{
    use rustrict::{BlockReason, Context};
    use std::time::Duration;
    
    pub struct User {
        context: Context,
    }
    
    let mut bob = User {
        context: Context::default()
    };
    
    // Ok messages go right through.
    assert_eq!(bob.context.process(String::from("hello")), Ok(String::from("hello")));
    
    // Bad words are censored.
    assert_eq!(bob.context.process(String::from("crap")), Ok(String::from("c***")));

    // Can take user reports (After many reports or inappropriate messages,
    // will only let known safe messages through.)
    for _ in 0..5 {
        bob.context.report();
    }
   
    // If many bad words are used or reports are made, the first letter of
    // future bad words starts getting censored too.
    assert_eq!(bob.context.process(String::from("crap")), Ok(String::from("****")));
    
    // Can manually mute.
    bob.context.mute_for(Duration::from_secs(2));
    assert!(matches!(bob.context.process(String::from("anything")), Err(BlockReason::Muted(_))));
}

Comparison

To compare filters, the first 100,000 items of this list is used as a dataset. Positive accuracy is the percentage of profanity detected as profanity. Negative accuracy is the percentage of clean text detected as clean.

Crate Accuracy Positive Accuracy Negative Accuracy Time
rustrict 89.92% 92.19% 89.36% 9s
censor 76.16% 72.76% 77.01% 23s

Development

Build

If you make an adjustment that would affect false positives, you will need to run false_positive_finder:

  1. Run ./download.sh to get the required word lists.
  2. Run cargo run --bin false_positive_finder --release --all-features

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~0.4–3.5MB
~57K SLoC