11 stable releases

1.6.0 Mar 4, 2023
1.5.3 Feb 26, 2023
1.5.0 Jan 28, 2023
1.4.7 Dec 18, 2022

#1811 in Text processing

29 downloads per month

Apache-2.0

52KB
643 lines

Automated Tests

Text-Sanitizer

Rust Crate to convert raw text bytes into valid std::str::String with plain ASCII encoding

Features

  • Very low Dependencies
    This leads to:
    • High Compability (compiles even with old Rust Compilers)
    • Very fast Startup Time (Execution Time less than 3 ms on a 27KB document)
  • Robust Code (does not use risky unwrap() Methods)
    Developed with the DevOps Mentalitity: "can fail but will live to tell"

Motivation

Most Rust parsing libraries will bail out when fed with raw data that is not UTF-8 encoded like ISO-8859-15 Windows encoding and others or mixed-up encodings.
Using Str::from_utf8_lossy() will break those data and includes linear back and forth parsing on byte level which introduces performance penality on bigger data.
text-sanitizer does not depend on proper encoding detection and relies only on an internal customizable convertion map.

Usage

Object Oriented Method

When several sanitization operations are executed the Conversion Map can be reused.
This is more resource saving and results also in a Micro Optimization. For example Web Server can benefit from this.

    //-------------------------------------
    // Test data is the Sparkle Heart from the UTF-8 documentation examples

    use text_sanitizer::TextSanitizer;

    let vsparkle_heart = vec![240, 159, 146, 150];

    let mut sanitizer = TextSanitizer::new_with_options(false, true, false);

    sanitizer.add_request_language(&"en");

    let srsout = sanitizer.sanitize_u8(&vsparkle_heart);

    println!("sparkle_heart: '{}'", srsout);

    assert_eq!(srsout, "<3");

Procedural Method

The sanitizer::sanitize_u8() function takes the raw data and creates a new valid UTF-8 std::str::String from it.

use text_sanitizer::sanitizer;

fn sparkle_heart() {
    //-------------------------------------
    // Test data is the Sparkle Heart from the UTF-8 documentation examples
    // which will be converted to " <3 ".

    let vsparkle_heart = vec![240, 159, 146, 150];
    let vrqlngs: Vec<String> = vec![String::from("en")];

    let srsout = sanitizer::sanitize_u8(&vsparkle_heart, &vrqlngs, &"");

    println!("sparkle_heart: '{}'", srsout);

    assert_eq!(srsout, "<3");
}

Considering this example where the data in the center is corrupted somehow: This data cannot be parsed by normal Rust libraries and the containing valid information would be lost.

use text_sanitizer::sanitizer;

fn two_hearts_center() {
    //-------------------------------------
    // Test data contains 2 Sparkle Hearts but is corrupted in the center
    // According to the Official Standard Library Documentation at:
    // https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8
    // this would produce a FromUtf8Error or panic the application
    // when used with unwrap()

    let vsparkle_heart = vec![240, 159, 146, 150, 119, 250, 240, 159, 146, 150];
    let vrqlngs: Vec<String> = vec![String::from("en")];

    let srsout = sanitizer::sanitize_u8(&vsparkle_heart, &vrqlngs, &" -d");

    println!("sparkle_heart: '{}'", srsout);

    assert_eq!(srsout, "<3w(?fa)<3");
}

Dependencies

~2MB
~49K SLoC