11 stable releases
1.6.0 | Mar 4, 2023 |
---|---|
1.5.3 | Feb 26, 2023 |
1.5.0 | Jan 28, 2023 |
1.4.7 | Dec 18, 2022 |
#1811 in Text processing
29 downloads per month
52KB
643 lines
Text-Sanitizer
Rust Crate to convert raw text bytes into valid std::str::String
with plain ASCII encoding
Features
- Very low Dependencies
This leads to:- High Compability (compiles even with old Rust Compilers)
- Very fast Startup Time (Execution Time less than 3 ms on a 27KB document)
- Robust Code (does not use risky
unwrap()
Methods)
Developed with the DevOps Mentalitity: "can fail but will live to tell"
Motivation
Most Rust parsing libraries will bail out when fed with raw data that is not UTF-8 encoded like ISO-8859-15 Windows
encoding
and others or mixed-up encodings.
Using Str::from_utf8_lossy()
will break those data and includes linear back and forth parsing on byte level
which introduces performance penality on bigger data.
text-sanitizer
does not depend on proper encoding detection and relies only on an internal customizable convertion map.
Usage
Object Oriented Method
When several sanitization operations are executed the Conversion Map can be reused.
This is more resource saving and results also in a Micro Optimization.
For example Web Server can benefit from this.
//-------------------------------------
// Test data is the Sparkle Heart from the UTF-8 documentation examples
use text_sanitizer::TextSanitizer;
let vsparkle_heart = vec![240, 159, 146, 150];
let mut sanitizer = TextSanitizer::new_with_options(false, true, false);
sanitizer.add_request_language(&"en");
let srsout = sanitizer.sanitize_u8(&vsparkle_heart);
println!("sparkle_heart: '{}'", srsout);
assert_eq!(srsout, "<3");
Procedural Method
The sanitizer::sanitize_u8()
function takes the raw data and creates a new valid UTF-8 std::str::String
from it.
use text_sanitizer::sanitizer;
fn sparkle_heart() {
//-------------------------------------
// Test data is the Sparkle Heart from the UTF-8 documentation examples
// which will be converted to " <3 ".
let vsparkle_heart = vec![240, 159, 146, 150];
let vrqlngs: Vec<String> = vec![String::from("en")];
let srsout = sanitizer::sanitize_u8(&vsparkle_heart, &vrqlngs, &"");
println!("sparkle_heart: '{}'", srsout);
assert_eq!(srsout, "<3");
}
Considering this example where the data in the center is corrupted somehow: This data cannot be parsed by normal Rust libraries and the containing valid information would be lost.
use text_sanitizer::sanitizer;
fn two_hearts_center() {
//-------------------------------------
// Test data contains 2 Sparkle Hearts but is corrupted in the center
// According to the Official Standard Library Documentation at:
// https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8
// this would produce a FromUtf8Error or panic the application
// when used with unwrap()
let vsparkle_heart = vec![240, 159, 146, 150, 119, 250, 240, 159, 146, 150];
let vrqlngs: Vec<String> = vec![String::from("en")];
let srsout = sanitizer::sanitize_u8(&vsparkle_heart, &vrqlngs, &" -d");
println!("sparkle_heart: '{}'", srsout);
assert_eq!(srsout, "<3w(?fa)<3");
}
Dependencies
~2MB
~49K SLoC