1 unstable release
new 0.2.2 | Feb 1, 2025 |
---|
#9 in #tktax
Used in 8 crates
(2 directly)
43KB
167 lines
tktax-io
tktax-io
is a Rust library that supplies text preprocessing utilities, tokenization and stemming routines, as well as configurable formatted header-printing functions. It is designed for integration within the TKTAX project but can also be adopted for general lexical cleansing or linguistic normalization workflows.
Features
- Punctuation Filtering: Uses
regex
to remove extraneous punctuation and special symbols. - Case Normalization: Converts strings to lowercase for uniform comparisons.
- Tokenization & Stemming: Splits text using Unicode word boundaries and applies Snowball-based stemming to reduce words to canonical roots.
- Formatted Header Printing: Generates structured output lines with user-configurable width and character styles.
Example Usage
Below is a minimal example showing how to use the main functions in this crate:
use tktax_io::{preprocess, tokenize_and_stem, print_header, print_thick_header};
fn main() {
// Input text to preprocess
let transaction_description = "7-ELEVEN!!!";
// Remove punctuation and transform to lowercase
let clean_text = preprocess(transaction_description);
println!("Preprocessed: {}", clean_text);
// Tokenize and stem the cleaned text
let tokens = tokenize_and_stem(&clean_text);
println!("Tokens: {:?}", tokens);
// Print a couple of headers
print_header("Light Header");
print_thick_header("Heavy Header");
}
Run the tests with:
cargo test
Contributing
- Fork the repository and create a feature branch.
- Make changes, then open a pull request to the main repository.
- Provide a clear and detailed description of all modifications.
License
This project is licensed under either of:
- Apache License, Version 2.0
- MIT License
at your option.
Dependencies
~26–37MB
~641K SLoC