#preprocessor #header #tokenization #formatted #utilities #printing #tktax

tktax-io

A library providing text preprocessing, tokenization, and formatted header printing utilities for the TKTAX project

1 unstable release

new 0.2.2 Feb 1, 2025

#9 in #tktax


Used in 8 crates (2 directly)

MIT/Apache

43KB
167 lines

tktax-io

tktax-io is a Rust library that supplies text preprocessing utilities, tokenization and stemming routines, as well as configurable formatted header-printing functions. It is designed for integration within the TKTAX project but can also be adopted for general lexical cleansing or linguistic normalization workflows.

Features

  • Punctuation Filtering: Uses regex to remove extraneous punctuation and special symbols.
  • Case Normalization: Converts strings to lowercase for uniform comparisons.
  • Tokenization & Stemming: Splits text using Unicode word boundaries and applies Snowball-based stemming to reduce words to canonical roots.
  • Formatted Header Printing: Generates structured output lines with user-configurable width and character styles.

Example Usage

Below is a minimal example showing how to use the main functions in this crate:

use tktax_io::{preprocess, tokenize_and_stem, print_header, print_thick_header};

fn main() {
    // Input text to preprocess
    let transaction_description = "7-ELEVEN!!!";

    // Remove punctuation and transform to lowercase
    let clean_text = preprocess(transaction_description);
    println!("Preprocessed: {}", clean_text);

    // Tokenize and stem the cleaned text
    let tokens = tokenize_and_stem(&clean_text);
    println!("Tokens: {:?}", tokens);

    // Print a couple of headers
    print_header("Light Header");
    print_thick_header("Heavy Header");
}

Run the tests with:

cargo test

Contributing

  1. Fork the repository and create a feature branch.
  2. Make changes, then open a pull request to the main repository.
  3. Provide a clear and detailed description of all modifications.

License

This project is licensed under either of:

  • Apache License, Version 2.0
  • MIT License

at your option.

Dependencies

~26–37MB
~641K SLoC