1 unstable release

0.2.2	Feb 1, 2025

#1 in #tktax

119 downloads per month
Used in 19 crates (11 directly)

MIT license

79KB
735 lines

tktax-transaction-category

Overview

tktax-transaction-category provides robust categorization (Lat. Categorĭa, Gr. Κατηγορία) for financial transactions. It leverages a Porter stemmer to reduce vendor descriptions to root forms, then maps tokens onto enumerated categories. This approach supports composable classification, large reference tables (CSV-based), and advanced trait constraints for production usage.

Key Features

Porter-Stem-Based Tokenization
- Rust Stemmer from rust-stemmers is invoked to canonicalize words.
- Alphanumeric filtering purges punctuation.
CategoryMap
- Maintains a HashMap<StemmedToken, HashSet<TxCat>> mapping tokens to one or more categories.
- Dynamically constructed from CSV lines (allowing multiple categories via semicolon-delimited strings).
Predictive Scoring
- predict_category(desc, &CategoryMap) computes a distribution of categories.
- Each recognized token contributes 1 / N if it maps to N categories.
Trait-Based Extension
- TransactionCategory trait constraints unify advanced classification methods.
- Implement or extend domain-specific categories while retaining the same classification routines.
Flexible CSV Ingestion
- The method GetCategoryGoldenCsv::category_golden_csv() supplies a reference dataset.
- Real-world usage: integrate your own line-delimited “category,description” CSV for automatic classification.

Example Usage

// Use your own category enum implementing `TransactionCategory`
use tktax_transaction_category::{
    CategoryMap,
    predict_category,
    TransactionCategory,
    // ... other necessary imports
};

// Construct the map from a "golden" CSV reference
let cat_map = CategoryMap::<MockTransactionCategory>::new();

// Some unknown transaction description
let transaction_description = "AMZN MKTPLACE MED - FIRST AID KITS and extras";

// Predict the categories (in descending score order)
let predictions = predict_category(transaction_description, &cat_map);

// Evaluate the top prediction or inspect the full distribution
if let Some(best) = predictions.first() {
    println!("Likely category: {:?} with score {:?}", best.category(), best.score());
}

Production Considerations

Token scoring is purely additive; tokens recognized in multiple categories can split point allocations (e.g., 0.5 each if two categories match).
CSV lines with multiple categories (cat1;cat2,desc) unify them for each recognized token in desc.
Implement your own traits or rely on default ones to handle domain-specific expansions (e.g., medical_and_insurance_categories).

Testing

Extensive unit tests can be found under #[cfg(test)] within each module:

StemmedToken correctness under empty or punctuation-only input.
CategoryMap edge cases: unknown categories, repeated lines, multi-category expansions.
predict_category scoring distribution, tie resolution, and robust sorting in descending order.

To run tests:

cargo test --package tktax-transaction-category

License

This project is licensed under the MIT License.

Contributing

Fork the repository.
Create a feature branch (git checkout -b my-new-feature).
Commit changes (git commit -am 'Add new feature').
Push to the branch (git push origin my-new-feature).
Create a new Pull Request on GitHub.

Happy categorizing!

Dependencies

~26–37MB
~637K SLoC