1 unstable release
new 0.2.2 | Feb 1, 2025 |
---|
#23 in #advanced
Used in 14 crates
(8 directly)
79KB
735 lines
tktax-transaction-category
Overview
tktax-transaction-category provides robust categorization (Lat. Categorĭa, Gr. Κατηγορία) for financial transactions. It leverages a Porter stemmer to reduce vendor descriptions to root forms, then maps tokens onto enumerated categories. This approach supports composable classification, large reference tables (CSV-based), and advanced trait constraints for production usage.
Key Features
-
Porter-Stem-Based Tokenization
- Rust
Stemmer
fromrust-stemmers
is invoked to canonicalize words. - Alphanumeric filtering purges punctuation.
- Rust
-
CategoryMap
- Maintains a
HashMap<StemmedToken, HashSet<TxCat>>
mapping tokens to one or more categories. - Dynamically constructed from CSV lines (allowing multiple categories via semicolon-delimited strings).
- Maintains a
-
Predictive Scoring
predict_category(desc, &CategoryMap)
computes a distribution of categories.- Each recognized token contributes
1 / N
if it maps to N categories.
-
Trait-Based Extension
TransactionCategory
trait constraints unify advanced classification methods.- Implement or extend domain-specific categories while retaining the same classification routines.
-
Flexible CSV Ingestion
- The method
GetCategoryGoldenCsv::category_golden_csv()
supplies a reference dataset. - Real-world usage: integrate your own line-delimited “category,description” CSV for automatic classification.
- The method
Example Usage
// Use your own category enum implementing `TransactionCategory`
use tktax_transaction_category::{
CategoryMap,
predict_category,
TransactionCategory,
// ... other necessary imports
};
// Construct the map from a "golden" CSV reference
let cat_map = CategoryMap::<MockTransactionCategory>::new();
// Some unknown transaction description
let transaction_description = "AMZN MKTPLACE MED - FIRST AID KITS and extras";
// Predict the categories (in descending score order)
let predictions = predict_category(transaction_description, &cat_map);
// Evaluate the top prediction or inspect the full distribution
if let Some(best) = predictions.first() {
println!("Likely category: {:?} with score {:?}", best.category(), best.score());
}
Production Considerations
- Token scoring is purely additive; tokens recognized in multiple categories can split point allocations (e.g., 0.5 each if two categories match).
- CSV lines with multiple categories (
cat1;cat2,desc
) unify them for each recognized token indesc
. - Implement your own traits or rely on default ones to handle domain-specific expansions (e.g.,
medical_and_insurance_categories
).
Testing
Extensive unit tests can be found under #[cfg(test)]
within each module:
- StemmedToken correctness under empty or punctuation-only input.
- CategoryMap edge cases: unknown categories, repeated lines, multi-category expansions.
- predict_category scoring distribution, tie resolution, and robust sorting in descending order.
To run tests:
cargo test --package tktax-transaction-category
License
This project is licensed under the MIT License.
Contributing
- Fork the repository.
- Create a feature branch (
git checkout -b my-new-feature
). - Commit changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin my-new-feature
). - Create a new Pull Request on GitHub.
Happy categorizing!
Dependencies
~26–37MB
~641K SLoC