#classification #transaction #advanced #category #tokens #csv

tktax-transaction-category

A Rust library for categorizing financial transactions using Porter stemming, CSV-driven classification, and advanced trait-based extensibility

1 unstable release

new 0.2.2 Feb 1, 2025

#23 in #advanced


Used in 14 crates (8 directly)

MIT license

79KB
735 lines

tktax-transaction-category

Overview

tktax-transaction-category provides robust categorization (Lat. Categorĭa, Gr. Κατηγορία) for financial transactions. It leverages a Porter stemmer to reduce vendor descriptions to root forms, then maps tokens onto enumerated categories. This approach supports composable classification, large reference tables (CSV-based), and advanced trait constraints for production usage.

Key Features

  1. Porter-Stem-Based Tokenization

    • Rust Stemmer from rust-stemmers is invoked to canonicalize words.
    • Alphanumeric filtering purges punctuation.
  2. CategoryMap

    • Maintains a HashMap<StemmedToken, HashSet<TxCat>> mapping tokens to one or more categories.
    • Dynamically constructed from CSV lines (allowing multiple categories via semicolon-delimited strings).
  3. Predictive Scoring

    • predict_category(desc, &CategoryMap) computes a distribution of categories.
    • Each recognized token contributes 1 / N if it maps to N categories.
  4. Trait-Based Extension

    • TransactionCategory trait constraints unify advanced classification methods.
    • Implement or extend domain-specific categories while retaining the same classification routines.
  5. Flexible CSV Ingestion

    • The method GetCategoryGoldenCsv::category_golden_csv() supplies a reference dataset.
    • Real-world usage: integrate your own line-delimited “category,description” CSV for automatic classification.

Example Usage

// Use your own category enum implementing `TransactionCategory`
use tktax_transaction_category::{
    CategoryMap,
    predict_category,
    TransactionCategory,
    // ... other necessary imports
};

// Construct the map from a "golden" CSV reference
let cat_map = CategoryMap::<MockTransactionCategory>::new();

// Some unknown transaction description
let transaction_description = "AMZN MKTPLACE MED - FIRST AID KITS and extras";

// Predict the categories (in descending score order)
let predictions = predict_category(transaction_description, &cat_map);

// Evaluate the top prediction or inspect the full distribution
if let Some(best) = predictions.first() {
    println!("Likely category: {:?} with score {:?}", best.category(), best.score());
}

Production Considerations

  • Token scoring is purely additive; tokens recognized in multiple categories can split point allocations (e.g., 0.5 each if two categories match).
  • CSV lines with multiple categories (cat1;cat2,desc) unify them for each recognized token in desc.
  • Implement your own traits or rely on default ones to handle domain-specific expansions (e.g., medical_and_insurance_categories).

Testing

Extensive unit tests can be found under #[cfg(test)] within each module:

  • StemmedToken correctness under empty or punctuation-only input.
  • CategoryMap edge cases: unknown categories, repeated lines, multi-category expansions.
  • predict_category scoring distribution, tie resolution, and robust sorting in descending order.

To run tests:

cargo test --package tktax-transaction-category

License

This project is licensed under the MIT License.

Contributing

  1. Fork the repository.
  2. Create a feature branch (git checkout -b my-new-feature).
  3. Commit changes (git commit -am 'Add new feature').
  4. Push to the branch (git push origin my-new-feature).
  5. Create a new Pull Request on GitHub.

Happy categorizing!

Dependencies

~26–37MB
~641K SLoC