2 releases

0.1.1	Mar 25, 2020
0.1.0	Mar 6, 2019

#367 in Machine learning

MIT license

30KB
647 lines

rust-bn

A simple Naive Bayes Model in Rust.

Basic idea is from blayze but rust-bn is a Rust implementation and use a very simple Rust HashMap<String, f64> to save model in memory.

Later it should be very simple to enhance it with a hard drive key-value store to make model persist.

How To Use

Simply checkout this repo and run some examples locally:

git clone git@github.com:liufuyang/rust-nb.git

cargo run --example spam

# or run a more complex example, use --release to speed up train/test process
cargo run --example 20newsgroup_stopwords --release

And then you can modify those examples in the examples folder and perhaps from there build your own models.

Or you can use this package in your application by setting in Cargo.toml:

[dependencies]
...
rust_nb = "0.1.0"

Just take make a main function looks like below. See how a simple email spam model might look like when you train and predict on it.

extern crate rust_nb;

use rust_nb::{Feature, FeatureType, Model};

fn main() {
    let mut model = Model::new();

    let input_train = vec![
        (
            "spam".to_owned(),
            vec![
                Feature {
                    feature_type: FeatureType::Text,
                    name: "email.body".to_owned(),
                    value: "Good day dear beneficiary. This is Secretary to president of Benin republic is writing this email ... heritage, tax, dollars, money, credit card...".to_owned(),
                },
                Feature {
                    feature_type: FeatureType::Category,
                    name: "email.domain".to_owned(),
                    value: "evil.com".to_owned(),
                },
                Feature {
                    feature_type: FeatureType::Gaussian,
                    name: "email.n_words".to_owned(),
                    value: "482".to_owned(),
                },
            ],
        ),
        (
            "not spam".to_owned(),
            vec![
                Feature {
                    feature_type: FeatureType::Text,
                    name: "email.body".to_owned(),
                    value: "Hey bro, how's work these days, wanna join me for hotpot next week?".to_owned(),
                },
                Feature {
                    feature_type: FeatureType::Category,
                    name: "email.domain".to_owned(),
                    value: "gmail.com".to_owned(),
                },
                Feature {
                    feature_type: FeatureType::Gaussian,
                    name: "email.n_words".to_owned(),
                    value: "42".to_owned(),
                },
            ],
        ),
    ];

    model.train("Spam checker", &input_train);

    // test example 1
    let result = model.predict(
        "Spam checker",
        &vec![
            Feature {
                feature_type: FeatureType::Text,
                name: "email.body".to_owned(),
                value: "Hey bro, This is Secretary to president want to give you some money. Please give me your credit card number ..."
                    .to_owned(),
            },
            Feature {
                feature_type: FeatureType::Category,
                name: "email.domain".to_owned(),
                value: "example.com".to_owned(),
            },
            Feature {
                feature_type: FeatureType::Gaussian,
                name: "email.n_words".to_owned(),
                value: "288".to_owned(),
            },
        ],
    );

    println!("{:?}\n", result);
    assert!(result.get("spam").unwrap().abs() > 0.9);
    // result will be:
    // {"not spam": 0.04228956359881729, "spam": 0.9577104364011828}

    // test example 2
    let result = model.predict(
        "Spam checker",
        &vec![
            Feature {
                feature_type: FeatureType::Text,
                name: "email.body".to_owned(),
                value: "Hey bro, hotpot again?".to_owned(),
            },
            Feature {
                feature_type: FeatureType::Category,
                name: "email.domain".to_owned(),
                value: "gmail.com".to_owned(),
            },
            Feature {
                feature_type: FeatureType::Gaussian,
                name: "email.n_words".to_owned(),
                value: "10".to_owned(),
            },
        ],
    );

    println!("{:?}\n", result);
    assert!(result.get("not spam").unwrap().abs() > 0.9);
    // result will be:
    // {"spam": 0.03786816269284711, "not spam": 0.9621318373071529}
}

About Naive Bayes Model (and how to understand the code)

Firstly let's take a look at the Bayes equations for only 2 classes and a feature

$p(c_1 | x) = \frac{ p(x | c_1) p(c_1) }{ p(x | c_1) p(c_1) + p(x | c_2) p(c_2)}$

$p(c_2 | x) = \frac{ p(x | c_2) p(c_2) }{ p(x | c_1) p(c_1) + p(x | c_2) p(c_2)}$

As we can see the denominator is same for both probabilities of class 1 and 2 based on input x (and they sum equal to 1).

Thus we could simply only focusing calculating only the numerator part for each classes then normalized them all in the end to get probabilities for each class prediction.

This also generalize to the classes number greater than 2 as well.

$p(c_n | x) <= { p(x | c_n) p(c_n) }$

Note here we use <= notation meaning we can infer p(c_n | x) by the value on the right later on, after we have all calculations of classes with number index as `n.

Now expand this to situations where we have multiple features, index of features noted with i, and let X = x_1, x_2, ... x_i, we have:

$p(c_n | X) <= { p(X | c_n) p(c_n) }$

$p(c_n | X) <= { p(x_1, x_2, ... x_i | c_n) p(c_n) }$

As "Naive" way of thinking, each feature appearance x_i is independent, so we could have:

Feature type: Multinomial and Categorical

Currently we support two feature type:

Categorical: each x_i in above equations are different value
Multinomial (a.k.a Text feature): each x_i in above equations could be the same. For example, in the case of counting words to predict document class, a word apple as x_i can appear multiple times, let's denote it as t_i (as in code it's called inputFeatureCounts.)

One can also think of Categorical feature being as Multinomial feature but all t_i is 1.

So we for now just look at equations for Multinomial. Suppose now our x_i as unique words, the equations becomes as:

$p(c_n | X) <= p(c_n) \: p(x_1| c_n)^{t_1} \, p(x_2| c_n)^{t_2} \, p(x_3| c_n)^{t_3} \,...\, p(x_i | c_n) ^{t_i}$

There are many multiplications with values less than one. To prevent the number gets to small to be presented as double in computers, we can calculate the log value on each side instead:

$log(p(c_n | X)) <= log( p(c_n)) + \: \sum_{i} t_i log(p(x_i| c_n))$

To calculate the priors p(c_n) and conditional probabilities p(x_i| c_n):

$p(c_n) = \frac{N_{c_n}}{N} ,\: \: p(x_i| c_n) = \frac{count(x_i, c_n) + \epsilon }{count(c_n) + |V| * \epsilon }$

$\begin{align*} log(p(c_n | X)) <= \: & log( N_{c_n}) - log(N) +\\ & \sum_{i} t_i (log(count(x_i, c_n) + \epsilon) - log(count(c_n) + |V| * \epsilon)) \end{align*}$

So in the end we have those things need to calculate during training and predicting:

During Training and Predicting, save or access these parameters:
- N_cn : prior count of class c_n. Calculated via function logPrior in code.
- N: sum of all prior count N_cn of all class c_n. Calculated via function logPrior in code.
- count(x_i, c_n): count of word/feature i's appearance in class c_n countFeatureAppearsInOutcome
- count(c_n): count of total numbers of word/feature appeared in class c_n totalFeatureCountInOutcome
- |V|: count of the unique words/feature/vocabulary among all classes. In code it is called numOfUniqueFeaturesSeen
Only During Predicting, also calculate:
- t_i: number of time word/feature i appears in the incoming data for prediction. In code it is called inputFeatureCounts
Constant:
- epsilon: pseudocount, no probability is ever set to be exactly zero. By default we set it as 1, This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one

pic-1

Dependencies

~4–6MB
~112K SLoC