4 releases

0.1.3	Apr 2, 2024
0.1.2	Mar 8, 2024
0.1.1	Feb 24, 2024
0.1.0	Feb 9, 2024

#353 in Cargo plugins

284 downloads per month

MIT/Apache

36KB
561 lines

mpnet-rs

What is this?

This is a translation of MPNet from PyTorch into Rust Candle.

The trained model I used is PatentSBERTa, which is designed to obtain embeddings optimized for the patent domain.
train pipeline is NOT yet prepared.
If you have your own MPNet weights, they can be loaded using this carte.

Updates

v.0.1.3

some dtypes are changed: get_embeddings(), get_embeddings_parallel()

v.0.1.2

candle version up: 0.3.3 -> 0.4.1

v.0.1.1

parallel version for get_embeddings() : get_embedding_parallel()

How to use

get trained model

download the model from huggingface
Candle v0.4.0 supports loading pytorch_model.bin directly, but v0.3.3 does not support it.
if you want to load model from .safetensors, you have to convert it yourself. this implementation might be helpful.

load model and weights

use mpnet_rs::mpnet::load_model;
let (model, tokenizer, pooler) = load_model("/path/to/model/and/tokenizer").unwrap();

get embeddings(with pooler): see test function below

this is about how to get embeddings and consine similarity

use candle_core::{DType, Device, Result, Tensor};
use candle_nn::{VarBuilder,  Module};

use mpnet_rs::mpnet::{MPNetEmbeddings, MPNetConfig, create_position_ids_from_input_ids, cumsum, load_model, get_embeddings, normalize_l2, PoolingConfig, MPNetPooler};


fn test_get_embeddings() ->Result<()>{
    let path_to_checkpoints_folder = "D:/RustWorkspace/checkpoints/AI-Growth-Lab_PatentSBERTa".to_string();

    let (model, mut tokenizer, pooler) = load_model(path_to_checkpoints_folder).unwrap();

    let sentences = vec![
        "an invention that targets GLP-1",
        "new chemical that targets glucagon like peptide-1 ",
        "de novo chemical that targets GLP-1",
        "invention about GLP-1 receptor",
        "new chemical synthesis for glp-1 inhibitors",
        "It feels like I'm in America",
        "It's rainy. all day long.",
    ];
    let n_sentences = sentences.len();
    let embeddings = get_embeddings(&model, &tokenizer, Some(&pooler), &sentences).unwrap();

    let l2norm_embeds = normalize_l2(&embeddings).unwrap();
    println!("pooled embeddings {:?}", l2norm_embeds.shape());

    let mut similarities = vec![];
    for i in 0..n_sentences {
        let e_i = l2norm_embeds.get(i)?;
        for j in (i + 1)..n_sentences {
            let e_j = l2norm_embeds.get(j)?;
            let sum_ij = (&e_i * &e_j)?.sum_all()?.to_scalar::<f32>()?;
            let sum_i2 = (&e_i * &e_i)?.sum_all()?.to_scalar::<f32>()?;
            let sum_j2 = (&e_j * &e_j)?.sum_all()?.to_scalar::<f32>()?;
            let cosine_similarity = sum_ij / (sum_i2 * sum_j2).sqrt();
            similarities.push((cosine_similarity, i, j))
        }
    }
    similarities.sort_by(|u, v| v.0.total_cmp(&u.0));
    for &(score, i, j) in similarities[..5].iter() {
        println!("score: {score:.2} '{}' '{}'", sentences[i], sentences[j])
    }

    Ok(())
}

Note

Pooling layer

In the original PyTorch implementation in Transformers, the pooling layers are declared in the MPNetModel class
I have implemented the pooling layer independently, separating it from the MPNetModel class.

activation

In the original implementation, tanh is used as the activation function for the pooling layers.
However, since it was difficult to find the implementation of tanh in Candle, I have set gelu as the default

References

Dependencies

~22MB
~465K SLoC