7 unstable releases
0.4.1 | Sep 23, 2023 |
---|---|
0.3.0 | Oct 6, 2020 |
0.2.1 | Aug 21, 2020 |
0.1.2 | Jun 19, 2020 |
0.1.1 | May 19, 2020 |
#531 in Text processing
35KB
725 lines
Rust SBert
Rust port of sentence-transformers using rust-bert and tch-rs.
Supports both rust-tokenizers and Hugging Face's tokenizers.
Supported models
-
distiluse-base-multilingual-cased: Supported languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. Performance on the extended STS2017: 80.1
-
DistilRoBERTa-based classifiers
Usage
Example
The API is made to be very easy to use and enables you to create quality multilingual sentence embeddings in a straightforward way.
Load SBert model with weights by specifying the directory of the model:
let mut home: PathBuf = env::current_dir().unwrap();
home.push("path-to-model");
You can use different versions of the models that use different tokenizers:
// To use Hugging Face tokenizer
let sbert_model = SBertHF::new(home.to_str().unwrap());
// To use Rust-tokenizers
let sbert_model = SBertRT::new(home.to_str().unwrap());
Now, you can encode your sentences:
let texts = ["You can encode",
"As many sentences",
"As you want",
"Enjoy ;)"];
let batch_size = 64;
let output = sbert_model.forward(texts.to_vec(), batch_size).unwrap();
The parameter batch_size
can be left to None
to let the model use its default value.
Then you can use the output
sentence embedding in any application you want.
Convert models from Python to Rust
Firstly, get a model provided by UKPLabs (all models are here):
mkdir -p models/distiluse-base-multilingual-cased
wget -P models https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/distiluse-base-multilingual-cased.zip
unzip models/distiluse-base-multilingual-cased.zip -d models/distiluse-base-multilingual-cased
Then, you need to convert the model in a suitable format (requires pytorch):
python utils/prepare_distilbert.py models/distiluse-base-multilingual-cased
A dockerized environment is also available for running the conversion script:
docker build -t tch-converter -f utils/Dockerfile .
docker run \
-v $(pwd)/models/distiluse-base-multilingual-cased:/model \
tch-converter:latest \
python prepare_distilbert.py /model
Finally, set "output_attentions": true
in distiluse-base-multilingual-cased/0_distilbert/config.json
.
Dependencies
~39MB
~763K SLoC