2 releases
new 0.9.1 | Jan 13, 2025 |
---|---|
0.9.0 | Jan 8, 2025 |
#222 in Text processing
246 downloads per month
83KB
1.5K
SLoC
๐ฟ gline-rs: Inference Engine for GLiNER Models, in Rust
๐ฌ Introduction
gline-rs
is an inference engine for GLiNER models. These language models proved to be efficient at zero-shot Named Entity Recognition (NER) and other tasks such as Relation Extraction, while consuming less resources than large generative models (LLMs).
This implementation has been written from the ground up in Rust, and supports both span- and token-oriented variants (for inference only). It aims to provide a production-grade and user-friendly API in a modern and safe programming language, including a clean and maintainable implementation of the mechanics surrounding these models.
For those interested, it can also help getting a deep understanding of GLiNER's operation.
๐ก Background and Motivation
Common drawbacks of machine learning systems include cryptic implementations and high resource consumption. gline-rs
aims to take a step toward a more maintainable and sustainable approach. ๐ฑ
Why GLiNER?
The term stands for "Generalist and Lightweight Model for Named Entity Recognition", after an original work from Zaratiana at al. It now refers to a family of lightweight models capable of performing various zero-shots extractions using a bidirectional transformer architecture (BERT-like). For this kind of tasks, this approach can be much more relevant than full-blown LLMs.
However, it is characterized by a number of operations that need to be performed both upstream and downstream of applying the pre-trained model. These operations are conceptually described in the academic papers, but the implementations details are not trivial to understand and reproduce. To address this issue, this implementation emphasizes code readability, modularity, and documentation.
Why Rust?
The original implementation was written in Python, which is widely used for machine learning, but not particularly efficient and not always suitable in production environments.
Rust combines bare-metal performance with memory and thread safety. It helps to write fast, reliable, and resource-efficient code by ensuring sound concurrency and memory use at compile time. For example, the borrow checker enforces strict ownership rules, reducing costly operations like cloning to prevent data races.
Although it is not yet as widespread as Python in the ML world, it makes an excellent candidate for enabling reliable and efficient ML systems.
๐ Public API
Include gline-rs
as a regular dependency in your Cargo.toml
:
[dependencies]
"gline-rs" = "0.9.1"
The public API is self-explanatory:
let model = GLiNER::<TokenMode>::new(
Parameters::default(),
RuntimeParameters::default(),
"tokenizer.json",
"model.onnx",
)?;
let input = TextInput::from_str(
&[
"My name is James Bond.",
"I like to drive my Aston Martin.",
],
&[
"person",
"vehicle",
],
)?;
let output = model.inference(input)?;
// => "James Bond" : "person"
// => "Aston Martin" : "vehicle"
Please refer the the examples
source codes for complete code.
๐งฌ Getting the Models
To leverage gline-rs
, you need the appropriate models in ONNX format.
Ready-to-use models can be downloaded from ๐ค Hugging Face repositories. For example:
To run the examples without any modification, this file structure is expected:
For token-mode:
models/gliner-multitask-large-v0.5/tokenizer.json
models/gliner-multitask-large-v0.5/onnx/model.onnx
For span-mode:
models/gliner_small-v2.1/tokenizer.json
models/gliner_small-v2.1/onnx/model.onnx
The original GLiNER implementation also provides some tools to convert models by your own.
๐ Running the Examples
They are located in the examples
directory. For instance in token mode:
$ cargo run --example token-mode
Expected output:
0 | James Bond | person | 99.7%
1 | James | person | 98.1%
1 | Chelsea | location | 96.4%
1 | London | location | 92.4%
2 | James Bond | person | 99.4%
3 | Aston Martin | vehicle | 99.9%
โก๏ธ GPU/NPU Inferences
The ort
execution providers can be leveraged to perform considerably faster inferences on GPU/NPU hardware. A working example is provided in examples/benchmark-gpu.rs
.
The first step is to pass the appropriate execution providers in RuntimeParameters
(which is then passed to GLiNER
initialization). For example:
let rtp = RuntimeParameters::default().with_execution_providers([
CUDAExecutionProvider::default().build()
])
The second step is to activate the appropriate features (see related section below), otherwise the example will silently fall-back to CPU. For example:
$ cargo run --example benchmark-gpu --features=cuda
Please refer to doc/ORT.md
for details about execution providers.
๐ฆ Create Features
This create mirrors the following ort
features:
- To allow for dynamic loading of ONNX-runtime libraries:
load-dynamic
- To allow for activation of execution providers:
cuda
,tensorrt
,directml
,coreml
,rocm
,openvino
,onednn
,xnnpack
,qnn
,cann
,nnapi
,tvm
,acl
,armnn
,migraphx
,vitis
, andrknpu
โฑ๏ธ Performances
CPU
Comparing performances from one implementation to another is complicated, as they depend on many factors. But according to the first measures, it appears that gline-rs
can run 4x faster on CPU than the original implementation out of the box:
Implementation | sequences/second |
---|---|
gline-rs | 6.67 |
GLiNER.py | 1.61 |
Both implementations have been tested under the following configuration:
- Dataset: subset of the NuNER dataset (first 100 entries)
- Mode: token, flat_ner: true, multi_label: false
- Number of entity classes: 3
- Threshold: 0.5
- Model: gliner-multitask-large-v0.5
- CPU specs: Intel Core i9 @2.3Ghz with 8 cores (12 threads)
GPU
Unsurprisingly, leveraging a GPU dramatically increases the throughput:
Implementation | sequences/second |
---|---|
gline-rs | 248.75 |
The configuration of the test is similar to the above, except:
- Dataset: subset of the NuNER dataset (first 1000 entries)
- Execution provider: CUDA
- GPU specs: NVIDIA RTX 4080 (16Gb VRAM)
- CPU specs: Intel Core i7 13700KF @3.4Ghz
(Comparison with the original implementation has yet to be done.)
๐งช Current Status
Although it is sufficiently mature to be embraced by the community, the current version (0.9.x) should not be considered as production-ready.
For any critical use, it is advisable to wait until it has been extensively tested and ort-2.0
(the ONNX runtime wrapper) has reached a stable release.
The first stable, production-grade release will be labeled 1.0.0.
โ๏ธ Design Principles
gline-rs
is written in pure and safe Rust (beside the ONNX runtime), with the following dependencies:
- the ort ONNX runtime wrapper,
- the Hugging-Face tokenizers,
- the ndarray crate,
- the regex crate.
The implementation aims to clearly distinguish and comment each processing step, make them easily configurable, and model the pipeline concept almost declaratively.
Default configurations are provided, but it should be easy to adapt them:
- One can have a look at the
model::{pipeline, input, output}
modules to see how the pre- and post-processing steps are defined by implementing thePipeline
trait. - Others traits like
Splitter
orTokenizer
can be easily leveraged to test with different implementations of the text-processing steps. - While there is always room for improvement, special care has been taken to craft idiomatic, generic, commented, and efficient code.
๐ References and Acknowledgments
The following papers were used as references:
- GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer by Urchade Zaratiana, Nadi Tomeh, Pierre Holat and Thierry Charnois (2023).
- GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks by Ihor Stepanov and Mykhailo Shtopko (2024).
- Named Entity Recognition as Structured Span Prediction by Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois (2022).
The original implementation was also used to check for the details.
Special thanks to the original authors of GLiNER for this great and original work. ๐
Dependencies
~17โ31MB
~448K SLoC