2 releases

new 0.9.1 Jan 13, 2025
0.9.0 Jan 8, 2025

#222 in Text processing

Download history 143/week @ 2025-01-05 103/week @ 2025-01-12

246 downloads per month

Apache-2.0

83KB
1.5K SLoC

๐ŸŒฟ gline-rs: Inference Engine for GLiNER Models, in Rust

๐Ÿ’ฌ Introduction

gline-rs is an inference engine for GLiNER models. These language models proved to be efficient at zero-shot Named Entity Recognition (NER) and other tasks such as Relation Extraction, while consuming less resources than large generative models (LLMs).

This implementation has been written from the ground up in Rust, and supports both span- and token-oriented variants (for inference only). It aims to provide a production-grade and user-friendly API in a modern and safe programming language, including a clean and maintainable implementation of the mechanics surrounding these models.

For those interested, it can also help getting a deep understanding of GLiNER's operation.

๐Ÿ’ก Background and Motivation

Common drawbacks of machine learning systems include cryptic implementations and high resource consumption. gline-rs aims to take a step toward a more maintainable and sustainable approach. ๐ŸŒฑ

Why GLiNER?

The term stands for "Generalist and Lightweight Model for Named Entity Recognition", after an original work from Zaratiana at al. It now refers to a family of lightweight models capable of performing various zero-shots extractions using a bidirectional transformer architecture (BERT-like). For this kind of tasks, this approach can be much more relevant than full-blown LLMs.

However, it is characterized by a number of operations that need to be performed both upstream and downstream of applying the pre-trained model. These operations are conceptually described in the academic papers, but the implementations details are not trivial to understand and reproduce. To address this issue, this implementation emphasizes code readability, modularity, and documentation.

Why Rust?

The original implementation was written in Python, which is widely used for machine learning, but not particularly efficient and not always suitable in production environments.

Rust combines bare-metal performance with memory and thread safety. It helps to write fast, reliable, and resource-efficient code by ensuring sound concurrency and memory use at compile time. For example, the borrow checker enforces strict ownership rules, reducing costly operations like cloning to prevent data races.

Although it is not yet as widespread as Python in the ML world, it makes an excellent candidate for enabling reliable and efficient ML systems.

๐ŸŽ“ Public API

Include gline-rs as a regular dependency in your Cargo.toml:

[dependencies]
"gline-rs" = "0.9.1"

The public API is self-explanatory:

let model = GLiNER::<TokenMode>::new(
    Parameters::default(),
    RuntimeParameters::default(),
    "tokenizer.json",
    "model.onnx",
)?;

let input = TextInput::from_str(
    &[
        "My name is James Bond.", 
        "I like to drive my Aston Martin.",
    ],
    &[
        "person", 
        "vehicle",
    ],
)?;

let output = model.inference(input)?;

// => "James Bond" : "person"
// => "Aston Martin" : "vehicle"

Please refer the the examples source codes for complete code.

๐Ÿงฌ Getting the Models

To leverage gline-rs, you need the appropriate models in ONNX format.

Ready-to-use models can be downloaded from ๐Ÿค— Hugging Face repositories. For example:

To run the examples without any modification, this file structure is expected:

For token-mode:

models/gliner-multitask-large-v0.5/tokenizer.json
models/gliner-multitask-large-v0.5/onnx/model.onnx

For span-mode:

models/gliner_small-v2.1/tokenizer.json
models/gliner_small-v2.1/onnx/model.onnx

The original GLiNER implementation also provides some tools to convert models by your own.

๐Ÿš€ Running the Examples

They are located in the examples directory. For instance in token mode:

$ cargo run --example token-mode

Expected output:

0 | James Bond      | person     | 99.7%
1 | James           | person     | 98.1%
1 | Chelsea         | location   | 96.4%
1 | London          | location   | 92.4%
2 | James Bond      | person     | 99.4%
3 | Aston Martin    | vehicle    | 99.9%

โšก๏ธ GPU/NPU Inferences

The ort execution providers can be leveraged to perform considerably faster inferences on GPU/NPU hardware. A working example is provided in examples/benchmark-gpu.rs.

The first step is to pass the appropriate execution providers in RuntimeParameters (which is then passed to GLiNER initialization). For example:

let rtp = RuntimeParameters::default().with_execution_providers([
    CUDAExecutionProvider::default().build()
])

The second step is to activate the appropriate features (see related section below), otherwise the example will silently fall-back to CPU. For example:

$ cargo run --example benchmark-gpu --features=cuda

Please refer to doc/ORT.md for details about execution providers.

๐Ÿ“ฆ Create Features

This create mirrors the following ort features:

  • To allow for dynamic loading of ONNX-runtime libraries: load-dynamic
  • To allow for activation of execution providers: cuda, tensorrt, directml, coreml, rocm, openvino, onednn, xnnpack, qnn, cann, nnapi, tvm, acl, armnn, migraphx, vitis, and rknpu

โฑ๏ธ Performances

CPU

Comparing performances from one implementation to another is complicated, as they depend on many factors. But according to the first measures, it appears that gline-rs can run 4x faster on CPU than the original implementation out of the box:

Implementation sequences/second
gline-rs 6.67
GLiNER.py 1.61

Both implementations have been tested under the following configuration:

  • Dataset: subset of the NuNER dataset (first 100 entries)
  • Mode: token, flat_ner: true, multi_label: false
  • Number of entity classes: 3
  • Threshold: 0.5
  • Model: gliner-multitask-large-v0.5
  • CPU specs: Intel Core i9 @2.3Ghz with 8 cores (12 threads)

GPU

Unsurprisingly, leveraging a GPU dramatically increases the throughput:

Implementation sequences/second
gline-rs 248.75

The configuration of the test is similar to the above, except:

  • Dataset: subset of the NuNER dataset (first 1000 entries)
  • Execution provider: CUDA
  • GPU specs: NVIDIA RTX 4080 (16Gb VRAM)
  • CPU specs: Intel Core i7 13700KF @3.4Ghz

(Comparison with the original implementation has yet to be done.)

๐Ÿงช Current Status

Although it is sufficiently mature to be embraced by the community, the current version (0.9.x) should not be considered as production-ready.

For any critical use, it is advisable to wait until it has been extensively tested and ort-2.0 (the ONNX runtime wrapper) has reached a stable release.

The first stable, production-grade release will be labeled 1.0.0.

โš™๏ธ Design Principles

gline-rs is written in pure and safe Rust (beside the ONNX runtime), with the following dependencies:

The implementation aims to clearly distinguish and comment each processing step, make them easily configurable, and model the pipeline concept almost declaratively.

Default configurations are provided, but it should be easy to adapt them:

  • One can have a look at the model::{pipeline, input, output} modules to see how the pre- and post-processing steps are defined by implementing the Pipeline trait.
  • Others traits like Splitter or Tokenizer can be easily leveraged to test with different implementations of the text-processing steps.
  • While there is always room for improvement, special care has been taken to craft idiomatic, generic, commented, and efficient code.

๐Ÿ“– References and Acknowledgments

The following papers were used as references:

The original implementation was also used to check for the details.

Special thanks to the original authors of GLiNER for this great and original work. ๐Ÿ™

Dependencies

~17โ€“31MB
~448K SLoC