9 unstable releases

0.5.0 Apr 24, 2023
0.5.0-beta.2 Nov 26, 2022
0.5.0-beta.1 Aug 5, 2022
0.5.0-beta.0 Jul 19, 2022
0.2.0 Nov 19, 2020

#786 in Machine learning

Download history 18/week @ 2024-02-26 4/week @ 2024-03-04 5/week @ 2024-03-11 2/week @ 2024-03-18 78/week @ 2024-04-01

86 downloads per month
Used in 2 crates

MIT/Apache

150KB
3.5K SLoC

SyntaxDot

Introduction

SyntaxDot is a sequence labeler and dependency parser using Transformer networks. SyntaxDot models can be trained from scratch or using pretrained models, such as BERT or XLM-RoBERTa.

In principle, SyntaxDot can be used to perform any sequence labeling task, but so far the focus has been on:

  • Part-of-speech tagging
  • Morphological tagging
  • Topological field tagging
  • Lemmatization
  • Named entity recognition

The easiest way to get started with SyntaxDot is to use a pretrained sticker2 model (SyntaxDot is currently compatbile with sticker2 models).

Features

  • Input representations:
    • Word pieces
    • Sentence pieces
  • Flexible sequence encoder/decoder architecture, which supports:
    • Simple sequence labels (e.g. POS, morphology, named entities)
    • Lemmatization, based on edit trees
    • Simple API to extend to other tasks
    • Dependency parsing as sequence labeling
  • Dependency parsing using deep biaffine attention and MST decoding.
  • Multi-task training and classification using scalar weighting.
  • Encoder models:
    • Transformers
    • Finetuning of BERT, XLM-RoBERTa, ALBERT, and SqueezeBERT models
  • Model distillation
  • Deployment:
    • Standalone binary that links against PyTorch's libtorch
    • Very liberal license

Documentation

References

SyntaxDot uses techniques from or was inspired by the following papers:

Issues

You can report bugs and feature requests in the SyntaxDot issue tracker.

License

For licensing information, see COPYRIGHT.md.


lib.rs:

Transformer models (Vaswani et al., 2017)

This crate implements various transformer models, provided through the models module. The implementations are more restricted than e.g. their Huggingface counterparts, focusing only on the parts necessary for sequence labeling.

Dependencies

~11MB
~237K SLoC