#segmentation #standard #plain-text #regex #compliant #rules #exchange

srx

A mostly compliant Rust implementation of the Segmentation Rules eXchange (SRX) 2.0 standard for text segmentation

5 releases

0.1.4 Jul 17, 2023
0.1.3 Mar 27, 2021
0.1.2 Feb 10, 2021
0.1.1 Feb 6, 2021
0.1.0 Feb 6, 2021

#459 in Text processing

Download history 341/week @ 2024-01-05 373/week @ 2024-01-12 412/week @ 2024-01-19 436/week @ 2024-01-26 417/week @ 2024-02-02 352/week @ 2024-02-09 409/week @ 2024-02-16 357/week @ 2024-02-23 287/week @ 2024-03-01 300/week @ 2024-03-08 307/week @ 2024-03-15 299/week @ 2024-03-22 227/week @ 2024-03-29 311/week @ 2024-04-05 311/week @ 2024-04-12 279/week @ 2024-04-19

1,185 downloads per month
Used in 5 crates (via nlprule)

MIT/Apache

43KB
455 lines

SRX

Crates.io Docs.rs MIT OR Apache 2.0 license

A simple and reasonably fast Rust implementation of the Segmentation Rules eXchange 2.0 standard for text segmentation. srx is not fully compliant with the standard.

This crate is intended for segmentation of plaintext so markup information (<formathandle> and segmentsubflows) is ignored.

Not complying with the SRX spec, overlapping matches of the same <rule> are not found which could lead to different behavior in a few edge cases.

A note on regular expressions

This crate uses the regex crate for parsing and executing regular expressions. The regex crate is mostly compatible with the regular expression standard from the SRX specification. However, some metacharacters such as \Q and \E are not supported.

To still be able to use files containing unsupported rules and to parse useful SRX files such as segment.srx from LanguageTool which does not comply with the standard by e. g. using look-ahead and look-behind, srx ignores <rule> elements with invalid regular expressions and provides information about them via the srx.errors() function.

Dependencies

~2–3.5MB
~58K SLoC