#word #tokenization #wordpiece #piece

wordpieces

Split tokens into word pieces

8 releases (4 breaking)

0.5.0 Jul 6, 2021
0.4.1 Feb 5, 2021
0.4.0 May 13, 2020
0.3.0 Apr 15, 2020
0.1.0 Nov 29, 2019

#6 in #word

Download history 63/week @ 2021-07-04 131/week @ 2021-07-11 65/week @ 2021-07-18 22/week @ 2021-07-25 31/week @ 2021-08-01 19/week @ 2021-08-08 31/week @ 2021-08-15 18/week @ 2021-08-22 10/week @ 2021-08-29 5/week @ 2021-09-05 29/week @ 2021-09-12 14/week @ 2021-09-19 5/week @ 2021-09-26 19/week @ 2021-10-03 15/week @ 2021-10-10 14/week @ 2021-10-17

194 downloads per month
Used in 5 crates (3 directly)

MIT/Apache

13KB
295 lines

wordpieces

This crate provides a subword tokenizer. A subword tokenizer splits a token into several pieces, so-called word pieces. Word pieces were popularized by and used in the BERT natural language encoder.


lib.rs:

Tokenize words into word pieces.

This crate provides a subword tokenizer. A subword tokenizer splits a token into several pieces, so-called word pieces. Word pieces were popularized by and used in the BERT natural language encoder.

The tokenizer splits a word, providing an iterator over pieces. The piece is represented as a string and its vocabulary index.

use std::convert::TryFrom;
use std::fs::File;
use std::io::{BufRead, BufReader};

use wordpieces::{WordPiece, WordPieces};

let f = File::open("testdata/test.pieces").unwrap();
let word_pieces = WordPieces::try_from(BufReader::new(f).lines()).unwrap();

// A word that can be split fully.
let pieces = word_pieces.split("coördinatie")
 .map(|p| p.piece()).collect::<Vec<_>>();
assert_eq!(pieces, vec![Some("coördina"), Some("tie")]);

// A word that can be split partially.
let pieces = word_pieces.split("voorkomen")
 .map(|p| p.piece()).collect::<Vec<_>>();
assert_eq!(pieces, vec![Some("voor"), None]);

Dependencies

~2.5MB
~25K SLoC

`