9 releases (5 breaking)
0.6.1 | Oct 10, 2022 |
---|---|
0.5.0 | Jul 6, 2021 |
0.4.1 | Feb 5, 2021 |
0.4.0 | May 13, 2020 |
0.1.0 | Nov 29, 2019 |
#1099 in Text processing
112 downloads per month
Used in 5 crates
(3 directly)
15KB
342 lines
wordpieces
This crate provides a subword tokenizer. A subword tokenizer splits a token into several pieces, so-called word pieces. Word pieces were popularized by and used in the BERT natural language encoder.
lib.rs
:
Tokenize words into word pieces.
This crate provides a subword tokenizer. A subword tokenizer splits a token into several pieces, so-called word pieces. Word pieces were popularized by and used in the BERT natural language encoder.
The tokenizer splits a word, providing an iterator over pieces. The piece is represented as a string and its vocabulary index.
use std::convert::TryFrom;
use std::fs::File;
use std::io::{BufRead, BufReader};
use wordpieces::{WordPiece, WordPieces};
let f = File::open("testdata/test.pieces").unwrap();
let word_pieces = WordPieces::from_buf_read(BufReader::new(f)).unwrap();
// A word that can be split fully.
let pieces = word_pieces.split("coördinatie")
.map(|p| p.piece()).collect::<Vec<_>>();
assert_eq!(pieces, vec![Some("coördina"), Some("tie")]);
// A word that can be split partially.
let pieces = word_pieces.split("voorkomen")
.map(|p| p.piece()).collect::<Vec<_>>();
assert_eq!(pieces, vec![Some("voor"), None]);
Dependencies
~2MB
~21K SLoC