2 releases
0.1.3 | Feb 16, 2023 |
---|---|
0.1.2 | Feb 16, 2023 |
0.1.1 |
|
0.1.0 |
|
#11 in #bert
24KB
547 lines
bert_create_pretraining
The crate provides the port of the original BERT create_pretraining_data.py script from the Google BERT repository.
Installation
Cargo
$ cargo install bert_create_pretraining
Usage
You can use the bert_create_pretraining
binary to create the pretraining data for BERT in parallel. The binary takes the following arguments:
$ find "${DATA_DIR}" -name "*.txt" | xargs -I% -P $NUM_PROC -n 1 \
basename % | xargs -I% -P ${NUM_PROC} -n 1 \
"${TARGET_DIR}/bert_create_pretraining" \
--input-file="${DATA_DIR}/%" \
--output-file="${OUTPUT_DIR}/%.tfrecord" \
--vocab-file="${VOCAB_DIR}/vocab.txt" \
--max-seq-length=512 \
--max-predictions-per-seq=75 \
--masked-lm-prob=0.15 \
--random-seed=12345 \
--dupe-factor=5
You can check the full list of options with the following command:
$ bert_create_pretraining --help
License
MIT license. See LICENSE file for full license.
Dependencies
~11–23MB
~350K SLoC