4 releases

Uses old Rust 2015

0.1.3 Mar 24, 2017
0.1.2 Mar 20, 2017
0.1.1 Mar 19, 2017
0.1.0 Mar 19, 2017

#12 in #corpus

23 downloads per month

CC0 license

157 lines

opus_tools: Miscellaneous tools for working with OPUS parallel corpus

Latest version License Build Status Build status

These are small utilties for working with the OPUS parallel corpus, which is normally used for machine translation research. To install:

curl https://sh.rustup.rs -sSf | sh
cargo install opus_tools

opusraw2txt: Extract raw text from raw, monolingual file

Download the file ca.raw.tar.gz from the right-hand column of the subtitle page and run:

opusraw2txt ca.raw.tar.gz

This will print a huge number of sentences on standard output in UTF-8 format for further processing.

If you want to process an entire directory of files, you could install GNU parallel and szip, and run:

ls *.raw.tar.gz |
    sed 's/\.raw\.tar\.gz$//' |
    parallel --joblog out.log 'opusraw2txt {}.raw.tar.gz | szip > {}.sz'

This will rapidly extract a huge number of sentences:

Extracted 26782811 sentences from 27605 files.
Extracted 80140630 sentences from 90319 files.
Extracted 79320 sentences from 89 files.
Extracted 112360292 sentences from 124815 files.
Extracted 22917237 sentences from 23492 files.
Extracted 229583 sentences from 188 files.
Extracted 7335505 sentences from 6438 files.
Extracted 38677592 sentences from 44584 files.
Extracted 101502145 sentences from 114150 files.

...and so on.

If you see:

couldn't process OpenSubtitles2016/raw/es/2015/4544966/6155032.xml.gz (skipping):
Error: corrupt deflate stream
Error: couldn't process es.raw.tar.gz
Caused by: corrupt deflate stream

...this means that the file you downloaded was truncated before the end. As far as I can tell, this affects that master copies of es.raw.tar.gz and pt_br.raw.tar.gz.


Your feedback and contributions are welcome! For more information, see the subtitles-rs project.


~427K SLoC