5 releases (breaking)

0.5.0 Feb 6, 2023
0.4.0 Jan 26, 2022
0.3.0 Jan 25, 2022
0.2.0 Jan 24, 2022
0.1.0 Jan 19, 2022

#1990 in Parser implementations

Custom license

395KB
9K SLoC

An ORC reader for Rust

Rust build status Java build status Coverage status

This project contains tools for working with Apache ORC files from the Rust programming language.

ORC is an open source data format that lets you represent tables of data efficiently (think CSV, but with types, compression, indexing, etc.).

Please note that this software is not "open source", but the source is available for use and modification by individuals, non-profit organizations, and worker-owned businesses (see the license section below for details).

Example use case

I've recently been working with the Twitter Stream Grab, a data set published by the Archive Team and the Internet Archive that includes billions of tweets and Twitter user profiles collected between 2011 and 2021.

The Twitter Stream Grab is 5.2 terabytes of compressed JSON data, and around 50 terabytes uncompressed. It takes many hundreds of hours of computing time to parse this data, which makes repeated processing impractical for personal projects, or for projects by activist groups with limited resources.

Storing this much data can also be impractical. I personally spent several hundred dollars just getting a copy from the Internet Archive's servers to Berlin, and storing a (compressed) copy in S3 currently costs about $122 per month.

There are many kinds of derived data sets and products you might want to build from data like the Twitter Stream Grab. One example is this collection of several million Twitter user profile snapshots for accounts that were active in spreading false claims about voter fraud in 2020. I'm also running a web service that allows users to look up past screen names for Twitter accounts.

I'm using the ORC format to make building projects like these from this data more practical. The basic idea is that instead of re-processing the entire 50 terabytes of JSON data for each application, you parse it once to extract the user profiles (and other information) into a set of ORC tables.

This intermediate representation is slightly more compact: for example the original compressed data for December 2020 takes up about 60 gigabytes, but the ORC table I've built for data from that month only takes up about 21 gigabytes. This means storing the ORC representation of the full 10 years of data only costs around $40 per month using a service like S3, but more importantly it means that it's much, much cheaper and easier to process or query the data.

AWS's Athena lets you run SQL queries directly against ORC files stored in S3, for example. You can also use Athena to process CSV files in S3, but running any SQL query against compressed CSV files for the entire Twitter Stream Grab would cost at least $2.50 (since all of the two or three terabytes of compressed data have to be scanned), while querying ORC in Athena generally costs a tiny fraction of that, since the ORC format makes it possible to avoid scanning data that isn't relevant to the query.

Products like Athena are useful for exploring data like the Twitter Stream Grab, and ORC makes this practical in terms of cost and time, but it's also possible to process the ORC files directly, so that for example instead of spending hundreds of hours of computing time to build a relational database of Twitter user info from the raw JSON data, you can spend a few hours and extract the data from the ORC files.

Why this project?

The ORC format was developed to be a native storage format for Apache Hive, which is built on Hadoop, which is firmly in the Java ecosystem. I personally find Hive to be extremely annoying and painful to work with, and I don't prefer writing Java.

There is also a C++ API for ORC, but I have a fair amount of related tooling already written in Rust, and I wanted to learn more about the internals of the ORC spec, so I decided to try to put together this implementation, and it only took a couple of days.

Use

The project currently provides one command-line tool that does a couple of things:

$ target/release/orcrs --help
orcrs 0.1.0
Travis Brown <travisrobertbrown@gmail.com>

USAGE:
    orcrs [OPTIONS] <SUBCOMMAND>

OPTIONS:
    -h, --help       Print help information
    -v, --verbose    Level of verbosity
    -V, --version    Print version information

SUBCOMMANDS:
    export    Export the contents of the ORC file
    help      Print this message or the help of the given subcommand(s)
    info      Dump raw info about the ORC file

To list all profiles for verified Twitter accounts from the provided sample data, for example:

target/release/orcrs -vvv export --header --columns 0,3,9 examples/ts-10k-2020-09-20.orc | egrep -v "(false|,)$"
id,screen_name,verified
561595762,morinaga_pino,true
1746230882607849472,weareoneEXO,true
29363584,Sandi,true
2067989391190130694,WayV_official,true
36764368,AdamParkhomenko,true
53970806,stephengrovesjr,true
15327404,fox32news,true
1678598579585548288,Mippcivzla,true
158278844,fadlizon,true
79721594,alfredodelmazo,true

This tool can currently export around 10 million rows of this data from a 886 megabyte ORC file (representing one day from 2020) in about 6 seconds:

$ time target/release/orcrs -vvv export --header --columns 0,3,9 /data/tsg/users/v2/2020-09-20.orc | wc
9705227 9705227 314287998

real    0m5.088s
user    0m6.048s
sys     0m0.349s

This is currently completely unoptimized and could be made at least a little faster.

Features

This project currently only supports reading ORC files (writing will probably stay out of scope unless I switch to using bindings to the ORC C++ API at some point).

Feature Status Notes
Integer types ✔️
String types ✔️
Floating point types Coming soon
Date types
Compound types
Zlib compression ✔️
Zstandard compression ✔️
Snappy compression Probably trivial
Column encryption Almost certainly permanently out of scope

Also note that right now these tools don't use the indices: you see every row in the file. So far this is fast enough for the things I need to do, but that will probably change in the future.

Known issues

This software is largely untested, undocumented, and unoptimized.

Developing

You'll need to install Rust and Cargo to build the project. Once you've got them, you can check out this repository and run cargo test (to run the tests) and cargo build --release (to build the command-line tool, which will be available as target/release/orcrs).

The Protobuf schemas for the metadata in the ORC file are not distributed with this repository, but they will be downloaded to $OUT_DIR/proto/ during the build. You can update this file as needed either manually or by changing the commit in build.rs. I got frustrated after 15 minutes of trying to figure out how to make the Protobuf code generation work properly with the build, so it's gone. You'll need to copy the scripts/build.rs file into the project directory in order to update the Protobuf schemas (but this shouldn't be necessary very often).

This repository also includes a Java project with some code that I used for generating ORC test data during development.

Previous work

There's a partial implementation of a few pieces of an ORC reader for Rust here. I've borrowed a couple of test cases for the byte run length encoding reader, but my implementation is otherwise unrelated.

Future work

I'll probably continue to add support for ORC format features as I need them. Eventually it'd be nice to have Rust bindings for the C++ API, and I may end up doing that here.

License

This software is published under the Anti-Capitalist Software License (v. 1.4).

Dependencies

~9–19MB
~257K SLoC