#parquet #json #streaming #command-line #command-line-tool #tools #reads #converting

app parquet2json

A command-line tool for streaming Parquet as line-delimited JSON

14 stable releases

2.0.1 Jun 18, 2022
1.6.1 Apr 4, 2022
1.4.0 Mar 21, 2022
1.2.2 Oct 3, 2021
1.0.1 Jul 31, 2021

#857 in Command line utilities

Download history 6/week @ 2022-12-03 6/week @ 2022-12-10 15/week @ 2022-12-17 3/week @ 2022-12-24 3/week @ 2022-12-31 6/week @ 2023-01-07 2/week @ 2023-01-14 9/week @ 2023-01-21 12/week @ 2023-01-28 7/week @ 2023-02-04 81/week @ 2023-02-11 65/week @ 2023-02-18 21/week @ 2023-02-25 11/week @ 2023-03-04 18/week @ 2023-03-11 36/week @ 2023-03-18

98 downloads per month

MIT license

612 lines


A command-line tool for streaming Parquet as line-delimited JSON.

It reads only required ranges from file, HTTP or S3 locations, and supports offset/limit and column selection.

It uses the Apache Parquet Official Native Rust Implementation which has excellent support for compression formats and complex types.

How to use it

Install from crates.io and execute from the command line, e.g.:

$ cargo install parquet2json
$ parquet2json --help

    parquet2json [OPTIONS] <FILE> <SUBCOMMAND>

    <FILE>    Location of Parquet input file (file path, HTTP or S3 URL)

    -t, --timeout <TIMEOUT>    Request timeout in seconds [default: 60]
    -h, --help                 Print help information
    -V, --version              Print version information

    cat         Outputs data as JSON lines
    schema      Outputs the Thrift schema
    rowcount    Outputs only the total row count
    help        Print this message or the help of the given subcommand(s)

S3 Settings

Credentials are provided as per standard AWS toolchain, i.e. per environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY), AWS credentials file or IAM ECS container/instance profile.

The default AWS region must be set per environment variable (AWS_DEFAULT_REGION) o in AWS credentials file and must match region of the bucket the bucket is located in.


Use it to stream output to files and other tools such as grep and jq.

Output to a file

$ parquet2json ./myfile.pq cat > output.jsonl

From S3 or HTTP (S3)

$ parquet2json s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet cat
$ parquet2json https://amazon-reviews-pds.s3.us-east-1.amazonaws.com/parquet/product_category%3DGift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet cat

Filter selected columns with jq

$ parquet2json ./myfile.pq cat --columns=url,level | jq 'select(.level==3) | .url'




~1.5M SLoC