40 releases
0.20.0 | Dec 15, 2024 |
---|---|
0.19.0 | Sep 18, 2024 |
0.18.1 | Jun 8, 2024 |
0.17.10 | Feb 5, 2024 |
0.1.4 | Mar 29, 2021 |
#730 in Parser implementations
292 downloads per month
24KB
395 lines
CSV to Parquet
Convert CSV files to Apache Parquet. This package is part of Arrow CLI tools.
Installation
Download prebuilt binaries
You can get the latest releases from https://github.com/domoritz/arrow-tools/releases.
With Homebrew
brew install domoritz/homebrew-tap/csv2parquet
With Cargo
cargo install csv2parquet
With Cargo B(inary)Install
To avoid re-compilation and speed up installation, you can install this tool with cargo binstall
:
cargo binstall csv2parquet
Usage
Usage: csv2parquet [OPTIONS] <CSV> <PARQUET>
Arguments:
<CSV>
Input CSV fil, stdin if not present
<PARQUET>
Output file
Options:
-s, --schema-file <SCHEMA_FILE>
File with Arrow schema in JSON format
--max-read-records <MAX_READ_RECORDS>
The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed
--header <HEADER>
Set whether the CSV file has headers
[default: true]
[possible values: true, false]
--delimiter <DELIMITER>
Set the CSV file's column delimiter as a byte character
--escape <ESCAPE>
Specify an escape character
--quote <QUOTE>
Specify a custom quote character
--comment <COMMENT>
Specify a comment character.
Lines starting with this character will be ignored
--null-regex <NULL_REGEX>
Provide a regex to match null values
-c, --compression <COMPRESSION>
Set the compression
[possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd, lz4-raw]
-e, --encoding <ENCODING>
Sets encoding for any column
[possible values: plain, plain-dictionary, rle, rle-dictionary, delta-binary-packed, delta-length-byte-array, delta-byte-array, byte-stream-split]
--data-page-size-limit <DATA_PAGE_SIZE_LIMIT>
Sets data page size limit
--dictionary-page-size-limit <DICTIONARY_PAGE_SIZE_LIMIT>
Sets dictionary page size limit
--write-batch-size <WRITE_BATCH_SIZE>
Sets write batch size
--max-row-group-size <MAX_ROW_GROUP_SIZE>
Sets max size for a row group
--created-by <CREATED_BY>
Sets "created by" property
--dictionary <DICTIONARY>
Sets flag to enable/disable dictionary encoding for any column
[possible values: true, false]
--statistics <STATISTICS>
Sets flag to enable/disable statistics for any column
[possible values: none, chunk, page]
--max-statistics-size <MAX_STATISTICS_SIZE>
Sets max statistics size for any column. Applicable only if statistics are enabled
-p, --print-schema
Print the schema to stderr
-n, --dry
Only print the schema
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
The --schema-file option uses the same file format as --dry and --print-schema.
Examples
Convert a CSV to Parquet
csv2parquet data.csv data.parquet
Convert a CSV with no header
to Parquet
csv2parquet --header false <CSV> <PARQUET>
Get the schema
from a CSV with header
csv2parquet --header true --dry <CSV> <PARQUET>
Convert a CSV using schema-file
to Parquet
Below is an example of the schema-file
content:
{
"fields": [
{
"name": "col1",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
},
{
"name": " col2",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
}
],
" metadata": {}
}
Then add the schema-file schema.json
in the command:
csv2parquet --header false --schema-file schema.json <CSV> <PARQUET>
Convert streams piping from standard input to standard output
This technique can prevent you from writing large files to disk. For example, here we stream a CSV file from a URL to S3.
curl <FILE_URL> | csv2parquet /dev/stdin /dev/stdout | aws s3 cp - <S3_DESTINATION>
Dependencies
~35MB
~704K SLoC