#data-fusion #query #data-query #cli

bin+lib dfkit

A command-line toolkit for querying and transforming CSV, JSON, Parquet, and Avro data

2 unstable releases

Uses new Rust 2024

0.2.0 Apr 25, 2025
0.1.0 Apr 17, 2025

#1431 in Command line utilities

Download history 160/week @ 2025-04-16 127/week @ 2025-04-23 14/week @ 2025-04-30

301 downloads per month

MIT license

36KB
536 lines

CI Crates.io License

dfkit

dfkit is an extensive suite of command-line functions to easily view, query, and manipulate CSV, Parquet, JSON, and Avro files. Written in Rust and powered by Apache Arrow and Apache DataFusion. Currently a work in progress.

Highlights

Here's a high level overview of some of the features in dfkit:

  • Supports viewing and manipulating both local files and and files from remote URLs
  • Works with CSV, JSON, Parquet, and Avro files
  • Ultra-fast performance powered by Apache Arrow and DataFusion
  • Transform data with SQL or with several other built-in functions
  • Written entirely in Rust!

Commands

dfkit 0.2.0

USAGE:
    dfkit <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    cat         Concatenate multiple files or all files in a directory
    convert     Convert file format (CSV, Parquet, JSON)
    count       Count the number of rows in a file
    dedup       Remove duplicate rows
    describe    Show summary statistics for a file
    help        Prints this message or the help of the given subcommand(s)
    query       Run a SQL query on a file
    reverse     Reverse the order of rows
    schema      Show schema of a file
    sort        Sort rows by one or more columns
    split       Split a file into N chunks
    view        View the contents of a file

Installation

dfkit can be installed via cargo (requires rust):

cargo install dfkit

Examples

View takes the filename and an optional limit argument.

dfkit view sample.csv

+-------+-----+
| name  | age |
+-------+-----+
| Joe   | 34  |
| Matt  | 24  |
| Emily | 65  |
+-------+-----+

Query allows you to query the data with SQL. An optional output argument can also be supplied to save the results.

dfkit query sample.csv --sql "SELECT * FROM t WHERE age < 50"

+------+-----+
| name | age |
+------+-----+
| Joe  | 34  |
| Matt | 24  |
+------+-----+

Show the file schema.

dfkit schema sample.csv

+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| name        | Utf8      | YES         |
| age         | Int64     | YES         |
+-------------+-----------+-------------+

Show summary statistics of a file with describe.

dfkit describe sample.csv

+------------+-------+-------------------+
| describe   | name  | age               |
+------------+-------+-------------------+
| count      | 3     | 3.0               |
| null_count | 0     | 0.0               |
| mean       | null  | 41.0              |
| std        | null  | 21.37755832643195 |
| min        | Emily | 24.0              |
| max        | Matt  | 65.0              |
| median     | null  | 34.0              |
+------------+-------+-------------------+

Reverse the order of rows (save the output with --output)

dfkit reverse sample.csv

+-------+-----+
| name  | age |
+-------+-----+
| Emily | 65  |
| Matt  | 24  |
| Joe   | 34  |
+-------+-----+

Sort rows and optionally save the output with --output. You can specify multiple columns as a comma separated string.

dfkit sort sample.csv --columns "age"

+-------+-----+
| name  | age |
+-------+-----+
| Matt  | 24  |
| Joe   | 34  |
| Emily | 65  |
+-------+-----+

Dependencies

~79MB
~1.5M SLoC