13 releases
0.3.2 | May 25, 2024 |
---|---|
0.3.0 | Mar 4, 2024 |
0.2.1 | Oct 12, 2023 |
0.2.0 | Aug 14, 2023 |
0.1.5 | May 29, 2023 |
#1095 in Parser implementations
285KB
3K
SLoC
dply is a command line tool for viewing, querying, and writing csv and parquet files, inspired by dplyr.
Usage overview
A dply pipeline consists of a number of functions to read, transform, or write Parquet or CSV files.
Conversions between CSV, NdJSON, and Parquet files
The functions csv
, json
and parquet
read and write data for their respective
formats. The following two steps pipeline converts a Parquet file to NdJSON:
$ dply -c 'parquet("nyctaxi.parquet") | json("nyctaxi.json")'
We can use a select
step if we want to convert a subset of the columns:
$ dply -c 'parquet("nyctaxi.parquet") |
select(ends_with("time"), payment_type) |
json("nyctaxi.json")'
$ head -2 nyctaxi.json| jq
{
"payment_type": "Credit card",
"tpep_dropoff_datetime": "2022-11-22T19:45:53",
"tpep_pickup_datetime": "2022-11-22T19:27:01"
}
{
"payment_type": "Cash",
"tpep_dropoff_datetime": "2022-11-27T16:50:06",
"tpep_pickup_datetime": "2022-11-27T16:43:26"
}
Extracting nested fields from nested NdJSON
To extract a nested field in a NdJSON file we can use the field
function in a
mutate
step. The following example extracts the sha
from the list of
commits
in the payload
object:
$ dply -c 'json("./tests/data/github.json") |
mutate(commits = field(payload, commits)) |
unnest(commits) |
mutate(sha = field(commits, sha)) |
select(sha) |
show()'
shape: (4, 1)
┌──────────────────────────────────────────┐
│ sha │
│ --- │
│ str │
╞══════════════════════════════════════════╡
│ a02be18dc2a0faa0faec14f50c8b190ca0b50034 │
│ ac97a4ab3a4d86f61a6ba167c06cd8813b470867 │
│ null │
│ e4b233f1323a4b4e4461ed1aad31d20a7fbf0db4 │
└──────────────────────────────────────────┘
Complex NdJSON files can be converted to Parquet for faster query processing:
$ dply -c 'json("github.json") | parquet("github.parquet")'
Grouping, sorting columns, and saving results to a file
The following pipeline reads a Parquet file[^1], group rows by payment_type
,
computes the minimum, mean, and maximum fare for each payment type, saves the
result to fares.csv
CSV file, and shows the result:
$ dply -c 'parquet("nyctaxi.parquet") |
group_by(payment_type) |
summarize(
min_price = min(total_amount),
mean_price = mean(total_amount),
max_price = max(total_amount)
) |
arrange(payment_type) |
csv("fares.csv") |
show()'
shape: (5, 4)
┌──────────────┬───────────┬────────────┬───────────┐
│ payment_type ┆ min_price ┆ mean_price ┆ max_price │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞══════════════╪═══════════╪════════════╪═══════════╡
│ Cash ┆ -61.85 ┆ 18.07 ┆ 86.55 │
│ Credit card ┆ 4.56 ┆ 22.969491 ┆ 324.72 │
│ Dispute ┆ -55.6 ┆ -0.145161 ┆ 54.05 │
│ No charge ┆ -16.3 ┆ 0.086667 ┆ 19.8 │
│ Unknown ┆ 9.96 ┆ 28.893333 ┆ 85.02 │
└──────────────┴───────────┴────────────┴───────────┘
Running dply without any parameter starts the interactive client:
[^1]: The file nyctaxi.parquet
in the tests/data folder is a
250 rows parquet file sampled from the NYC trip record data.
Supported functions
dply
supports the following functions:
- arrange Sorts rows by column values
- count Counts columns unique values
- config Configure display format options
- csv Reads or writes a dataframe in CSV format
- distinct Retains unique rows
- filter Filters rows that satisfy given predicates
- glimpse Shows a dataframe overview
- group by and summarize Performs grouped aggregations
- head Shows the first few dataframe rows in table format
- joins Left, inner, outer and cross joins
- json Reads or writes a dataframe in JSON format
- mutate Creates or mutate columns
- parquet Reads or writes a dataframe in Parquet format
- relocate Moves columns positions
- rename Renames columns
- select Selects columns
- show Shows all dataframe rows
- unnest Expands list columns into rows
more examples can be found in the tests folder.
Installation
Binaries generated by the release Github action for Linux, macOS (x86), and Windows are available in the releases page.
You can also install dply
using Cargo:
cargo install dply
or by building it from this repository:
git clone https://github.com/vincev/dply-rs
cd dply-rs
cargo install --path .
Dependencies
~48–81MB
~1.5M SLoC