9 breaking releases

Uses new Rust 2021

0.10.0 Aug 2, 2022
0.9.0 May 31, 2022
0.8.0 May 18, 2022
0.5.0 Mar 1, 2022

#159 in Database implementations

Download history 292/week @ 2022-04-22 710/week @ 2022-04-29 716/week @ 2022-05-06 527/week @ 2022-05-13 808/week @ 2022-05-20 797/week @ 2022-05-27 608/week @ 2022-06-03 702/week @ 2022-06-10 436/week @ 2022-06-17 439/week @ 2022-06-24 483/week @ 2022-07-01 216/week @ 2022-07-08 410/week @ 2022-07-15 520/week @ 2022-07-22 802/week @ 2022-07-29 324/week @ 2022-08-05

2,074 downloads per month

Unlicense

37KB
768 lines

sqlite2parquet

sqlite2parquet lets you generate parquet files from a sqlite database. It's available as a library and as a CLI application; they're feature-equivalent, so use whichever is most convenient. See here for some notes on archiving sqlite databases as parquet files. The code is in the public domain.

The CLI application

Install it like this:

$ cargo install sqlite2parquet

The library

If you use the library directly, you'll probably want to remove the CLI-related dependencies, so set default-features = false in your Cargo.toml.


lib.rs:

Generate parquet files from sqlite databases

This library provides two things:

  1. A flexible way to generate a parquet file from a bunch of SQL statements
  2. A way to generate the neccessary config for writing a whole table to a parquet file

This package also contains a binary crate which lets you easily compress a whole sqlite DB into a bunch of parquet files. This typically gets a better compression ratio than xz, and is much faster. See ARCHIVE for a comparison.

The easy way

If you just want to dump the whole table as-is into a parquet file, you can use the handy [infer_schema()]. It tries to guess the best encoding based on the sqlite schema.

# let conn = rusqlite::Connection::open_in_memory().unwrap();
# conn.execute("CREATE TABLE my_table (category TEXT, timestamp DATETIME)", []);
let cols = sqlite2parquet::infer_schema(&conn, "my_table")
.unwrap()
.collect::<anyhow::Result<Vec<_>>>()
.unwrap();
let out_path = std::path::PathBuf::from("my_table.parquet");
sqlite2parquet::write_table(&conn, "my_table", &cols, &out_path, 1_000_000).unwrap();

The flexible way

Explicitly define the columns that will go in the parquet file. One thing to be careful about: the SELECT queries must all return the same number of rows. If not, you'll get a runtime error.

# let conn = rusqlite::Connection::open_in_memory().unwrap();
# conn.execute("CREATE TABLE my_table (category TEXT, timestamp DATETIME)", []);
use sqlite2parquet::*;
let cols = vec![
Column {
name: "category".to_string(),
required: true,
physical_type: PhysicalType::ByteArray,
logical_type: Some(LogicalType::String),
encoding: None,
dictionary: true,
query: "SELECT category FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
},
Column {
name: "first_timestamp".to_string(),
required: true,
physical_type: PhysicalType::Int64,
logical_type: Some(LogicalType::Timestamp(TimeType { utc: true, unit: TimeUnit::Nanos })),
encoding: Some(Encoding::DeltaBinaryPacked),
dictionary: false,
query: "SELECT MIN(timestamp) FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
},
];

let out_path = std::path::PathBuf::from("category_start_times.parquet");
write_table(&conn, "category_start_times", &cols, &out_path, 1_000_000).unwrap();

Dependencies

~34MB
~646K SLoC