#sqlite #parquet #generate

bin+lib sqlite2parquet

Generate parquet files from sqlite databases

12 releases (breaking)

0.10.2 Sep 22, 2023
0.10.1 Nov 23, 2022
0.10.0 Aug 2, 2022
0.9.0 May 31, 2022
0.5.0 Mar 1, 2022

#858 in Database interfaces

Unlicense

38KB
770 lines

sqlite2parquet

sqlite2parquet lets you generate parquet files from a sqlite database. It's available as a library and as a CLI application; they're feature-equivalent, so use whichever is most convenient. See here for some notes on archiving sqlite databases as parquet files. The code is in the public domain.

The CLI application

Install it like this:

$ cargo install sqlite2parquet

The library

If you use the library directly, you'll probably want to remove the CLI-related dependencies, so set default-features = false in your Cargo.toml.


lib.rs:

Generate parquet files from sqlite databases

This library provides two things:

  1. A flexible way to generate a parquet file from a bunch of SQL statements
  2. A way to generate the neccessary config for writing a whole table to a parquet file

This package also contains a binary crate which lets you easily compress a whole sqlite DB into a bunch of parquet files. This typically gets a better compression ratio than xz, and is much faster. See ARCHIVE for a comparison.

The easy way

If you just want to dump the whole table as-is into a parquet file, you can use the handy [infer_schema()]. It tries to guess the best encoding based on the sqlite schema.

let cols = sqlite2parquet::infer_schema(&conn, "my_table")
.unwrap()
.collect::<anyhow::Result<Vec<_>>>()
.unwrap();
let out_path = std::fs::File::create("my_table.parquet").unwrap();
sqlite2parquet::write_table(&conn, "my_table", &cols, &out_path, 1_000_000).unwrap();

The flexible way

Explicitly define the columns that will go in the parquet file. One thing to be careful about: the SELECT queries must all return the same number of rows. If not, you'll get a runtime error.

use sqlite2parquet::*;
let cols = vec![
Column {
name: "category".to_string(),
required: true,
physical_type: PhysicalType::ByteArray,
logical_type: Some(LogicalType::String),
encoding: None,
dictionary: true,
query: "SELECT category FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
},
Column {
name: "first_timestamp".to_string(),
required: true,
physical_type: PhysicalType::Int64,
logical_type: Some(LogicalType::Timestamp(TimeType { utc: true, unit: TimeUnit::Nanos })),
encoding: Some(Encoding::DeltaBinaryPacked),
dictionary: false,
query: "SELECT MIN(timestamp) FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
},
];

let out_path = std::fs::File::create("category_start_times.parquet").unwrap();
write_table(&conn, "category_start_times", &cols, &out_path, 1_000_000).unwrap();

Dependencies

~38MB
~626K SLoC