12 releases (breaking)
0.10.2 | Sep 22, 2023 |
---|---|
0.10.1 | Nov 23, 2022 |
0.10.0 | Aug 2, 2022 |
0.9.0 | May 31, 2022 |
0.5.0 | Mar 1, 2022 |
#858 in Database interfaces
38KB
770 lines
sqlite2parquet
sqlite2parquet lets you generate parquet files from a sqlite database. It's available as a library and as a CLI application; they're feature-equivalent, so use whichever is most convenient. See here for some notes on archiving sqlite databases as parquet files. The code is in the public domain.
The CLI application
Install it like this:
$ cargo install sqlite2parquet
The library
If you use the library directly, you'll probably want to remove the CLI-related
dependencies, so set default-features = false
in your Cargo.toml.
lib.rs
:
Generate parquet files from sqlite databases
This library provides two things:
- A flexible way to generate a parquet file from a bunch of SQL statements
- A way to generate the neccessary config for writing a whole table to a parquet file
This package also contains a binary crate which lets you easily compress a whole sqlite DB into a bunch of parquet files. This typically gets a better compression ratio than xz, and is much faster. See ARCHIVE for a comparison.
The easy way
If you just want to dump the whole table as-is into a parquet file, you can
use the handy [infer_schema()
]. It tries to guess the best encoding based
on the sqlite schema.
let cols = sqlite2parquet::infer_schema(&conn, "my_table")
.unwrap()
.collect::<anyhow::Result<Vec<_>>>()
.unwrap();
let out_path = std::fs::File::create("my_table.parquet").unwrap();
sqlite2parquet::write_table(&conn, "my_table", &cols, &out_path, 1_000_000).unwrap();
The flexible way
Explicitly define the columns that will go in the parquet file. One thing
to be careful about: the SELECT
queries must all return the same number
of rows. If not, you'll get a runtime error.
use sqlite2parquet::*;
let cols = vec![
Column {
name: "category".to_string(),
required: true,
physical_type: PhysicalType::ByteArray,
logical_type: Some(LogicalType::String),
encoding: None,
dictionary: true,
query: "SELECT category FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
},
Column {
name: "first_timestamp".to_string(),
required: true,
physical_type: PhysicalType::Int64,
logical_type: Some(LogicalType::Timestamp(TimeType { utc: true, unit: TimeUnit::Nanos })),
encoding: Some(Encoding::DeltaBinaryPacked),
dictionary: false,
query: "SELECT MIN(timestamp) FROM my_table GROUP BY category ORDER BY MIN(timestamp)".to_string(),
},
];
let out_path = std::fs::File::create("category_start_times.parquet").unwrap();
write_table(&conn, "category_start_times", &cols, &out_path, 1_000_000).unwrap();
Dependencies
~38MB
~626K SLoC