4 releases
new 0.1.5 | Dec 21, 2024 |
---|---|
0.1.4 | Dec 19, 2024 |
#1401 in Parser implementations
323 downloads per month
65KB
1.5K
SLoC
spreadsheet-to-json
Convert Spreadsheets and CSV files to jSON
This library crate provides the core functions to convert common spreadsheet and CSV files into JSON or JSONL (JSON Lines) either directly or asynchronously.
It relies on the Calamine and CSV library crates to process files, the tokio crate for asynchronous operations and naturally serde and serde_json serialization libraries.
It supports the following formats:
- Excel 2007+ Workbook (.xlsx)
- Excel 2007+ Binary (.xlsb)
- Excel 97-2004 Legacy (.xls)
- OpenDocument Spreadsheets (.ods) compatible with LibreOffice
- CSV: comma separated values (.csv)
- TSV: tab-separated values (.tsv)
Features
- Blazingly fast: It can import 10,000 rows in less than 0.4 seconds.
- Can export to standard JSON or to JSON lines when writing large files
- Formula cells are read as calculated values
- Can identify and convert both Excel's 1900 datetime format and standard ISO format as used in OpenDocument Spreadsheet
- Can identify numeric fields formatted as text and convert them to integers or floats.
- Can identify truthy text or number cells and convert them to booleans
- Can save large files asynchronously
To do
Full explanation of options to come.
Core Options
Options can be set by instantiating OptionSet::new("path/to/spreadsheet.xlsx")
with chained setter methods:
-
.max_row_count(max: u32)
: overrides the default max row count of 10,000. Use this is direct mode or to return only the first n rows. -
.header_row(index: u8)
overrides the default header row index of 0, useful for spreadsheets with a title and notes on top -
.omit_header(index: u8)
omit the header altogether and assign default A1-style keys or column numbers. -
.sheet_index(index: u32)
zero-based sheer index. Any value over zero will override the specified sheet name. -
.sheet_name(name: &str)
case-insensitive sheet name. It will match the first sheet with name after stripping spaces and punctuation. -
.read_mode_async()
Defer processing of rows with a callback in the second argument in render_spreadsheet_async() -
.json_lines()
Output will be rendered one json object per row. -
field_name_mode(system: &str, override_header: bool)
: use either A1 or C for the default column key notation where headers are either unavailable or suppressed via theoverride_header
flag. -
override_headers(keys: &[&str])
Override matched or automatic column keys. More advanced column options will be detailed soon.
Simple example:
let opts = OptionsSet::new("path/to/spreadsheet.ods")
.sheet_index(3)
.read_mode_async()
.override_headers(&["name", "height", "width", "weight", "hue", "price"]);
Core functions
process_spreadsheet_direct(opts: &OptionSet)
: May be called in a synchronous context where you need to process results immediately.
Alpha Version History
This crate is still alpha and likely to undergo breaking changes as it's part of larger data import project. I do not expect a stable version before mid January when it has been battle-tested.
- 0.1.2 the core public functions with Result return types now use a GenericError error type
- 0.1.3 Refined A1 and C01 column name styles and added result output as vectors of lines for interoperability with CLI utilities and debugging.
- 0.1.4 Added support for Excel Binary format (.xlsb)
- 0.1.5 Added two new core functions
process_spreadsheet_direct()
for direct row processing in a synchronous context andprocess_spreadsheet_direct()
in an asynchronous context with a callback. If you need to process a spreadsheet directly in an async function, you should userender_spreadsheet_direct()
instead.
Examples
The main implementation is my unpublished Spreadsheet to JSON CLI crate,
Simple immediate jSON conversion
This function processes the spreadsheet file immediately.
use spreadsheet_to_json::*;
fn main() -> Result((), Error) {
let opts = Opts::new("path/to/spreadsheet.xlsx")->set_sheet_index(1);
let result = process_spreadsheet_direct(&opts);
let json_value = match result {
Err(msg_code) => json!{ { "error": true, "key": msg_code.to_string() },
Ok(data_set) => data_set.to_json() // full result set
};
println!("{}", json_value);
}
Asynchronous parsing saving to a database
This must be called in an async function and save rows in separate proc
use spreadsheet_to_json::*;
use spreadsheet_to_json::tokio;
#[tokio:main]
async fn main() -> Result((), GenericError) {
let opts = Opts::new("path/to/spreadsheet.xlsx")->read_mode_async();
let dataset_id = db_dataset_id(&opts);
let callback = move |row: IndexMap<String, Value>| -> Result<(), Error> {
save_data_row(row, &connection, dataset_id)
};
let result = process_spreadsheet_async(&opts, callback, Some(dataset_id)).await;
let result_set = match result {
Err(msg_code) => json!{ { "error": true, "key": msg_code.to_string() },
Ok(data_set) => data_set.to_json() // full result set
};
println!("{}", result_set);
}
// Save function called in a closure for each row with a database connection and data_id from the outer scope
fn save_data_row(row: IndexMap<String, Value>, connection: PgConnection, data_id: u32) -> Result((), GenericError) {
let mut row_struct = CustomTableStruct {
id: None,
data_id: data_id, // or whatever ID setting logic you have
field1: None,
field2: None,
// ... set other fields to None or default values
};
for (key, value) in row {
match key.as_str() {
"field1" => {
if let Value::String(s) = value {
row_struct.field1 = Some(s.clone());
}
},
"field2" => {
if let Value::Number(n) = value {
if let Some(i) = n.as_i64() {
row_struct.field2 = Some(i as i32);
}
}
},
// Add other field mappings here
_ => {} // Ignore unknown keys
}
}
diesel::insert_into("data_rows")
.values(&row_struct)
.execute(connection),map_error(|_| GenericError("database_error"));
Ok()
}
Dependencies
~11–18MB
~298K SLoC