4 releases (2 breaking)
Uses new Rust 2024
new 0.3.0 | May 1, 2025 |
---|---|
0.2.1 | May 1, 2025 |
0.2.0 | May 1, 2025 |
0.1.0 | May 1, 2025 |
#139 in Database implementations
375 downloads per month
21KB
332 lines
csv_polars_cleaner
A robust Rust library for extracting and cleaning tabular data from messy CSV files using the Polars DataFrame engine.
Objective
- To reliably parse CSV files that may contain metadata, comments, empty lines, or other non-tabular content before or after the actual data table.
- To automatically detect the start and end of the true data region using statistical heuristics (mode of column counts).
Functionality
- Skips metadata, comments, and blank lines to find the real table header and data.
- Uses the most frequent column count to infer the bounds of the data block.
- Returns a Polars DataFrame for further analysis or processing.
- Provides clear error messages for malformed or unsupported files.
Limitations
- Only supports single-table CSVs (not multi-table or hierarchical data).
- Assumes the delimiter is consistent within the data region (default:
,
). - Does not attempt to infer or repair rows with inconsistent column counts within the main data region.
- Metadata and comments must not contain the delimiter in a way that mimics a table row.
Usage
Add to your Cargo.toml
:
[dependencies]
csv_polars_cleaner = "<version>"
Example usage:
use csv_polars_cleaner::parse_folder;
fn main() {
let folder = "path/to/your/folder";
match parse_folder(folder, b',') {
Ok(dfs) => {
println!("Parsed {} files", dfs.len());
for (i, df) in dfs.iter().enumerate() {
println!("\nFile {}:", i + 1);
println!("Headers: {:?}", df.get_column_names());
println!("Number of rows: {}", df.height());
}
}
Err(e) => {
eprintln!("Failed to parse folder: {:?}", e);
}
}
}
Command-line Usage
To get started, clone this repository:
git clone https://github.com/sanjaysingh13/csv_polars_cleaner.git
cd csv_polars_cleaner
This crate includes a simple CLI for quickly checking CSV parsing on your system:
cargo run -- path/to/your/folder
This will recursively parse all .csv files in the specified folder and its subfolders.
For more details, see the source code.
Dependencies
~16–26MB
~395K SLoC