#polars #csv #objective #csv-polars-cleaner

bin+lib csv_polars_cleaner

A robust Rust library for extracting and cleaning tabular data from messy CSV files using Polars

4 releases (2 breaking)

Uses new Rust 2024

new 0.3.0 May 1, 2025
0.2.1 May 1, 2025
0.2.0 May 1, 2025
0.1.0 May 1, 2025

#139 in Database implementations

Download history 375/week @ 2025-04-27

375 downloads per month

MIT/Apache

21KB
332 lines

csv_polars_cleaner

A robust Rust library for extracting and cleaning tabular data from messy CSV files using the Polars DataFrame engine.

Objective

  • To reliably parse CSV files that may contain metadata, comments, empty lines, or other non-tabular content before or after the actual data table.
  • To automatically detect the start and end of the true data region using statistical heuristics (mode of column counts).

Functionality

  • Skips metadata, comments, and blank lines to find the real table header and data.
  • Uses the most frequent column count to infer the bounds of the data block.
  • Returns a Polars DataFrame for further analysis or processing.
  • Provides clear error messages for malformed or unsupported files.

Limitations

  • Only supports single-table CSVs (not multi-table or hierarchical data).
  • Assumes the delimiter is consistent within the data region (default: ,).
  • Does not attempt to infer or repair rows with inconsistent column counts within the main data region.
  • Metadata and comments must not contain the delimiter in a way that mimics a table row.

Usage

Add to your Cargo.toml:

[dependencies]
csv_polars_cleaner = "<version>"

Example usage:

use csv_polars_cleaner::parse_folder;

fn main() {
    let folder = "path/to/your/folder";
    match parse_folder(folder, b',') {
        Ok(dfs) => {
            println!("Parsed {} files", dfs.len());
            for (i, df) in dfs.iter().enumerate() {
                println!("\nFile {}:", i + 1);
                println!("Headers: {:?}", df.get_column_names());
                println!("Number of rows: {}", df.height());
            }
        }
        Err(e) => {
            eprintln!("Failed to parse folder: {:?}", e);
        }
    }
}

Command-line Usage

To get started, clone this repository:

git clone https://github.com/sanjaysingh13/csv_polars_cleaner.git
cd csv_polars_cleaner

This crate includes a simple CLI for quickly checking CSV parsing on your system:

cargo run -- path/to/your/folder

This will recursively parse all .csv files in the specified folder and its subfolders.

For more details, see the source code.

View API Documentation (GitHub Pages)

Dependencies

~16–26MB
~395K SLoC