#cleanup #data-processing #csv

nightly macro no-std sanitise

Headache-free data clean-up

8 unstable releases (3 breaking)

0.4.0 May 21, 2023
0.3.0 May 20, 2023
0.2.1 May 19, 2023
0.1.3 May 17, 2023

#2800 in Parser implementations

MIT/Apache

74KB
1.5K SLoC

sanitise

A library for headache-free data clean-up and validation.

crates.io github docs.rs

sanitise is a CSV processing and validation library that generates code at compile time based on a YAML configuration file. The generated code is robust and will not panic.

no_std environments are supported, but the alloc crate is required.

Quick Start

Add sanitise to your dependencies in your Cargo.toml:

[dependencies]
sanitise = "0.1"

Import the macro:

use sanitise::sanitise_string;

And call:

// main.rs
use std::{fs, iter::zip};

use sanitise::sanitise_string;

fn main() {
    let csv = fs::read_to_string("data.csv").unwrap();
    let ((time_millis, pulse, movement), (time_secs,)) = sanitise_string!(include_str!("sanitise_config.yaml"), &csv).unwrap();

    println!("time_millis,time_secs,pulse,movement");
    for (((time_millis, pulse), movement), time_secs) in zip(zip(zip(time_millis, pulse), movement), time_secs) {
        println!("{time_millis},{time_secs},{pulse},{movement}")
    }
}
# sanitise_config.yaml
processes:
  - name: validate
    columns:
      - title: time
        type: integer
      - title: pulse
        type: integer
        max: 100
        min: 40
        on-invalid: average
        valid-streak: 3
      - title: movement
        type: integer
        valid-values: [0, 1]
        output-type: boolean
        output: "value == 1"
  - name: process
    columns:
      - title: time
        type: integer
        output: "value / 1000"
      - title: pulse
        type: integer
        ignore: true
      - title: movement
        type: integer
        ignore: true

# data.csv
time,pulse,movement
0,67,0
15,45,1
126,132,1

The first argument to sanitise_string! must be either a string literal or a macro call that expands to a string literal. The second argument must be an expression that resolves to a &str in CSV format. In the above example, sanitise_config.yaml must be next to main.rs, and data.csv must be in the working directory at runtime.

The other macro, sanitise!, is used when your data has already been parsed into the correct shape. See the documentation for more details.

Configuration

For details on the configuration file, see the specification.

Optional features

  • benchmark: Print the time taken to complete various stages of the process. Disables no_std support. You probably don't want this.

Efficiency

The macro creates linear finite automata to process each column. If on-invalid is set to average for a given column, that column's automaton will use a state machine to keep track of valid and invalid values. If a column is ignored, no automaton will be generated for it. All data is stored in native Rust types.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~0.4–0.8MB
~20K SLoC