12 stable releases

2.2.1 Mar 30, 2022
2.2.0 Mar 29, 2022
2.1.0 Mar 27, 2021
1.4.0 Dec 31, 2020
0.1.0 Jun 30, 2020

#970 in Text processing

41 downloads per month

MIT license

155KB
4K SLoC

CSVSC

A library for building transformation chains on csv files.

Docs

please visit https://docs.rs/csvsc


lib.rs:

csvsc is a framework for building csv file processors.

Imagine you have N csv files with the same structure and you want to use them to make other M csv files whose information depends in some way on the original files. This is what csvcv was built for. With this tool you can build a processing chain (row stream) that will take each of the input files and generate new output files with the modifications.

Quickstart

Start a new binary project with cargo:

$ cargo new --bin miprocesadordecsv

Add csvsc and encoding as a dependency in Cargo.toml.

[dependencies]
csvsc = "2.2"

Now start building your processing chain. Specify the inputs (one or more csv files), the transformations, and the output.

use csvsc::prelude::*;

let mut chain = InputStreamBuilder::from_paths(&[
// Put here the path to your source files, from 1 to a million
"test/assets/chicken_north.csv",
"test/assets/chicken_south.csv",
]).unwrap().build().unwrap()

// Here is where you do the magic: add columns, remove ones, filter
// the rows, group and aggregate, even probably transpose the data
// to fit your needs.

// Specify some (zero, one or many) output targets so that results of
// your computations get stored somewhere.
.flush(Target::path("data/output.csv")).unwrap()

.into_iter();

// And finally consume the stream, reporting any errors to stderr.
while let Some(item) = chain.next() {
if let Err(e) = item {
eprintln!("{}", e);
}
}

Example

Grab your input files, in this case I'll use this two:

chicken_north.csv

month,eggs per week
1,3
1,NaN
1,6
2,
2,4
2,8
3,5
3,1
3,8

chicken_south.csv

month,eggs per week
1,2
1,NaN
1,
2,7
2,8
2,23
3,3
3,2
3,12

Now build your processing chain.

// main.rs
use csvsc::prelude::*;

use encoding::all::UTF_8;

let mut chain = InputStreamBuilder::from_paths(vec![
"test/assets/chicken_north.csv",
"test/assets/chicken_south.csv",
]).unwrap()

// optionally specify the encoding
.with_encoding(UTF_8)

// optionally add a column with the path of the source file as specified
// in the builder
.with_source_col("_source")

// build the row stream
.build().unwrap()

// Filter some columns with invalid values
.filter_col("eggs per week", |value| {
value.len() > 0 && value != "NaN"
}).unwrap()

// add a column with a value obtained from the filename ¡wow!
.add(
Column::with_name("region")
.from_column("_source")
.with_regex("_([a-z]+).csv").unwrap()
.definition("$1")
).unwrap()

// group by two columns, compute some aggregates
.group(["region", "month"], |row_stream| {
row_stream.reduce(vec![
Reducer::with_name("region").of_column("region").last("").unwrap(),
Reducer::with_name("month").of_column("month").last("").unwrap(),
Reducer::with_name("avg").of_column("eggs per week").average().unwrap(),
Reducer::with_name("sum").of_column("eggs per week").sum(0.0).unwrap(),
]).unwrap()
})

// Write a report to a single file that will contain all the data
.flush(
Target::path("data/report.csv")
).unwrap()

// This column will allow us to output to multiple files, in this case
// a report by month
.add(
Column::with_name("monthly report")
.from_all_previous()
.definition("data/monthly/{month}.csv")
).unwrap()

.del(vec!["month"])

// Write every row to a file specified by its `monthly report` column added
// previously
.flush(
Target::from_column("monthly report")
).unwrap()

// Pack the processing chain into an interator that can be consumed.
.into_iter();

// Consuming the iterator actually triggers all the transformations.
while let Some(item) = chain.next() {
item.unwrap();
}

This is what comes as output:

data/monthly/1.csv

region,avg,sum
south,2,2
north,4.5,9

data/monthly/2.csv

region,avg,sum
north,6,12
south,12.666666666666666,38

data/monthly/3.csv

region,avg,sum
north,4.666666666666667,14
south,5.666666666666667,17

data/report.csv

region,month,avg,sum
north,2,6,12
south,1,2,2
south,2,12.666666666666666,38
north,3,4.666666666666667,14
south,3,5.666666666666667,17
north,1,4.5,9

Dig deeper

Check InputStreamBuilder to see more options for starting a processing chain and reading your input.

Go to the RowStream documentation to see all the transformations available as well as options to flush the data to files or standard I/O.

Dependencies

~7MB
~105K SLoC