#csv #diff #difference #compare #csv-diff

csv-diff

Compare two CSVs - with ludicrous speed 🚀

1 unstable release

0.1.0-alpha Dec 30, 2021

#494 in Encoding

MIT/Apache

115KB
2.5K SLoC

csv-diff

Find the difference between two CSVs - with ludicrous speed!🚀


Crates.io version Download docs.rs docs

Documentation

https://docs.rs/csv-diff

⚠️Warning⚠️

This crate is still in it's infancy. There will be breaking changes (and dragons🐉) in the beginning.

Highlights ✨

  • fastest CSV-diffing library in the world🚀
    • compare two CSVs with 1,000,000 rows x 9 columns in under 500ms
  • thread-pool agnostic 🧵🧶

Example

use std::io::Cursor;
use csv_diff::{csv_diff::CsvByteDiff, csv::Csv};
use csv_diff::diff_row::{ByteRecordLineInfo, DiffByteRecord};
use std::collections::HashSet;
use std::iter::FromIterator;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // some csv data with a header, where the first column is a unique id
    let csv_data_left = "id,name,kind\n\
                        1,lemon,fruit\n\
                        2,strawberry,fruit";
    let csv_data_right = "id,name,kind\n\
                        1,lemon,fruit\n\
                        2,strawberry,nut";

    let csv_byte_diff = CsvByteDiff::new()?;

    let mut diff_byte_records = csv_byte_diff.diff(
        // we need to wrap our bytes in a cursor, because it needs to be `Seek`able
        Csv::new(Cursor::new(csv_data_left.as_bytes())),
        Csv::new(Cursor::new(csv_data_right.as_bytes())),
    )?;

    diff_byte_records.sort_by_line();

    let diff_byte_rows = diff_byte_records.as_slice();

    assert_eq!(
        diff_byte_rows,
        &[DiffByteRecord::Modify {
            delete: ByteRecordLineInfo::new(
                csv::ByteRecord::from(vec!["2", "strawberry", "fruit"]),
                3
            ),
            add: ByteRecordLineInfo::new(csv::ByteRecord::from(vec!["2", "strawberry", "nut"]), 3),
            field_indices: vec![2]
        }]
    );
    Ok(())
}

Getting Started

In your Cargo.toml file add the following lines under [dependencies]:

csv-diff = "0.1.0-alpha"

This will use a rayon thread-pool, but you can opt-out of it and for example use threads without a thread-pool, by opting in into the crossbeam-threads feature (and opting-out of the default features):

csv-diff = { version = "0.1.0-alpha", default-features = false, features = ["crossbeam-threads"] }

Use Case

This crate should be used on CSV data that has some sort of primary key for uniquely identifying a record. It is not a general line-by-line diffing crate. You can imagine dumping a database table in CSV format from your test and production system and comparing it with each other to find differences.

Caveats

Due to the fact that this crate is still in it's infancy, there are still some caveats, which we might resolve in the near future:

  • resulting CSV records/lines that have differences are provided as raw bytes; you can use StringRecord::from_byte_record , provided by the csv crate, to try converting them into UTF-8 encoded records.
  • CSVs must be Seekable
    • Seek is implemented for the most important types like:
      • Files
      • and when wrapped in a Cursor
        • Strings and &str
        • [u8]
  • when using your own custom thread-pool, thread-spawning must support scoped threads
  • documentation must be improved

Benchmarks

You can run benchmarks with the following command:

cargo bench

Safety

This crate is implemented in 100% Safe Rust, which is ensured by using #![forbid(unsafe_code)].

MSRV

The Minimum Supported Rust Version for this crate is 1.49. An increase of MSRV will be indicated by a breaking change (according to SemVer).

Credits

This crate is inspired by the CLI tool csvdiff by Aswin Karthik, which is written in Go. Definitely check it out. It is a great tool.

Additionally, this crate would not exist without the awesome Rust community and these fantastic crates 🦀:




License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~2.1–3MB
~51K SLoC