#csv #csv-tsv #tsv #slice #file-header #dataset #command

bin+lib csv-guillotine

CSV's often have metadata at top before data headers. This removes it.

9 releases

0.3.5 May 27, 2019
0.3.4 May 27, 2019
0.2.0 May 3, 2019
0.1.1 Apr 30, 2019

#13 in #tsv

30 downloads per month

MIT license

22KB
502 lines

CSV Guillotine

Build Status

Purpose

Many banks, stockbrokers and other large institutions will allow you to download your account history in a CSV file. This is good and to be applauded but they often include an extra metadata header at the top of the file explaining what it is. The CSV file may look something like the following:

Account:,****07493
£4536.24
£4536.24

Transaction type,Description,Paid out,Paid in,Balance
Bank credit YOUR EMPLOYER,Bank credit YOUR EMPLOYER,,£2016.12,£4536.24
Direct debit CREDIT PLASTIC,CREDIT PLASTIC,£402.98,,£520.12

For many users this is fine as it can still be loaded into a spreadsheet application.

For my use case, I need to download many of these files, which makes up one large data set and these extra metadata headers are quite an issue because I can no longer use xsv to parse them directly.

This library is a form of buffer which removes this metadata header. It does this by looking at the field count in a given number of rows and removes the lines before the maximum is reached.

Compiling

To compile install rust from rustup, check out this repository and run:

    cargo install --path .

Command Line Usage

This can be used like the following:

    cat with_metadata_headers.csv | csv-guillotine --separator=',' --consider=20 > csv_header_and_data only.csv

or

    csv-guillotine -i with_metadata_headers.csv -o csv_header_and_data only.csv

see csv-guillotine --help for full usage instructions

Errors will be printed to STDERR and their existence can be detected via the exit status.

NOTE: This software makes no attempt to actually validate that your CSV.

Library Usage

This library exposes a Blade class which is constructed with a Read as well as a character (expressed as a u8) and a line limit. The Blade class can be used as a Read to get the actual data out.

Example below:

    extern crate csv_guillotine;
    use std::io::{BufRead, BufReader};
    use csv_guillotine::Blade;

    fn main() {

        let stdin = std::io::stdin();
        let blade = Blade::new(stdin, 44, 20);
        let mut buf_reader = BufReader::new(blade);

        let mut read_size = 1;
        while read_size != 0 {
            let mut buffer = String::new();
            match buf_reader.read_line(&mut buffer) {
                Ok(r) => {
                    print!("{}", buffer);
                    read_size = r;
                },
                Err(e) => {
                    eprintln!("ERROR: {}", e);
                }
            }
        }

    }

Versions

  • 0.3.5 - Allow tab seperated CSV files (AKA TSV) in binary.
  • 0.3.4 - Convert Blade (lib.rs) to Blade (a generic) instead of using a Box field internally.
  • 0.3.3 - More normal project layout and nicer code.
  • 0.3.2 - More normal project layout and nicer code.
  • 0.3.1 - Improve test coverage and fix bugs.
  • 0.3.0 - Use bytes instead of String for everything so it can process non UTF8 files.
  • 0.2.0 - Add a command line program
  • 0.1.1 - Rename main class to Blade to keep with the guillotine theme
  • 0.1.0 - Initial version

Dependencies

~1.5MB
~19K SLoC