1 stable release

1.0.0 Feb 17, 2022

#16 in #tsv

MIT license

28KB
556 lines

Solidify: CSV data consolidator

Introduction

Solidify is a command line tool that allows to combine CSV/TSV files like so:

Input 1 Input 2
Country	Population
China	1.41B
India	1.39B
US	333M
Country	Area
Canada	10M km²
US	9.8M km²
China	9.6M km²

Output:

Country	Population	Area
China	1.41B	9.6M km²
India	1.39B	N/A
US	333M	9.8M km²
Canada	N/A	10M km²

Installation

Install Rust, then run:

cargo install solidify

Usage

Basic usage

The introductory example can be reproduced using the following command:

solidify -i 1.tsv 2.tsv -o out.tsv --shared 1 --filler N/A

Here --shared 1 refers to the fact that the first column is shared between 1.tsv and 2.tsv—and it is this column’s contents that are used to identify and match records across the files.

Inputs

You can specify two or more input files to be combined using -i or --inputs:

-i 1.tsv 2.tsv
--inputs a.csv b.csv c.csv

Output

You have to specify the output file with -o or --output:

-o out.tsv
--output combined.csv

To prevent accidental overriding of data, the output path must be different from all the input paths.

Delimiter

Solidify does not attempt to autodetect delimiters used in your data, so you need to manually specify one (the same delimiter will also be applied to the output). If a delimiter is not provided, the default will be assumed: the tab character (" "). To prevent any mistakes when specifying a delimiter, Solidify will exit with an error if each of the input files appears to have a single column (unless you explicitly allow it).

Only ASCII characters are currently accepted as delimiters. You can provide one with -d or --delimiter:

-d ,
--delimiter "	"

Shared columns

Using -s, or --shared, you can specify which of the columns of your data are shared between input files (in case there are multiple columns, each value has to be provided separately by repeating the option):

-s 1
--shared 3
-s 2 -s 3 -s 8

These columns will be used to identify which records should be matched and merged.

Reverse indexing

Negative values refer to columns in reverse order, that is, -1 refers to the last column, -2 to the second-to-last, etc. To guarantee consistency of output data, negatively indexed columns are not allowed to precede any positively indexed column in any of the input files.

Merge all vs. merge none

If no shared columns are specified, any pair of records will be considered matching (given multiway merge is allowed).

For instance, running

solidify -i 1.tsv 2.tsv -o out.tsv --multi

against the introductory example would produce the following output:

Country	Population	Country	Area
China	1.41B	Canada	10M km²
India	1.39B	US	9.8M km²
US	333M	China	9.6M km²

In contrast, if a special value of 0 is provided as the value of -s/--shared, no two records will be considered matching. Running

solidify -i 1.tsv 2.tsv -o out.tsv -s 0 -s 1 --filler N/A

will hence produce:

Country	Population	N/A
China	1.41B	N/A
India	1.39B	N/A
US	333M	N/A
Country	N/A	Area
Canada	N/A	10M km²
US	N/A	9.8M km²
China	N/A	9.6M km²

Single-columned inputs

To prevent any mistakes when specifying a delimiter, Solidify will exit with an error if each of the input files appears to have a single column. To allow processing such inputs, pass the --single flag.

Multiway merge

When data admits multiple ways to match records, Solidify needs to be passed the --multi flag to proceed. If the flag is set, records will be matched in the order they appear in input files (see Merge all vs. merge none for an example).

Filler

The value of --filler determines the content of unmatched cells (N/A in the introductory example). If not provided, an empty string will be used.

Warn on similar records

To track records not being matched due to typos, you may set --warn-similar to a positive integer. If the combined edit distance between a pair of records does not exceed this value, and yet the records are not identical, a warning will be displayed. Only values in columns declared as shared are compared.

Warn on unmatched records

When the flag --warn-unmatched is set, any records that could not be matched with any records in at least one of the other input files will be reported.

Dependencies

~5.5MB
~90K SLoC