1 stable release
1.0.0 | Feb 17, 2022 |
---|
#16 in #tsv
28KB
556 lines
Solidify: CSV data consolidator
Introduction
Solidify is a command line tool that allows to combine CSV/TSV files like so:
Input 1 | Input 2 |
Country Population China 1.41B India 1.39B US 333M | Country Area Canada 10M km² US 9.8M km² China 9.6M km² |
Output:
Country Population Area
China 1.41B 9.6M km²
India 1.39B N/A
US 333M 9.8M km²
Canada N/A 10M km²
Installation
Install Rust, then run:
cargo install solidify
Usage
Basic usage
The introductory example can be reproduced using the following command:
solidify -i 1.tsv 2.tsv -o out.tsv --shared 1 --filler N/A
Here --shared 1
refers to the fact that the first column is shared between 1.tsv
and 2.tsv
—and it is this column’s contents that are used to identify and match records across the files.
Inputs
You can specify two or more input files to be combined using -i
or --inputs
:
-i 1.tsv 2.tsv
--inputs a.csv b.csv c.csv
Output
You have to specify the output file with -o
or --output
:
-o out.tsv
--output combined.csv
To prevent accidental overriding of data, the output path must be different from all the input paths.
Delimiter
Solidify does not attempt to autodetect delimiters used in your data, so you need to manually specify one (the same delimiter will also be applied to the output). If a delimiter is not provided, the default will be assumed: the tab character (" "
). To prevent any mistakes when specifying a delimiter, Solidify will exit with an error if each of the input files appears to have a single column (unless you explicitly allow it).
Only ASCII characters are currently accepted as delimiters. You can provide one with -d
or --delimiter
:
-d ,
--delimiter " "
Shared columns
Using -s
, or --shared
, you can specify which of the columns of your data are shared between input files (in case there are multiple columns, each value has to be provided separately by repeating the option):
-s 1
--shared 3
-s 2 -s 3 -s 8
These columns will be used to identify which records should be matched and merged.
Reverse indexing
Negative values refer to columns in reverse order, that is, -1
refers to the last column, -2
to the second-to-last, etc. To guarantee consistency of output data, negatively indexed columns are not allowed to precede any positively indexed column in any of the input files.
Merge all vs. merge none
If no shared columns are specified, any pair of records will be considered matching (given multiway merge is allowed).
For instance, running
solidify -i 1.tsv 2.tsv -o out.tsv --multi
against the introductory example would produce the following output:
Country Population Country Area
China 1.41B Canada 10M km²
India 1.39B US 9.8M km²
US 333M China 9.6M km²
In contrast, if a special value of 0
is provided as the value of -s
/--shared
, no two records will be considered matching. Running
solidify -i 1.tsv 2.tsv -o out.tsv -s 0 -s 1 --filler N/A
will hence produce:
Country Population N/A
China 1.41B N/A
India 1.39B N/A
US 333M N/A
Country N/A Area
Canada N/A 10M km²
US N/A 9.8M km²
China N/A 9.6M km²
Single-columned inputs
To prevent any mistakes when specifying a delimiter, Solidify will exit with an error if each of the input files appears to have a single column. To allow processing such inputs, pass the --single
flag.
Multiway merge
When data admits multiple ways to match records, Solidify needs to be passed the --multi
flag to proceed. If the flag is set, records will be matched in the order they appear in input files (see Merge all vs. merge none for an example).
Filler
The value of --filler
determines the content of unmatched cells (N/A
in the introductory example). If not provided, an empty string will be used.
Warn on similar records
To track records not being matched due to typos, you may set --warn-similar
to a positive integer. If the combined edit distance between a pair of records does not exceed this value, and yet the records are not identical, a warning will be displayed. Only values in columns declared as shared are compared.
Warn on unmatched records
When the flag --warn-unmatched
is set, any records that could not be matched with any records in at least one of the other input files will be reported.
Dependencies
~5.5MB
~90K SLoC