9 unstable releases (4 breaking)

0.10.0	Jul 2, 2021
0.9.1	Jun 28, 2021
0.8.2	Aug 29, 2020
0.7.2	May 3, 2020
0.6.2	Oct 21, 2019

#1193 in Text processing

26 downloads per month

Unlicense OR MIT

150KB
3K SLoC

csv-groupby

`gb` A Command that does a SQL like "group by" on delimited files OR arbitrary lines of text

gb is a command that takes delimited data (like csv files) or lines of text (like a log file) and emulates a SQL-select-group-by on that data. This is a utility partially inspired by xsv and the desire to stop having to write the same perl one-liners to analyze massive log files.

It does this job very fast by "slicing" blocks of data on line boundary points and forwarding those line-even blocks to multiple parser threads. There are also multiple IO threads when list of files are provided as a data source.

Fast - processing at 500MB/s to 2GB/s is not uncommon on fast multicore machines
Do group-bys with sums, counts, count distincts, avg, min, and max
Can process CSV files OR text using regular expressions
Process files or stdin as a data source:
- csv files
- text/log handled via regex mode where sub groups map to field positions
- files are decompressed (like .zst, .gz, .xz, etc) on the fly
- recursive --walk directory trees and filter for only the files you want
Filenames (-p) can be filtered/parsed with regular expressions where sub-groups become fields.
Delimited output or "aligned" table output

HOW-TO:

You identify fields as column numbers.
These will either be part of the "key" or group-by, an aggregate (avg or sum), or count distinct.
You may use none-to-many fields for each kind of field including the key field. Use -A option to give these numbered fields "names" used in the output.

Summaries on a csv file:

Using some airline flight data taken from: http://stat-computing.org/dataexpo/2009/2008.csv.bz2

Note that this data is truncated a bit here and reformated from this csv to make it readable.

  1     2        3           4         5         6          7          8           9

Year  Month  DayofMonth  DayOfWeek  DepTime  CRSDepTime  ArrTime  CRSArrTime  UniqueCarrier  ...
2008  1      3           4          2003     1955        2211     2225        WN             ...
2008  1      3           4          754      735         1002     1000        WN             ...
2008  1      3           4          628      620         804      750         WN             ...
2008  1      3           4          926      930         1054     1100        WN             ...
2008  1      3           4          1829     1755        1959     1925        WN             ...
2008  1      3           4          1940     1915        2121     2110        WN             ...
2008  1      3           4          1937     1830        2037     1940        WN             ...
2008  1      3           4          1039     1040        1132     1150        WN             ...
2008  1      3           4          617      615         652      650         WN             ...
....

Running the command:

gb -f 2008.csv -k 2,9 -s 14 --skip_header -c | head -10

How this command corresponds to the SQL:

select Month, UniqueCarrier, count(*), sum(AirTime) from csv group by Month, UniqueCarrier
       ^              ^                     ^
       |              |                     |
   -k  2,             9                -s   14

Here's a partial output:

 k:2 | k:9 | count  | s:14    | a:14
-----+-----+--------+---------+--------------------
 1   | 9E  | 22840  | 1539652 | 71.22412915760744
 1   | AA  | 52390  | 7317245 | 144.5524496246543
 1   | AQ  | 4026   | 247200  | 61.830915457728864
 1   | AS  | 12724  | 1587637 | 129.23378103378104
 1   | B6  | 16441  | 2477670 | 152.93315227455096
 1   | CO  | 25168  | 3878167 | 155.51858683883387
 1   | DL  | 38241  | 4812768 | 130.22967853663818
 1   | EV  | 23106  | 1584233 | 72.21739526826822
....

Another example that determines number of airplanes used and time spent in the air by that carrier.

select Carrier, count(*), sum(AirTime), count(distinct TailNum), sum(AirTime) average(AirTime) from csv 
group by Carrier

The following command emulates this:

gb -f ~/dl/2008.csv -k 9 -s 14 -u 11 -s 14 -a 14 --skip_header

Output:

Note that the output order of the columns does not correspond to the order of the field options. It is fixed to keys, count, sums, avgs, and then uniques.

 k:9 | count   | s:14      | s:14      | a:14               | u:11
-----|---------|-----------|-----------|--------------------|------
 9E  | 262109  | 18080077  | 18080077  | 71.11840692300127  | 162
 AA  | 604655  | 82989937  | 82989937  | 141.8001178963196  | 656
 AQ  | 7797    | 479581    | 479581    | 61.889405084527034 | 21
 AS  | 151045  | 19569900  | 19569900  | 131.83977040764768 | 126
 B6  | 196018  | 28849406  | 28849406  | 150.22524356777978 | 154
 CO  | 298342  | 45784515  | 45784515  | 155.86589297447088 | 378
 ...

Summaries on arbitrary text

Using regular expression sub groups as data fields.

Alternatively, you can use a regular expression (see -r option) against the lines where the sub groups captured become fields and these field indices correspond with the subgroup. This is useful for data that is not as organized like a csv such as logs files etc.

This example will peal off the date from mayapp log files and summarize the ERROR based on the first 5 to 40 characters of that line.

This is example of using the file path as part of the reporting.

 gb --walk /some/log/directory -p 'myapp.*(2019-\d\d-\d\d).log' -r '.*ERROR(.{5,40}).*' -k 1,2

Here the subgroups of 1 and 2 are used to create a composite key of the date from the log file name, and the bit of text after the ERROR string in the log file.

Test RE Mode

If you want to test how how a line of text and your regular expression interact use the options -R "some regular expression" and the -L "line of text" to get the sub groups gb will find.

Help `gb -h`

csv-groupby ver: 0.8.2  rev: 194704d  date: 2020-08-29
Execute a sql-like group-by on arbitrary text or csv files. Field indices start at 1.

Note that -l, -f, and -i define where data comes from.  If none of these options is given then it default to reading
stdin.

USAGE:
    gb [FLAGS] [OPTIONS]

FLAGS:
    -c, --csv_output                   Write delimited output
    -v                                 Verbosity - use more than one v for greater detail
        --skip_header                  Skip the first (header) line
        --no_record_count              Do not write records
        --noop_proc                    Do no real work - no parsing - for testing
    -i                                 Read a list of files to parse from stdin
        --stats                        Write stats after processing
        --no_output                    do not write summary output
        --recycle_io_blocks_disable    disable reusing memory io blocks
        --disable_key_sort             disables the key sort
    -E, --print_examples               Prints example usage scenarious - extra help
    -h, --help                         Prints help information
    -V, --version                      Prints version information

OPTIONS:
    -R, --test_re <testre>                            One-off test of a regex
    -L, --test_line <testline>...                     Line(s) of text to test with -R option instead of stdin
    -k, --key_fields <keyfield>...                    Fields that will act as group by keys
    -u, --unique_values <uniquefield>...              Fields to count distinct
    -D, --write_distros <writedistros>...             write unique value distro with -u option
        --write_distros_upper <writedistrosupper>     number highest value x count [default: 5]
        --write_distros_bottom <writedistrobottom>    number lowest value x count [default: 2]
    -s, --sum_values <sumfield>...                    Sum fields as float64s
    -a, --avg_values <avg_fields>...                  Average fields
    -x, --max_nums <max_num_fields>...                Max fields as float64s
    -n, --min_nums <min_num_fields>...                Min fieldss as float64
    -X, --max_strings <max_str_fields>...             Max fields as string
    -N, --min_strings <min_str_fields>...             Min fields as string
    -A, --field-aliases <field_aliases>...            Alias the field positions to meaningful names
    -r, --regex <re-str>...                           Regex mode regular expression to parse fields
    -p, --path_re <re-path>                           Match path on files and get fields from sub groups
        --re_line_contains <re-line-contains>         Grep lines that must contain a string
    -d, --input_delimiter <delimiter>                 Delimiter if in csv mode [default: ,]
    -q, --quote <quote>                               csv quote character
    -e, --escape <escape>                             csv escape character
    -C, --comment <comment>                           csv mode comment character
    -o, --output_delimiter <outputdelimiter>          Output delimiter for written summaries [default: ,]
        --empty_string <empty>                        Empty string substitution [default: ]
    -t, --parse_threads <parse-threads>               Number of parser threads [default: 12]
    -I, --io_threads <io-threads>                     Number of IO threads [default: 6]
        --queue_size <thread-qsize>                   Queue length of blocks between threads [default: 48]
        --path_qsize <path-qsize>                     Queue length of paths to IO slicer threads [default: 0]
        --io_block_size <io-block-size>               IO block size - 0 use default [default: 0]
        --q_block_size <q-block-size>                 Block size between IO thread and worker [default: 256K]
    -l <file_list>                                    A file containing a list of input files
    -f <file>...                                      List of input files
    -w, --walk <walk>                                 recursively walk a tree of files to parse
        --null_write <nullstring>                     String to use for NULL fields [default: NULL]

TODO/ideas:


--where 1:<re>
--where_not 1:<re>
--tail &| --head
--count_ge <num>
--count_le

~--sort_count_desc
--sort_count_asc

Fix -i to also support -p option

fixed
More better readme - sometimes more is not better
More diverse unit testing
Oh musl, where for art thou musl? Why your alloc so bad....
more aggregates: min, max, empty_count, number_count, zero_count
- avg and sum done
- max and min done for string and numbers
faster sort on mixed key fields before output
faster output for csv output mode

faster, but not as fast as possible
do more work to multi-line re mode - not sure how it should really work yet

This does function and I have used it on well formed xml, but not sure it will be a thing or not.
flat mode - no summary but just write X fields to the output as found - a kind s/()..()..()/$1,$2,$3,.../;
use RE find/search for matches instead of line bound interface?

Dependencies

~21–34MB
~541K SLoC