4 releases (breaking)
0.6.0 | Aug 19, 2021 |
---|---|
0.5.0 | Aug 19, 2021 |
0.4.0 | Aug 19, 2021 |
0.3.0 | Aug 18, 2021 |
#50 in #cleanup
10KB
175 lines
🔥 cleanse
A small utility to clean up delimited data to make it consumable by standard unix tools.
Search words
Clean tsv data. Clean csv data.
Overview
Under the hood this uses the csv
crate to parse data as a CSV, respecting quoting and escaping rules. For each field
cleanse
will then try to do the following three things:
- Inside a field, replace any instances of the
delimiter
character with - Inside a field, replace any instances of the terminator
\n
character with - Inside a field, replace any malformed UTF8 with the utf8 replacment character.
If any changes were made to a field a log entry is made with the record number, field number and changes.
Example
$ cat data.tsv | cleanse -o cleansed.tsv -
Aug 18 15:28:02.556 INFO cleanse: Record number 23485, field number 35: [TerminatorReplacement]
Aug 18 15:28:02.724 INFO cleanse: Record number 31036, field number 24: [DelimiterReplacement]
Aug 18 15:28:02.984 INFO cleanse: Record number 44053, field number 35: [TerminatorReplacement]
Aug 18 15:28:03.456 INFO cleanse: Record number 66273, field number 35: [TerminatorReplacement]
Aug 18 15:28:05.149 INFO cleanse: Record number 150669, field number 14: [FixedEncoding]
Dependencies
~10–17MB
~194K SLoC