#field #cleanup #text #data #cli #clean #up

app cleanse

Small utility to clean up delimited (TSV/CSV) data

4 releases (breaking)

0.6.0 Aug 19, 2021
0.5.0 Aug 19, 2021
0.4.0 Aug 19, 2021
0.3.0 Aug 18, 2021

#50 in #cleanup

Unlicense/MIT

10KB
175 lines

🔥 cleanse

A small utility to clean up delimited data to make it consumable by standard unix tools.

Search words

Clean tsv data. Clean csv data.

Overview

Under the hood this uses the csv crate to parse data as a CSV, respecting quoting and escaping rules. For each field cleanse will then try to do the following three things:

  1. Inside a field, replace any instances of the delimiter character with .
  2. Inside a field, replace any instances of the terminator \n character with .
  3. Inside a field, replace any malformed UTF8 with the utf8 replacment character.

If any changes were made to a field a log entry is made with the record number, field number and changes.

Example

$ cat data.tsv | cleanse -o cleansed.tsv -
Aug 18 15:28:02.556  INFO cleanse: Record number 23485, field number 35: [TerminatorReplacement]
Aug 18 15:28:02.724  INFO cleanse: Record number 31036, field number 24: [DelimiterReplacement]
Aug 18 15:28:02.984  INFO cleanse: Record number 44053, field number 35: [TerminatorReplacement]
Aug 18 15:28:03.456  INFO cleanse: Record number 66273, field number 35: [TerminatorReplacement]
Aug 18 15:28:05.149  INFO cleanse: Record number 150669, field number 14: [FixedEncoding]

Dependencies

~10–17MB
~194K SLoC