#csv #parser #control-codes

no-std c0sv

Binary CSV, using C0 ASCII control codes

3 unstable releases

0.2.0 Dec 4, 2020
0.1.1 Nov 28, 2020
0.1.0 Nov 28, 2020

#48 in #parsers

MIT license

23KB
418 lines

c0sv

This is a binary CSV format. Incredibly simple, and separated by ASCII Control characters. This uses SOH (Start of Heading), STX (Start of Text), ETX (End of Text), ESC (Escape), US (Unit Separator), and RS (Record Separator).

The stream is expressed in the following faux-EBNF (where * represents any single byte):

stream = [header], STX, records, ETX
header = SOH, units
units = unit, { US, unit}
unit = { (* - control) | (ESC, *) }
control = SOH | STX | ETX | ESC | US | RS
records = units, { RS, units}

This is mostly a simple experiment to see how feasible it would be to create a very CSV-like format using these ASCII control characters for delimitation (and particularly to use the control characters in the way they are intended to be used). It's probably not extraordinarily useful, because the only real purpose of CSV is exchange where manual readability and/or writability is important. If you want good binary flexibility, you're probably better off using a good binary format, like bincode or messagepack.

Still, this does have some convenient aspects, such as the fact that it is rather easily streamable, allowing processing records while needing only one in memory at a time. At the expense of being slower to parse, this format is capable of being slightly smaller than most other binary formats, as there are no length prefixes.

Advantages

Over CSV:

  • Headers are explicitly delimited, so there is never any guessing about whether the first row constitutes a header.
  • There is only one method of escaping, so there is not any need for confusion about how to handle special characters like commas or newlines in fields. Parsing is also simpler in this case, as CSV parsers often try naively to process interleaved newlines.
  • The end of the document is also explicitly delimited, so multiple documents with headers can be concatenated in the same stream and parsed without any loss.

Disadvantages

  • Control characters aren't really printable, so this format can not be easily edited with a text editor.
  • Because fields and records are delimited with a separator, There is a minimum of a single field and a single record. A document represented by [STX][ETX] is a document with a single record consisting of a single empty field. This means it is impossible to represent a record without any fields (it will be a single empty field, rather than having no fields), or a document without any records (it will be a document with a single record of a single empty field).

That last one could feasibly be solvable by using the record separator and unit separator as prefixes instead of separators, but that's less fun, and doesn't fit the semantics of the control characters. It also increases the document size by one extra byte per record.

Dependencies

~1MB
~22K SLoC