#json5 #serde #json #parser #json-key #data-model #round-trip

json-five

JSON5 parser with round-trip capabilities and compatible with the serde data model

3 releases

new 0.1.2 Feb 6, 2025
0.1.1 Feb 6, 2025
0.1.0 Feb 5, 2025

#417 in Parser implementations

Download history 403/week @ 2025-02-03

403 downloads per month

MIT license

275KB
7K SLoC

json-five-rs

This project provides a handwritten JSON5 tokenizer and recursive descent parser compatible with serde.

Crates.io Version docs.rs

Key Features

  • Compatible with serde data model
  • Supports round-trip use cases with preservation/editing of whitespace and comments
  • Supports formatting (indent, compact formats, etc.) in serialization
  • Supports both model-based (AST) edits and token-based round-tripping
  • Performance-focused default tokenizer/parser that avoids copying input
  • Ergonomics-focused round-trip tokenizer/parser that produce structures with solely owned types for ease of editing
  • Supports basic parsing and serialization without serde (you may disable the default serde feature!)

Usage

You can use this lib with serde in the typical way:

use json_five::from_str;
use serde::Deserialize;
#[derive(Debug, PartialEq, Deserialize)]
struct MyData {
    name: String,
    count: i64,
    maybe: Option<f64>,
}

fn main() {
    let source = r#"
    // A possible JSON5 input
    {
      name: 'Hello',
      count: 42,
      maybe: NaN
    }
"#;

    let parsed = from_str::<MyData>(source).unwrap();
    let expected = MyData {name: "Hello".to_string(), count: 42, maybe: Some(NaN)}
    assert_eq!(parsed, expected)
}

Examples

See the examples/ directory for examples of programs that utilize round-tripping features.

  • examples/json5-doublequote-fixer gives an example of tokenization-based round-tripping edits
  • examples/json5-trailing-comma-formatter gives an example of model-based round-tripping edits

Benchmarking

Benchmarks are available in the benches/ directory. Test data is in the data/ directory. A couple of benchmarks use big files that are not committed to this repo. So run ./data/setupdata.sh to download the required data files so that you don't skip the big benchmarks. The benchmarks compare json_five (this crate) to serde_json and json5-rs.

Notwithstanding the general caveats of benchmarks, in initial testing, json_five outperforms json5-rs. In typical scenarios: 3-4x performance, it seems. At time of writing (pre- v0) no performance optimizations have been done. I expect performance to improve, if at least marginally, in the future.

These benchmarks were run on Windows on an i9-10900K. This table won't be updated unless significant changes happen.

test json_five serde_json json5
big 580.31 ms 150.39 ms 3.0861 s
empty 228.62 ns 38.786 ns 708.00 ns
medium-ascii 199.88 ms 59.008 ms 706.94 ms
arrays 578.24 ns 100.95 ns 1.3228 µs
objects 922.91 ns 205.75 ns 2.0748 µs
nested-array 22.990 µs 5.0483 µs 29.356 µs
nested-objects 50.659 µs 14.755 µs 132.75 µs
string 421.17 ns 91.051 ns 3.5691 µs
number 238.75 ns 36.179 ns 779.13 ns

Round-trip model

The rt module contains the round-trip parser. This is intended to be ergonomic for round-trip use cases, although it is still very possible to use the default parser (which is more performance-oriented) for certain round-trip use cases. The round-trip AST model produced by the round-trip parser includes additional context fields that describe the whitespace, comments, and (where applicable) trailing commas on each production. Moreover, unlike the default parser, the AST consists entirely of owned types, allowing for simplified in-place editing.

The context field holds a single field struct that contains the field wsc (meaning 'white space and comments') which holds a tuple of Strings that represent the contextual whitespace and comments. The last element in the wsc tuple in the context of JSONArrayValue and JSONKeyValuePair objects is an Option<String> -- which is used as a marker to indicate an optional trailing comma and any whitespace that may follow that optional comma.

The context field is always an Option.

Contexts are associated with the following structs (which correspond to the JSON5 productions) and their context layout:

rt::parser::JSONText

Represents the top-level Text production of a JSON5 document. It consists solely of a single (required) value. It may have whitespace/comments before or after the value. The value field contains any JSONValue and the context field contains the context struct containing the wsc field, a two-length tuple that describes the whitespace before and after the value. In other words: { wsc.0 } value { wsc.1 }

use json_five::rt::parser::from_str;
use json_five::rt::parser::JSONValue;

let doc = from_str(" 'foo'\n").unwrap();
let context = doc.context.unwrap();

assert_eq!(&context.wsc.0, " ");
assert_eq!(doc.value, JSONValue::SingleQuotedString("foo".to_string()));
assert_eq!(&context.wsc.1, "\n");

rt::parser::JSONValue::JSONObject

Member of the rt::parser::JSONValue enum representing JSON5 objects.

There are two fields: key_value_pairs, which is a Vec of JSONKeyValuePairs, and context whose wsc is a one-length tuple containing the whitespace/comments that occur after the opening brace. In non-empty objects, the whitespace that precedes the closing brace is part of the last item in the key_value_pairs Vec.
In other words: LBRACE { wsc.0 } [ key_value_pairs ] RBRACE
and: .context.wsc: (String,)

rt::parser::KeyValuePair

The KeyValuePair struct represents the 'JSON5Member' production. It has three fields: key, value, and context. The key is a JSONValue, in practice limited to JSONValue::Identifier, JSONValue::DoubleQuotedString or a JSONValue::SingleQuotedString. The value is any JSONValue.

Its context describes whitespace/comments that are between the key and :, between the : and the value, after the value, and (optionally) a trailing comma and whitespace trailing the comma.
In other words, roughly: key { wsc.0 } COLON { wsc.1 } value { wsc.2 } [ COMMA { wsc.3 } [ next_key_value_pair ] ]
and: .context.wsc: (String, String, String, Option<String>)

When context.wsc.3 is Some(), it indicates the presence of a trailing comma (not included in the string) and whitespace that follows the comma. This item MUST be Some() when it is not the last member in the object.

rt::parser::JSONValue::JSONArray

Member of the rt::parser::JSONValue enum representing JSON5 arrays.

There are two fields on this struct: values, which is of type Vec<JSONArrayValue>, and context which holds a one-length tuple containing the whitespace/comments that occur after the opening bracket. In non-empty arrays, the whitespace that precedes the closing bracket is part of the last item in the values Vec.
In other words: LBRACKET { wsc.0 } [ values ] RBRACKET
and: .context.wsc: (String,)

rt::parser::JSONArrayValue

The JSONArrayValue struct represents a single member of a JSON5 Array. It has two fields: value, which is any JSONValue, and context which contains the contextual whitespace/comments around the member. The context's wsc field is a two-length tuple for the whitespace that may occur after the value and (optionally) after the comma following the value.
In other words, roughly: value { wsc.0 } [ COMMA { wsc.1 } [ next_value ]]
and: .context.wsc: (String, Option<String>)

When context.wsc.1 is Some() it indicates the presence of the comma (not included in the string) and any whitespace following the comma is contained in the string. This item MUST be Some() when it is not the last member of the array.

Other rt::parser::JSONValues

  • JSONValue::Integer(String)
  • JSONValue::Float(String)
  • JSONValue::Exponent(String)
  • JSONValue::Null
  • JSONValue::Infinity
  • JSONValue::NaN
  • JSONValue::Hexadecimal(String)
  • JSONValue::Bool(bool)
  • JSONValue::DoubleQuotedString(String)
  • JSONValue::SingleQuotedString(String)
  • JSONValue::Unary { operator: UnaryOperator, value: Box<JSONValue> }
  • JSONValue::Identifier(String) (for object keys only!).

Where these enum members have Strings, they represent the object as it was tokenized without any modifications (that is, for example, without any escape sequences un-escaped). The single- and double-quoted Strings do not include the surrounding quote characters. These members alone have no context.

round-trip tokenizer

The rt::tokenizer module contains some useful tools for round-tripping tokens. The Tokens produced by the rt tokenizer are owned types containing the lexeme from the source. There are two key functions in the tokenizer module:

  • rt::tokenize::source_to_tokens
  • rt::tokenize::tokens_to_source

Each Token generated from source_to_tokens also contains some contextual information, such as line/col numbers, offsets, etc. This contextual information is not required for tokens_to_source -- that is: you can create new tokens and insert them into your tokens array and process those tokens back to JSON5 source without issue.

The tok_type attribute leverages the same json_five::tokenize::TokType types. Those are:

  • LeftBrace
  • RightBrace
  • LeftBracket
  • RightBracket
  • Comma
  • Colon
  • Name (Identifiers)
  • SingleQuotedString
  • DoubleQuotedString
  • BlockComment
  • LineComment note: the lexeme includes the singular trailing newline, if present (e.g., not a comment just before EOF with no newline at end of file)
  • Whitespace
  • True
  • False
  • Null
  • Integer
  • Float
  • Infinity
  • Nan
  • Exponent
  • Hexadecimal
  • Plus
  • Minus
  • EOF

Note: string tokens will include surrounding quotes.

Notes

Status

This project is in very early phases. While the crate is usable right now, more thorough testing is needed to ensure that the tokenizer/parser rejects invalid documents.

Questions, discussions, and contributions are welcome. Right now, things are moving fast, so the best way to contribute is likely to just open an issue.

Expect breaking changes for now, even in patch releases.

Serde is optional

Using serde is actually optional. Some use cases may not require the use of serde's various deserialization methods and may only need to rely on the tokenizer and/or AST tree features. By default, the serde feature is enabled, but this can be disabled. Even without the serde feature, the parser modules provide functions and methods for parsing and serialization, including the ability to customize the style.

TODOs

Some things I need to implement and some things I may or may not implement. In rough priority order:

  • Move documentation from readme to crate documentation
  • Provide methods for safely editing models (e.g., validate that, when serialized, the model will produce a valid JSON5 document) today. This may also let us adjust the visibility of certain attributes.
  • Provide a json5! macro similar to serde_json's json! macro
  • Investigate no_std support
  • Optimize the round-trip tokenizer to avoid processing the input twice
  • More serialization formatting options (e.g., prefer single- or double-quoted strings, try to use identifiers where possible, etc.)
  • Incremental parsing. Originally, an incremental tokenizer/parser was actually developed. In testing, speeds were the same or worse. Maybe it could be done in a performant way. But this may be useful for specific use cases, such as memory-constrained environments, very large JSON5 files (why?), or use cases where the input is streamed (say, over the network).
  • Publish crate
  • Benchmarks
  • Basic formatting options (indent, compact, trailing comma)
  • Complete logic for serialization of values (specifically: processing all [unicode] escape sequences in strings/identifiers and handling certain float formats like .0 and 1.)
  • Come up with a way to reject invalid unicode escape sequences (e.g., when an illegal escape sequence is used at the start of an identifier)
  • Validate correctness of the tokenizer (specifically: use of is_alphabetic may not comport with the JSON5 spec)

Dependencies

~320–540KB
~11K SLoC