2 unstable releases

new 0.1.0 Feb 10, 2025
0.0.1 Feb 9, 2025

#505 in Parser implementations

MIT/Apache

210KB
4.5K SLoC

Documentation

deser-incomplete: Deserialize incomplete or broken data with Serde

Parse incomplete or broken data with existing Serde data formats.

This is nice for ingesting streaming JSON, which is technically invalid until the stream is done. By tolerating premature end of input, we can immediately make use of the streaming input.

Someone is slowly
typing JSON into a terminal program. The JSON is an array of objects.
The program gradually renders the JSON input as Rust debug output, and as a table.
The fields of the Rust struct are printed even though they are missing in the JSON input.
The example program is called "live".

Here, we wrapped serde_json with deser-incomplete, and printed the Rust debug representation of the result. We also reserialized to JSON and let nushell do its beautiful table formatting.

The JSON can also come from an external program. Here is a demo program that computes disk usage of directories and outputs the results as JSON. In true Unix style, displaying for the user is a separate concern, implemented by a separate program.

A Unix pipeline with
two programs is shown. The source program computes the disk size
of a bunch of directories and outputs a JSON array of objects. The sink program
pretty-prints the JSON table. Computing the disk size takes a while, and you can
see which directory is being analyzed because the result for that directory is empty
while it is computing.

deser-incomplete sits between #[serde(Deserialize)] and the data format. When a parse error is detected (presumably because the input ended), it safely halts parsing.

This library sits
in between Deserialize and Deserializer. Information about the parsed data is successfully
sent from Deserializer through deser-incomplete to Deserialize. But errors from Deserializer are
blocked.

How to use: JSON and YAML

let result: Result<Vec<u32>, deser_incomplete::Error<serde_json::Error>>
    = deser_incomplete::from_json_str("[3, 4, ");

assert_eq!(result.unwrap(), vec![3, 4]);

let result: Result<Vec<bool>, deser_incomplete::Error<serde_yaml::Error>>
   = deser_incomplete::from_yaml_str("- true\n- false\n- ");

assert_eq!(result.unwrap(), vec![true, false]);

Command line:

$ cargo install deser-incomplete --example repair-deser

$ echo '[3, 4' | repair-deser    # JSON by default
[3,4]

How to use: other data formats

  • You need to explain how to create the Deserializer by implementing Source.

    • If your format has &mut T: Deserializer then mimic source::JsonStr.
    • If your format has T: Deserializer then mimic source::YamlStr.
  • Some formats need a trailer for best results. For example, from_json_str appends a double-quote to the input before parsing, this lets serde_json see strings that weren't actually complete.

    We also preprocess the input in from_yaml_str, actually there it is even more important for good results.

    Add preprocessing with Options::set_random_trailer, or turn it off such preprocessing with Options::disable_random_tag. You can see the effect of it with cargo run --example live -- --use-random-trailer false.

    I expect that binary formats don't need this preprocessing.

How this works internally

The implementation sits in between Deserialize, Deserializer, and Visitor, gathers metadata during the parse, and saves successful sub-parses. It also "backtracks": if a parse fails, then we retry, but just before the failure point we swap out the real Deserializer for a decoy which can brings deserialization to a safe end.

We apply multiple techniques. Suppose we want to parse Vec<u32> with serde_json. Here are the main techniques.

  1. (Example: parse empty JSON as [] .) — On the top level, if parsing fails immediately (e.g. empty input) but a sequence is expected, then return [].

    [setting name: fallback_seq_empty_at_root]

  2. (Example: parse JSON "[3" as [3] .) — When there are no more elements in a sequence, let the Visitor construct the Vec<u32> and put it somewhere safe. Now serde_json::Deserializer::deserialize_seq notices the missing close bracket and returns Err to us. We ignore Err, retrieve the saved value again, and return Ok of it.

    This happens for every deserialize_* method, not just sequences.

    [setting name: tolerate_deserializer_fail_after_visit_success]

  3. (Example: parse JSON "[3," as [3] .) — Inside a sequence, if parsing the next element will fail, then don't even try.

    This works using backtracking.

    [setting name: backtrack_seq_skip_item]

  4. Before deserializing, we append a random trailer.

Random trailer

Additionally we have a "random trailer" technique to get incomplete strings to parse. Unfortunately this technique is specific to the data format. This library implements it for JSON and YAML.

This technique is not applied by default for other data formats. Even with JSON/YAML, this technique can be turned off with Options::disable_random_tag.

Random trailer for JSON

We actually append tRANDOM" to every JSON input, where RANDOM are some randomly chosen letters. It turns out that serde_json can parse any prefix of valid JSON, as long as we concatenate tRANDOM" to it. Some examples:

  1. (Example: "hello .) The concatenation is "hellotRANDOM" and we actually get this back from serde_json through fn visit_borrowed_str --- after serde_json removed the double-quotes.

    In fn visit_borrowed_str, we notice that the string ends in RANDOM. Because this is a random string of letters, it cannot have been part of the incomplete JSON input. We remove the tRANDOM suffix and get back just "hello".

  2. (Example: "hello\ --- perhaps breaking in the middle of \n .) The concatenation is "hello\tRANDOM"; the \t parses to a tab character. We strip off <TAB>random and again return "hello".

  3. (Example: "hello" .) The concatenation is "hello"tRANDOM". Now serde_json visits the hello string as it would normally do, and if there should be any error after the visit, we can recover from it anyway as per tolerate_deserializer_fail_after_visit_success.

Inspecting at runtime

There is extensive logging through the tracing library, which becomes visible if you initialize the library.

Guiding principles

The logic was hand-tweaked to the following criteria:

  1. ("soundness") For any complete and valid JSON/YAML, if you call deser-incomplete on a prefix, then its output should not contain data that doesn't exist in the complete JSON/YAML.

  2. ("monotone") A larger prefix should not parse to a shorter output.

  3. ("prompt") Ideally, each prefix contains as much data as we can be certain of.

The implementation of Deserializer (data format) may influence the quality of the output, but the default ruleset does generally very well with serde_json and serde_yaml.

There are extensive snapshot tests that validate the quality of the output on these criteria.

If you are curious, then it is possible to tweak the ruleset with unstable::UnstableCustomBehavior. We also have snapshot tests for some alternative parsing configurations.

Notes and limitations

  • Ideally, your data format should be relatively greedy, in the sense that it generates information quickly and does not need to look ahead in the serialized stream too much.

  • This approach lets us safely abort parsing and get a value, but we cannot skip over invalid segments of input. (For that you need an approach like tree-sitter.)

  • We cannot distinguish eof from invalid input.

  • YAML works well in general, but it is a bit less exhaustively tested than JSON. The randomized trailer is really important for YAML.

  • JSON: when parsing a floating-point number, if the end of input happens to fall directly after the decimal point, then the number is missing from the output.

  • For YAML, the randomized trailer uses a heuristic to see if we are currently in an escape sequence in a string --- but this heuristic can fail. In this case, the incomplete string will be missing from the output.

Have fun!

Acknowledgements

Thanks to Annisa Chand and @XAMPPRocky for useful feedback.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~1.3–2.7MB
~51K SLoC