13 breaking releases

0.14.0 Apr 2, 2024
0.13.0 Mar 4, 2024
0.12.0 Mar 2, 2024

#916 in Parser implementations

Download history 136/week @ 2024-02-16 502/week @ 2024-02-23 537/week @ 2024-03-01 54/week @ 2024-03-08 11/week @ 2024-03-15 113/week @ 2024-03-29 21/week @ 2024-04-05 4/week @ 2024-04-12

138 downloads per month

MIT license

1.5MB
1.5K SLoC

JSN

A queryable, streaming, JSON pull-parser with low allocation overhead.

  • Pull parser?: The parser is implemented as an iterator that emits tokens
  • Streaming?: The JSON document being parsed is never fully loaded into memory. It is read & validated byte by byte. This makes it ideal for dealing with large JSON documents
  • Queryable? You can configure the parser to only emit & allocate tokens for the parts of the input you are interested in.

JSON is expected to conform to RFC 8259. However, newline-delimited JSON and concatenated json formats are also supported.

Input can come from any source that implements the Read trait (e.g. a file, byte slice, network socket etc..)

Basic Usage

use jsn::{TokenReader, mask::*, Format};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let data = r#"
        {
            "name": "John Doe",
            "age": 43,
            "nicknames": [ "joe" ],
            "phone": {
                "carrier": "Verizon",
                "numbers": [ "+44 1234567", "+44 2345678" ]
            }
        }
        {
            "name": "Jane Doe",
            "age": 32,
            "nicknames": [ "J" ],
            "phone": {
                "carrier": "AT&T",
                "numbers": ["+33 38339"]
            }
        }
    "#;

    let mask = key("numbers").and(index(0))
        .or(key("name"))
        .or(key("age"));
    let mut iter = TokenReader::new(data.as_bytes())
        .with_mask(mask)
        .with_format(Format::Concatenated)
        .into_iter();

    assert_eq!(iter.next().unwrap()?, "John Doe");
    assert_eq!(iter.next().unwrap()?, 43);
    assert_eq!(iter.next().unwrap()?, "+44 1234567");
    assert_eq!(iter.next().unwrap()?, "Jane Doe");
    assert_eq!(iter.next().unwrap()?, 32);
    assert_eq!(iter.next().unwrap()?, "+33 38339");
    assert_eq!(iter.next(), None);

    Ok(())
}

Quick Explanation

Like traditional streaming parsers, the parser emits JSON tokens. The twist is that you can query them in a "fun" way. The best analogy is bitmasks.

If you can use a bitwise AND to extract a bit pattern:

input   : 0101 0101
AND
bitmask : 0000 1111
=
pattern : 0000 0101

Why can't you use a bitwise AND to extract a JSON token pattern?

input     : { "hello": { "name" : "world" } }
AND
json mask : {something that extracts a "hello" key}
=
pattern   : _ ________ { "name" : "world" } _

That {something that extracts a "hello" key} is what this crate provides.

Memory Footprint

jsn allows you to select the parts of your JSON that are of interest. What you do with those parts and how long you keep them in memory is up to you.

To illustrate this, I'll use the Valgrind DHAT tool to profile the heap memory usage of two similar programs. Both programs read & extract keys from a JSON file. I'll be using the sf-city-lots json file (189 MB) from here.

  • examples/store-tokens.rs: This program keeps the extracted tokens in a Vec
  • examples/print-tokens.rs: This program prints the tokens as they are encountered
valgrind --tool=dhat ./target/profiling/examples/store-tokens ~/downloads/citylots.json
# ==1146722== Total:     13,823,524 bytes in 196,541 blocks
# ==1146722== At t-gmax: 7,529,044 bytes in 196,515 blocks
valgrind --tool=dhat ./target/profiling/examples/print-tokens ~/downloads/citylots.json
# ==1152944== Total:     1,240,708 bytes in 196,524 blocks
# ==1152944== At t-gmax: 9,367 bytes in 9 blocks

The first number (Total) is the total amount of heap memory that was allocated by the program during its execution.

The second number (At t-gmax) is the maximum amount of allocated memory at any one time during execution

Unsurprisingly, store-tokens.rs has a higher footprint. Yet, the crate's utility is still obvious because the total memory allocated (13 MB) is still an order of magnitude less than the size of the file (189 MB).

Things get better when you can operate immediately on tokens as they are yielded (i.e. you do not accumulate them). Not only do you allocate less in total, but your footprint is much much smaller. print-tokens.rs ripped through the file while using at most 7KB of heap memory at any one time.

Dependencies