13 breaking releases
0.14.0 | Apr 2, 2024 |
---|---|
0.13.0 | Mar 4, 2024 |
0.12.0 | Mar 2, 2024 |
#532 in Parser implementations
326 downloads per month
1.5MB
1.5K
SLoC
JSN
A queryable, streaming, JSON pull-parser with low allocation overhead.
- Pull parser?: The parser is implemented as an iterator that emits tokens
- Streaming?: The JSON document being parsed is never fully loaded into memory. It is read & validated byte by byte. This makes it ideal for dealing with large JSON documents
- Queryable? You can configure the parser to only emit & allocate tokens for the parts of the input you are interested in.
JSON is expected to conform to RFC 8259. However, newline-delimited JSON and concatenated json formats are also supported.
Input can come from any source that implements the Read
trait (e.g. a file,
byte slice, network socket etc..)
Basic Usage
use jsn::{TokenReader, mask::*, Format};
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
let data = r#"
{
"name": "John Doe",
"age": 43,
"nicknames": [ "joe" ],
"phone": {
"carrier": "Verizon",
"numbers": [ "+44 1234567", "+44 2345678" ]
}
}
{
"name": "Jane Doe",
"age": 32,
"nicknames": [ "J" ],
"phone": {
"carrier": "AT&T",
"numbers": ["+33 38339"]
}
}
"#;
let mask = key("numbers").and(index(0))
.or(key("name"))
.or(key("age"));
let mut iter = TokenReader::new(data.as_bytes())
.with_mask(mask)
.with_format(Format::Concatenated)
.into_iter();
assert_eq!(iter.next().unwrap()?, "John Doe");
assert_eq!(iter.next().unwrap()?, 43);
assert_eq!(iter.next().unwrap()?, "+44 1234567");
assert_eq!(iter.next().unwrap()?, "Jane Doe");
assert_eq!(iter.next().unwrap()?, 32);
assert_eq!(iter.next().unwrap()?, "+33 38339");
assert_eq!(iter.next(), None);
Ok(())
}
Quick Explanation
Like traditional streaming parsers, the parser emits JSON tokens. The twist is that you can query them in a "fun" way. The best analogy is bitmasks.
If you can use a bitwise AND
to extract a bit pattern:
input : 0101 0101
AND
bitmask : 0000 1111
=
pattern : 0000 0101
Why can't you use a bitwise AND
to extract a JSON token pattern?
input : { "hello": { "name" : "world" } }
AND
json mask : {something that extracts a "hello" key}
=
pattern : _ ________ { "name" : "world" } _
That {something that extracts a "hello" key}
is what this crate provides.
Memory Footprint
jsn
allows you to select the parts of your JSON that are of interest. What you
do with those parts and how long you keep them in memory is up to you.
To illustrate this, I'll use the Valgrind DHAT tool to profile the heap memory usage of two similar programs. Both programs read & extract keys from a JSON file. I'll be using the sf-city-lots json file (189 MB) from here.
examples/store-tokens.rs
: This program keeps the extracted tokens in a Vecexamples/print-tokens.rs
: This program prints the tokens as they are encountered
valgrind --tool=dhat ./target/profiling/examples/store-tokens ~/downloads/citylots.json
# ==1146722== Total: 13,823,524 bytes in 196,541 blocks
# ==1146722== At t-gmax: 7,529,044 bytes in 196,515 blocks
valgrind --tool=dhat ./target/profiling/examples/print-tokens ~/downloads/citylots.json
# ==1152944== Total: 1,240,708 bytes in 196,524 blocks
# ==1152944== At t-gmax: 9,367 bytes in 9 blocks
The first number (Total) is the total amount of heap memory that was allocated by the program during its execution.
The second number (At t-gmax) is the maximum amount of allocated memory at any one time during execution
Unsurprisingly, store-tokens.rs
has a higher footprint. Yet, the crate's
utility is still obvious because the total memory allocated (13 MB) is still an
order of magnitude less than the size of the file (189 MB).
Things get better when you can operate immediately on tokens as they are yielded
(i.e. you do not accumulate them). Not only do you allocate less in total, but
your footprint is much much smaller. print-tokens.rs
ripped through the file
while using at most 7KB of heap memory at any one time.