4 stable releases

1.1.2 Feb 11, 2025
1.1.0 Jan 27, 2025
1.0.1 Jan 5, 2025
1.0.0 Jan 4, 2025

#564 in Parser implementations

Download history 272/week @ 2024-12-31 34/week @ 2025-01-07 2/week @ 2025-01-14 70/week @ 2025-01-21 40/week @ 2025-01-28 10/week @ 2025-02-04 125/week @ 2025-02-11

245 downloads per month
Used in scan_json

MIT license

43KB
708 lines

RJiter: Streaming JSON parser for Rust

RJiter allows processing of large JSON files using a small buffer. It is a wrapper for Jiter and "R" stands for "Reader", which fills the buffer on demand.

API documentation:

  • RJiter. For most functions, the documentation redirects to Jiter
  • Jiter

See also scan_json for a callback-based API built on top of RJiter.

Example

The example repeats the one of Jiter. The only difference is how RJiter is constructed: To parse JSON, it uses the buffer of size 16 bytes.

use rjiter::jiter::{NumberInt, Peek};
use rjiter::RJiter;
use std::io::Cursor;

let json_data = r#"
{
    "name": "John Doe", 
    "age": 43,
    "phones": [
        "+44 1234567",
        "+44 2345678"
    ]
}"#;

// Create RJiter
let mut buffer = [0u8; 16];
let mut reader = Cursor::new(json_data.as_bytes());
let mut rjiter = RJiter::new(&mut reader, &mut buffer);

// The rest is again the same as in Jiter
assert_eq!(rjiter.next_object().unwrap(), Some("name"));
assert_eq!(rjiter.next_str().unwrap(), "John Doe");
assert_eq!(rjiter.next_key().unwrap(), Some("age"));
assert_eq!(rjiter.next_int().unwrap(), NumberInt::Int(43));
assert_eq!(rjiter.next_key().unwrap(), Some("phones"));
assert_eq!(rjiter.next_array().unwrap(), Some(Peek::String));
// we know the next value is a string as we just asserted so
assert_eq!(rjiter.known_str().unwrap(), "+44 1234567");
assert_eq!(rjiter.array_step().unwrap(), Some(Peek::String));
// same again
assert_eq!(rjiter.known_str().unwrap(), "+44 2345678");
// next we'll get `None` from `array_step` as the array is finished
assert_eq!(rjiter.array_step().unwrap(), None);
// and `None` from `next_key` as the object is finished
assert_eq!(rjiter.next_key().unwrap(), None);
// and we check there's nothing else in the input
rjiter.finish().unwrap();

Logic and limitations

First, RJiter calls Jiter. If the result is ok, RJiter returns it. Otherwise, the logic is as follows:

  1. Skip spaces
  2. Shift the buffer
  3. Read, try again, read, try again, and so on until success or until the error can't be fixed by reading more data

The buffer should be large enough to contain each complete JSON element. In the example above, if the buffer size were 12 bytes, the parsing would fail on the telephone numbers:

called `Result::unwrap()` on an `Err` value: Error { error_type: JsonError(EofWhileParsingString), index: 79 }

Functions that return pointers to bytes point inside the buffer. You should copy the bytes elsewhere before calling RJiter again; otherwise, RJiter may shift the buffer and the pointers will become invalid.

Pass-through long strings

Strings can be longer than the buffer, therefore the default logic doesn't work for them. RJiter provides a workaround: The caller provides a writer and RJiter writes the string to it.

  • write_long_bytes: Copy bytes as is, without touching escapes. Useful for json-to-json conversion.
  • write_long_str: Unescape the string during copying. Useful for json-to-text conversion.
use rjiter::RJiter;
use std::io::Cursor;

let cdata = r#"\"\u4F60\u597d\",\n\\\\\\\\\\\\\\\\\\\\\\\\ how can I help you today?"#;
let input = format!("\"{cdata}\"\"{cdata}\"");

let mut buffer = [0u8; 10];
let mut reader = Cursor::new(input.as_bytes());
let mut rjiter = RJiter::new(&mut reader, &mut buffer);

//
// write_long_bytes
//

let mut writer = Vec::new();
let wb = rjiter.write_long_bytes(&mut writer);
wb.unwrap();
assert_eq!(writer, cdata.as_bytes()); // <--- bytes are copied as is

//
// write_long_str
//
let mut writer = Vec::new();
let wb = rjiter.write_long_str(&mut writer);
wb.unwrap();
assert_eq!( // <--- escapes are decoded
    writer,
    r#""你好",
\\\\\\\\\\\\ how can I help you today?"#.as_bytes()
);

let finish = rjiter.finish();
assert!(finish.is_ok());

Skip tokens

For the case when JSON fragments are mixed with known text, RJiter provides the function known_skip_token.

use rjiter::{RJiter, Result as RJiterResult};
use rjiter::jiter::Peek;
use std::io::Cursor;

let json_data = r#"
    event: ping
    data: {"type": "ping"}
"#;

fn peek_skipping_tokens(rjiter: &mut RJiter, tokens: &[&str]) -> RJiterResult<Peek> {
    'outer: loop {
        let peek = rjiter.peek();
        for token in tokens {
            let found = rjiter.known_skip_token(token.as_bytes());
            if found.is_ok() {
                continue 'outer;
            }
        }
        return peek;
    }
}

let mut buffer = [0u8; 10];
let mut reader = Cursor::new(json_data.as_bytes());
let mut rjiter = RJiter::new(&mut reader, &mut buffer);

// Skip non-json
let tokens = vec!["data:", "event:", "ping"];
let result = peek_skipping_tokens(&mut rjiter, &tokens);
assert_eq!(result.unwrap(), Peek::Object);

// Continue with json
let key = rjiter.next_object();
assert_eq!(key.unwrap(), Some("type"));

Colophon

License: MIT

Author: Oleg Parashchenko, olpa@ https://uucode.com/

Contact: via email or Ailets Discord

RJiter is a part of the ailets.org project.

Dependencies

~3.5MB
~72K SLoC