46 releases (6 breaking)

0.7.1 Oct 12, 2025
0.6.6 Oct 10, 2025

#408 in Encoding

Download history 142/week @ 2025-08-14 228/week @ 2025-08-21 27/week @ 2025-08-28 512/week @ 2025-09-04 1009/week @ 2025-09-11 728/week @ 2025-09-18 705/week @ 2025-09-25 1769/week @ 2025-10-02 1342/week @ 2025-10-09 39/week @ 2025-10-16 8/week @ 2025-10-23

3,383 downloads per month
Used in genson-cli

MIT license

290KB
5.5K SLoC

Genson Core

crates.io MIT/Apache-2.0 licensed Documentation

Fast and robust Rust library for JSON schema inference: pre-validates JSON to avoid panics, handles errors properly. Adapts the genson-rs library's SIMD parallelism after first checking the string with serde_json in a streaming pass without allocating values.

This is the core library that powers both the genson-cli command-line tool and the polars-genson Python plugin. It includes a vendored and enhanced version of the genson-rs library with added safety features and comprehensive error handling.

Features

  • Robust JSON Schema Inference: Generate JSON schemas from JSON data with comprehensive type detection
  • Normalisation Against Schema: Enforce a consistent Avro schema across heterogeneous JSON inputs (handles empty arrays/maps, unions, type coercion, etc.)
  • Parallel Processing: Efficient processing of large JSON datasets using Rayon
  • Enhanced Error Handling: Proper error propagation instead of panics for invalid JSON
  • Multiple Input Formats: Support for regular JSON, NDJSON, and arrays of JSON objects
  • Field Order Preservation: Maintains original field ordering using IndexMap
  • Memory Efficient: Uses mimalloc for optimized memory allocation
  • SIMD Acceleration: Fast JSON parsing with simd-json

Feature Flags

Feature Description Dependencies
avro Enables Avro schema export and normalisation against Avro types avrotize
parquet Enables Parquet schema integration arrow, parquet
trace Enables tracing and visualisation of schema inference using crustrace + Mermaid diagrams crustrace, tracing, tracing-subscriber

Quick Start

Add this to your Cargo.toml:

[dependencies]
genson-core = "0.1.2"
  • ⚠️ Caution: if you include serde_json in your dependencies but don't activate its preserve_order feature, genson-core schema properties will not be in insertion order. This may be an unwelcome surprise!
serde_json = { version = "1.0", features = ["preserve_order"] }

Basic Usage

use genson_core::{infer_json_schema, SchemaInferenceConfig};

fn main() -> Result<(), String> {
    let json_strings = vec![
        r#"{"name": "Alice", "age": 30, "scores": [95, 87]}"#.to_string(),
        r#"{"name": "Bob", "age": 25, "city": "NYC", "active": true}"#.to_string(),
    ];

    let result = infer_json_schema(&json_strings, None)?;
    
    println!("Processed {} JSON objects", result.processed_count);
    println!("Schema: {}", serde_json::to_string_pretty(&result.schema)?);
    
    Ok(())
}

Configuration Options

Schema Inference Configuration

The SchemaInferenceConfig struct controls how genson-core infers, merges, and normalises JSON schemas. It allows you to fine-tune map detection, scalar wrapping, and unification behaviour.

Field Type Default Description
ignore_outer_array bool true Treat top-level arrays as streams of JSON objects instead of a single array value.
delimiter Option<u8> None Enables NDJSON processing when set (typically b'\n').
schema_uri Option<String> "AUTO" Base URI for the generated schema; "AUTO" uses a default inferred URI.
map_threshold usize 20 When an object has more than this number of distinct keys across records, it’s treated as a map instead of a record.
map_max_required_keys Option<usize> None Upper limit for required keys before forcing an object to remain a record. If None, no restriction applies.
unify_maps bool false Enables merging of record-like and map-like structures during schema unification.
no_unify HashSet<String> Fields whose subfields should not be merged during schema unification. Prevents overgeneralisation.
force_field_types HashMap<String, String> {} Explicitly force certain fields to specific types, e.g. { "labels": "map" }.
force_parent_field_types HashMap<String, String> {} Prevents objects containing specific child fields from being inferred as maps. Ensures parent remains a record.
force_scalar_promotion HashSet<String> Always wrap specific scalar fields in objects to ensure schema stability across datasets.
wrap_scalars bool true When scalar values collide with object values, promote the scalar to a wrapped object (e.g. "foo"{ "foo__string": "foo" }).
wrap_root Option<String> None Wraps the entire schema under a single required field name (e.g. "labels").
no_root_map bool true Prevents the top-level document from being inferred as a map.
max_builders Option<usize> None Limits the number of schema builders used in parallel (reduces peak memory usage).
avro (feature = "avro") bool false When enabled, outputs Avro-compatible schema instead of JSON Schema.
debug bool false Enables structured debug output showing inference and unification decisions.
profile bool false Enables profiling output for timing information during schema inference.
verbosity DebugVerbosity Normal Controls how detailed debug/profiling output is (Normal or Verbose).
use genson_core::{infer_json_schema, SchemaInferenceConfig};

let config = SchemaInferenceConfig {
    ignore_outer_array: true,           // Treat top-level arrays as streams of objects
    delimiter: Some(b'\n'),             // Enable NDJSON processing
    schema_uri: Some("AUTO".to_string()), // Auto-detect schema URI
};

let result = infer_json_schema(&json_strings, Some(config))?;

DebugVerbosity

As well as profiling (which is mainly to tell how long each step takes, the unification itself can be debugged more verbosely.

Variant Description
Normal Shows high-level inference and unification decisions.
Verbose Shows all internal debug output including field introductions and merges.

NDJSON Processing

let ndjson_data = vec![
    r#"
    {"user": "alice", "action": "login"}
    {"user": "bob", "action": "logout"}
    {"user": "charlie", "action": "login", "ip": "192.168.1.1"}
    "#.to_string()
];

let config = SchemaInferenceConfig {
    delimiter: Some(b'\n'),  // Enable NDJSON mode
    ..Default::default()
};

let result = infer_json_schema(&ndjson_data, Some(config))?;

Advanced Schema Building

For more control over the schema building process:

use genson_core::genson_rs::{get_builder, build_json_schema, BuildConfig};

let mut builder = get_builder(Some("https://json-schema.org/draft/2020-12/schema"));

let build_config = BuildConfig {
    delimiter: None,
    ignore_outer_array: true,
};

let mut json_bytes = br#"{"field": "value"}"#.to_vec();
let schema = build_json_schema(&mut builder, &mut json_bytes, &build_config);

let final_schema = builder.to_schema();

Normalisation

In addition to inferring schemas, genson-core can normalise arbitrary JSON values against an Avro schema. This is useful when working with jagged or heterogeneous data where rows may encode the same field in different ways.

What it does

  • Ensures every row conforms to the same inferred schema.
  • Converts empty arrays/maps to null by default (configurable).
  • Normalises scalars into arrays when the schema requires it.
  • Handles optional fields, missing values, and type mismatches gracefully.
  • Supports unions: the first non-null type branch takes precedence.
  • Optional coercion of strings to numbers/booleans ("42"42, "true"true).

API

use genson_core::normalise::{normalise_value, normalise_values, NormaliseConfig};
use serde_json::json;

let schema = json!({
    "type": "record",
    "name": "doc",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "labels", "type": {"type": "map", "values": "string"}}
    ]
});

let cfg = NormaliseConfig::default();

let input = json!({"id": 42, "labels": {}});
let normalised = normalise_value(input, &schema, &cfg);

assert_eq!(normalised, json!({"id": 42, "labels": null}));

Configuration

NormaliseConfig lets you control behaviour:

let cfg = NormaliseConfig {
    empty_as_null: true,   // [] and {} become null (default)
    coerce_string: false,  // "42" becomes null not coerced from string (default)
};

Example

Input values:

{"id": 7,    "labels": {"en": "Hello"}}
{"id": "42", "labels": {}}

Normalised (default):

{"id": 7,    "labels": {"en": "Hello"}}
{"id": null, "labels": null}

Normalised (with coerce_string = true):

{"id": 7,  "labels": {"en": "Hello"}}
{"id": 42, "labels": null}

Performance Features

Parallel Processing

The library automatically uses parallel processing for:

  • Large JSON arrays (when items > 10)
  • NDJSON files with delimiter-based splitting
  • Multiple JSON objects in a single input

Memory Optimisation

  • mimalloc: Fast memory allocation
  • SIMD JSON: Hardware-accelerated parsing where available
  • Streaming: Processes large files without loading everything into memory

Error Handling

The library has been put together so as to avoid panics. That said, if a panic does occur, it will be caught. This was left in after solving the initial panic problem, and should not be seen in practice, since the JSON is always pre-validated with serde_json and panics only occurred when the JSON was invalid. Please report any examples you find that panic along with the JSON that caused it if possible.

The library provides comprehensive error handling that catches and converts internal panics into proper error messages:

let invalid_json = vec![r#"{"invalid": json}"#.to_string()];

match infer_json_schema(&invalid_json, None) {
    Ok(result) => println!("Success: {:?}", result),
    Err(error) => {
        // Will contain a descriptive error message instead of panicking
        eprintln!("JSON parsing failed: {}", error);
    }
}

Error messages include:

  • Position information for syntax errors
  • Truncated JSON content for context (prevents huge error messages)
  • Clear descriptions of what went wrong

Schema Features

Type Inference

The library accurately infers:

  • Basic types: string, number, integer, boolean, null
  • Complex types: object, array
  • Nested structures with proper schema merging
  • Optional vs required fields based on data presence

Field Order Preservation

This library uses OrderMap to preserve the original field ordering from JSON input:

// Input: {"z": 1, "b": 2, "a": 3}
// Output schema will maintain z -> b -> a ordering

Schema Merging

When processing multiple JSON objects, schemas are intelligently merged:

// Object 1: {"name": "Alice", "age": 30}
// Object 2: {"name": "Bob", "city": "NYC"}
// Merged schema: name (required), age (optional), city (optional)

Integration

This crate is designed as the foundation for:

  • polars-genson: Python plugin for Polars DataFrames
  • genson-cli: Command-line JSON schema inference tool (mainly for testing)

Safety & Reliability

  • Panic Safety: All genson-rs panics are caught and converted to errors
  • Memory Safety: Comprehensive bounds checking and safe parsing
  • Input Validation: JSON validation before processing
  • Graceful Degradation: Handles malformed input gracefully

Contributing

This crate is part of the polars-genson project. See the main repository for the contribution and development docs.

License

Licensed under the MIT License. See LICENSE](https://github.com/lmmx/polars-genson/blob/master/LICENSE) for details.

Contains vendored and adapted code from the Apache 2.0 licensed genson-rs crate.

Dependencies

~8–27MB
~406K SLoC