#xsd #null #xml #xml-parser #another #format #false

app yaxp-cli

<yaxp-cli ⚡> Yet Another XML Parser CLI

21 releases

0.2.3 Feb 17, 2025
0.2.2 Feb 15, 2025
0.1.18 Feb 10, 2025

#244 in Parser implementations

Download history 318/week @ 2025-01-27 857/week @ 2025-02-03 680/week @ 2025-02-10 100/week @ 2025-02-17

1,955 downloads per month

MIT license

81KB
1.5K SLoC

version downloads pipelines

<yaxp ⚡> Yet Another XSD Parser

📌 Note: This project is still under heavy development, and its APIs are subject to change.

Introduction

Using roxmltree to parse XML files.

Converts xsd schema to:

  • arrow
  • avro
  • duckdb (read_csv columns/types)
  • json
  • json representation of spark schema
  • jsonschema
  • polars
  • protobuf

Installation

When you already have Rust installed or want to install from crates.io:

$ cargo install yaxp-cli

on MacOS, you can also install using homebrew, from the tap opensourceworks-org/homebrew-yaxp-cli

$ tap opensourceworks-org/homebrew-yaxp-cli
$ install yaxp-cli
==> Downloading https://formulae.brew.sh/api/formula.jws.json
==> Downloading https://formulae.brew.sh/api/cask.jws.json
==> Fetching dependencies for opensourceworks-org/yaxp-cli/yaxp-cli: libgit2@1.8 and rust
==> Fetching libgit2@1.8
==> Downloading https://ghcr.io/v2/homebrew/core/libgit2/1.8/manifests/1.8.4
Already downloaded: /Users/jeroen/Library/Caches/Homebrew/downloads/9302724e2f7c0eb8122204e7d395e6c2575f176e627ea6f6a16ac4fc24be4d72--libgit2@1.8-1.8.4.bottle_manifest.json
==> Downloading https://ghcr.io/v2/homebrew/core/libgit2/1.8/blobs/sha256:5a9fe4aae3865e5c977633107b829e639e6535d8f986c851d60d63bb2e5b0932
Already downloaded: /Users/jeroen/Library/Caches/Homebrew/downloads/4932981d5b3e9b3df6840f9997858933be19bc15a7f9d8c5ce8e792b7339ee79--libgit2@1.8--1.8.4.arm64_sequoia.bottle.tar.gz
==> Fetching rust
==> Downloading https://ghcr.io/v2/homebrew/core/rust/manifests/1.84.1-1
Already downloaded: /Users/jeroen/Library/Caches/Homebrew/downloads/0c05e1e855a42deca67c60dbd378e38f9a8c2abe0ac9adf40600280372100cfa--rust-1.84.1-1.bottle_manifest.json
==> Downloading https://ghcr.io/v2/homebrew/core/rust/blobs/sha256:6fe0e14f08adae82662551b478fdfaeb87f516be7762c60d28203e830c5caa91
Already downloaded: /Users/jeroen/Library/Caches/Homebrew/downloads/551c47c2bbea27cd0ba44e951c5e3a99e60485a5e8be0ed4087eb3b6850e2284--rust--1.84.1.arm64_sequoia.bottle.1.tar.gz
==> Fetching opensourceworks-org/yaxp-cli/yaxp-cli
==> Downloading https://github.com/opensourceworks-org/homebrew-yaxp-cli/releases/download/v0.2.2/macos-arm64-v0.2.2.tar.gz
Already downloaded: /Users/jeroen/Library/Caches/Homebrew/downloads/738a5b79287f1e0d54a967db2f6f8212053c762792331b5545b173779b141dfb--macos-arm64-v0.2.2.tar.gz
==> Installing yaxp-cli from opensourceworks-org/yaxp-cli
==> Installing dependencies for opensourceworks-org/yaxp-cli/yaxp-cli: libgit2@1.8 and rust
==> Installing opensourceworks-org/yaxp-cli/yaxp-cli dependency: libgit2@1.8
==> Downloading https://ghcr.io/v2/homebrew/core/libgit2/1.8/manifests/1.8.4
Already downloaded: /Users/jeroen/Library/Caches/Homebrew/downloads/9302724e2f7c0eb8122204e7d395e6c2575f176e627ea6f6a16ac4fc24be4d72--libgit2@1.8-1.8.4.bottle_manifest.json
==> Pouring libgit2@1.8--1.8.4.arm64_sequoia.bottle.tar.gz
🍺  /opt/homebrew/Cellar/libgit2@1.8/1.8.4: 106 files, 4.7MB
==> Installing opensourceworks-org/yaxp-cli/yaxp-cli dependency: rust
==> Downloading https://ghcr.io/v2/homebrew/core/rust/manifests/1.84.1-1
Already downloaded: /Users/jeroen/Library/Caches/Homebrew/downloads/0c05e1e855a42deca67c60dbd378e38f9a8c2abe0ac9adf40600280372100cfa--rust-1.84.1-1.bottle_manifest.json
==> Pouring rust--1.84.1.arm64_sequoia.bottle.1.tar.gz
🍺  /opt/homebrew/Cellar/rust/1.84.1: 3,566 files, 321.3MB
==> Installing opensourceworks-org/yaxp-cli/yaxp-cli
🍺  /opt/homebrew/Cellar/yaxp-cli/0.2.2: 4 files, 1.3MB, built in 1 second
==> Running `brew cleanup yaxp-cli`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
$ yaxp-cli --version
yaxp-cli 0.2.2
$ which yaxp-cli
/opt/homebrew/bin/yaxp-cli
$ yaxp-cli --help
<yaxp-cli > Yet Another XSD Parser

Usage: yaxp-cli [OPTIONS] --xsd <XSD>

Options:
  -x, --xsd <XSD>              Path to the XSD file
  -f, --format <FORMAT>        Output format [default: json] [possible values: json, arrow, spark, json-schema, duckdb, polars, avro]
  -o, --output <OUTPUT>        optional output filename
  -t, --timeunit <TIMEUNIT>    optional timeunit [default: ns]
  -z, --timezone <TIMEZONE>    optional timezone [default: UTC]
  -e, --encoding <ENCODING>    optional encoding of the XSD file [default: utf-8]
  -l, --lowercase <LOWERCASE>  optional lowercase column names [default: false] [possible values: true, false]
  -h, --help                   Print help
  -V, --version                Print version
$

Usage

$ yaxp-cli --help
<yaxp-cli > Yet Another XSD Parser

Usage: yaxp-cli [OPTIONS] --xsd <XSD>

Options:
  -x, --xsd <XSD>        Path to the XSD file
  -f, --format <FORMAT>  Output format: json (default), arrow [default: json] [possible values: json, arrow]
  -o, --output <OUTPUT>  optional output filename
  -h, --help             Print help
  -V, --version          Print version
$

Examples

$ yaxp-cli --xsd example.xsd --format polars
Schema:
name: Field1, field: String
name: Field2, field: String
name: Field3, field: String
name: Field4, field: String
name: Field5, field: Datetime(Milliseconds, None)
name: Field6, field: Date
name: Field7, field: Date
name: Field8, field: String
name: Field9, field: String
name: Field10, field: String
name: Field11, field: String
name: Field12, field: Decimal(Some(25), Some(7))
name: Field13, field: String
name: Field14, field: String
name: Field15, field: String
name: Field16, field: String
name: Field17, field: Date
name: Field18, field: String
name: Field19, field: String
name: Field20, field: Decimal(Some(38), Some(10))
name: Field21, field: Int64

$ yaxp-cli --xsd example.xsd --format arrow
Schema { fields: [Field { name: "Field1", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1", "maxLength": "15"} }, Field { name: "Field2", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"maxLength": "20", "maxOccurs": "1"} }, Field { name: "Field3", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"maxLength": "10", "maxOccurs": "1"} }, Field { name: "Field4", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxLength": "50", "maxOccurs": "1"} }, Field { name: "Field5", data_type: Timestamp(Nanosecond, None), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1"} }, Field { name: "Field6", data_type: Date32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1"} }, Field { name: "Field7", data_type: Date32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1"} }, Field { name: "Field8", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"minLength": "2", "maxOccurs": "1", "maxLength": "10"} }, Field { name: "Field9", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1", "maxLength": "3"} }, Field { name: "Field10", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1", "maxLength": "30"} }, Field { name: "Field11", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1", "maxLength": "10"} }, Field { name: "Field12", data_type: Decimal128(25, 7), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1"} }, Field { name: "Field13", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1", "values": "N,Q,V,C"} }, Field { name: "Field14", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1", "values": "%,P,R"} }, Field { name: "Field15", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"values": "%,P,R", "maxOccurs": "1"} }, Field { name: "Field16", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"pattern": ".{3}", "maxOccurs": "1"} }, Field { name: "Field17", data_type: Date32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1"} }, Field { name: "Field18", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxLength": "30", "pattern": "[a-cA-C]*", "maxOccurs": "1"} }, Field { name: "Field19", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1", "values": "Y,N"} }, Field { name: "Field20", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1"} }, Field { name: "Field21", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"maxOccurs": "1"} }], metadata: {} }
$ yaxp-cli --xsd example.xsd --format json |jq
{
  "namespace": null,
  "schemaElement": {
    "id": "Main_Element",
    "name": "Main_Element",
    "dataType": null,
    "minOccurs": "1",
    "maxOccurs": "1",
    "minLength": null,
    "maxLength": null,
    "minExclusive": null,
    "maxExclusive": null,
    "minInclusive": null,
    "maxInclusive": null,
    "pattern": null,
    "fractionDigits": null,
    "totalDigits": null,
    "values": null,
    "isCurrency": false,
    "xpath": "Main_Element/Main_Element",
    "nullable": null,
    "elements": [
      {
        "id": "Field1",
        "name": "Field1",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": "15",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field1",
        "nullable": false,
        "elements": []
      },
      {
        "id": "Field2",
        "name": "Field2",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": "20",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field2",
        "nullable": false,
        "elements": []
      },
      {
        "id": "Field3",
        "name": "Field3",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": "10",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field3",
        "nullable": false,
        "elements": []
      },
      {
        "id": "Field4",
        "name": "Field4",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": "50",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field4",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field5",
        "name": "Field5",
        "dataType": "dateTime",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field5",
        "nullable": false,
        "elements": []
      },
      {
        "id": "Field6",
        "name": "Field6",
        "dataType": "date",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field6",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field7",
        "name": "Field7",
        "dataType": "date",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field7",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field8",
        "name": "Field8",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": "2",
        "maxLength": "10",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field8",
        "nullable": false,
        "elements": []
      },
      {
        "id": "Field9",
        "name": "Field9",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": "3",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field9",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field10",
        "name": "Field10",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": "30",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field10",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field11",
        "name": "Field11",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": "10",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field11",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field12",
        "name": "Field12",
        "dataType": "decimal",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": "7",
        "totalDigits": "25",
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field12",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field13",
        "name": "Field13",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": [
          "N",
          "Q",
          "V",
          "C"
        ],
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field13",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field14",
        "name": "Field14",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": [
          "%",
          "P",
          "R"
        ],
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field14",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field15",
        "name": "Field15",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": [
          "%",
          "P",
          "R"
        ],
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field15",
        "nullable": false,
        "elements": []
      },
      {
        "id": "Field16",
        "name": "Field16",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": ".{3}",
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field16",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field17",
        "name": "Field17",
        "dataType": "date",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field17",
        "nullable": false,
        "elements": []
      },
      {
        "id": "Field18",
        "name": "Field18",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": "30",
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": "[a-cA-C]*",
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field18",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field19",
        "name": "Field19",
        "dataType": "string",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": [
          "Y",
          "N"
        ],
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field19",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field20",
        "name": "Field20",
        "dataType": "decimal",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field20",
        "nullable": true,
        "elements": []
      },
      {
        "id": "Field21",
        "name": "Field21",
        "dataType": "integer",
        "minOccurs": "1",
        "maxOccurs": "1",
        "minLength": null,
        "maxLength": null,
        "minExclusive": null,
        "maxExclusive": null,
        "minInclusive": null,
        "maxInclusive": null,
        "pattern": null,
        "fractionDigits": null,
        "totalDigits": null,
        "values": null,
        "isCurrency": false,
        "xpath": "Main_Element/Main_Element/Field21",
        "nullable": true,
        "elements": []
      }
    ]
  }
}
$

TODO

  • pyo3/maturin support
  • parameter for timezone unit/TZ (testing with polars)
  • support for different xsd file encoding: UTF-16, UTF16LE, ...
  • more tests
  • strict schema validation to spec (xsd, avro, json-schema, ...)
  • example implementation <xsd ⚡> convert
  • option to lowercase column names

Dependencies

~50–81MB
~1.5M SLoC