4 releases

0.2.2 Oct 5, 2024
0.2.1 Mar 7, 2024
0.2.0 Feb 27, 2024
0.1.0 Feb 20, 2024

#840 in Parser implementations

28 downloads per month
Used in dotnet-lens

MPL-2.0 license

440KB
7.5K SLoC

Simple(ish) parser and extractor of XML.

This package provides an XmlReader which can automatically determine the character encoding of UTF-8 and UTF-16 (big endian and little endian byte order) XML byte streams, and parse the XML into an immutable Element tree held within an XmlDocument. It's also possible to use a custom byte stream decoder to read XML in other character encodings.

The aim of this package is to support as closely as possible the W3C specifications Extensible Markup Language (XML) 1.0 and Namespaces in XML 1.0 for well-formed XML. This package does not aim to support validation of XML, and consequently DTD (document type definition) is deliberately not supported.

Namespace support is always enabled, so the colon character is not permitted within the names of elements nor attributes.

XML concepts already supported

  • Elements
  • Attributes
  • Default namespaces xmlns="namespace.com"
  • Prefixed namespaces xmlns:prefix="namespace.com"
  • Processing instructions
  • Comments (skipped and thus not retrievable)
  • CDATA sections
  • Element language xml:lang and filtering by language
  • White space indication xml:space
  • Automatic detection and decoding of UTF-8 and UTF-16 XML streams.
  • Support for custom encodings where the encoding is known before parsing, and where the client supplies a custom decoder to handle the byte-to-character conversion.

Examples

Reading an XML file

Suppose you want to read and extract XML from a file you know to be either UTF-8 or UTF-16 encoded. You can use XmlReader::parse_auto to read, parse, and extract the XML from the file and return either an XmlDocument or an std::io::Error.

let xml_file = File::open("test_resources/xml_utf8_BOM.xml")?;
let xml_doc = XmlReader::parse_auto(xml_file)?;

Traversing an XmlDocument

Once you have an XmlDocument you can grab an immutable reference to the root Element and then traverse through the element tree using the req (required child element) and opt (optional child element) methods to target the first child element with the specified name. And once we're pointing at the desired target, we can use element() or text() to attempt to grab the element or text-only content of the target element.

For example, let's define a simple XML structure where required elements have a name starting with "r_" and optional elements have a name starting with "o_".

<root>
    <r_Widget>
        <r_Name>Helix</r_Name>
        <o_AdditionalInfo>
            <r_ReleaseDate>2021-05-12</r_ReleaseDate>
            <r_CurrentVersion>23.10</r_CurrentVersion>
            <o_TopContributors>
                <r_Name>archseer</r_Name>
                <r_Name>the-mikedavis</r_Name>
                <r_Name>sudormrfbin</r_Name>
                <r_Name>pascalkuthe</r_Name>
                <r_Name>dsseng</r_Name>
                <r_Name>pickfire</r_Name>
            </o_TopContributors>
        </o_AdditionalInfo>
    </r_Widget>
</root>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    // Let's start by grabbing a reference to the widget element.
    // Because we use req to indicate that it should be considered
    // an error if this required element is missing, the element()
    // method will return a Result<&Element, XmlError>. So we use
    // the `?` operator to throw the XmlError if it occurs.
    let widget = xml_doc.root().req("r_Widget").element()?;

    // The name is required, so we just use req again. We also
    // expect the name to contain only simple text content (not
    // mixed with other elements or processing instructions) so we
    // call text() followed by the `?` operator to throw the
    // XmlError that will be generated if either the name element
    // is not found, or if it contains non-simple content.
    let widget_name = widget.req("r_Name").text()?;

    // The info and top contributor elements are optional (may or
    // may not appear in this type of XML document) so we can use
    // the opt method to indicate that it is not an error if
    // either element is not found. Instead of a
    // Result<&Element, XmlError> this entirely optional chain
    // will cause element() to give us an Option<&Element>
    // instead, so we use `if let` to take action only if the
    // given optional chain elements all exist.
    if let Some(top_contrib_list) = widget
        .opt("o_AdditionalInfo")
        .opt("o_TopContributors")
        .element() {
        println!("Found top {} contributors!",
            top_contrib_list.elements()
                .filter(|e| e.is_named("r_Name")).count());
    }

    // If we want the release date, that's a required element
    // within an optional element. In other words, it's not an
    // error if "o_AdditionalInfo" is missing, but if it *is*
    // found then we consider it an error if it does not contain
    // "r_ReleaseDate". This is a mixed chain, involving both
    // required and optional, which means that element() will
    // return a Result<Option<&Element>, XmlError>, an Option
    // wrapped in a Result. So we use `if let` and the `?`
    // operator together.
    if let Some(release_date) = widget
            .opt("o_AdditionalInfo")
            .req("r_ReleaseDate")
            .element()? {
        println!("Release date: {}", release_date.text()?);
    }

    Ok(())
}

Note that the return type of the element() and text() methods varies depending on whether the method chain involves req or opt or both. This table summarizes the scenarios.

Chain involves element() returns text() returns
only req Result<&Element, XmlError> Result<&str, XmlError>
only opt Option<&Element> Result<Option<&str>, XmlError>
both req and opt Result<Option<&Element>, XmlError> Result<Option<&str>, XmlError>

Similarly, the return types of att_req and att_opt methods also vary depending on the method chain.

Chain involves att_req(name) returns att_opt(name) returns
only req Result<&str, XmlError> Result<Option<&str>, XmlError>
only opt Result<Option<&str>, XmlError> Option<&str>
both req and opt Result<Option<&str>, XmlError> Result<Option<&str>, XmlError>

It's easier to remember this as the following: req/att_req will generate an error if the element or attribute does not exist, so their use means that the return type must involve a Result<_, XmlError> of some sort. And opt/att_opt may or may not return a value, so their use means that the return type must involve an Option<_> of some sort. And mixing the two (required and optional) means that the return type must involve a Result<Option<_>, XmlError> of some sort. And text() generates an error if the target element does not have simple content (no child elements and no processing instructions) so its use also means that the return type must involve a Result of some sort.

More complex traversal using XmlPath

The methods req and opt always turn their attention to the first child element with the given name. It's not possible to use them to target a sibling, say the second "Widget" within a list of "Widget" elements. To target siblings, and/or to iterate multiple elements, you instead use XmlPath. (Don't confuse this with XPath which has a similar purpose but very different implementation.)

For example, if you have XML which contains a list of employees, and you want to iterate the employees' tasks' deadlines, you could use XmlPath like this:

<roster>
    <employee>
        <name>Angelica</name>
        <department>Finance</department>
        <task-list>
            <task>
                <name>Payroll</name>
                <deadline>tomorrow</deadline>
            </task>
            <task>
                <name>Reconciliation</name>
                <deadline>Friday</deadline>
            </task>
        </task-list>
    </employee>
    <employee>
        <name>Byron</name>
        <department>Sales</department>
        <task-list>
            <task>
                <name>Close the big deal</name>
                <deadline>Saturday night</deadline>
            </task>
        </task-list>
    </employee>
    <employee>
        <name>Cat</name>
        <department>Software</department>
        <task-list>
            <task>
                <name>Fix that bug</name>
                <deadline>Maybe later this month</deadline>
            </task>
            <task>
                <name>Add that new feature</name>
                <deadline>Possibly this year</deadline>
            </task>
            <task>
                <name>Make that customer happy</name>
                <deadline>Good luck with that</deadline>
            </task>
        </task-list>
    </employee>
</roster>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    for deadline in xml_doc.root()
        .all("employee")
        .first("task-list")
        .all("task")
        .first("deadline")
        .iter() {
        println!("Found task deadline: {}", deadline.text()?);
    }
    Ok(())
}

This creates and iterates an XmlPath which represents "the first deadline element within every task within the first task-list within every employee". Based on the example XML above, this will print out all the text content of all six "deadline" elements.

Note that we could use first("employee") if we only wanted the first employee. Or we could use nth("employee", 1) if we only want the second employee (zero would point to the first). Or we could use last("employee") if we only want the last employee. Similarly, we could use first("task") if we only wanted to consider the first task in each employee's list.

Filtering elements within an XmlPath

An XmlPath not only lets you specify which child element names are of interest, but also lets you specify which xml:lang patterns are of interest, and lets you specify a required attribute name-value pair which must be found within a child element in order to include it in the iterator.

<inventory>
    <box type='games'>
        <item>
            <name xml:lang='en'>C&amp;C: Tiberian Dawn</name>
            <name xml:lang='en-US'>Command &amp; Conquer</name>
            <name xml:lang='de'>C&amp;C: Teil 1</name>
        </item>
        <item>
            <name xml:lang='en'>Doom</name>
            <name xml:lang='sr'>Zla kob</name>
            <name xml:lang='ja'>ドゥーム</name>
        </item>
        <item>
            <name xml:lang='en'>Half-Life</name>
            <name xml:lang='sr'>Polu-život</name>
        </item>
    </box>
    <box type='movies'>
        <item>
            <name xml:lang='en'>Aliens</name>
            <name xml:lang='sv-SE'>Aliens - Återkomsten</name>
            <name xml:lang='vi'>Quái Vật Không Gian 2</name>
        </item>
        <item>
            <name xml:lang='en'>The Cabin In The Woods</name>
            <name xml:lang='bg'>Хижа в гората</name>
            <name xml:lang='fr'>La cabane dans les bois</name>
        </item>
    </box>
</inventory>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    let english = ExtendedLanguageRange::new("en")?;

    for movie in xml_doc.root()
        .all("box")
        .with_attribute("type", "games")
        .all("item")
        .all("name")
        .filter_lang_range(&english)
        .iter() {
        println!("Found movie title in English: {}",
            movie.text()?);
    }
    Ok(())
}

This will print out the names of all four English-language titles for the three games. It will skip all of the movies, and all names which are rejected by the "en" language filter. Note that this "en" filter will match both xml:lang="en" and xml:lang="en-US" so you'll get two matching name elements for the first game.

Attribute extraction

Getting the value of an attribute is done with the methods att_req (generate an error if the attribute is missing) and att_opt (no error if the attribute is missing).

For example, given this simple XML document, we can grab the attribute values easily.

<root generationDate='2023-02-09T18:10:00Z'>
    <record id='35517'>
        <temp locationId='23'>40.5</temp>
    </record>
    <record id='35518'>
        <temp locationId='36'>38.9</temp>
    </record>
</root>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    // Iterate the records using an XmlPath.
    for record in xml_doc.root().all("record").iter() {
        // The record@id attribute is required (we consider it an
        // error if it is missing). So use att_req and then the
        // `?` syntax to throw any XmlError generated.
        let record_id = record.att_req("id")?;

        let temp = record.req("temp").element()?;
        let temp_value = temp.text()?;

        // The temp@locationId attribute is optional (we don't
        // consider it an error if it's not found within this
        // element). So use att_opt and then `if let` to check for
        // it.
        if let Some(loc_id) = temp.att_opt("locationId") {
            println!("Found temperature {} at {}",
                temp_value, loc_id);
        } else {
            println!("Found temperature {} at ??? location.",
                temp_value);
        }
    }
    Ok(())
}

Note: the xml:lang and xml:space values cannot be read from as attribute values from an Element, because these are "special attributes" whose values are inherited by child elements (and the language is inherited by an element's attributes too). To get the effective value of these language and space properties, see the methods language_tag and white_space_handling instead.

Namespace handling

All of the examples so far have used XML without any namespace declarations, which means that the element and attribute names are not within any namespace (or put another way, they have a namespace which has no value). Specifying the target name of an element or attribute can be done with a string slice &str when the namespace has no value. But when the target name has a namespace value, you must specify the namespace in order to target the desired element.

The most direct way of doing this is to use a (&str, &str) tuple which contains the local part and then namespace (not the prefix) of the element name. But you can also call the pre_ns (preset or predefined namespace) method to let a cursor or XmlPath know that it should assume the given namespace value if you don't use a tuple to directly specify the namespace for each element and attribute within the method chain. An example is probably be the easiest way to explain this.

<!-- The root element declares that the default namespace for it
and its descendants should be the given URI. It also declares that
any element/attribute using prefix 'pfx' belongs to a namespace
with a different URI. -->
<root xmlns='example.com/DefaultNamespace'
xmlns:pfx='example.com/OtherNamespace'>
    <one>This child element has no prefix, so it inherits
the default namespace.</one>
    <pfx:two>This child element has prefix pfx, so inherits the
other namespace.</pfx:two>
    <pfx:three pfx:key='value'>Attribute names can be prefixed
too.</pfx:three>
    <four key2='value2'>Unprefixed attribute names do *not*
inherit namespaces.</four>
    <five xmlns='' key3='value3'>The default namespace can be
cleared too.</five>
</root>
// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    let root = xml_doc.root();
    // You can use a tuple to specify the local part and namespace
    // of the targeted element.
    let one = root.req(("one", "example.com/DefaultNamespace"))
        .element()?;

    // Or you can call pre_ns before a chain of
    // req/opt/first/all/nth/last method calls.
    let two = root.pre_ns("example.com/OtherNamespace")
        .req("two").element()?;

    // The effect of pre_ns continues until you call element() or
    // text(), so you can keep assuming the same namespace for
    // child elements or attributes.
    let three_key = root.pre_ns("example.com/OtherNamespace")
        .req("three").att_req("key")?;

    // Be careful if the namespace changes (or is cleared) when
    // moving down through child elements and attributes. If that
    // happens, you can call pre_ns again, or you can use a tuple
    // to explicitly state the different namespace.
    let four_key = root
        .pre_ns("example.com/DefaultNamespace")
        .req("four")
        .pre_ns("")
        .att_req("key2")?;

    // When no namespace applies to a method or attribute name,
    // you don't need to specify any namespace to target it, so
    // you don't need to use pre_ns nor a tuple. But you can
    // anyway if you want to make it more explicit that there is
    // no namespace.
    let five_key = root.req(("five", "")).att_req(("key3", ""))?;

    Ok(())
}

It's important to note that once you call element() the effect of pre_ns vanishes. So don't forget that you if you do call element() in the middle of a method chain, you need to call pre_ns again in order to specify the preset namespace from that point forward.

<root xmlns='example.com/DefaultNamespace'>
    <topLevel>
        <innerLevel>
            <list>
                <item>something</item>
                <item>whatever</item>
                <item>more</item>
                <item>and so on</item>
            </list>
        </innerLevel>
    </topLevel>
</root>
// Defining a static constant makes it quicker to type namespaces,
// and easier to read the code.
const NS_DEF: &str = "example.com/DefaultNamespace";

// Once the above XML is turned into an XmlDocument, it gets
// passed to this method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    // Use a chain of req calls to get to the required list, then
    // use an XmlPath to iterate however many items are found
    // within the list and count them.

    // This first attempt will actually give us the wrong number,
    // because once we call element()? we receive an `&Element`
    // reference, and the preset namespace effect is lost. So the
    // XmlPath we chain on straight after that will be searching
    // the empty namespace and won't find any matching elements
    // and will report a count of zero.
    let mistake = xml_doc
        .root()
        .pre_ns(NS_DEF)
        .req("topLevel")
        .req("innerLevel")
        .req("list")
        .element()?
        .all("item")
        .iter()
        .count();

    // You can fix the problem by either using an explicit name
    // tuple `("item", NS_DEF)` or by calling pre_ns again after
    // element() so that the XmlPath knows which namespace should
    // be used when searching for items.
    let correct = xml_doc
        .root()
        .pre_ns(NS_DEF)
        .req("topLevel")
        .req("innerLevel")
        .req("list")
        .element()?
        .pre_ns(NS_DEF)
        .all("item")
        .iter()
        .count();

    // However, to avoid confusion, it's recommended to avoid
    // including `element()` between two different method chains,
    // and to instead assign it to a variable name for clarity.
    let list = xml_doc
        .root()
        .pre_ns(NS_DEF)
        .req("topLevel")
        .req("innerLevel")
        .req("list")
        .element()?;

    let cleanest = list.all(("item", NS_DEF)).iter().count();

    Ok(())
}

Error handling

The examples above have simplified the code snippets for brevity, but in a real application you will need to handle the different error types returned by the different steps of reading/parsing and extracting from XML. Here is a compact example which shows the error handling needed for each step.

fn main() {
    // Decide what to do if either step returns an error.
    // For simplicity, we'll simply panic in this example, but in
    // a real application you may want to remap the error to the
    // type used by your application, or trigger some recovery
    // logic instead.
    let xml_doc = match read_xml() {
        Ok(d) => d,
        Err(e) => panic!("XML reading or parsing failed!"),
    };
    match extract_xml(xml_doc) {
        Ok(()) => println!("Finished without errors!"),
        Err(e) => panic!("XML extraction failed!"),
    }
}

// The XML parsing methods might throw an std::io::Error, so they
// go into their own method.
fn read_xml() -> Result<XmlDocument, std::io::Error> {
    let xml = "<root><child/></root>";
    let xml_doc = XmlReader::parse_auto(xml.as_bytes());
    xml_doc
}

// The extraction methods might throw an XmlError, so they go into
// their own method.
fn extract_xml(xml_doc: XmlDocument) -> Result<(), XmlError> {
    let child = xml_doc.root().req("child").element()?;
    Ok(())
}

Dependencies