1 unstable release
0.1.4 | Nov 17, 2024 |
---|---|
0.1.3 |
|
0.1.2 |
|
0.1.1 |
|
0.1.0 |
|
#926 in Parser implementations
439 downloads per month
14KB
157 lines
XML Language Tag Parser
Brief Description
The XML Language Tag Parser is a Rust-based parser designed to handle and parse XML documents with specific focus on extracting language tags, elements, attributes, and content. This parser is intended to facilitate processing XML documents in applications where structured content needs to be validated and manipulated. The parsed structure includes elements, attributes, and content, allowing for further use in applications that require such data in a structured format.
How the Parsing Works
The parsing process involves using the pest
library to define grammar rules for XML-like tags. These grammar rules are as follows:
- Parsing the XML document: The input XML string is parsed using
pest
parser library. It begins by breaking the document into tokens, which are then matched against predefined grammar rules. - Element parsing: Each XML element (e.g.,
<div>
,<p>
) is identified. The parser extracts its tag name and attributes. - Attribute handling: The parser identifies the attributes within each element, ensuring they follow the correct format (e.g.,
id="123"
). - Content handling: The parser processes both text content and nested elements.
- Error Handling: If the XML is malformed or contains any issues (like missing closing tags or invalid attributes), the parser raises appropriate errors for correction.
Grammar Rules
- Element Rule: an element consists of an opening tag (
open_tag
), content (content
), and a closing tag (close_tag
). - Open Tag Rule: the
open_tag
starts with a<
, followed by atag_name
(the name of the element), optional attributes (if any), and ends with a>
. - Close Tag Rule: the
close_tag
starts with</
, followed by atag_name
, and ends with a>
. - Self-Closing Tag Rule: a self-closing tag is an element that doesn't require a separate closing tag. It ends with
/>
. - Tag Name Rule: the tag name must start with an alphabetic character and may include alphanumeric characters, hyphens (
-
), or underscores (_
). - Attributes Rules: attributes are key-value pairs within the opening tag. Multiple attributes can be present, separated by whitespace. An attribute consists of a name, an equals sign (
=
), and a value enclosed in double quotes. - Attribute Name Rule: the attribute name can contain alphanumeric characters, hyphens (
-
), underscores (_
), or colons (:
). - Attribute Value Rule: the attribute value consists of any characters except for the double quote (
"
). - Content Rule: the
content
rule defines what can appear inside an element. It can be a mixture of other elements and text. - Text Rule: text is the content inside an element that is not a tag. It can include any characters except for
<
and>
.
Example: Parsing a Simple XML with Attributes
For the input:
cargo run "<person id=\"1\" age=\"30\"><name>John Doe</name></person>"
Parsed XML Structure:
Element {
tag_name: "person",
attributes: [
Attribute {
name: "id",
value: "1",
},
Attribute {
name: "age",
value: "30",
},
],
content: [
Element {
tag_name: "name",
attributes: [],
content: [],
text: Some(
"John Doe",
),
},
],
text: None,
}
Dependencies
~4MB
~73K SLoC