2 releases
0.1.1 | Nov 13, 2024 |
---|---|
0.1.0 | Nov 11, 2024 |
#445 in Text processing
14KB
155 lines
URL Parser
Link: https://crates.io/crates/url_parser Docs: https://docs.rs/url_parser/0.1.0/url_parser/
URL Parser is a Rust parser developed to parse URLs into structured components such as scheme, domain, path, query and fragment.
Parsing Process
URL Parser processes a URL string and extracts the following components:
- Scheme: The protocol used (for example, "http", "https").
- Domain: The domain name or IP address of the URL (for example, "example.com").
- Path: The path that specifies the resource location on the server (for example, "/products/electronics").
- Query: The query string following a '?', used for passing parameters (for example, "?query=1").
- Fragment: The section identifier following a '#', referring to a part of the resource (for example, "#reviews").
The parsed components are essential for applications needing to analyze, validate or manipulate URLs.
Grammar
url
url = { scheme ~ "://" ~ domain ~ path? ~ query? ~ fragment? }
Purpose: Defines the overall structure of the URL.
Explanation: The url rule is the basic rule for parsing a full URL. It expects:
- scheme followed by ://
- domain
- optional path
- optional query
- optional fragment
'?' after some components (path?, query?, fragment?) means that these components are optional.
scheme
scheme = @{ ASCII_ALPHANUMERIC+ }
Purpose: Identifies the URL scheme or protocol (for example, "http", "https", "ftp").
Explanation: A scheme rule matches one or more alphanumeric characters (letters and numbers). The '@' symbol marks this rule as an atomic rule, which means that it is processed as a single unit to improve performance and ensure correct parsing.
domain
domain = @{ (!("/" | "?") ~ ANY)+ }
Purpose: Defines the domain part of the URL (for example, "example.com", "localhost", "192.168.0.1").
Explanation: The domain rule finds one or more characters that are not the '/' or '?' character. The ANY keyword matches any character other than a newline. This allows the domain to include subdomains, IP addresses or hostnames, as long as they do not contain '/' or '?'.
path
path = { "/" ~ (!("?" | "#") ~ ANY)* }
Purpose: Defines a component of a URL path that indicates the location of a resource on the server (for example, "/products/electronics").
Explanation: The path rule begins with '/' and then allows zero or more characters that are not the '?' or '#' character. This excludes query parameters and fragment identifiers from the path part.
query
query = { "?" ~ (!"#" ~ ANY)* }
Purpose: Defines a query string in the URL that is typically used to pass parameters (for example, "?query=1").
Explanation: The query rule starts with '?' and allows zero or more characters that are not '#'. This ensures that the query string does not include a fragment character (#) that belongs to a URL fragment component.
fragment
fragment = { "#" ~ ANY* }
Purpose: Defines a section identifier that points to a specific section within the resource (for example, "#reviews").
Explanation: The fragment rule starts with '#' and allows any number of characters after it. This rule allows the fragment to contain any characters after the '#'.
Dependencies
~4MB
~74K SLoC