url_parser

2 releases

0.1.1	Nov 13, 2024
0.1.0	Nov 11, 2024

#985 in Parser implementations

40 downloads per month

MIT license

14KB
155 lines

URL Parser

Link: https://crates.io/crates/url_parser Docs: https://docs.rs/url_parser/0.1.0/url_parser/

URL Parser is a Rust parser developed to parse URLs into structured components such as scheme, domain, path, query and fragment.

Parsing Process

URL Parser processes a URL string and extracts the following components:

Scheme: The protocol used (for example, "http", "https").
Domain: The domain name or IP address of the URL (for example, "example.com").
Path: The path that specifies the resource location on the server (for example, "/products/electronics").
Query: The query string following a '?', used for passing parameters (for example, "?query=1").
Fragment: The section identifier following a '#', referring to a part of the resource (for example, "#reviews").

The parsed components are essential for applications needing to analyze, validate or manipulate URLs.

Grammar

url

url = { scheme ~ "://" ~ domain ~ path? ~ query? ~ fragment? }

Purpose: Defines the overall structure of the URL.

Explanation: The url rule is the basic rule for parsing a full URL. It expects:

scheme followed by ://
domain
optional path
optional query
optional fragment

'?' after some components (path?, query?, fragment?) means that these components are optional.

scheme

scheme = @{ ASCII_ALPHANUMERIC+ }

Purpose: Identifies the URL scheme or protocol (for example, "http", "https", "ftp").

Explanation: A scheme rule matches one or more alphanumeric characters (letters and numbers). The '@' symbol marks this rule as an atomic rule, which means that it is processed as a single unit to improve performance and ensure correct parsing.

domain

domain = @{ (!("/" | "?") ~ ANY)+ }

Purpose: Defines the domain part of the URL (for example, "example.com", "localhost", "192.168.0.1").

Explanation: The domain rule finds one or more characters that are not the '/' or '?' character. The ANY keyword matches any character other than a newline. This allows the domain to include subdomains, IP addresses or hostnames, as long as they do not contain '/' or '?'.

path

path = { "/" ~ (!("?" | "#") ~ ANY)* }

Purpose: Defines a component of a URL path that indicates the location of a resource on the server (for example, "/products/electronics").

Explanation: The path rule begins with '/' and then allows zero or more characters that are not the '?' or '#' character. This excludes query parameters and fragment identifiers from the path part.

query

query = { "?" ~ (!"#" ~ ANY)* }

Purpose: Defines a query string in the URL that is typically used to pass parameters (for example, "?query=1").

Explanation: The query rule starts with '?' and allows zero or more characters that are not '#'. This ensures that the query string does not include a fragment character (#) that belongs to a URL fragment component.

fragment

fragment = { "#" ~ ANY* }

Purpose: Defines a section identifier that points to a specific section within the resource (for example, "#reviews").

Explanation: The fragment rule starts with '#' and allows any number of characters after it. This rule allows the fragment to contain any characters after the '#'.

Dependencies

~4MB
~73K SLoC