#pest-parser #header #body #validate #grammar #structured

bin+lib email_pest_parser

An email parser that parse entire email and validates email addresses

3 releases

0.1.2 Nov 16, 2024
0.1.1 Nov 12, 2024
0.1.0 Nov 12, 2024

#79 in Email

Download history 185/week @ 2024-11-08 147/week @ 2024-11-15 17/week @ 2024-11-22 2/week @ 2024-11-29 3/week @ 2024-12-06

179 downloads per month

MIT license

13KB
105 lines

Email Pest Parser

Link: https://crates.io/crates/email_pest_parser
Docs: https://docs.rs/email_pest_parser/latest/email_pest_parser/

A Rust-based email parser using the pest parsing library, designed to extract and validate various components such as headers, email addresses, and the body content of an email.

Parsing Logic

The grammar is defined using pest and covers the following components:

  • Headers: Key-value pairs that represent the metadata of an email, such as "From," "To," etc.
  • Email Addresses: Extracted from specific headers like "From" and "To." Supports standard formats for usernames and domains.
  • Body: The main content of the email, supporting multiple lines and any type of character.

How It Works

The parser processes emails by breaking them into components using the predefined grammar rules. Then, it encapsulates the results in a structured ParsedEmail object for easy access to headers, email addresses, and body content.

Example

From: sender@example.com
To: recipient@example.com
Subject: Meeting Update

Hello,

This is a reminder for our meeting scheduled tomorrow at 10 AM.
Please let us know if you have any questions.

Best regards,
Sender

Result

ParsedEmail {
    headers: [
        (
            "From",
            "sender@example.com",
        ),
        (
            "To",
            "recipient@example.com",
        ),
        (
            "Subject",
            "Meeting Update",
        ),
    ],
    body: "Hello,\r\n\r\nThis is a reminder for our meeting scheduled tomorrow at 10 AM.\r\nPlease let us know if you have any questions.\r\n\r\nBest regards,\r\nSender",
    email_addresses: [
        "sender@example.com",
        "recipient@example.com",
    ],
}

Grammar

email

email = { headers ~ NEWLINE ~ body }

The email rule consists of two main parts: the headers and the body. These are separated by a NEWLINE. The headers are a series of header lines, and the body contains the actual message content.

headers

headers = { (header_line ~ NEWLINE)* }

The headers rule consists of one or more header_line rules, each followed by a NEWLINE.

header_line

header_line = { field_name ~ ": " ~ field_value }

A header_line consists of a field_name, followed by a colon and a space (": "), and then a field_value. The field_name represents the name of the header (e.g., Subject, From), and the field_value represents the value of the header.

field_name

field_name = { (ASCII_ALPHANUMERIC | "-" | "_")+ }

A field_name can consist of one or more characters, which can be ASCII alphanumeric characters (letters and numbers), or the special characters "-" (hyphen) or "_" (underscore).

field_value

field_value = { (!NEWLINE ~ ANY)+ }

A field_value consists of any characters except a NEWLINE. The ANY rule matches any character, and the value can have one or more characters.

body

body = { (!EOI ~ ANY)* }

The body rule matches any character (ANY) except the end of input (EOI), repeated zero or more times. This rule defines the actual content of the email after the headers section.

email_address

email_address = { username ~ "@" ~ domain }

An email_address consists of a username, followed by the "@" symbol, and then a domain. The domain is further broken down into subdomains.

username

username = { (ASCII_ALPHANUMERIC | "_" | "." | "-")+ }

A username can consist of one or more characters, which can be ASCII alphanumeric characters, or the special characters "_", ".", and "-".

domain

domain = { subdomain ~ ("." ~ subdomain)+ }

A domain consists of one or more subdomain rules, separated by periods ("."). Each subdomain is defined by the subdomain rule.

subdomain

subdomain = { ASCII_ALPHANUMERIC+ }

A subdomain consists of one or more ASCII alphanumeric characters. Subdomains are typically used in the domain name (e.g., gmail in gmail.com).

Dependencies

~2.2–3MB
~58K SLoC