#match #string #optimization #byte #trie

macro no-std lighter-derive

Procedural macros for the lighter crate

1 unstable release

0.1.0 Feb 28, 2022

#56 in #bytes

48 downloads per month
Used in lighter

CC0 license

13KB
195 lines

lighter - slightly better than a match 🔥

Say we have some Rust code that uses a match statement on a string:

pub fn greeting_id(greeting: &str) -> usize {
    match greeting {
        "hi" => 0,
        "hey" => 1,
        "hello" => 2,
        _ => 3,
    }
}

It turns out match statements with strings for patterns are compiled into a series of if statements, such that the above code is essentially equivalent to this:

pub fn greeting_id(greeting: &str) -> usize {
    if greeting == "hi" {
        0
    } else if greeting == "hey" {
        1
    } else if greeting == "hello" {
        2
    } else {
        3
    }
}

rustc and LLVM have many tricks to optimize match statements where the arguments are primitives such as integers or enums, but Rust doesn't seem to treat strings as "special" the way it does with numbers or enums in match statements. Instead, it falls back to calling the relevant PartialEq implementation. This can be inefficient, especially if we have many strings and some of them share common prefixes. Here is greeting_id modified to use lighter instead of a plain match:

use lighter::lighter;

pub fn greeting_id(greeting: &str) -> usize {
    lighter! {
        match greeting {
            "hi" => 0,
            "hey" => 1,
            "hello" => 2,
            _ => 3,
        }
    }
}

During compilation, lighter essentially turns the flat match statement into a static trie, where each match statement considers one character at a time:

pub fn greeting_id(greeting: &str) -> usize {
    let mut bytes = greeting.bytes();
    match bytes.next() {
        Some(b'h') => match bytes.next() {
            Some(b'i') => match bytes.next() {
                None => 0,
                _ => 3,
            },
            Some(b'e') => match bytes.next() {
                Some(b'y') => match bytes.next() {
                    None => 1,
                    _ => 3,
                },
                Some(b'l') => match bytes.next() {
                    Some(b'l') => match bytes.next() {
                        Some(b'o') => match bytes.next() {
                            None => 2,
                            _ => 3,
                        },
                        _ => 3,
                    },
                    _ => 3,
                },
                _ => 3,
            },
            _ => 3,
        },
        _ => 3,
    }
}

This may look somewhat gnarly compared to the original match without lighter, but by using byte literals (which are actually just u8s) we allow Rust and LLVM to use their full arsenal of optimizations for switches mapping numbers to numbers, resulting in better code. The nested-match structure also means we only have to compare each character once: with a plain match, greeting_id compares its input against both the h in "hi" and the h in "hello", whereas with lighter, greeting_id matches an h once and knows the suffixes it is looking for are either i or ello.

What's more, lighter doesn't just work with strings or slices: you can match on the output of any iterator of u8, or anything that can be turned into one. lighter can also match strings matching a prefix instead of or in addition to entire strings; see lighter/examples/is_whitespace_2.rs.

Dependencies

~3.5MB
~73K SLoC