#search-query #query-parser #lexer #search-engine #search #parser #ui

whydrogen

A slightly opinioated search query parser/lexer

1 unstable release

0.1.0 Jul 6, 2024

#2018 in Parser implementations


Used in unobtanium-viewer

LGPL-3.0-only

23KB
473 lines

whydrogen — Serch Query Parser Library

Whydrogen is a slightly opinioated search query parser written in rust.

The syntax should be familiar to anyone who knows a bit about the "Advanced search" features of several popular search engines. It supports just words, "Quoted tokens", key:value pairs, !prefixed-values, -"negated" -key:"values and" -!prefixes.

Which keywords and prefixes work is fully customizable. The syntax is unicode aware.

For a detailed syntax description see the documentation in the lib.rs file.

License

whydrogen is licensed under a LGPL-3.0-only license.


lib.rs:

Whydrogen is a slightly opinioated parser for search queries from humans.

Its main purpose is converting strings of text from a search entry into a more easy to process list of tokens/lexemes (depending on what you are doing with them).

The search syntax in a nutshell:

  • Unless something else applies everythinng seperated by a space is a word.
  • Character classification happens through the means of Unicode category groups.
  • Any sequence of whitespace (Unicode whitespace category plus newline and tab) or the start or end of a query can be a token seperator.
  • Phrases are quoted sequences of text.
    • Supported pairs of quotes are: "", »…« and «…», this will be expanded in the future.
    • The Phrase can only start after a token seperator.
    • The closing quote must be followed by a token seperator (if not it is taken as part of the phrase).
    • Any token seperator inside a quote is taken as its literal character (as in most quoting syntaxes).
    • Inside a phrase a backslash \ can be used to escape the closing quote character (independent of any token seperators)
    • A double backslash \\ can be used to unambigiously represent a backslash inside the quotes.
    • A backslash followed by anything else is taken as is.
    • A minus - before the first quote marks the phrase as inverted.
    • An unclosed phrase is ignored, the part with the opening quote is treated as a word, parsing continues as usual after that.
  • Key-Value pairs are a keyword and an optionally quoted value seperated by a colon :.
    • A keyword may contain any alphanumeric (unicode letter or number) character and -, _ and .. It may only start with an alphanumreic.
    • Valid keywords are implementation defined.
    • A minus - before the keyword marks the Key-Value pair as inverted.
  • Prefixed values are optionally quoted values prefixed with a single non-alphanumeric character.
    • Valid prefixes are implementation defined.
    • Prefixed values are parsed to the same data structure as Key-Value pairs.
    • A minus - before the prefix marks the prefixed value as inverted.
  • optionally quoted means:
    • A text literal that ends at the next token seperator like a word.
    • Quoted text according to the same quoting rules as Phrases (but starting immedeately instetad of after a token seperator), an additional quote pair of semicolons ;; is supported.

Design goals of the syntax were:

  • Familiar to anyone who has used such sntax in other search engines.
  • Fault tolerant without synax errors in case of clumsy use.
  • Pasting things like error messages into the serch field should not trigger any search syntax.
  • Quotting must be able to reliably encode any sequence of characters without getting into the way of more casual use.

The name is made up of the word for asking the most important kind of questtion and the most abundand chemical element in the universe, which also happens to be a very important component in answer seeking beings :D.

Dependencies

~285KB