#wildcard #search #edit #character #trie #suffix #text

spyglass

Search engine for documents, inspired by bioinformatics

2 releases (1 stable)

1.1.0 Jan 9, 2021
0.1.0 Jan 3, 2021

#1927 in Text processing

MIT license

71KB
878 lines

Spyglass

Tool to search through texts using a Suffix Trie built up from sentences of the text.

Search types

  1. Single wildcard

te?t matches test and text

  1. Multi character wildcard

mush* matches mushroom and mushy and mush

Equivalent to \w* in regex

  1. Multi word wildcard

this ** rabbit matches this rabbit and this enormous rabbit and this big furry rabbit

  1. Approximate match using edit distance

he repl'd with edit distance 2 matches he replied

  1. Searching with list of ignorable characters

E.g. ignoring vowels and punctuation wracked matches rack'd and wrecked

To do

  • Allow deterministic printing - hashmap keys are never sorted, so each time we print the keys are printed in random order
  • Matching with ? wildcard
  • Matching with * wildcard
  • Matching with ** wildcard
  • Return proper match object, includign line number of match
  • Deal with multiple matches of same line/section e.g. when edit distance is large

Dependencies

~4–13MB
~140K SLoC