#suffix #search #index #saca

sacapart

Partitioned suffix arrays, for use with sacabase

1 stable release

2.0.0 Nov 23, 2019

#2312 in Algorithms

Download history 72/week @ 2023-12-04 81/week @ 2023-12-11 93/week @ 2023-12-18 8/week @ 2023-12-25 33/week @ 2024-01-01 91/week @ 2024-01-08 82/week @ 2024-01-15 111/week @ 2024-01-22 73/week @ 2024-01-29 106/week @ 2024-02-05 96/week @ 2024-02-12 98/week @ 2024-02-19 107/week @ 2024-02-26 72/week @ 2024-03-04 66/week @ 2024-03-11 53/week @ 2024-03-18

302 downloads per month
Used in 3 crates (2 directly)

MIT license

15KB
287 lines

sacapart

Computing the suffix array (the lexicographic order of all suffixes of a text) is expensive, especially as the text gets large.

Sometimes, for very large inputs, a compromise is possible. Instead of computing the suffix array of the whole text, we can compute the suffix array of the first half, and the suffix array of the second half.

Memory usage remains roughly the same (depending on the SACA used), lookup time gets worse by a constant factor (the number of partitions), and, across partitions boundaries, worse (shorter) matches are sometimes found.

For some applications, like diffing very large files, this compromise makes sense. Read the docs and the tests to see if sacapart is right for you.

Note: sacapart is meant to be used in conjuction with a SACA that supports sacabase, like divsufsort.

Dependencies

~1.5MB
~28K SLoC