3 releases (breaking)

Uses old Rust 2015

0.3.0 May 7, 2018
0.2.0 Mar 30, 2018
0.1.0 Feb 22, 2018

#1155 in Text processing

28 downloads per month
Used in dedup

MIT/Apache

13KB
252 lines

Build Status Build status

A better deduplicator written in Rust.

Basic usage: dedup <INPUT> [-o <OUTPUTFILE>]

Run dedup --help to see:

USAGE:
    dedup.exe [FLAGS] [OPTIONS] [INPUT]

FLAGS:
    -l, --count-lines        If flag is set only print the number of unique entries found.
        --mmap               Enables use of memory mapped files. This is enabled by default.
        --no-mmap            Prohibits usage of memory mapped files. This will slow down the deduplication process
                             significantly!
    -z, --zero-terminated    Specifies that entries should be intepreted as being separated by a null byte rather than a
                             newline.
    -h, --help               Prints help information
    -V, --version            Prints version information

OPTIONS:
    -o, --output <OUTPUT>
        --terminator <TERMINATOR>    Specifies the single-byte pattern to separate entries by. Default is newline.
                                     [default: \n]

ARGS:
    <INPUT>    Specifies the input file to read from. Omit or supply '-' to read from stdin.

To run the benchmark run python benchsuite/benchrunner. This will download a large (400MB+) text file to use as a benchmark case.

Feature requests and bug reports are always welcome! Please raise them as an issue in this Github repository.


lib.rs:

This crate provides one function: fastchr, which very quickly finds the first occurrence of a given byte in a slice. fastchr is implemented using SIMD intrinsics and runtime CPU feature detection so it will always use the fastest method available on a platform. If SIMD features are not available, fastchr falls back to using memchr.

Dependencies

~170–315KB