2 unstable releases

0.2.0 Feb 18, 2023
0.1.0 Jan 2, 2019

#954 in Text processing

MIT/Apache

8KB
115 lines

sortuniq

sortuniq provides optimised versions of various | sort | uniq constructions which are common in shell scripting. This construction finds all the unique lines in an input, regardless of the line order.

For example:

1
2
3
2
1

becomes:

1
2
3

Usage

Without arguments, | sortuniq generates the same set of results as | sort | uniq, and produces them immediately. This is, uh, slightly disconcerting. They are not sorted, unlike in the original.

| sortuniq -c generates the similar output to | sort | uniq -c.

local

| sortuniq --local will drop "local" duplicates from a stream, again immediately. This can be useful if your data is very large, and has many useless values in, and you only want the eventual uniq values, or some other idea of the stream. For example, for some data which looks like:

one
two
ponies!
two
one
one
horses!
two
one
ponies!

| sortuniq --local --size-hint 3 will (immediately) print:

one
two
ponies!
horses!
ponies!

It has eliminated many of the one and two entries, but can't eliminate the second ponies! as it runs out of "memory". You can (arbitrarily) increase the size-hint (default: 64).

You can't really do that with uniq, which can't look more than one line back in the history.

Indicative benchmarks

I took a gigabyte of rendered Wikipedia pages, and extracted the "words", giving >200 million lines.

For this input:

  • | sortuniq -c takes 17.5s (single core) and 190MB of memory (max RSS)
  • | sort | uniq -c takes 111s (405s total user time) and around 4gb of memory.

That's a 6x-23x speedup, and a 21x memory improvement.


Here's a subset of the input (via. wikiextractor and | perl -pe 's/\b/\n/g;s/[ \t]//g' | egrep -v '^$':

Helena
Carroll
Helena
Winifred
Carroll
(
13
November
192831
March
2013
)
was
a
veteran
film
,
television
and
stage
actress
.

The most common words are, unsurprisingly:

   1195445 by
   1229458 with
   1274390 on
   1360706 as
   1405234 for
   1416371 '
   1675168 is
   1777549 The
   1931603 -
   2074956 was
   3038885 "
   3462954 a
   3630791 to
   4267812 in
   4999315 and
   5732322 of
   8336207 .
   9153796 ,
  10748102 the

Dependencies

~3MB
~41K SLoC