2 unstable releases
0.2.0 | Feb 18, 2023 |
---|---|
0.1.0 | Jan 2, 2019 |
#1034 in Text processing
8KB
117 lines
sortuniq
sortuniq
provides optimised versions of various | sort | uniq
constructions which are common in shell scripting. This construction
finds all the unique lines in an input, regardless of the line order.
For example:
1
2
3
2
1
becomes:
1
2
3
Usage
Without arguments, | sortuniq
generates the same set of results
as | sort | uniq
, and produces them immediately. This is, uh,
slightly disconcerting. They are not sorted, unlike in the original.
| sortuniq -c
generates the similar output to | sort | uniq -c
.
local
| sortuniq --local
will drop "local" duplicates from a stream, again
immediately. This can be useful if your data is very large, and has many
useless values in, and you only want the eventual uniq
values, or some
other idea of the stream. For example, for some data which looks like:
one
two
ponies!
two
one
one
horses!
two
one
ponies!
| sortuniq --local --size-hint 3
will (immediately) print:
one
two
ponies!
horses!
ponies!
It has eliminated many of the one
and two
entries, but can't
eliminate the second ponies!
as it runs out of "memory". You can
(arbitrarily) increase the size-hint
(default: 64
).
You can't really do that with uniq
, which can't look more than
one line back in the history.
Indicative benchmarks
I took a gigabyte of rendered Wikipedia pages, and extracted the "words", giving >200 million lines.
For this input:
| sortuniq -c
takes 17.5s (single core) and 190MB of memory (max RSS)| sort | uniq -c
takes 111s (405s total user time) and around 4gb of memory.
That's a 6x-23x speedup, and a 21x memory improvement.
Here's a subset of the input (via. wikiextractor
and | perl -pe 's/\b/\n/g;s/[ \t]//g' | egrep -v '^$'
:
Helena
Carroll
Helena
Winifred
Carroll
(
13
November
1928
–
31
March
2013
)
was
a
veteran
film
,
television
and
stage
actress
.
The most common words are, unsurprisingly:
1195445 by
1229458 with
1274390 on
1360706 as
1405234 for
1416371 '
1675168 is
1777549 The
1931603 -
2074956 was
3038885 "
3462954 a
3630791 to
4267812 in
4999315 and
5732322 of
8336207 .
9153796 ,
10748102 the
Dependencies
~3MB
~40K SLoC