31 releases (breaking)
Uses new Rust 2024
new 0.23.0 | May 9, 2025 |
---|---|
0.21.0 | May 3, 2025 |
0.16.0 | Feb 13, 2025 |
0.15.0 | Nov 21, 2024 |
0.8.2 | Jul 25, 2024 |
#166 in Text processing
671 downloads per month
110KB
2K
SLoC
word-tally
Output a tally of the number of times unique words appear in source input.
Usage
Usage: word-tally [OPTIONS] [PATH]
Arguments:
[PATH] File path to use as input rather than stdin ("-") [default: -]
Options:
-c, --case <FORMAT> Case normalization [default: lower] [possible values: original, upper, lower]
-s, --sort <ORDER> Sort order [default: desc] [possible values: desc, asc, unsorted]
-m, --min-chars <COUNT> Exclude words containing fewer than min chars
-M, --min-count <COUNT> Exclude words appearing fewer than min times
-E, --exclude-words <WORDS> Exclude words from a comma-delimited list
-i, --include <PATTERN> Include only words matching a regex pattern
-x, --exclude <PATTERN> Exclude words matching a regex pattern
-f, --format <FORMAT> Output format [default: text] [possible values: text, json, csv]
-d, --delimiter <VALUE> Delimiter between keys and values [default: " "]
-o, --output <PATH> Write output to file rather than stdout
-v, --verbose Print verbose details
--io <STRATEGY> I/O strategy to use for input processing [default: streamed] [possible values: streamed, buffered, mmap]
-p, --parallel Use parallel processing
-h, --help Print help (see more with '--help')
-V, --version Print version
Stability Notice
Pre-release level stability: This project is currently in pre-release stage. Expect breaking interface changes at MINOR version bumps (0.x.0) as the API evolves. The library will maintain API stability once it reaches 1.0.0.
Examples
Basic Usage
word-tally README.md | head -n3
#>> tally 22
#>> word 20
#>> https 11
echo "one two two three three three" | word-tally
#>> three 3
#>> two 2
#>> one 1
word-tally README.md --output=words.txt
Filtering Words
# Only include words that appear at least 10 times
word-tally --min-count=10 book.txt
# Exclude words with fewer than 5 characters
word-tally --min-chars=5 book.txt
# Exclude words by pattern
word-tally --exclude="^a.*" --exclude="^the$" book.txt
# Combining include and exclude patterns
word-tally --include="^w.*" --include=".*o$" --exclude="^who$" book.txt
# Exclude specific words
word-tally --exclude-words="the,a,an,and,or,but" book.txt
CSV output:
# Using delimiter (manual CSV)
word-tally --delimiter="," --output="tally.csv" README.md
# Using CSV format (with headers)
word-tally --format=csv --output="tally.csv" README.md
JSON output:
word-tally --format=json --output="tally.json" README.md
Transform JSON output for visualization with d3-cloud:
word-tally --format=json README.md | jq 'map({text: .[0], value: .[1]})' > d3-cloud.json
Transform and pipe the JSON output to the wordcloud_cli to produce an image:
word-tally --format=json README.md | jq -r 'map(.[0] + " ") | join(" ")' | wordcloud_cli --imagefile wordcloud.png
I/O and Processing Strategies
word-tally supports various I/O modes and parallel processing:
# --io=streamed is the default I/O strategy
output | word-tally
word-tally file.txt
word-tally --parallel large-file.txt
word-tally --io=mmap --parallel large-file.txt
word-tally --io=buffered file.txt
word-tally --io=mmap file.txt
Performance Considerations
Synthetic enchmarks with semi-realistic data suggest these strategies based on file size:
File Size | Best for Speed | Best for Memory | Balanced Approach |
---|---|---|---|
Small (<1MB) | Sequential + Memory-mapped | Sequential + Streamed | Sequential + Streamed |
Medium (1-80MB) | Sequential + Memory-mapped | Sequential + Streamed | Sequential + Memory-mapped |
Large (>80MB) | Parallel + Memory-mapped | Parallel + Streamed | Parallel + Memory-mapped |
Very Large (>1GB) | Parallel + Buffered | Parallel + Streamed | Parallel + Streamed |
Anecdotal insights:
- The inflection point where parallel processing becomes faster for me is around 80MB
- At this point, parallel processing may be several times faster than sequential
- For pipes and non-seekable sources, streaming I/O is required
- Memory-mapped I/O provides excellent performance but requires a seekable file
- Sequential streaming processing remains memory-efficient for files under 80MB
Performance can be further tuned through environment variables (detailed below).
Environment Variables
The following environment variables configure various aspects of the library:
Memory allocation and performance in all modes:
WORD_TALLY_UNIQUENESS_RATIO
- Divisor for estimating unique words from input size (default: 10)WORD_TALLY_DEFAULT_CAPACITY
- Default initial capacity when there is no size hint (default: 1024)WORD_TALLY_WORD_DENSITY
- Multiplier for estimating unique words per chunk (default: 15)WORD_TALLY_RESERVE_THRESHOLD
- Base threshold for capacity reservation when merging maps (default: 1000, scales with input size)
Parallel processing configuration:
WORD_TALLY_THREADS
- Number of threads for parallel processing (default: all available cores)WORD_TALLY_CHUNK_SIZE
- Size of chunks for parallel processing in bytes (default: 65536, 64KB)
I/O and processing strategy configuration:
WORD_TALLY_IO
- I/O strategy (default: streamed, options: streamed, buffered, memory-mapped)WORD_TALLY_PROCESSING
- Processing strategy (default: sequential, options: sequential, parallel)WORD_TALLY_VERBOSE
- Enable verbose mode (default: false, options: true/1/yes/on)
Installation
cargo install word-tally
Library Usage
[dependencies]
word-tally = "0.22.0"
use std::fs::File;
use word_tally::{Io, Options, Processing, WordTally};
fn main() -> std::io::Result<()> {
// Create a word tally with default options (Streamed I/O, Sequential processing)
let file = File::open("document.txt")?;
let word_tally = WordTally::new(file, &Options::default());
// Or customize I/O and processing strategies
let file = File::open("large-document.txt")?;
let options = Options::default()
.with_io(Io::MemoryMapped) // Use memory-mapped I/O for better performance with large files
.with_processing(Processing::Parallel); // Use parallel processing for multi-core efficiency
// For memory-mapped I/O, use try_from_file to handle potential errors
let word_tally = WordTally::try_from_file(file, &options).expect("Failed to process file");
// Print basic statistics
println!("Words: {} total, {} unique", word_tally.count(), word_tally.uniq_count());
// Print the top 5 words and the count of times each appear
for (word, count) in word_tally.tally().iter().take(5) {
println!("{}: {}", word, count);
}
Ok(())
}
The library supports customization including case normalization, sorting, filtering, and I/O and processing strategies.
Documentation
Tests & Benchmarks
Clone the repository.
git clone https://github.com/havenwood/word-tally
cd word-tally
Run the tests.
cargo test
Run the benchmarks.
cargo bench
Benchmarks
The project includes comprehensive benchmarks for measuring performance across different strategies:
# Run specific benchmark groups
cargo bench --bench core
cargo bench --bench io
cargo bench --bench features
# Run specific benchmark tests
cargo bench --bench io -- size_10kb
cargo bench --bench io -- size_75kb
Dependencies
~7–9.5MB
~159K SLoC