#simd #white-space #collapse

fast_whitespace_collapse

Collapse consecutive spaces and tabs into a single space using SIMD

1 unstable release

new 0.1.0 Feb 20, 2025

#433 in Text processing

MIT license

14KB
133 lines

fast_whitespace_collapse

A high-performance Rust crate for collapsing consecutive spaces and tabs into a single space.
Uses SIMD (u8x16) via the wide crate for efficient processing.
Automatically falls back to a scalar implementation if SIMD is unavailable.

Features

  • Collapses multiple spaces and tabs into a single space.
  • Preserves newlines and non-whitespace characters.
  • Uses SIMD (u8x16) when supported to process 16 bytes at a time.
  • Falls back to a fast scalar implementation if SIMD is unavailable.
  • Ensures valid UTF-8 output.
  • SIMD requires AVX2, SSE2, or NEON instruction sets.

Installation

Add this to your Cargo.toml:

[dependencies]
fast_whitespace_collapse = "0.1"

Or run the following command:

cargo add fast_whitespace_collapse

Controlling SIMD Support

By default, SIMD acceleration is enabled. You can control it via Cargo features:

πŸ”Ή Disable SIMD for Embedded Targets

cargo build --no-default-features

πŸ”Ή Explicitly Enable SIMD

cargo build --features simd-optimized

Usage

use fast_whitespace_collapse::collapse_whitespace;

let input = "This   is \t  a   test.";
let output = collapse_whitespace(input);
assert_eq!(output, "This is a test.");

Performance

  • Processes text using SIMD (u8x16), handling 16 bytes in parallel.
  • Falls back to scalar processing when SIMD is unavailable.
  • Handles large inputs efficiently while maintaining valid UTF-8 output.

Benchmark Results

Comparison with Other Approaches

Method Time
Regex approach 11.289 Β΅s
collapse crate 1.2624 Β΅s
Iterative approach 629.60 ns
Iterative bytes 428.00 ns
fast_whitespace_collapse crate 388.73 ns

πŸš€ fast_whitespace_collapse outperforms other methods, achieving the lowest execution time.

πŸ“Œ Benchmark executed on Apple M1 Pro (NEON SIMD enabled).

πŸ”Ή Run Your Own Benchmark

cargo bench

Compatibility

fast_whitespace_collapse supports multiple architectures:

  • x86_64: Uses SIMD (SSE2, AVX2) for maximum performance.
  • ARM (aarch64, M1/M2/M3): Uses NEON SIMD.
  • **Other: Falls back to a scalar implementation.

Examples

Basic Usage

use fast_whitespace_collapse::collapse_whitespace;

assert_eq!(collapse_whitespace("Hello    world"), "Hello world");
assert_eq!(collapse_whitespace("   Trim   spaces   " ), "Trim spaces");
assert_eq!(collapse_whitespace("Tabs\t\tconverted"), "Tabs converted");

Unicode Support

assert_eq!(collapse_whitespace("こんにけは  δΈ–η•Œ"), "こんにけは δΈ–η•Œ"); // Japanese
assert_eq!(collapse_whitespace("δ½ ε₯½  δΈ–η•Œ"), "δ½ ε₯½ δΈ–η•Œ"); // Chinese
assert_eq!(collapse_whitespace("πŸ˜€  πŸ˜ƒ  πŸ˜„"), "πŸ˜€ πŸ˜ƒ πŸ˜„"); // Emojis

Handling Newlines

assert_eq!(collapse_whitespace("Line1\n   Line2\nLine3"), "Line1\n Line2\nLine3");

Tests

Run tests with:

cargo test

License

This project is licensed under the MIT License.

Dependencies

~215KB