7 releases
new 0.1.6 | Apr 25, 2025 |
---|---|
0.1.5 | Apr 25, 2025 |
#335 in Text processing
743 downloads per month
18KB
283 lines
๐งน newline_normalizer ๐งน
The library for normalizing text into Unix (\n
) or DOS (\r\n
) newline formats, using fast SIMD search and zero-copy when possible.
โจ Features
- Adds extension traits to
str
โ call.to_unix_newlines()
and.to_dos_newlines()
directly. - Preserves input with
Cow<str>
โ skips allocation if no changes are needed. - Converts
\r
and\r\n
into consistent Unix (\n
) or DOS (\r\n
) newlines. - Unicode-safe โ preserves all characters without loss.
- Fast scanning with memchr and SIMD.
๐ Examples
use newline_normalizer::{ToUnixNewlines, ToDosNewlines};
let unix = "line1\r\nline2\rline3".to_unix_newlines();
assert_eq!(unix, "line1\nline2\nline3");
let dos = "line1\nline2\nline3".to_dos_newlines();
assert_eq!(dos, "line1\r\nline2\r\nline3");
๐ Benchmark
Benchmarks are in the /benches
folder.
Run them using:
cargo bench --bench to_unix
cargo bench --bench to_dos
All suggestions on how to improve the benchmarks are welcome.
๐ Results
Hardware: AMD Ryzen 9 9900X 12-Core Processor with 64 GB RAM.
Rust version: rustc 1.86.0 (05f9846f8 2025-03-31)
Benchmark framework: Criterion
Normalizing to DOS newlines (\r\n
):
Case | newline-converter |
This crate (newline_normalizer ) |
---|---|---|
Small Unicode paragraph | ~685.46 ns | ~88.789 ns ๐ |
Small Unicode paragraph pre-normalized | ~151.39 ns | ~58.350 ns ๐ |
The Adventures of Sherlock Holmes (608kb) | ~345.27 ยตs | ~138.26 ยตs ๐ |
The Adventures of Sherlock Holmes (608kb) pre-normalized | ~342.91 ยตs | ~137.54 ยตs ๐ |
Note: Pre-normalized means the input already has correct line endings and does not require changes.
Normalizing to Unix newlines (\n
):
Case | newline-converter |
regex replace all |
This crate (newline_normalizer ) |
---|---|---|---|
Small Unicode paragraph | ~1.0858 ยตs | ~101.42 ns | ~24.464 ns ๐ |
Small Unicode paragraph pre-normalized | ~164.41 ns | ~20.744 ns | ~4.6608 ns ๐ |
The Adventures of Sherlock Holmes (608kb) | ~680.12 ยตs | ~289.84 ยตs | ~89.150 ยตs ๐ |
The Adventures of Sherlock Holmes (608kb) pre-normalized | ~318.83 ยตs | ~7.7864 ยตs | ~2.5146 ยตs ๐ |
Benchmark result notes
- Pre-normalized means the input text already uses the correct line endings.
- In such cases,
newline_normalizer
can skip allocations and return a borrowed reference. - Extremely low latency (e.g., ~4.66 ns) is achieved by using
Cow::Borrowed
, avoiding an allocation of a new string when the input does not change.
๐ค Unicode behavior
This crate does not alter Unicode content. It only rewrites newline boundaries.
All valid UTF-8 sequences are preserved, including:
- Combining characters stay attached
- Emoji and multi-codepoint sequences remain valid
- Right-to-left (RTL) markers are unaffected
โ ๏ธ Limitations
This crate does not currently normalize U+2028 (LINE SEPARATOR) or U+2029 (PARA SEP). Only ASCII newline formats are converted.
๐ Licensed under MIT
This project is licensed under the MIT License.
Dependencies
~240KB