#newlines #unix-windows #unix #windows #normalize #text

newline_normalizer

Zero-copy newline normalization to \n or \r\n with SIMD acceleration

7 releases

new 0.1.6 Apr 25, 2025
0.1.5 Apr 25, 2025

#335 in Text processing

Download history 743/week @ 2025-04-22

743 downloads per month

MIT license

18KB
283 lines

๐Ÿงน newline_normalizer ๐Ÿงน

The library for normalizing text into Unix (\n) or DOS (\r\n) newline formats, using fast SIMD search and zero-copy when possible.

crates.io

โœจ Features

  • Adds extension traits to str โ€” call .to_unix_newlines() and .to_dos_newlines() directly.
  • Preserves input with Cow<str> โ€” skips allocation if no changes are needed.
  • Converts \r and \r\n into consistent Unix (\n) or DOS (\r\n) newlines.
  • Unicode-safe โ€” preserves all characters without loss.
  • Fast scanning with memchr and SIMD.

๐Ÿ“š Examples

use newline_normalizer::{ToUnixNewlines, ToDosNewlines};

let unix = "line1\r\nline2\rline3".to_unix_newlines();
assert_eq!(unix, "line1\nline2\nline3");

let dos = "line1\nline2\nline3".to_dos_newlines();
assert_eq!(dos, "line1\r\nline2\r\nline3");

๐Ÿš€ Benchmark

Benchmarks are in the /benches folder.

Run them using:

cargo bench --bench to_unix
cargo bench --bench to_dos

All suggestions on how to improve the benchmarks are welcome.

๐Ÿ“ˆ Results

Hardware: AMD Ryzen 9 9900X 12-Core Processor with 64 GB RAM.

Rust version: rustc 1.86.0 (05f9846f8 2025-03-31)

Benchmark framework: Criterion

Normalizing to DOS newlines (\r\n):

Case newline-converter This crate (newline_normalizer)
Small Unicode paragraph ~685.46 ns ~88.789 ns ๐Ÿš€
Small Unicode paragraph pre-normalized ~151.39 ns ~58.350 ns ๐Ÿš€
The Adventures of Sherlock Holmes (608kb) ~345.27 ยตs ~138.26 ยตs ๐Ÿš€
The Adventures of Sherlock Holmes (608kb) pre-normalized ~342.91 ยตs ~137.54 ยตs ๐Ÿš€

Note: Pre-normalized means the input already has correct line endings and does not require changes.

Normalizing to Unix newlines (\n):

Case newline-converter regex replace all This crate (newline_normalizer)
Small Unicode paragraph ~1.0858 ยตs ~101.42 ns ~24.464 ns ๐Ÿš€
Small Unicode paragraph pre-normalized ~164.41 ns ~20.744 ns ~4.6608 ns ๐Ÿš€
The Adventures of Sherlock Holmes (608kb) ~680.12 ยตs ~289.84 ยตs ~89.150 ยตs ๐Ÿš€
The Adventures of Sherlock Holmes (608kb) pre-normalized ~318.83 ยตs ~7.7864 ยตs ~2.5146 ยตs ๐Ÿš€

Benchmark result notes

  • Pre-normalized means the input text already uses the correct line endings.
  • In such cases, newline_normalizer can skip allocations and return a borrowed reference.
  • Extremely low latency (e.g., ~4.66 ns) is achieved by using Cow::Borrowed, avoiding an allocation of a new string when the input does not change.

๐Ÿ”ค Unicode behavior

This crate does not alter Unicode content. It only rewrites newline boundaries.

All valid UTF-8 sequences are preserved, including:

  • Combining characters stay attached
  • Emoji and multi-codepoint sequences remain valid
  • Right-to-left (RTL) markers are unaffected

โš ๏ธ Limitations

This crate does not currently normalize U+2028 (LINE SEPARATOR) or U+2029 (PARA SEP). Only ASCII newline formats are converted.

๐Ÿ“ Licensed under MIT

This project is licensed under the MIT License.

Dependencies

~240KB