#sorting #alphanumeric #natural #lexicographical

natlex_sort

Provides hybrid natural and lexicographical sorting for strings and byte slices, useful for sorting mixed lists of filenames and identifiers

1 unstable release

new 0.1.0 Apr 27, 2025

#354 in Algorithms


Used in c5store

MPL-2.0 license

37KB
505 lines

Natural Lexicographic Order Sort

Crates.io Docs.rs License: MPL-2.0

Provides hybrid sorting logic for Rust, combining natural and lexicographical ordering for strings (&str) and byte slices (&[u8]).

Overview

Standard sorting methods often fall short when dealing with lists containing both fixed-format identifiers and human-readable names with embedded numbers.

  • Lexicographical sort: Works well for fixed-length IDs (e.g., id-001, id-002) but sorts filenames counter-intuitively (file10.txt before file2.txt).
  • Natural sort: Handles filenames correctly (file2.txt before file10.txt) but might not be the desired behaviour for fixed-length, zero-padded IDs where lexicographical order is preferred.

natlex_sort offers a hybrid approach to address this:

  1. Items of the same length are compared lexicographically.
  2. Items of different lengths are compared using natural ordering.

This allows fixed-length IDs to sort byte-wise while variable-length names are sorted naturally.

Case-sensitive and case-insensitive variants are provided. The string-based natural sorting relies on the excellent natord crate.

Features

  • Hybrid comparison functions for &str and &[u8].
  • Case-sensitive (nat_lex_cmp, nat_lex_byte_cmp) and case-insensitive (nat_lex_cmp_ignore, nat_lex_byte_cmp_ignore) variants.
  • Convenience functions to sort slices in place (nat_lex_sort, nat_lex_sort_bytes, etc.).
  • NatLexOrderedString wrapper type for use with ordered collections like BTreeMap.
  • Relies on natord for robust natural sorting of strings.
  • Custom byte-slice natural comparison logic suitable for ASCII text with numbers.

Installation

Add natlex_sort to your Cargo.toml:

[dependencies]
natlex_sort = "0.1.0" # Replace with the latest version from crates.io

Usage Examples

use natlex_sort::{nat_lex_sort, nat_lex_sort_ignore_case, NatLexOrderedString};
use std::collections::BTreeMap;

// --- Sorting Mixed Strings ---
let mut items = vec![
    "item10",
    "item2",
    "id_002", // Same length as id_001
    "id_001",
    "Item1",  // Different length from item2/item10
];

// Case-sensitive sort
nat_lex_sort(&mut items);
// Expected: "Item1", "id_001", "id_002", "item2", "item10"
// - "id_001", "id_002" (same length -> lexicographical)
// - "Item1", "item2", "item10" (different lengths -> natural, 'I' < 'i')
// - Group comparison: "Item1" vs "id_001" (diff length -> natural, 'I' < 'i')
assert_eq!(items, vec!["Item1", "id_001", "id_002", "item2", "item10"]);


// --- Case-Insensitive Sorting ---
let mut items_ci = vec![
    "item10",
    "item2",
    "ID_002", // Different case
    "id_001",
    "Item1",
];
nat_lex_sort_ignore_case(&mut items_ci);
// Expected: "ID_002", "id_001", "Item1", "item2", "item10"
// - "ID_002", "id_001" (same length -> case-insensitive lex, 'D' < 'd' tie-breaker)
// - "Item1", "item2", "item10" (different lengths -> case-insensitive natural)
// - Group comparison: "ID_002" vs "Item1" (diff length -> natural ignore case, 'I'=='I', then 'D' < 't')
assert_eq!(items_ci, vec!["ID_002", "id_001", "Item1", "item2", "item10"]);


// --- Using the Wrapper Type ---
let mut map = BTreeMap::new();
// Keys are ordered using case-sensitive nat_lex_cmp
map.insert(NatLexOrderedString::from("config_10"), "Value C");
map.insert(NatLexOrderedString::from("id_abc"), "Value A"); // Same length
map.insert(NatLexOrderedString::from("id_abd"), "Value B"); // Same length
map.insert(NatLexOrderedString::from("config_2"), "Value D");

let ordered_keys: Vec<_> = map.keys().map(|k| k.0.as_str()).collect();
// Expected: "config_2", "config_10", "id_abc", "id_abd"
// ('c' < 'i')
assert_eq!(ordered_keys, vec!["config_2", "config_10", "id_abc", "id_abd"]);

// --- See documentation for byte slice sorting and direct comparator usage ---

API Documentation

Detailed API documentation can be found on docs.rs.

The main components are:

  • Comparison Functions:
    • nat_lex_cmp, nat_lex_cmp_ignore (for &str)
    • nat_lex_byte_cmp, nat_lex_byte_cmp_ignore (for &[u8])
  • Sorting Functions:
    • nat_lex_sort, nat_lex_sort_ignore_case (for &[T: AsRef<str>])
    • nat_lex_sort_bytes, nat_lex_sort_bytes_ignore_case (for &[&[u8]])
  • Wrapper Type:
    • NatLexOrderedString (owning String wrapper using nat_lex_cmp)

Behavior Notes

  • Case-Insensitive Tie-Breaking: When comparing items of the same length using the _ignore_case variants, if the items are identical ignoring case (e.g., "FileA" vs "filea"), a standard case-sensitive comparison (a.cmp(b)) is used as a tie-breaker to ensure a stable, total order.
  • Byte Slice Logic: The byte comparison functions (nat_lex_byte_cmp, nat_lex_byte_cmp_ignore) implement custom natural sorting logic suitable for ASCII text with embedded numbers. See the documentation for details. Behavior with non-ASCII bytes in the natural sorting path might be less predictable than the &str versions which use natord.

License

This project is licensed under the Mozilla Public License Version 2.0 (MPL-2.0).

Contributing

Contributions are welcome! Please feel free to open an issue for bugs, feature requests, or questions. If you'd like to submit a pull request, please open an issue first to discuss the proposed changes.

When contributing, please ensure:

  • Code is formatted using cargo fmt.
  • Tests pass (cargo test).
  • New functionality includes relevant tests.
  • Clippy lints are addressed (cargo clippy).

Dependencies