#filename #sanitizer #nodejs #sanitiser

no-std sanitise-file-name

An unusually flexible and efficient file name sanitiser

1 stable release

1.0.0 Jan 5, 2022

#612 in Filesystem

Download history 486/week @ 2024-01-04 446/week @ 2024-01-11 327/week @ 2024-01-18 199/week @ 2024-01-25 496/week @ 2024-02-01 247/week @ 2024-02-08 435/week @ 2024-02-15 354/week @ 2024-02-22 622/week @ 2024-02-29 220/week @ 2024-03-07 240/week @ 2024-03-14 225/week @ 2024-03-21 342/week @ 2024-03-28 167/week @ 2024-04-04 205/week @ 2024-04-11 204/week @ 2024-04-18

966 downloads per month
Used in 4 crates (3 directly)

BlueOak-1.0.0 OR MIT OR Apache-2.0

69KB
879 lines

sanitise-file-name: an unusually flexible and efficient file name sanitiser

At the time of writing, I believe this to be one of the very best file name sanitisers around (comparing it with extant Rust options like sanitize-filename and sanitize-filename-reader-friendly, and other implementations I found for environments like Node.js and Python; I didn’t look at anything C/C++).

  • It’s faster: while its flexibility may act against it in some cases (depending on the optimiser), it starts out with the substantial advantage of making exactly one allocation, whereas the alternatives (even Rust ones like sanitize-filename and sanitize-filename-reader-friendly) make at least three or four, normally quite a few more. (Note that I haven’t done any benchmarking comparison.) What’s more, it lets you keep on reusing one large-enough buffer if you want, for amortised zero allocations.

  • It’s better documented: each option declares precisely what it does, why you might care, and sometimes gives extra suggestions (e.g. “if you want to support HFS+, normalise to NFD first so the length limit is correct”).

  • It’s more flexible: you can choose whether you want things like Windows-safety and URL-safety, plus there are more options for producing probably-prettier results (mostly inspired a bit by sanitize-filename-reader-friendly).

  • It’s more correct: it doesn’t remove unnecessary characaters and does remove all necessary characters (a surprisingly rare combination, though certainly not unknown); and length limitations (implemented correctly as UTF-8 code units rather than UTF-16 code units or Unicode code points or scalar values) truncate the base name rather than the extension where possible (now this is a feature that I haven’t found in any other library; and if you prefer to append the extension afterwards, I’ve got you covered, including adjusting the length limitation, which is also a feature that I haven’t found in any other library).

  • It behaves in a platform- and file-system-neutral way, because matching the local platform’s behaviour is just asking for trouble, especially in cases where you can’t accurately detect the file system in use. Instead, it supports all even vaguely popular file systems by default (which only care about ␀ and /), and you can opt out of Windows support since it’s the only one with even mildly cumbersome rules.

    • But ext3cow, which doesn’t allow @, is not supported.
    • And HFS+ environments where : is reserved are only supported incidentally via Windows-safety; but I believe (without having definitely confirmed this) that that’s pretty much ancient history, Mac OS 9 or so from memory.
  • It doesn’t even require std or alloc (though they’re enabled by default): it can support tinyvec_string::ArrayString, requiring no more than 510 bytes under the default options (and only that much because of extension cleverness).

  • It uses The Original And The Best™ English. (That is: sanitise instead of sanitize, and file name instead of filename.) ——Though as a concession to Americans, the functions are also exported under the spelling sanitize; but you’ll still have to steel yourselves to spelling it sanitise in the crate name.

Demonstration of the simplest and most convenient form of usage:

use sanitise_file_name::sanitise;

fn main() {
	// Examples of some of the things it can do:
	// whitespace is collapsed to one space,
	// various ASCII puntuation gets replaced by underscores,
	// outer whitespace is trimmed.
	// (There are reasons for each of these things,
	// and they can all be turned off or customised with options.)
    assert_eq!(
		sanitise("   https://example.com/Some\tfile \u{a0}  name .exe "),
		"https___example.com_Some file name.exe",
	);

	// The windows_safe option leads to the addition of the underscore.
    assert_eq!(sanitise("aux.h"), "aux_.h");
}

sanitise_file_name::Options docs explain all sanitisation functionality precisely. And all of it is customisable.

This crate supports no_std operation and has several other Cargo features; refer to the root of the crate docs for information.

Dependencies

~67KB