2 releases
new 0.1.1 | May 16, 2025 |
---|---|
0.1.0 | May 16, 2025 |
#518 in Text processing
133 downloads per month
37KB
623 lines
superfold
A multilingual Rust library and CLI tool to process UTF-8 strings to exclude diacritics and fold non-phonetic graphemes into their phonetic ASCII representation (romantization by transliteration). This library preserves original whitespace (spaces, tabs, newlines, etc.), only transforming the actual word content and emoji representations. This means that: Japonic and Sino-Tibetan based languages such as Chinese and Japanese characters are represented as ASCII. Also means that: Emoji are replaced by their name enclosed by ":" as 🍆 becomes "🍆".
Examples:
use superfold::fold;
assert_eq!(fold("北亰"), "BeiJing");
assert_eq!(fold("🦄"), ":unicorn:");
// Whitespace and structure are preserved:
assert_eq!(
fold(" 你好 世界\nNext line with piejlüsse কথাটা 🦄!"),
" NiHao ShiJie\nNext line with piejlusse kotha :unicorn:!"
);
This library is inspired by great work of others such as:
CLI Usage
superfold
can also be used as a command-line tool to process files and directories.
Installation:
If you have Rust installed, you can build and install the CLI:
cargo install --path . # Run from the root of the superfold project directory
Or, after building with cargo build --release
, find the binary at target/release/superfold
.
Usage:
superfold [OPTIONS] [INPUTS]...
Options:
-o, --output-dir <OUTPUT_DIR>
: Output directory for processed files when multiple inputs or a directory are provided. Defaults to "superfold_output".-f, --overwrite
: Overwrite output files or directory if they already exist.-h, --help
: Print help information.-V, --version
: Print version information.
Examples:
-
Fold a string from stdin:
echo "precisão" | superfold
Output:
precisao
-
Fold a single file (outputs to
filename_folded.ext
):superfold myfile.txt
This will create
myfile_folded.txt
in the same directory. -
Fold specific files into an output directory:
superfold file1.txt path/to/file2.log -o my_folded_texts
This will create
my_folded_texts/file1.txt
andmy_folded_texts/file2.log
. -
Fold all text files in a directory (recursively) into an output directory:
superfold ./input_documents --output-dir ./folded_documents
This will process text files in
./input_documents
and its subdirectories, replicating the structure in./folded_documents
. -
Overwrite existing output:
superfold myfile.txt -f
Piping:
superfold
supports piping from stdin and to stdout, fitting into standard Unix pipelines:
cat long_text_file.txt | superfold > output.txt
echo "你好 🦄" | superfold | sed 's/:unicorn:/U/' # Example of further processing
Dependencies
~3.5MB
~50K SLoC