1 unstable release
Uses new Rust 2024
new 0.1.0 | Mar 15, 2025 |
---|
#698 in Text processing
71 downloads per month
31KB
537 lines
Arabic Text Utils
A comprehensive Rust library for Arabic text processing and manipulation. This crate provides a collection of utilities for working with Arabic text, including character analysis, number conversion, text normalization, and more.
🇵🇸 Free Palestine
We stand in solidarity with the Palestinian people. To learn more about supporting Palestine, please visit BDS Movement.
Features
-
Character Analysis
- Identify Arabic characters
- Get Arabic character names
- Check for diacritical marks (harakat)
-
Number Handling
- Convert between Arabic and Western numerals
- Support for Eastern Arabic numerals (٠١٢٣٤٥٦٧٨٩)
-
Text Processing
- Remove diacritical marks (tashkeel)
- Normalize Arabic text
- Detect Arabic content
- Count Arabic words
- Extract Arabic text from mixed content
- RTL character detection
- Character frequency analysis
- Word and sentence segmentation
- URL-friendly slug generation
- Text wrapping
- Text sanitization
-
Presentation Forms
- Normalize Arabic presentation forms
- Handle Arabic ligatures
- Convert between isolated and connected forms
-
Punctuation
- Convert between Arabic and Latin punctuation marks
Installation
Add this to your Cargo.toml
:
[dependencies]
arabic_text_utils = "0.1.0"
Usage
use arabic_text_utils::{
remove_tashkeel,
normalize_arabic,
convert_numbers_to_arabic,
is_arabic_char,
};
// Remove diacritical marks
let text = "مَرْحَباً بِكُمْ";
assert_eq!(remove_tashkeel(text), "مرحبا بكم");
// Normalize Arabic text
let text = "ﷺ"; // Arabic ligature
let normalized = normalize_arabic(text);
assert_eq!(normalized, "صلى الله عليه وسلم");
// Convert numbers to Arabic
let text = "Page 123";
assert_eq!(convert_numbers_to_arabic(text), "Page ١٢٣");
// Check if a character is Arabic
assert!(is_arabic_char('ع'));
assert!(!is_arabic_char('x'));
Documentation
For detailed documentation and examples, please visit docs.rs/arabic_text_utils.
Features in Detail
Character Module
is_arabic_char
: Detect Arabic charactersget_arabic_char_name
: Get Unicode names for Arabic charactersis_haraka
: Check for diacritical marks
Numbers Module
convert_numbers_to_arabic
: Convert Western numerals to Arabicconvert_numbers_from_arabic
: Convert Arabic numerals to Western
Text Module
remove_tashkeel
: Strip diacritical marksnormalize_arabic
: Standardize Arabic text representationcontains_arabic
: Check for Arabic contentcount_arabic_words
: Count Arabic words in textextract_arabic_text
: Extract Arabic-only contenthas_rtl_characters
: Detect right-to-left charactersarabic_character_frequency
: Analyze character distributionsegment_words
: Split text into wordssegment_sentences
: Split text into sentencesgenerate_slug
: Create URL-friendly textwrap_text
: Wrap text at specified widthsanitize_arabic
: Clean and standardize Arabic text
Presentation Module
normalize_presentation_forms
: Standardize Arabic character formsreplace_ligatures
: Handle special character combinations
Punctuation Module
convert_punctuation
: Convert between Arabic and Latin punctuation
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Unicode Standard for Arabic script processing
- The Rust community for their valuable feedback and contributions