#arabic #string #unicode #internationalization

arabic_text_utils

A Rust library for Arabic text processing and manipulation

1 unstable release

Uses new Rust 2024

new 0.1.0 Mar 15, 2025

#698 in Text processing

Download history 68/week @ 2025-03-09

71 downloads per month

MIT license

31KB
537 lines

Arabic Text Utils

A comprehensive Rust library for Arabic text processing and manipulation. This crate provides a collection of utilities for working with Arabic text, including character analysis, number conversion, text normalization, and more.

🇵🇸 Free Palestine

We stand in solidarity with the Palestinian people. To learn more about supporting Palestine, please visit BDS Movement.

Features

  • Character Analysis

    • Identify Arabic characters
    • Get Arabic character names
    • Check for diacritical marks (harakat)
  • Number Handling

    • Convert between Arabic and Western numerals
    • Support for Eastern Arabic numerals (٠١٢٣٤٥٦٧٨٩)
  • Text Processing

    • Remove diacritical marks (tashkeel)
    • Normalize Arabic text
    • Detect Arabic content
    • Count Arabic words
    • Extract Arabic text from mixed content
    • RTL character detection
    • Character frequency analysis
    • Word and sentence segmentation
    • URL-friendly slug generation
    • Text wrapping
    • Text sanitization
  • Presentation Forms

    • Normalize Arabic presentation forms
    • Handle Arabic ligatures
    • Convert between isolated and connected forms
  • Punctuation

    • Convert between Arabic and Latin punctuation marks

Installation

Add this to your Cargo.toml:

[dependencies]
arabic_text_utils = "0.1.0"

Usage

use arabic_text_utils::{
    remove_tashkeel,
    normalize_arabic,
    convert_numbers_to_arabic,
    is_arabic_char,
};

// Remove diacritical marks
let text = "مَرْحَباً بِكُمْ";
assert_eq!(remove_tashkeel(text), "مرحبا بكم");

// Normalize Arabic text
let text = ""; // Arabic ligature
let normalized = normalize_arabic(text);
assert_eq!(normalized, "صلى الله عليه وسلم");

// Convert numbers to Arabic
let text = "Page 123";
assert_eq!(convert_numbers_to_arabic(text), "Page ١٢٣");

// Check if a character is Arabic
assert!(is_arabic_char('ع'));
assert!(!is_arabic_char('x'));

Documentation

For detailed documentation and examples, please visit docs.rs/arabic_text_utils.

Features in Detail

Character Module

  • is_arabic_char: Detect Arabic characters
  • get_arabic_char_name: Get Unicode names for Arabic characters
  • is_haraka: Check for diacritical marks

Numbers Module

  • convert_numbers_to_arabic: Convert Western numerals to Arabic
  • convert_numbers_from_arabic: Convert Arabic numerals to Western

Text Module

  • remove_tashkeel: Strip diacritical marks
  • normalize_arabic: Standardize Arabic text representation
  • contains_arabic: Check for Arabic content
  • count_arabic_words: Count Arabic words in text
  • extract_arabic_text: Extract Arabic-only content
  • has_rtl_characters: Detect right-to-left characters
  • arabic_character_frequency: Analyze character distribution
  • segment_words: Split text into words
  • segment_sentences: Split text into sentences
  • generate_slug: Create URL-friendly text
  • wrap_text: Wrap text at specified width
  • sanitize_arabic: Clean and standardize Arabic text

Presentation Module

  • normalize_presentation_forms: Standardize Arabic character forms
  • replace_ligatures: Handle special character combinations

Punctuation Module

  • convert_punctuation: Convert between Arabic and Latin punctuation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Unicode Standard for Arabic script processing
  • The Rust community for their valuable feedback and contributions

No runtime deps