#text #english #ngrams #gibberish #dictionary #analysis #figure

bin+lib gibberish-or-not

Figure out if text is gibberish or not

11 releases (4 stable)

new 1.3.0 Feb 27, 2025
0.7.0 Feb 23, 2025

#379 in Text processing

Download history

589 downloads per month

MIT license

6.5MB
375K SLoC

🔍 Gibberish Detection Tool

Instantly detect if text is English or nonsense with 99% accuracy

Crates.io Documentation License

Documentation | Examples | Contributing

⚡ Quick Install

# As a CLI tool
cargo install gibberish-or-not

# As a library in Cargo.toml
gibberish-or-not = "1.0.0"

✨ Features

🚀 Lightning Fast

  • Zero runtime loading
  • Perfect hash table lookups
  • Optimized for speed

📚 Smart Analysis

  • Dictionary of 370k+ words
  • N-gram pattern matching
  • Frequency analysis

🎯 High Accuracy

  • 99% detection rate
  • Handles edge cases
  • Works with technical text

🎯 Examples

use gibberish_or_not::is_gibberish;

// Valid English
assert!(!is_gibberish("The quick brown fox jumps over the lazy dog"));
assert!(!is_gibberish("Technical terms like TCP/IP and README.md work too"));

// Gibberish
assert!(is_gibberish("asdf jkl qwerty"));
assert!(is_gibberish("xkcd vwpq mntb"));

🔬 How It Works

Our advanced detection algorithm uses three main components:

1. 📚 Dictionary Analysis

  • 370,000+ English words compiled into the binary
  • Perfect hash table for O(1) lookups
  • Zero runtime loading overhead
  • Includes technical terms and proper nouns

2. 🧮 N-gram Analysis

  • Trigrams (3-letter sequences)
    • Needs >15% match for single-word texts (Bletchley)
    • Needs >10% match for no-word texts (TextThatLooksLikeThisWhichCouldTripItUpOtherwise)
  • Quadgrams (4-letter sequences)
    • Needs >10% match for single-word texts
    • Needs >5% match for no-word texts (TextThatLooksLikeThisWhichCouldTripItUpOtherwise)
  • Trained on massive English text corpus

3. 🎯 Smart Classification

  • Text with 2+ English words → Valid English
  • Text with 1 English word → Must pass n-gram thresholds
  • Text with no English words → Must pass lower n-gram thresholds
  • Short text (<10 chars) → Dictionary check only (not enough data for n-grams)

👥 Contributing

Contributions are welcome! Here's how you can help:

  • 🐛 Report bugs and request features
  • 📝 Improve documentation
  • 🔧 Submit pull requests
  • 💡 Share ideas and feedback

📜 License

MIT License - see the LICENSE file for details.

Dependencies

~5MB
~148K SLoC