11 releases

0.1.10 Dec 1, 2024
0.1.9 Nov 30, 2024
0.1.7 Oct 23, 2024
0.1.4 Aug 29, 2024

#619 in Web programming

Download history 222/week @ 2024-08-25 453/week @ 2024-09-01 473/week @ 2024-09-08 396/week @ 2024-09-15 933/week @ 2024-09-22 1026/week @ 2024-09-29 1181/week @ 2024-10-06 386/week @ 2024-10-13 1485/week @ 2024-10-20 809/week @ 2024-10-27 1146/week @ 2024-11-03 550/week @ 2024-11-10 642/week @ 2024-11-17 1638/week @ 2024-11-24 1360/week @ 2024-12-01

4,297 downloads per month
Used in 12 crates (3 directly)

MIT license

37KB
660 lines

auto_encoder

auto_encoder is a Rust library designed to automatically detect and encode various text and binary file formats, along with specific language encodings.

Features

  • Automatic Encoding Detection: Detects text encoding based on locale or content.
  • Binary Format Detection: Checks if a given file is a known binary format by inspecting its initial bytes.
  • HTML Language Detection: Extracts and detects the language of an HTML document from its content.

Installation

Add this to your Cargo.toml:

[dependencies]
auto_encoder = "0.1"

Usage

Encoding Detection

Automatically detect the encoding for a given locale:

use auto_encoder::encoding_for_locale;

let encoding = encoding_for_locale("ja-jp").unwrap();
println!("Encoding for Japanese locale: {:?}", encoding);

Encode bytes from a given HTML content and language:

use auto_encoder::encode_bytes_from_language;

let html_content = b"こんにちは、世界!";
let encoded = encode_bytes_from_language(html_content, "ja");
println!("Encoded content: {}", encoded);

Binary Format Detection

Check if a given file content is a known binary format:

use auto_encoder::is_binary_file;

let file_content = &[0xFF, 0xD8, 0xFF]; // JPEG file signature
let is_binary = is_binary_file(file_content);
println!("Is the file a known binary format? {}", is_binary);

HTML Language Detection

Detect the language attribute from an HTML document:

use auto_encoder::detect_language;

let html_content = br#"<html lang="en"><head><title>Test</title></head><body></body></html>"#;
let language = detect_language(html_content).unwrap();
println!("Language detected: {}", language);

API Documentation

Functions

encoding_for_locale

Get the encoding for a given locale if found.

pub fn encoding_for_locale(locale: &str) -> Option<&'static encoding_rs::Encoding>;

is_binary_file

Check if the file is a known binary format using its initial bytes.

pub fn is_binary_file(content: &[u8]) -> bool;

detect_language

Detect the language of an HTML resource based on its content.

pub fn detect_language(html_content: &[u8]) -> Option<String>;

encode_bytes

Get the content with proper encoding. Pass in a proper encoding label like SHIFT_JIS.

pub fn encode_bytes(html: &[u8], label: &str) -> String;

encode_bytes_from_language

Get the content with proper encoding based on a language code (e.g., ja for Japanese).

pub fn encode_bytes_from_language(html: &[u8], language: &str) -> String;

Supported Locales and Encodings

The library supports a wide range of locales and their corresponding encodings, such as WINDOWS_1252 for Western European languages, SHIFT_JIS for Japanese, GB18030 for Simplified Chinese, etc.

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Dependencies

~5MB
~153K SLoC