6 releases (stable)

Uses old Rust 2015

2.2.0	Jan 26, 2022
2.1.0	Jan 29, 2020
2.0.0	Oct 9, 2016
1.0.1	Oct 9, 2016
0.1.0	Aug 31, 2015

#1847 in Text processing

17,547 downloads per month
Used in 19 crates (3 directly)

MIT license

14KB
194 lines

xhtmlchardet

Basic character set detection for XML and HTML in Rust.

Minimum Supported Rust Version: 1.24.0

Example

use std::io::Cursor;
extern crate xhtmlchardet;

let text = b"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><channel><title>Example</title></channel>";
let mut text_cursor = Cursor::new(text.to_vec());
let detected_charsets: Vec<String> = xhtmlchardet::detect(&mut text_cursor, None).unwrap();
assert_eq!(detected_charsets, vec!["iso-8859-1".to_string()]);

Rationale

I wrote a feed crawler that needed to determine the character set of fetched content so that it could be normalised to UTF-8. Initially I used the uchardet crate but I encountered some situations where it misdetected the charset. I collected all these edge cases together and built a test suite. Then I implemented this crate, which passes all of those tests. It uses a fairly naïve approach derived from section F of the XML specification.

6 releases (stable)

xhtmlchardet

Example

Rationale

No runtime deps