6 releases (stable)
Uses old Rust 2015
2.2.0 | Jan 26, 2022 |
---|---|
2.1.0 | Jan 29, 2020 |
2.0.0 | Oct 9, 2016 |
1.0.1 | Oct 9, 2016 |
0.1.0 | Aug 31, 2015 |
#1478 in Text processing
62 downloads per month
Used in 3 crates
14KB
194 lines
xhtmlchardet
Basic character set detection for XML and HTML in Rust.
Minimum Supported Rust Version: 1.24.0
Example
use std::io::Cursor;
extern crate xhtmlchardet;
let text = b"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><channel><title>Example</title></channel>";
let mut text_cursor = Cursor::new(text.to_vec());
let detected_charsets: Vec<String> = xhtmlchardet::detect(&mut text_cursor, None).unwrap();
assert_eq!(detected_charsets, vec!["iso-8859-1".to_string()]);
Rationale
I wrote a feed crawler that needed to determine the character set of fetched content so that it could be normalised to UTF-8. Initially I used the uchardet crate but I encountered some situations where it misdetected the charset. I collected all these edge cases together and built a test suite. Then I implemented this crate, which passes all of those tests. It uses a fairly naïve approach derived from section F of the XML specification.