6 releases (stable)

Uses old Rust 2015

2.2.0 Jan 26, 2022
2.1.0 Jan 29, 2020
2.0.0 Oct 9, 2016
1.0.1 Oct 9, 2016
0.1.0 Aug 31, 2015

#1257 in Text processing

Download history 32/week @ 2024-11-29 44/week @ 2024-12-06 31/week @ 2024-12-13 6/week @ 2024-12-20 2/week @ 2024-12-27 42/week @ 2025-01-03 133/week @ 2025-01-10 119/week @ 2025-01-17 79/week @ 2025-01-24 111/week @ 2025-01-31 249/week @ 2025-02-07 128/week @ 2025-02-14 139/week @ 2025-02-21 120/week @ 2025-02-28 1791/week @ 2025-03-07 3033/week @ 2025-03-14

5,110 downloads per month
Used in 19 crates (3 directly)

MIT license

14KB
194 lines

xhtmlchardet

Basic character set detection for XML and HTML in Rust.

Build Status Documentation Latest Version

Minimum Supported Rust Version: 1.24.0

Example

use std::io::Cursor;
extern crate xhtmlchardet;

let text = b"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><channel><title>Example</title></channel>";
let mut text_cursor = Cursor::new(text.to_vec());
let detected_charsets: Vec<String> = xhtmlchardet::detect(&mut text_cursor, None).unwrap();
assert_eq!(detected_charsets, vec!["iso-8859-1".to_string()]);

Rationale

I wrote a feed crawler that needed to determine the character set of fetched content so that it could be normalised to UTF-8. Initially I used the uchardet crate but I encountered some situations where it misdetected the charset. I collected all these edge cases together and built a test suite. Then I implemented this crate, which passes all of those tests. It uses a fairly naïve approach derived from section F of the XML specification.

No runtime deps