#xml #html #character #detection #set

xhtmlchardet

Character set detection for XML and HTML

6 releases (stable)

Uses old Rust 2015

2.2.0 Jan 26, 2022
2.1.0 Jan 29, 2020
2.0.0 Oct 9, 2016
1.0.1 Oct 9, 2016
0.1.0 Aug 31, 2015

#706 in Text processing

Download history 29/week @ 2022-12-07 15/week @ 2022-12-14 17/week @ 2022-12-21 18/week @ 2022-12-28 4/week @ 2023-01-04 14/week @ 2023-01-11 19/week @ 2023-01-18 25/week @ 2023-01-25 28/week @ 2023-02-01 20/week @ 2023-02-08 127/week @ 2023-02-15 63/week @ 2023-02-22 15/week @ 2023-03-01 18/week @ 2023-03-08 14/week @ 2023-03-15 12/week @ 2023-03-22

61 downloads per month
Used in 2 crates

MIT license

14KB
194 lines

xhtmlchardet

Basic character set detection for XML and HTML in Rust.

Build Status Documentation Latest Version

Minimum Supported Rust Version: 1.24.0

Example

use std::io::Cursor;
extern crate xhtmlchardet;

let text = b"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><channel><title>Example</title></channel>";
let mut text_cursor = Cursor::new(text.to_vec());
let detected_charsets: Vec<String> = xhtmlchardet::detect(&mut text_cursor, None).unwrap();
assert_eq!(detected_charsets, vec!["iso-8859-1".to_string()]);

Rationale

I wrote a feed crawler that needed to determine the character set of fetched content so that it could be normalised to UTF-8. Initially I used the uchardet crate but I encountered some situations where it misdetected the charset. I collected all these edge cases together and built a test suite. Then I implemented this crate, which passes all of those tests. It uses a fairly naïve approach derived from section F of the XML specification.

No runtime deps