#character-encoding #codec #character-set #pages #oem #charset

yore

Rust library for decoding/encoding character sets according to OEM code pages

10 releases (4 stable)

1.1.0 Jul 26, 2023
1.0.2 Apr 22, 2023
1.0.1 Nov 2, 2022
0.3.3 Jan 18, 2022
0.1.0 Jul 2, 2021

#512 in Encoding

Download history 366/week @ 2024-07-19 639/week @ 2024-07-26 682/week @ 2024-08-02 520/week @ 2024-08-09 396/week @ 2024-08-16 610/week @ 2024-08-23 554/week @ 2024-08-30 639/week @ 2024-09-06 477/week @ 2024-09-13 791/week @ 2024-09-20 746/week @ 2024-09-27 861/week @ 2024-10-04 1155/week @ 2024-10-11 797/week @ 2024-10-18 505/week @ 2024-10-25 787/week @ 2024-11-01

3,408 downloads per month
Used in 5 crates

MIT license

1MB
32K SLoC

Yore

A Rust library for decoding and encoding character sets based on OEM code pages.

yore at crates.io yore at docs.rs

Features

  • Fast performance *
  • Minimal memory usage with Cow and shrink_to_fit
  • Easy-to-use API
  • Broad range of supported code pages
  • Handles code pages with redefined ASCII characters (<0x80), such as '٪' in CP864

Usage

Add yore to your Cargo.toml file.

[dependencies]
yore = "1.1.0"

Examples

Using a specific code page

use yore::code_pages::{CP857, CP850};
use yore::{DecodeError, EncodeError};

// Vec contains ASCII "text"
let bytes = vec![116, 101, 120, 116];
// Vec contains ASCII "text " and codepoint 231
let bytes_undefined = vec![116, 101, 120, 116, 32, 231]; 

// Notice that decoding CP850 can't fail because it is completely defined
assert_eq!(CP850.decode(&bytes), "text");

// However, CP857 can fail
assert_eq!(CP857.decode(&bytes).unwrap(), "text");

// "text " + codepoint 231 
assert!(matches!(CP857.decode(&bytes_undefined), DecodeError));

// Lossy decoding won't fail due to fallback
assert_eq!(CP857.decode_lossy(&bytes_undefined), "text �");

// Encoding
assert_eq!(CP850.encode("text").unwrap(), bytes);
assert!(matches!(CP850.encode("text 🦀"), EncodeError));
assert_eq!(CP850.encode_lossy("text 🦀", 231), bytes_undefined);

Using a trait object

use yore::CodePage;
fn do_something(code_page: &dyn CodePage, bytes: &[u8]) {
    println!("{}", code_page.decode(bytes).unwrap());
}

Supported code pages

Identifier Name Description
437 ibm437 OEM United States
737 ibm737 OEM Greek (formerly 437G); Greek (DOS)
775 ibm775 OEM Baltic; Baltic (DOS)
850 ibm850 OEM Multilingual Latin 1; Western European (DOS)
852 ibm852 OEM Latin 2; Central European (DOS)
855 ibm855 OEM Cyrillic (primarily Russian)
857 ibm857 OEM Turkish; Turkish (DOS)
860 ibm860 OEM Portuguese; Portuguese (DOS)
861 ibm861 OEM Icelandic; Icelandic (DOS)
862 dos-862 OEM Hebrew; Hebrew (DOS)
863 ibm863 OEM French Canadian; French Canadian (DOS)
864 ibm864 OEM Arabic; Arabic (864)
865 ibm865 OEM Nordic; Nordic (DOS)
866 cp866 OEM Russian; Cyrillic (DOS)
869 ibm869 OEM Modern Greek; Greek, Modern (DOS)
874 windows-874 Thai (Windows)
910 ibm910 IBM-PC APL2
1250 windows-1250 ANSI Central European; Central European (Windows)
1251 windows-1251 ANSI Cyrillic; Cyrillic (Windows)
1252 windows-1252 ANSI Latin 1; Western European (Windows)
1253 windows-1253 ANSI Greek; Greek (Windows)
1254 windows-1254 ANSI Turkish; Turkish (Windows)
1255 windows-1255 ANSI Hebrew; Hebrew (Windows)
1256 windows-1256 ANSI Arabic; Arabic (Windows)
1257 windows-1257 ANSI Baltic; Baltic (Windows)
1258 windows-1258 ANSI/OEM Vietnamese; Vietnamese (Windows)

* Benchmarks

encoding_rs supports only a few of the encodings that oem_cp and yore support. Additionally, encoding_rs focuses on streaming use cases.

Refer to the bench crate for more details.

Dependencies

~245–700KB
~17K SLoC