#character-encoding #io-read #read-write #codec #io-write #io

encoding_rs_rw

Space-efficient std::io::{Read, Write} wrappers for encoding_rs

6 releases

0.4.2 Dec 3, 2023
0.4.1 Dec 2, 2023
0.3.2 Nov 28, 2023
0.2.1 Nov 11, 2023
0.1.2 Nov 8, 2023

#328 in Internationalization (i18n)

Apache-2.0

110KB
2K SLoC

encoding_rs_rw

Crates.io License

Space-efficient std::io::{Read, Write} wrappers for encoding_rs

This crate provides std::io::Read and std::io::Write implementations for encoding_rs::Decoder and encoding_rs::Encoder, respectively, to support Rust's standard streaming API.

use std::{fs, io, io::prelude::*};

use encoding_rs::{EUC_JP, SHIFT_JIS};
use encoding_rs_rw::{DecodingReader, EncodingWriter};

let file_r = io::BufReader::new(fs::File::open("foo.txt")?);
let mut reader = DecodingReader::new(file_r, EUC_JP.new_decoder());
let mut utf8 = String::new();
reader.read_to_string(&mut utf8)?;

let file_w = fs::File::create("bar.txt")?;
let mut writer = EncodingWriter::new(file_w, SHIFT_JIS.new_encoder());
write!(writer, "{}", utf8)?;
writer.flush()?;

This crate is an alternative to encoding_rs_io but provides a simpler API and more flexible error semantics.

This crate also provides a lossy variant of the decoding reader that replaces malformed byte sequences with replacement characters (U+FFED) and a with_unmappable_handler variant of writer that handles unmappable characters with the specified handler.

use std::{fs, io, io::prelude::*};

use encoding_rs::{EUC_KR, ISO_8859_7};
use encoding_rs_rw::{DecodingReader, EncodingWriter};

let file_r = io::BufReader::new(fs::File::open("baz.txt")?);
let mut reader = DecodingReader::new(file_r, EUC_KR.new_decoder());
let mut utf8 = String::new();
reader.lossy().read_to_string(&mut utf8)?;

let file_w = fs::File::create("qux.txt")?;
let mut writer = EncodingWriter::new(file_w, ISO_8859_7.new_encoder());
{
    let mut writer =
        writer.with_unmappable_handler(|e, w| write!(w, "&#{};", u32::from(e.value())));
    write!(writer, "{}", utf8)?;
    writer.flush()?;
}

Design

Conversion between different character encodings essentially requires byte buffers before and after the converter to implement Rust's Read and Write traits because, whereas read and write must support byte-by-byte operations, character encoders and decoders consume and produce multiple bytes at a time to handle multi-byte characters. The types in this crate employ small buffers to operate byte-by-byte, but it bypasses the internal buffers and utilizes the supplied buffers as much as possible to minimize double-buffering and memory consumption.

License

Licensed under the Apache License, Version 2.0.

See also

Dependencies

~3.5MB
~119K SLoC