1.0.0	~~Apr 5, 2021~~
0.1.5	~~Mar 14, 2021~~

#13 in #code-point

Apache-2.0

305KB
707 lines

UTF-8 Buffered Reader

This crate provides functions to read utf-8 text from any type implementing io::BufRead through a trait, BufRead, without waiting for newline delimiters. These functions take advantage of buffering and either return &str or chars. Each has an associated iterator, some have an equivalent to a Map iterator that avoids allocation and cloning as well.

Usage

Add this crate as a dependency in your Cargo.toml:

[dependencies]
utf8-bufread = "1.0.0"

The simplest way to read a file using this crate may be something along the following:

// Reader may be any type implementing io::BufRead
// We'll just use a cursor wrapping a slice for this example
let mut reader = Cursor::new("Löwe 老虎 Léopard");
loop { // Loop until EOF
    match reader.read_str() {
        Ok(s) => {
            if s.is_empty() {
                break; // EOF
            }
            // Do something with `s` ...
            print!("{}", s);
        }
        Err(e) => {
            // We should try again if we get interrupted
            if e.kind() != ErrorKind::Interrupted {
                break;
            }
        }
    }
}

Reading arbitrary-length string slices

The read_str function returns a &str of arbitrary length (up to the reader's buffer capacity) read from the inner reader, without cloning data, unless a valid codepoint ends up cut at the end of the reader's buffer. Its associated iterator can be obtained by calling str_iter, and since it involves cloning the data at each iteration, str_map is also provided.

Reading codepoints

The read_char function returns a char read from the inner reader. Its associated iterator can be obtained by calling char_iter.

Iterator types

This crate provides several structs for several ways of iterating over the inner reader's data:

StrIter and CodepointIter clone the data on each iteration, but use an Rc to check if the returned String buffer is still used. If not, it is re-used to avoid re-allocating.

let mut reader = Cursor::new("Löwe 老虎 Léopard");
for s in reader.str_iter().filter_map(|r| r.ok()) {
    // Do something with s ...
    print!("{}", s);
}

StrMap and CodepointMap allow having access to read data without allocating nor copying, but then it cannot be passed to further iterator adapters.

let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count: usize = reader
    .str_map(|s| s.len())
    .filter_map(Result::ok)
    .sum();
println!("There is {} valid utf-8 bytes in {}", count, s);

CharIter is similar to StrIter and others, except it relies on chars implementing Copy and thus doesn't need a buffer nor the "Rc trick".

let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count = reader
    .char_iter()
    .filter_map(Result::ok)
    .filter(|c| c.is_lowercase())
    .count();
assert_eq!(count, 9);

All these iterators may read data until EOF or an invalid codepoint is found. If valid codepoints are read from the inner reader, they will be returned before reporting an error. After encountering an error or EOF, they always return None(option). They always ignore any Interrupted error.

Work in progress

This crate is still a work in progress. Part of its API can be considered stable:

read_str, read_codepoint and read_char's behavior and signature.
str_iter, str_map, codepoints_iter, codepoints_map and char_iter's behavior and signature.
StrIter, StrMap, CodepointIter, CodepointMap and CharIter's API.

However some features are still considered unstable:

Error's behavior, particularly regarding its kind and how it avoids data loss (see leftovers).

And some features still have to be added:

A lossy and unchecked version of read_* (see from_utf8_lossy & from_utf8_unchecked).
(Optional) Support for grapheme clusters using the unicode-segmentation crate, in the same fashion as read_codepoint.
I'm open to suggestion, if you have ideas 😉

Given I'm not the most experience developer at all, you are very welcome to submit issues and push requests here

License

Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.

yanked utf8-bufread