#endianness #utf-16 #string #byte-string #encoded-string #wstring

utf16string

String types to work directly with UTF-16 encoded strings

2 unstable releases

0.2.0 Oct 10, 2020
0.1.0 Oct 9, 2020

#1162 in Encoding

Download history 13742/week @ 2024-07-21 14161/week @ 2024-07-28 14083/week @ 2024-08-04 11080/week @ 2024-08-11 13944/week @ 2024-08-18 14674/week @ 2024-08-25 13553/week @ 2024-09-01 12715/week @ 2024-09-08 11255/week @ 2024-09-15 13645/week @ 2024-09-22 15048/week @ 2024-09-29 13830/week @ 2024-10-06 16624/week @ 2024-10-13 17819/week @ 2024-10-20 19304/week @ 2024-10-27 21102/week @ 2024-11-03

76,335 downloads per month
Used in 13 crates (4 directly)

MIT/Apache

74KB
1.5K SLoC

UTF-16 string types

This crate provides two string types to work with UTF-16 encoded bytes, they are directly analogous to how String and &str work with UTF-8 encoded bytes.

UTF-16 can be encoded in little- and big-endian byte order, this crate identifies which encoding the types contain to using a generic byteorder type, thus the main types exposed are:

  • &WStr<ByteOrder>
  • WString<ByteOrder>

These types aim to behave very similar to the standard libarary &str and String types. While many APIs are already covered, feel free to contribute more methods.

Documentation is at docs.rs. Currently a lot of documentation is rather terse, referring to the matching methods on the string types in the standard library is best in those cases. Feel free to contribute more exhaustive in-line docs.


lib.rs:

A UTF-16 little-endian string type.

This crate provides two string types to handle UTF-16 encoded bytes directly as strings: WString and WStr. They are to UTF-16 exactly like String and [str] are to UTF-8. Some of the concepts and functions here are rather tersely documented, in this case you can look up their equivalents on String or [str] and the behaviour should be exactly the same, only the underlying byte encoding is different.

Thus WString is a type which owns the bytes containing the string. Just like String and the underlying [Vec] it is built on, it distinguishes length (WString::len) and capacity (String::capacity). Here length is the number of bytes used while capacity is the number of bytes the string can grow withouth reallocating.

The WStr type does not own any bytes, it can only point to a slice of bytes containing valid UTF-16. As such you will only ever use it as a reference like &WStr, just you you only use [str] as &str.

The WString type implements Deref<Target = WStr<ByteOrder>

UTF-16 ByteOrder

UTF-16 encodes to unsigned 16-bit integers ([u16]), denoting code units. However different CPU architectures encode these [u16] integers using different byte order: little-endian and big-endian. Thus when handling UTF-16 strings you need to be aware of the byte order of the encoding, commonly the encoding variants are know as UTF-16LE and UTF-16BE respectively.

For this crate this means the types need to be aware of the byte order, which is done using the byteorder::ByteOrder trait as a generic parameter to the types: WString<ByteOrder> and WStr<ByteOrder> commonly written as WString<E> and WStr<E> where E stands for "endianess".

This crate exports BigEndian, [BE], LittleEndian and [LE] in case you need to denote the type:

use utf16string::{BigEndian, BE, WString};

let s0: WString<BigEndian> = WString::from("hello");
assert_eq!(s0.len(), 10);

let s1: WString<BE> = WString::from("hello");
assert_eq!(s0, s1);

As these types can often be a bit cumbersome to write they can often be inferred, especially with the help of the shorthand constructors like WString::from_utf16le, WString::from_utf16be, WStr::from_utf16le, WStr::from_utf16be and related. For example:

use utf16string::{LE, WStr};

let b = b"h\x00e\x00l\x00l\x00o\x00";

let s0: &WStr<LE> = WStr::from_utf16(b)?;
let s1 = WStr::from_utf16le(b)?;

assert_eq!(s0, s1);
assert_eq!(s0.to_utf8(), "hello");

Dependencies

~115KB