1 /*!
2 An experimental byte string library.
3 
4 Byte strings are just like standard Unicode strings with one very important
5 difference: byte strings are only *conventionally* UTF-8 while Rust's standard
6 Unicode strings are *guaranteed* to be valid UTF-8. The primary motivation for
7 byte strings is for handling arbitrary bytes that are mostly UTF-8.
8 
9 # Overview
10 
11 This crate provides two important traits that provide string oriented methods
12 on `&[u8]` and `Vec<u8>` types:
13 
14 * [`ByteSlice`](trait.ByteSlice.html) extends the `[u8]` type with additional
15   string oriented methods.
16 * [`ByteVec`](trait.ByteVec.html) extends the `Vec<u8>` type with additional
17   string oriented methods.
18 
19 Additionally, this crate provides two concrete byte string types that deref to
20 `[u8]` and `Vec<u8>`. These are useful for storing byte string types, and come
21 with convenient `std::fmt::Debug` implementations:
22 
23 * [`BStr`](struct.BStr.html) is a byte string slice, analogous to `str`.
24 * [`BString`](struct.BString.html) is an owned growable byte string buffer,
25   analogous to `String`.
26 
27 Additionally, the free function [`B`](fn.B.html) serves as a convenient short
28 hand for writing byte string literals.
29 
30 # Quick examples
31 
32 Byte strings build on the existing APIs for `Vec<u8>` and `&[u8]`, with
33 additional string oriented methods. Operations such as iterating over
34 graphemes, searching for substrings, replacing substrings, trimming and case
35 conversion are examples of things not provided on the standard library `&[u8]`
36 APIs but are provided by this crate. For example, this code iterates over all
37 of occurrences of a subtring:
38 
39 ```
40 use bstr::ByteSlice;
41 
42 let s = b"foo bar foo foo quux foo";
43 
44 let mut matches = vec![];
45 for start in s.find_iter("foo") {
46     matches.push(start);
47 }
48 assert_eq!(matches, [0, 8, 12, 21]);
49 ```
50 
51 Here's another example showing how to do a search and replace (and also showing
52 use of the `B` function):
53 
54 ```
55 use bstr::{B, ByteSlice};
56 
57 let old = B("foo ☃☃☃ foo foo quux foo");
58 let new = old.replace("foo", "hello");
59 assert_eq!(new, B("hello ☃☃☃ hello hello quux hello"));
60 ```
61 
62 And here's an example that shows case conversion, even in the presence of
63 invalid UTF-8:
64 
65 ```
66 use bstr::{ByteSlice, ByteVec};
67 
68 let mut lower = Vec::from("hello β");
69 lower[0] = b'\xFF';
70 // lowercase β is uppercased to Β
71 assert_eq!(lower.to_uppercase(), b"\xFFELLO \xCE\x92");
72 ```
73 
74 # Convenient debug representation
75 
76 When working with byte strings, it is often useful to be able to print them
77 as if they were byte strings and not sequences of integers. While this crate
78 cannot affect the `std::fmt::Debug` implementations for `[u8]` and `Vec<u8>`,
79 this crate does provide the `BStr` and `BString` types which have convenient
80 `std::fmt::Debug` implementations.
81 
82 For example, this
83 
84 ```
85 use bstr::ByteSlice;
86 
87 let mut bytes = Vec::from("hello β");
88 bytes[0] = b'\xFF';
89 
90 println!("{:?}", bytes.as_bstr());
91 ```
92 
93 will output `"\xFFello β"`.
94 
95 This example works because the
96 [`ByteSlice::as_bstr`](trait.ByteSlice.html#method.as_bstr)
97 method converts any `&[u8]` to a `&BStr`.
98 
99 # When should I use byte strings?
100 
101 This library is somewhat of an experiment that reflects my hypothesis that
102 UTF-8 by convention is a better trade off in some circumstances than guaranteed
103 UTF-8. It's possible, perhaps even likely, that this is a niche concern for
104 folks working closely with core text primitives.
105 
106 The first time this idea hit me was in the implementation of Rust's regex
107 engine. In particular, very little of the internal implementation cares at all
108 about searching valid UTF-8 encoded strings. Indeed, internally, the
109 implementation converts `&str` from the API to `&[u8]` fairly quickly and
110 just deals with raw bytes. UTF-8 match boundaries are then guaranteed by the
111 finite state machine itself rather than any specific string type. This makes it
112 possible to not only run regexes on `&str` values, but also on `&[u8]` values.
113 
114 Why would you ever want to run a regex on a `&[u8]` though? Well, `&[u8]` is
115 the fundamental way at which one reads data from all sorts of streams, via the
116 standard library's [`Read`](https://doc.rust-lang.org/std/io/trait.Read.html)
117 trait. In particular, there is no platform independent way to determine whether
118 what you're reading from is some binary file or a human readable text file.
119 Therefore, if you're writing a program to search files, you probably need to
120 deal with `&[u8]` directly unless you're okay with first converting it to a
121 `&str` and dropping any bytes that aren't valid UTF-8. (Or otherwise determine
122 the encoding---which is often impractical---and perform a transcoding step.)
123 Often, the simplest and most robust way to approach this is to simply treat the
124 contents of a file as if it were mostly valid UTF-8 and pass through invalid
125 UTF-8 untouched. This may not be the most correct approach though!
126 
127 One case in particular exacerbates these issues, and that's memory mapping
128 a file. When you memory map a file, that file may be gigabytes big, but all
129 you get is a `&[u8]`. Converting that to a `&str` all in one go is generally
130 not a good idea because of the costs associated with doing so, and also
131 because it generally causes one to do two passes over the data instead of
132 one, which is quite undesirable. It is of course usually possible to do it an
133 incremental way by only parsing chunks at a time, but this is often complex to
134 do or impractical. For example, many regex engines only accept one contiguous
135 sequence of bytes at a time with no way to perform incremental matching.
136 
137 In summary, the conventional UTF-8 byte strings provided by this library is an
138 experiment. They are definitely useful in some limited circumstances, but how
139 useful they are more broadly isn't clear yet.
140 
141 # `bstr` in public APIs
142 
143 Since this library is still experimental, you should not use it in the public
144 API of your crates until it hits `1.0` (unless you're OK with with tracking
145 breaking releases of `bstr`).
146 
147 In general, it should be possible to avoid putting anything in this crate into
148 your public APIs. Namely, you should never need to use the `ByteSlice` or
149 `ByteVec` traits as bounds on public APIs, since their only purpose is to
150 extend the methods on the concrete types `[u8]` and `Vec<u8>`, respectively.
151 Similarly, it should not be necessary to put either the `BStr` or `BString`
152 types into public APIs. If you want to use them internally, then they can
153 be converted to/from `[u8]`/`Vec<u8>` as needed.
154 
155 # Differences with standard strings
156 
157 The primary difference between `[u8]` and `str` is that the former is
158 conventionally UTF-8 while the latter is guaranteed to be UTF-8. The phrase
159 "conventionally UTF-8" means that a `[u8]` may contain bytes that do not form
160 a valid UTF-8 sequence, but operations defined on the type in this crate are
161 generally most useful on valid UTF-8 sequences. For example, iterating over
162 Unicode codepoints or grapheme clusters is an operation that is only defined
163 on valid UTF-8. Therefore, when invalid UTF-8 is encountered, the Unicode
164 replacement codepoint is substituted. Thus, a byte string that is not UTF-8 at
165 all is of limited utility when using these crate.
166 
167 However, not all operations on byte strings are specifically Unicode aware. For
168 example, substring search has no specific Unicode semantics ascribed to it. It
169 works just as well for byte strings that are completely valid UTF-8 as for byte
170 strings that contain no valid UTF-8 at all. Similarly for replacements and
171 various other operations that do not need any Unicode specific tailoring.
172 
173 Aside from the difference in how UTF-8 is handled, the APIs between `[u8]` and
174 `str` (and `Vec<u8>` and `String`) are intentionally very similar, including
175 maintaining the same behavior for corner cases in things like substring
176 splitting. There are, however, some differences:
177 
178 * Substring search is not done with `matches`, but instead, `find_iter`.
179   In general, this crate does not define any generic
180   [`Pattern`](https://doc.rust-lang.org/std/str/pattern/trait.Pattern.html)
181   infrastructure, and instead prefers adding new methods for different
182   argument types. For example, `matches` can search by a `char` or a `&str`,
183   where as `find_iter` can only search by a byte string. `find_char` can be
184   used for searching by a `char`.
185 * Since `SliceConcatExt` in the standard library is unstable, it is not
186   possible to reuse that to implement `join` and `concat` methods. Instead,
187   [`join`](fn.join.html) and [`concat`](fn.concat.html) are provided as free
188   functions that perform a similar task.
189 * This library bundles in a few more Unicode operations, such as grapheme,
190   word and sentence iterators. More operations, such as normalization and
191   case folding, may be provided in the future.
192 * Some `String`/`str` APIs will panic if a particular index was not on a valid
193   UTF-8 code unit sequence boundary. Conversely, no such checking is performed
194   in this crate, as is consistent with treating byte strings as a sequence of
195   bytes. This means callers are responsible for maintaining a UTF-8 invariant
196   if that's important.
197 * Some routines provided by this crate, such as `starts_with_str`, have a
198   `_str` suffix to differentiate them from similar routines already defined
199   on the `[u8]` type. The difference is that `starts_with` requires its
200   parameter to be a `&[u8]`, where as `starts_with_str` permits its parameter
201   to by anything that implements `AsRef<[u8]>`, which is more flexible. This
202   means you can write `bytes.starts_with_str("☃")` instead of
203   `bytes.starts_with("☃".as_bytes())`.
204 
205 Otherwise, you should find most of the APIs between this crate and the standard
206 library string APIs to be very similar, if not identical.
207 
208 # Handling of invalid UTF-8
209 
210 Since byte strings are only *conventionally* UTF-8, there is no guarantee
211 that byte strings contain valid UTF-8. Indeed, it is perfectly legal for a
212 byte string to contain arbitrary bytes. However, since this library defines
213 a *string* type, it provides many operations specified by Unicode. These
214 operations are typically only defined over codepoints, and thus have no real
215 meaning on bytes that are invalid UTF-8 because they do not map to a particular
216 codepoint.
217 
218 For this reason, whenever operations defined only on codepoints are used, this
219 library will automatically convert invalid UTF-8 to the Unicode replacement
220 codepoint, `U+FFFD`, which looks like this: `�`. For example, an
221 [iterator over codepoints](struct.Chars.html) will yield a Unicode
222 replacement codepoint whenever it comes across bytes that are not valid UTF-8:
223 
224 ```
225 use bstr::ByteSlice;
226 
227 let bs = b"a\xFF\xFFz";
228 let chars: Vec<char> = bs.chars().collect();
229 assert_eq!(vec!['a', '\u{FFFD}', '\u{FFFD}', 'z'], chars);
230 ```
231 
232 There are a few ways in which invalid bytes can be substituted with a Unicode
233 replacement codepoint. One way, not used by this crate, is to replace every
234 individual invalid byte with a single replacement codepoint. In contrast, the
235 approach this crate uses is called the "substitution of maximal subparts," as
236 specified by the Unicode Standard (Chapter 3, Section 9). (This approach is
237 also used by [W3C's Encoding Standard](https://www.w3.org/TR/encoding/).) In
238 this strategy, a replacement codepoint is inserted whenever a byte is found
239 that cannot possibly lead to a valid UTF-8 code unit sequence. If there were
240 previous bytes that represented a *prefix* of a well-formed UTF-8 code unit
241 sequence, then all of those bytes (up to 3) are substituted with a single
242 replacement codepoint. For example:
243 
244 ```
245 use bstr::ByteSlice;
246 
247 let bs = b"a\xF0\x9F\x87z";
248 let chars: Vec<char> = bs.chars().collect();
249 // The bytes \xF0\x9F\x87 could lead to a valid UTF-8 sequence, but 3 of them
250 // on their own are invalid. Only one replacement codepoint is substituted,
251 // which demonstrates the "substitution of maximal subparts" strategy.
252 assert_eq!(vec!['a', '\u{FFFD}', 'z'], chars);
253 ```
254 
255 If you do need to access the raw bytes for some reason in an iterator like
256 `Chars`, then you should use the iterator's "indices" variant, which gives
257 the byte offsets containing the invalid UTF-8 bytes that were substituted with
258 the replacement codepoint. For example:
259 
260 ```
261 use bstr::{B, ByteSlice};
262 
263 let bs = b"a\xE2\x98z";
264 let chars: Vec<(usize, usize, char)> = bs.char_indices().collect();
265 // Even though the replacement codepoint is encoded as 3 bytes itself, the
266 // byte range given here is only two bytes, corresponding to the original
267 // raw bytes.
268 assert_eq!(vec![(0, 1, 'a'), (1, 3, '\u{FFFD}'), (3, 4, 'z')], chars);
269 
270 // Thus, getting the original raw bytes is as simple as slicing the original
271 // byte string:
272 let chars: Vec<&[u8]> = bs.char_indices().map(|(s, e, _)| &bs[s..e]).collect();
273 assert_eq!(vec![B("a"), B(b"\xE2\x98"), B("z")], chars);
274 ```
275 
276 # File paths and OS strings
277 
278 One of the premiere features of Rust's standard library is how it handles file
279 paths. In particular, it makes it very hard to write incorrect code while
280 simultaneously providing a correct cross platform abstraction for manipulating
281 file paths. The key challenge that one faces with file paths across platforms
282 is derived from the following observations:
283 
284 * On most Unix-like systems, file paths are an arbitrary sequence of bytes.
285 * On Windows, file paths are an arbitrary sequence of 16-bit integers.
286 
287 (In both cases, certain sequences aren't allowed. For example a `NUL` byte is
288 not allowed in either case. But we can ignore this for the purposes of this
289 section.)
290 
291 Byte strings, like the ones provided in this crate, line up really well with
292 file paths on Unix like systems, which are themselves just arbitrary sequences
293 of bytes. It turns out that if you treat them as "mostly UTF-8," then things
294 work out pretty well. On the contrary, byte strings _don't_ really work
295 that well on Windows because it's not possible to correctly roundtrip file
296 paths between 16-bit integers and something that looks like UTF-8 _without_
297 explicitly defining an encoding to do this for you, which is anathema to byte
298 strings, which are just bytes.
299 
300 Rust's standard library elegantly solves this problem by specifying an
301 internal encoding for file paths that's only used on Windows called
302 [WTF-8](https://simonsapin.github.io/wtf-8/). Its key properties are that they
303 permit losslessly roundtripping file paths on Windows by extending UTF-8 to
304 support an encoding of surrogate codepoints, while simultaneously supporting
305 zero-cost conversion from Rust's Unicode strings to file paths. (Since UTF-8 is
306 a proper subset of WTF-8.)
307 
308 The fundamental point at which the above strategy fails is when you want to
309 treat file paths as things that look like strings in a zero cost way. In most
310 cases, this is actually the wrong thing to do, but some cases call for it,
311 for example, glob or regex matching on file paths. This is because WTF-8 is
312 treated as an internal implementation detail, and there is no way to access
313 those bytes via a public API. Therefore, such consumers are limited in what
314 they can do:
315 
316 1. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8
317    by accessing their underlying 16-bit integer representation. Unfortunately,
318    this isn't zero cost (it introduces a second WTF-8 decoding step) and it's
319    not clear this is a good thing to do, since WTF-8 should ideally remain an
320    internal implementation detail.
321 2. One could instead declare that they will not handle paths on Windows that
322    are not valid UTF-16, and return an error when one is encountered.
323 3. Like (2), but instead of returning an error, lossily decode the file path
324    on Windows that isn't valid UTF-16 into UTF-16 by replacing invalid bytes
325    with the Unicode replacement codepoint.
326 
327 While this library may provide facilities for (1) in the future, currently,
328 this library only provides facilities for (2) and (3). In particular, a suite
329 of conversion functions are provided that permit converting between byte
330 strings, OS strings and file paths. For owned byte strings, they are:
331 
332 * [`ByteVec::from_os_string`](trait.ByteVec.html#method.from_os_string)
333 * [`ByteVec::from_os_str_lossy`](trait.ByteVec.html#method.from_os_str_lossy)
334 * [`ByteVec::from_path_buf`](trait.ByteVec.html#method.from_path_buf)
335 * [`ByteVec::from_path_lossy`](trait.ByteVec.html#method.from_path_lossy)
336 * [`ByteVec::into_os_string`](trait.ByteVec.html#method.into_os_string)
337 * [`ByteVec::into_os_string_lossy`](trait.ByteVec.html#method.into_os_string_lossy)
338 * [`ByteVec::into_path_buf`](trait.ByteVec.html#method.into_path_buf)
339 * [`ByteVec::into_path_buf_lossy`](trait.ByteVec.html#method.into_path_buf_lossy)
340 
341 For byte string slices, they are:
342 
343 * [`ByteSlice::from_os_str`](trait.ByteSlice.html#method.from_os_str)
344 * [`ByteSlice::from_path`](trait.ByteSlice.html#method.from_path)
345 * [`ByteSlice::to_os_str`](trait.ByteSlice.html#method.to_os_str)
346 * [`ByteSlice::to_os_str_lossy`](trait.ByteSlice.html#method.to_os_str_lossy)
347 * [`ByteSlice::to_path`](trait.ByteSlice.html#method.to_path)
348 * [`ByteSlice::to_path_lossy`](trait.ByteSlice.html#method.to_path_lossy)
349 
350 On Unix, all of these conversions are rigorously zero cost, which gives one
351 a way to ergonomically deal with raw file paths exactly as they are using
352 normal string-related functions. On Windows, these conversion routines perform
353 a UTF-8 check and either return an error or lossily decode the file path
354 into valid UTF-8, depending on which function you use. This means that you
355 cannot roundtrip all file paths on Windows correctly using these conversion
356 routines. However, this may be an acceptable downside since such file paths
357 are exceptionally rare. Moreover, roundtripping isn't always necessary, for
358 example, if all you're doing is filtering based on file paths.
359 
360 The reason why using byte strings for this is potentially superior than the
361 standard library's approach is that a lot of Rust code is already lossily
362 converting file paths to Rust's Unicode strings, which are required to be valid
363 UTF-8, and thus contain latent bugs on Unix where paths with invalid UTF-8 are
364 not terribly uncommon. If you instead use byte strings, then you're guaranteed
365 to write correct code for Unix, at the cost of getting a corner case wrong on
366 Windows.
367 */
368 
369 #![cfg_attr(not(feature = "std"), no_std)]
370 #![allow(dead_code)]
371 
372 #[cfg(feature = "std")]
373 extern crate core;
374 
375 #[cfg(feature = "unicode")]
376 #[macro_use]
377 extern crate lazy_static;
378 extern crate memchr;
379 #[cfg(test)]
380 #[macro_use]
381 extern crate quickcheck;
382 #[cfg(feature = "unicode")]
383 extern crate regex_automata;
384 #[cfg(feature = "serde1-nostd")]
385 extern crate serde;
386 #[cfg(test)]
387 extern crate ucd_parse;
388 
389 pub use bstr::BStr;
390 #[cfg(feature = "std")]
391 pub use bstring::BString;
392 pub use ext_slice::{
393     ByteSlice, Bytes, Fields, FieldsWith, Find, FindReverse, Finder,
394     FinderReverse, Lines, LinesWithTerminator, Split, SplitN, SplitNReverse,
395     SplitReverse, B,
396 };
397 #[cfg(feature = "std")]
398 pub use ext_vec::{concat, join, ByteVec, DrainBytes, FromUtf8Error};
399 #[cfg(feature = "unicode")]
400 pub use unicode::{
401     GraphemeIndices, Graphemes, SentenceIndices, Sentences, WordIndices,
402     Words, WordsWithBreakIndices, WordsWithBreaks,
403 };
404 pub use utf8::{
405     decode as decode_utf8, decode_last as decode_last_utf8, CharIndices,
406     Chars, Utf8Chunk, Utf8Chunks, Utf8Error,
407 };
408 
409 mod ascii;
410 mod bstr;
411 #[cfg(feature = "std")]
412 mod bstring;
413 mod byteset;
414 mod cow;
415 mod ext_slice;
416 #[cfg(feature = "std")]
417 mod ext_vec;
418 mod impls;
419 #[cfg(feature = "std")]
420 pub mod io;
421 mod search;
422 #[cfg(test)]
423 mod tests;
424 #[cfg(feature = "unicode")]
425 mod unicode;
426 mod utf8;
427 
428 #[cfg(test)]
429 mod apitests {
430     use bstr::BStr;
431     use bstring::BString;
432     use ext_slice::{Finder, FinderReverse};
433 
434     #[test]
oibits()435     fn oibits() {
436         use std::panic::{RefUnwindSafe, UnwindSafe};
437 
438         fn assert_send<T: Send>() {}
439         fn assert_sync<T: Sync>() {}
440         fn assert_unwind_safe<T: RefUnwindSafe + UnwindSafe>() {}
441 
442         assert_send::<&BStr>();
443         assert_sync::<&BStr>();
444         assert_unwind_safe::<&BStr>();
445         assert_send::<BString>();
446         assert_sync::<BString>();
447         assert_unwind_safe::<BString>();
448 
449         assert_send::<Finder>();
450         assert_sync::<Finder>();
451         assert_unwind_safe::<Finder>();
452         assert_send::<FinderReverse>();
453         assert_sync::<FinderReverse>();
454         assert_unwind_safe::<FinderReverse>();
455     }
456 }
457