README.md
1bstr
2====
3This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable
4their use as byte strings, where byte strings are _conventionally_ UTF-8. This
5differs from the standard library's `String` and `str` types in that they are
6not required to be valid UTF-8, but may be fully or partially valid UTF-8.
7
8[![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions)
9[![](https://meritbadge.herokuapp.com/bstr)](https://crates.io/crates/bstr)
10
11
12### Documentation
13
14https://docs.rs/bstr
15
16
17### When should I use byte strings?
18
19See this part of the documentation for more details:
20https://docs.rs/bstr/0.2.*/bstr/#when-should-i-use-byte-strings.
21
22The short story is that byte strings are useful when it is inconvenient or
23incorrect to require valid UTF-8.
24
25
26### Usage
27
28Add this to your `Cargo.toml`:
29
30```toml
31[dependencies]
32bstr = "0.2"
33```
34
35
36### Examples
37
38The following two examples exhibit both the API features of byte strings and
39the I/O convenience functions provided for reading line-by-line quickly.
40
41This first example simply shows how to efficiently iterate over lines in
42stdin, and print out lines containing a particular substring:
43
44```rust
45use std::error::Error;
46use std::io::{self, Write};
47
48use bstr::{ByteSlice, io::BufReadExt};
49
50fn main() -> Result<(), Box<dyn Error>> {
51 let stdin = io::stdin();
52 let mut stdout = io::BufWriter::new(io::stdout());
53
54 stdin.lock().for_byte_line_with_terminator(|line| {
55 if line.contains_str("Dimension") {
56 stdout.write_all(line)?;
57 }
58 Ok(true)
59 })?;
60 Ok(())
61}
62```
63
64This example shows how to count all of the words (Unicode-aware) in stdin,
65line-by-line:
66
67```rust
68use std::error::Error;
69use std::io;
70
71use bstr::{ByteSlice, io::BufReadExt};
72
73fn main() -> Result<(), Box<dyn Error>> {
74 let stdin = io::stdin();
75 let mut words = 0;
76 stdin.lock().for_byte_line_with_terminator(|line| {
77 words += line.words().count();
78 Ok(true)
79 })?;
80 println!("{}", words);
81 Ok(())
82}
83```
84
85This example shows how to convert a stream on stdin to uppercase without
86performing UTF-8 validation _and_ amortizing allocation. On standard ASCII
87text, this is quite a bit faster than what you can (easily) do with standard
88library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)
89
90```rust
91use std::error::Error;
92use std::io::{self, Write};
93
94use bstr::{ByteSlice, io::BufReadExt};
95
96fn main() -> Result<(), Box<dyn Error>> {
97 let stdin = io::stdin();
98 let mut stdout = io::BufWriter::new(io::stdout());
99
100 let mut upper = vec![];
101 stdin.lock().for_byte_line_with_terminator(|line| {
102 upper.clear();
103 line.to_uppercase_into(&mut upper);
104 stdout.write_all(&upper)?;
105 Ok(true)
106 })?;
107 Ok(())
108}
109```
110
111This example shows how to extract the first 10 visual characters (as grapheme
112clusters) from each line, where invalid UTF-8 sequences are generally treated
113as a single character and are passed through correctly:
114
115```rust
116use std::error::Error;
117use std::io::{self, Write};
118
119use bstr::{ByteSlice, io::BufReadExt};
120
121fn main() -> Result<(), Box<dyn Error>> {
122 let stdin = io::stdin();
123 let mut stdout = io::BufWriter::new(io::stdout());
124
125 stdin.lock().for_byte_line_with_terminator(|line| {
126 let end = line
127 .grapheme_indices()
128 .map(|(_, end, _)| end)
129 .take(10)
130 .last()
131 .unwrap_or(line.len());
132 stdout.write_all(line[..end].trim_end())?;
133 stdout.write_all(b"\n")?;
134 Ok(true)
135 })?;
136 Ok(())
137}
138```
139
140
141### Cargo features
142
143This crates comes with a few features that control standard library, serde
144and Unicode support.
145
146* `std` - **Enabled** by default. This provides APIs that require the standard
147 library, such as `Vec<u8>`.
148* `unicode` - **Enabled** by default. This provides APIs that require sizable
149 Unicode data compiled into the binary. This includes, but is not limited to,
150 grapheme/word/sentence segmenters. When this is disabled, basic support such
151 as UTF-8 decoding is still included.
152* `serde1` - **Disabled** by default. Enables implementations of serde traits
153 for the `BStr` and `BString` types.
154* `serde1-nostd` - **Disabled** by default. Enables implementations of serde
155 traits for the `BStr` type only, intended for use without the standard
156 library. Generally, you either want `serde1` or `serde1-nostd`, not both.
157
158
159### Minimum Rust version policy
160
161This crate's minimum supported `rustc` version (MSRV) is `1.28.0`.
162
163In general, this crate will be conservative with respect to the minimum
164supported version of Rust. MSRV may be bumped in minor version releases.
165
166
167### Future work
168
169Since this is meant to be a core crate, getting a `1.0` release is a priority.
170My hope is to move to `1.0` within the next year and commit to its API so that
171`bstr` can be used as a public dependency.
172
173A large part of the API surface area was taken from the standard library, so
174from an API design perspective, a good portion of this crate should be mature.
175The main differences from the standard library are in how the various substring
176search routines work. The standard library provides generic infrastructure for
177supporting different types of searches with a single method, where as this
178library prefers to define new methods for each type of search and drop the
179generic infrastructure.
180
181Some _probable_ future considerations for APIs include, but are not limited to:
182
183* A convenience layer on top of the `aho-corasick` crate.
184* Unicode normalization.
185* More sophisticated support for dealing with Unicode case, perhaps by
186 combining the use cases supported by [`caseless`](https://docs.rs/caseless)
187 and [`unicase`](https://docs.rs/unicase).
188* Add facilities for dealing with OS strings and file paths, probably via
189 simple conversion routines.
190
191Here are some examples that are _probably_ out of scope for this crate:
192
193* Regular expressions.
194* Unicode collation.
195
196The exact scope isn't quite clear, but I expect we can iterate on it.
197
198In general, as stated below, this crate is an experiment in bringing lots of
199related APIs together into a single crate while simultaneously attempting to
200keep the total number of dependencies low. Indeed, every dependency of `bstr`,
201except for `memchr`, is optional.
202
203
204### High level motivation
205
206Strictly speaking, the `bstr` crate provides very little that can't already be
207achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of
208library crates. For example:
209
210* The standard library's
211 [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html)
212 can be used for incremental lossy decoding of `&[u8]`.
213* The
214 [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html)
215 crate can be used for iterating over graphemes (or words), but is only
216 implemented for `&str` types. One could use `Utf8Error` above to implement
217 grapheme iteration with the same semantics as what `bstr` provides (automatic
218 Unicode replacement codepoint substitution).
219* The [`twoway`](https://docs.rs/twoway) crate can be used for
220 fast substring searching on `&[u8]`.
221
222So why create `bstr`? Part of the point of the `bstr` crate is to provide a
223uniform API of coupled components instead of relying on users to piece together
224loosely coupled components from the crate ecosystem. For example, if you wanted
225to perform a search and replace in a `Vec<u8>`, then writing the code to do
226that with the `twoway` crate is not that difficult, but it's still additional
227glue code you have to write. This work adds up depending on what you're doing.
228Consider, for example, trimming and splitting, along with their different
229variants.
230
231In other words, `bstr` is partially a way of pushing back against the
232micro-crate ecosystem that appears to be evolving. It's not clear to me whether
233this experiment will be successful or not, but it is definitely a goal of
234`bstr` to keep its dependency list lightweight. For example, `serde` is an
235optional dependency because there is no feasible alternative, but `twoway` is
236not, where we instead prefer to implement our own substring search. In service
237of this philosophy, currently, the only required dependency of `bstr` is
238`memchr`.
239
240
241### License
242
243This project is licensed under either of
244
245 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
246 https://www.apache.org/licenses/LICENSE-2.0)
247 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
248 https://opensource.org/licenses/MIT)
249
250at your option.
251
252The data in `src/unicode/data/` is licensed under the Unicode License Agreement
253([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)), although
254this data is only used in tests.
255