1Your friendly guide to understanding the performance characteristics of this 2crate. 3 4This guide assumes some familiarity with the public API of this crate, which 5can be found here: https://docs.rs/regex 6 7## Theory vs. Practice 8 9One of the design goals of this crate is to provide worst case linear time 10behavior with respect to the text searched using finite state automata. This 11means that, *in theory*, the performance of this crate is much better than most 12regex implementations, which typically use backtracking which has worst case 13exponential time. 14 15For example, try opening a Python interpreter and typing this: 16 17 >>> import re 18 >>> re.search('(a*)*c', 'a' * 30).span() 19 20I'll wait. 21 22At some point, you'll figure out that it won't terminate any time soon. ^C it. 23 24The promise of this crate is that *this pathological behavior can't happen*. 25 26With that said, just because we have protected ourselves against worst case 27exponential behavior doesn't mean we are immune from large constant factors 28or places where the current regex engine isn't quite optimal. This guide will 29detail those cases and provide guidance on how to avoid them, among other 30bits of general advice. 31 32## Thou Shalt Not Compile Regular Expressions In A Loop 33 34**Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation. 35 36Don't do it unless you really don't mind paying for it. Compiling a regular 37expression in this crate is quite expensive. It is conceivable that it may get 38faster some day, but I wouldn't hold out hope for, say, an order of magnitude 39improvement. In particular, compilation can take any where from a few dozen 40microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character 41classes, in particular, have the largest impact on compilation performance. At 42the time of writing, for example, `\pL{100}` takes around 44ms to compile. This 43is because `\pL` corresponds to every letter in Unicode and compilation must 44turn it into a proper automaton that decodes a subset of UTF-8 which 45corresponds to those letters. Compilation also spends some cycles shrinking the 46size of the automaton. 47 48This means that in order to realize efficient regex matching, one must 49*amortize the cost of compilation*. Trivially, if a call to `is_match` is 50inside a loop, then make sure your call to `Regex::new` is *outside* that loop. 51 52In many programming languages, regular expressions can be conveniently defined 53and compiled in a global scope, and code can reach out and use them as if 54they were global static variables. In Rust, there is really no concept of 55life-before-main, and therefore, one cannot utter this: 56 57 static MY_REGEX: Regex = Regex::new("...").unwrap(); 58 59Unfortunately, this would seem to imply that one must pass `Regex` objects 60around to everywhere they are used, which can be especially painful depending 61on how your program is structured. Thankfully, the 62[`lazy_static`](https://crates.io/crates/lazy_static) 63crate provides an answer that works well: 64 65 #[macro_use] extern crate lazy_static; 66 extern crate regex; 67 68 use regex::Regex; 69 70 fn some_helper_function(text: &str) -> bool { 71 lazy_static! { 72 static ref MY_REGEX: Regex = Regex::new("...").unwrap(); 73 } 74 MY_REGEX.is_match(text) 75 } 76 77In other words, the `lazy_static!` macro enables us to define a `Regex` *as if* 78it were a global static value. What is actually happening under the covers is 79that the code inside the macro (i.e., `Regex::new(...)`) is run on *first use* 80of `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but 81it's self contained and everything works exactly as you expect. In particular, 82`MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or 83a `Mutex`. On that note... 84 85## Using a regex from multiple threads 86 87**Advice**: The performance impact from using a `Regex` from multiple threads 88is likely negligible. If necessary, clone the `Regex` so that each thread gets 89its own copy. Cloning a regex does not incur any additional memory overhead 90than what would be used by using a `Regex` from multiple threads 91simultaneously. *Its only cost is ergonomics.* 92 93It is supported and encouraged to define your regexes using `lazy_static!` as 94if they were global static values, and then use them to search text from 95multiple threads simultaneously. 96 97One might imagine that this is possible because a `Regex` represents a 98*compiled* program, so that any allocation or mutation is already done, and is 99therefore read-only. Unfortunately, this is not true. Each type of search 100strategy in this crate requires some kind of mutable scratch space to use 101*during search*. For example, when executing a DFA, its states are computed 102lazily and reused on subsequent searches. Those states go into that mutable 103scratch space. 104 105The mutable scratch space is an implementation detail, and in general, its 106mutation should not be observable from users of this crate. Therefore, it uses 107interior mutability. This implies that `Regex` can either only be used from one 108thread, or it must do some sort of synchronization. Either choice is 109reasonable, but this crate chooses the latter, in particular because it is 110ergonomic and makes use with `lazy_static!` straight forward. 111 112Synchronization implies *some* amount of overhead. When a `Regex` is used from 113a single thread, this overhead is negligible. When a `Regex` is used from 114multiple threads simultaneously, it is possible for the overhead of 115synchronization from contention to impact performance. The specific cases where 116contention may happen is if you are calling any of these methods repeatedly 117from multiple threads simultaneously: 118 119* shortest_match 120* is_match 121* find 122* captures 123 124In particular, every invocation of one of these methods must synchronize with 125other threads to retrieve its mutable scratch space before searching can start. 126If, however, you are using one of these methods: 127 128* find_iter 129* captures_iter 130 131Then you may not suffer from contention since the cost of synchronization is 132amortized on *construction of the iterator*. That is, the mutable scratch space 133is obtained when the iterator is created and retained throughout its lifetime. 134 135## Only ask for what you need 136 137**Advice**: Prefer in this order: `is_match`, `find`, `captures`. 138 139There are three primary search methods on a `Regex`: 140 141* is_match 142* find 143* captures 144 145In general, these are ordered from fastest to slowest. 146 147`is_match` is fastest because it doesn't actually need to find the start or the 148end of the leftmost-first match. It can quit immediately after it knows there 149is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the 150search will quit after examing the first byte. 151 152In constrast, `find` must return both the start and end location of the 153leftmost-first match. It can use the DFA matcher for this, but must run it 154forwards once to find the end of the match *and then run it backwards* to find 155the start of the match. The two scans and the cost of finding the real end of 156the leftmost-first match make this more expensive than `is_match`. 157 158`captures` is the most expensive of them all because it must do what `find` 159does, and then run either the bounded backtracker or the Pike VM to fill in the 160capture group locations. Both of these are simulations of an NFA, which must 161spend a lot of time shuffling states around. The DFA limits the performance hit 162somewhat by restricting the amount of text that must be searched via an NFA 163simulation. 164 165One other method not mentioned is `shortest_match`. This method has precisely 166the same performance characteristics as `is_match`, except it will return the 167end location of when it discovered a match. For example, given the regex `a+` 168and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`, 169the latter of which being the correct end location of the leftmost-first match. 170 171## Literals in your regex may make it faster 172 173**Advice**: Literals can reduce the work that the regex engine needs to do. Use 174them if you can, especially as prefixes. 175 176In particular, if your regex starts with a prefix literal, the prefix is 177quickly searched before entering the (much slower) regex engine. For example, 178given the regex `foo\w+`, the literal `foo` will be searched for using 179Boyer-Moore. If there's no match, then no regex engine is ever used. Only when 180there's a match is the regex engine invoked at the location of the match, which 181effectively permits the regex engine to skip large portions of a haystack. 182If a regex is comprised entirely of literals (possibly more than one), then 183it's possible that the regex engine can be avoided entirely even when there's a 184match. 185 186When one literal is found, Boyer-Moore is used. When multiple literals are 187found, then an optimized version of Aho-Corasick is used. 188 189This optimization is in particular extended quite a bit in this crate. Here are 190a few examples of regexes that get literal prefixes detected: 191 192* `(foo|bar)` detects `foo` and `bar` 193* `(a|b)c` detects `ac` and `bc` 194* `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz` 195* `a?b` detects `a` and `b` 196* `a*b` detects `a` and `b` 197* `(ab){3,6}` detects `ababab` 198 199Literals in anchored regexes can also be used for detecting non-matches very 200quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match 201just by examing the first (or last) three bytes of the haystack. 202 203## Unicode word boundaries may prevent the DFA from being used 204 205**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)` 206instead of `\b` if you care about consistent performance more than correctness. 207 208It's a sad state of the current implementation. At the moment, the DFA will try 209to interpret Unicode word boundaries as if they were ASCII word boundaries. 210If the DFA comes across any non-ASCII byte, it will quit and fall back to an 211alternative matching engine that can handle Unicode word boundaries correctly. 212The alternate matching engine is generally quite a bit slower (perhaps by an 213order of magnitude). If necessary, this can be ameliorated in two ways. 214 215The first way is to add some number of literal prefixes to your regular 216expression. Even though the DFA may not be used, specialized routines will 217still kick in to find prefix literals quickly, which limits how much work the 218NFA simulation will need to do. 219 220The second way is to give up on Unicode and use an ASCII word boundary instead. 221One can use an ASCII word boundary by disabling Unicode support. That is, 222instead of using `\b`, use `(?-u:\b)`. Namely, given the regex `\b.+\b`, it 223can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It 224is important to limit the scope of disabling the `u` flag, since it might lead 225to a syntax error if the regex could match arbitrary bytes. For example, if one 226wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches 227any *byte* when the Unicode flag is disabled. 228 229The second way isn't appreciably different than just using a Unicode word 230boundary in the first place, since the DFA will speculatively interpret it as 231an ASCII word boundary anyway. The key difference is that if an ASCII word 232boundary is used explicitly, then the DFA won't quit in the presence of 233non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for 234more consistent performance. 235 236N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one 237can simply write `\b` to get an ASCII word boundary. 238 239## Excessive counting can lead to exponential state blow up in the DFA 240 241**Advice**: Don't write regexes that cause DFA state blow up if you care about 242match performance. 243 244Wait, didn't I say that this crate guards against exponential worst cases? 245Well, it turns out that the process of converting an NFA to a DFA can lead to 246an exponential blow up in the number of states. This crate specifically guards 247against exponential blow up by doing two things: 248 2491. The DFA is computed lazily. That is, a state in the DFA only exists in 250 memory if it is visited. In particular, the lazy DFA guarantees that *at 251 most* one state is created for every byte of input. This, on its own, 252 guarantees linear time complexity. 2532. Of course, creating a new state for *every* byte of input means that search 254 will go incredibly slow because of very large constant factors. On top of 255 that, creating a state for every byte in a large haystack could result in 256 exorbitant memory usage. To ameliorate this, the DFA bounds the number of 257 states it can store. Once it reaches its limit, it flushes its cache. This 258 prevents reuse of states that it already computed. If the cache is flushed 259 too frequently, then the DFA will give up and execution will fall back to 260 one of the NFA simulations. 261 262In effect, this crate will detect exponential state blow up and fall back to 263a search routine with fixed memory requirements. This does, however, mean that 264searching will be much slower than one might expect. Regexes that rely on 265counting in particular are strong aggravators of this behavior. For example, 266matching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s. 267 268In the future, it may be possible to increase the bound that the DFA uses, 269which would allow the caller to choose how much memory they're willing to 270spend. 271 272## Resist the temptation to "optimize" regexes 273 274**Advice**: This ain't a backtracking engine. 275 276An entire book was written on how to optimize Perl-style regular expressions. 277Most of those techniques are not applicable for this library. For example, 278there is no problem with using non-greedy matching or having lots of 279alternations in your regex. 280