1-----------------------------------------------------------------------------
2This file contains a concatenation of the PCRE2 man pages, converted to plain
3text format for ease of searching with a text editor, or for use on systems
4that do not have a man page processor. The small individual files that give
5synopses of each function in the library have not been included. Neither has
6the pcre2demo program. There are separate text files for the pcre2grep and
7pcre2test commands.
8-----------------------------------------------------------------------------
9
10
11PCRE2(3)                   Library Functions Manual                   PCRE2(3)
12
13
14
15NAME
16       PCRE2 - Perl-compatible regular expressions (revised API)
17
18INTRODUCTION
19
20       PCRE2 is the name used for a revised API for the PCRE library, which is
21       a set of functions, written in C,  that  implement  regular  expression
22       pattern matching using the same syntax and semantics as Perl, with just
23       a few differences. After nearly two decades,  the  limitations  of  the
24       original  API  were  making development increasingly difficult. The new
25       API is more extensible, and it was simplified by abolishing  the  sepa-
26       rate  "study" optimizing function; in PCRE2, patterns are automatically
27       optimized where possible. Since forking from PCRE1, the code  has  been
28       extensively refactored and new features introduced.
29
30       As  well  as Perl-style regular expression patterns, some features that
31       appeared in Python and the original PCRE before they appeared  in  Perl
32       are  available  using the Python syntax. There is also some support for
33       one or two .NET and Oniguruma syntax items, and there are  options  for
34       requesting  some  minor  changes that give better ECMAScript (aka Java-
35       Script) compatibility.
36
37       The source code for PCRE2 can be compiled to support 8-bit, 16-bit,  or
38       32-bit  code units, which means that up to three separate libraries may
39       be installed.  The original work to extend PCRE to  16-bit  and  32-bit
40       code  units  was  done  by Zoltan Herczeg and Christian Persch, respec-
41       tively. In all three cases, strings can be interpreted  either  as  one
42       character  per  code  unit, or as UTF-encoded Unicode, with support for
43       Unicode general category properties. Unicode  support  is  optional  at
44       build  time  (but  is  the default). However, processing strings as UTF
45       code units must be enabled explicitly at run time. The version of  Uni-
46       code in use can be discovered by running
47
48         pcre2test -C
49
50       The  three  libraries  contain  identical sets of functions, with names
51       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
52       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
53       32, a program that uses just one code unit width can be  written  using
54       generic names such as pcre2_compile(), and the documentation is written
55       assuming that this is the case.
56
57       In addition to the Perl-compatible matching function, PCRE2 contains an
58       alternative  function that matches the same compiled patterns in a dif-
59       ferent way. In certain circumstances, the alternative function has some
60       advantages.   For  a discussion of the two matching algorithms, see the
61       pcre2matching page.
62
63       Details of exactly which Perl regular expression features are  and  are
64       not  supported  by  PCRE2  are  given  in  separate  documents. See the
65       pcre2pattern and pcre2compat pages. There is a syntax  summary  in  the
66       pcre2syntax page.
67
68       Some  features  of PCRE2 can be included, excluded, or changed when the
69       library is built. The pcre2_config() function makes it possible  for  a
70       client  to  discover  which  features are available. The features them-
71       selves are described in the pcre2build page. Documentation about build-
72       ing  PCRE2 for various operating systems can be found in the README and
73       NON-AUTOTOOLS_BUILD files in the source distribution.
74
75       The libraries contains a number of undocumented internal functions  and
76       data  tables  that  are  used by more than one of the exported external
77       functions, but which are not intended  for  use  by  external  callers.
78       Their  names  all begin with "_pcre2", which hopefully will not provoke
79       any name clashes. In some environments, it is possible to control which
80       external  symbols  are  exported when a shared library is built, and in
81       these cases the undocumented symbols are not exported.
82
83
84SECURITY CONSIDERATIONS
85
86       If you are using PCRE2 in a non-UTF application that permits  users  to
87       supply  arbitrary  patterns  for  compilation, you should be aware of a
88       feature that allows users to turn on UTF support from within a pattern.
89       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
90       mode, which interprets patterns and subjects as strings of  UTF-8  code
91       units instead of individual 8-bit characters. This causes both the pat-
92       tern and any data against which it is matched to be checked  for  UTF-8
93       validity.  If the data string is very long, such a check might use suf-
94       ficiently many resources as to cause your application to  lose  perfor-
95       mance.
96
97       One  way  of guarding against this possibility is to use the pcre2_pat-
98       tern_info() function  to  check  the  compiled  pattern's  options  for
99       PCRE2_UTF.  Alternatively,  you can set the PCRE2_NEVER_UTF option when
100       calling pcre2_compile(). This causes a compile time error if  the  pat-
101       tern contains a UTF-setting sequence.
102
103       The  use  of Unicode properties for character types such as \d can also
104       be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
105       ture can be disallowed by setting the PCRE2_NEVER_UCP option.
106
107       If  your  application  is one that supports UTF, be aware that validity
108       checking can take time. If the same data string is to be  matched  many
109       times,  you  can  use  the PCRE2_NO_UTF_CHECK option for the second and
110       subsequent matches to avoid running redundant checks.
111
112       The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
113       to  problems,  because  it  may leave the current matching point in the
114       middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C  op-
115       tion can be used by an application to lock out the use of \C, causing a
116       compile-time error if it is encountered. It is also possible  to  build
117       PCRE2 with the use of \C permanently disabled.
118
119       Another  way  that  performance can be hit is by running a pattern that
120       has a very large search tree against a string that  will  never  match.
121       Nested  unlimited repeats in a pattern are a common example. PCRE2 pro-
122       vides some protection against  this:  see  the  pcre2_set_match_limit()
123       function  in  the  pcre2api  page.  There  is a similar function called
124       pcre2_set_depth_limit() that can be used to restrict the amount of mem-
125       ory that is used.
126
127
128USER DOCUMENTATION
129
130       The  user  documentation for PCRE2 comprises a number of different sec-
131       tions. In the "man" format, each of these is a separate "man page".  In
132       the  HTML  format, each is a separate page, linked from the index page.
133       In the plain  text  format,  the  descriptions  of  the  pcre2grep  and
134       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
135       respectively. The remaining sections, except for the pcre2demo  section
136       (which  is a program listing), and the short pages for individual func-
137       tions, are concatenated in pcre2.txt, for ease of searching.  The  sec-
138       tions are as follows:
139
140         pcre2              this document
141         pcre2-config       show PCRE2 installation configuration information
142         pcre2api           details of PCRE2's native C API
143         pcre2build         building PCRE2
144         pcre2callout       details of the pattern callout feature
145         pcre2compat        discussion of Perl compatibility
146         pcre2convert       details of pattern conversion functions
147         pcre2demo          a demonstration C program that uses PCRE2
148         pcre2grep          description of the pcre2grep command (8-bit only)
149         pcre2jit           discussion of just-in-time optimization support
150         pcre2limits        details of size and other limits
151         pcre2matching      discussion of the two matching algorithms
152         pcre2partial       details of the partial matching facility
153         pcre2pattern       syntax and semantics of supported regular
154                              expression patterns
155         pcre2perform       discussion of performance issues
156         pcre2posix         the POSIX-compatible C API for the 8-bit library
157         pcre2sample        discussion of the pcre2demo program
158         pcre2serialize     details of pattern serialization
159         pcre2syntax        quick syntax reference
160         pcre2test          description of the pcre2test command
161         pcre2unicode       discussion of Unicode and UTF support
162
163       In  the  "man"  and HTML formats, there is also a short page for each C
164       library function, listing its arguments and results.
165
166
167AUTHOR
168
169       Philip Hazel
170       University Computing Service
171       Cambridge, England.
172
173       Putting an actual email address here is a spam magnet. If you  want  to
174       email  me,  use  my two initials, followed by the two digits 10, at the
175       domain cam.ac.uk.
176
177
178REVISION
179
180       Last updated: 17 September 2018
181       Copyright (c) 1997-2018 University of Cambridge.
182------------------------------------------------------------------------------
183
184
185PCRE2API(3)                Library Functions Manual                PCRE2API(3)
186
187
188
189NAME
190       PCRE2 - Perl-compatible regular expressions (revised API)
191
192       #include <pcre2.h>
193
194       PCRE2  is  a  new API for PCRE, starting at release 10.0. This document
195       contains a description of all its native functions. See the pcre2 docu-
196       ment for an overview of all the PCRE2 documentation.
197
198
199PCRE2 NATIVE API BASIC FUNCTIONS
200
201       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
202         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
203         pcre2_compile_context *ccontext);
204
205       void pcre2_code_free(pcre2_code *code);
206
207       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
208         pcre2_general_context *gcontext);
209
210       pcre2_match_data *pcre2_match_data_create_from_pattern(
211         const pcre2_code *code, pcre2_general_context *gcontext);
212
213       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
214         PCRE2_SIZE length, PCRE2_SIZE startoffset,
215         uint32_t options, pcre2_match_data *match_data,
216         pcre2_match_context *mcontext);
217
218       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
219         PCRE2_SIZE length, PCRE2_SIZE startoffset,
220         uint32_t options, pcre2_match_data *match_data,
221         pcre2_match_context *mcontext,
222         int *workspace, PCRE2_SIZE wscount);
223
224       void pcre2_match_data_free(pcre2_match_data *match_data);
225
226
227PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
228
229       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
230
231       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
232
233       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
234
235       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
236
237
238PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
239
240       pcre2_general_context *pcre2_general_context_create(
241         void *(*private_malloc)(PCRE2_SIZE, void *),
242         void (*private_free)(void *, void *), void *memory_data);
243
244       pcre2_general_context *pcre2_general_context_copy(
245         pcre2_general_context *gcontext);
246
247       void pcre2_general_context_free(pcre2_general_context *gcontext);
248
249
250PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
251
252       pcre2_compile_context *pcre2_compile_context_create(
253         pcre2_general_context *gcontext);
254
255       pcre2_compile_context *pcre2_compile_context_copy(
256         pcre2_compile_context *ccontext);
257
258       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
259
260       int pcre2_set_bsr(pcre2_compile_context *ccontext,
261         uint32_t value);
262
263       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
264         const uint8_t *tables);
265
266       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
267         uint32_t extra_options);
268
269       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
270         PCRE2_SIZE value);
271
272       int pcre2_set_newline(pcre2_compile_context *ccontext,
273         uint32_t value);
274
275       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
276         uint32_t value);
277
278       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
279         int (*guard_function)(uint32_t, void *), void *user_data);
280
281
282PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
283
284       pcre2_match_context *pcre2_match_context_create(
285         pcre2_general_context *gcontext);
286
287       pcre2_match_context *pcre2_match_context_copy(
288         pcre2_match_context *mcontext);
289
290       void pcre2_match_context_free(pcre2_match_context *mcontext);
291
292       int pcre2_set_callout(pcre2_match_context *mcontext,
293         int (*callout_function)(pcre2_callout_block *, void *),
294         void *callout_data);
295
296       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
297         int (*callout_function)(pcre2_substitute_callout_block *, void *),
298         void *callout_data);
299
300       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
301         PCRE2_SIZE value);
302
303       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
304         uint32_t value);
305
306       int pcre2_set_match_limit(pcre2_match_context *mcontext,
307         uint32_t value);
308
309       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
310         uint32_t value);
311
312
313PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
314
315       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
316         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
317
318       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
319         uint32_t number, PCRE2_UCHAR *buffer,
320         PCRE2_SIZE *bufflen);
321
322       void pcre2_substring_free(PCRE2_UCHAR *buffer);
323
324       int pcre2_substring_get_byname(pcre2_match_data *match_data,
325         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
326
327       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
328         uint32_t number, PCRE2_UCHAR **bufferptr,
329         PCRE2_SIZE *bufflen);
330
331       int pcre2_substring_length_byname(pcre2_match_data *match_data,
332         PCRE2_SPTR name, PCRE2_SIZE *length);
333
334       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
335         uint32_t number, PCRE2_SIZE *length);
336
337       int pcre2_substring_nametable_scan(const pcre2_code *code,
338         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
339
340       int pcre2_substring_number_from_name(const pcre2_code *code,
341         PCRE2_SPTR name);
342
343       void pcre2_substring_list_free(PCRE2_SPTR *list);
344
345       int pcre2_substring_list_get(pcre2_match_data *match_data,
346         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
347
348
349PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
350
351       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
352         PCRE2_SIZE length, PCRE2_SIZE startoffset,
353         uint32_t options, pcre2_match_data *match_data,
354         pcre2_match_context *mcontext, PCRE2_SPTR replacementz,
355         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
356         PCRE2_SIZE *outlengthptr);
357
358
359PCRE2 NATIVE API JIT FUNCTIONS
360
361       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
362
363       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
364         PCRE2_SIZE length, PCRE2_SIZE startoffset,
365         uint32_t options, pcre2_match_data *match_data,
366         pcre2_match_context *mcontext);
367
368       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
369
370       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
371         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
372
373       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
374         pcre2_jit_callback callback_function, void *callback_data);
375
376       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
377
378
379PCRE2 NATIVE API SERIALIZATION FUNCTIONS
380
381       int32_t pcre2_serialize_decode(pcre2_code **codes,
382         int32_t number_of_codes, const uint8_t *bytes,
383         pcre2_general_context *gcontext);
384
385       int32_t pcre2_serialize_encode(const pcre2_code **codes,
386         int32_t number_of_codes, uint8_t **serialized_bytes,
387         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
388
389       void pcre2_serialize_free(uint8_t *bytes);
390
391       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
392
393
394PCRE2 NATIVE API AUXILIARY FUNCTIONS
395
396       pcre2_code *pcre2_code_copy(const pcre2_code *code);
397
398       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
399
400       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
401         PCRE2_SIZE bufflen);
402
403       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
404
405       void pcre2_maketables_free(pcre2_general_context *gcontext,
406         const uint8_t *tables);
407
408       int pcre2_pattern_info(const pcre2_code *code, uint32_t what,
409         void *where);
410
411       int pcre2_callout_enumerate(const pcre2_code *code,
412         int (*callback)(pcre2_callout_enumerate_block *, void *),
413         void *user_data);
414
415       int pcre2_config(uint32_t what, void *where);
416
417
418PCRE2 NATIVE API OBSOLETE FUNCTIONS
419
420       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
421         uint32_t value);
422
423       int pcre2_set_recursion_memory_management(
424         pcre2_match_context *mcontext,
425         void *(*private_malloc)(PCRE2_SIZE, void *),
426         void (*private_free)(void *, void *), void *memory_data);
427
428       These  functions became obsolete at release 10.30 and are retained only
429       for backward compatibility. They should not be used in  new  code.  The
430       first  is  replaced by pcre2_set_depth_limit(); the second is no longer
431       needed and has no effect (it always returns zero).
432
433
434PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
435
436       pcre2_convert_context *pcre2_convert_context_create(
437         pcre2_general_context *gcontext);
438
439       pcre2_convert_context *pcre2_convert_context_copy(
440         pcre2_convert_context *cvcontext);
441
442       void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
443
444       int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
445         uint32_t escape_char);
446
447       int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
448         uint32_t separator_char);
449
450       int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
451         uint32_t options, PCRE2_UCHAR **buffer,
452         PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
453
454       void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
455
456       These functions provide a way of  converting  non-PCRE2  patterns  into
457       patterns that can be processed by pcre2_compile(). This facility is ex-
458       perimental and may be changed in future releases. At  present,  "globs"
459       and  POSIX  basic  and  extended patterns can be converted. Details are
460       given in the pcre2convert documentation.
461
462
463PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
464
465       There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
466       code  units,  respectively.  However,  there  is  just one header file,
467       pcre2.h.  This contains the function prototypes and  other  definitions
468       for all three libraries. One, two, or all three can be installed simul-
469       taneously. On Unix-like systems the libraries  are  called  libpcre2-8,
470       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
471       inal PCRE libraries.
472
473       Character strings are passed to and from a PCRE2 library as a  sequence
474       of  unsigned  integers  in  code  units of the appropriate width. Every
475       PCRE2 function comes in three different forms, one  for  each  library,
476       for example:
477
478         pcre2_compile_8()
479         pcre2_compile_16()
480         pcre2_compile_32()
481
482       There are also three different sets of data types:
483
484         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
485         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32
486
487       The  UCHAR  types define unsigned code units of the appropriate widths.
488       For example, PCRE2_UCHAR16 is usually defined as `uint16_t'.  The  SPTR
489       types  are  constant  pointers  to the equivalent UCHAR types, that is,
490       they are pointers to vectors of unsigned code units.
491
492       Many applications use only one code unit width. For their  convenience,
493       macros are defined whose names are the generic forms such as pcre2_com-
494       pile() and  PCRE2_SPTR.  These  macros  use  the  value  of  the  macro
495       PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
496       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
497       An  application  must  define  it  to  be 8, 16, or 32 before including
498       pcre2.h in order to make use of the generic names.
499
500       Applications that use more than one code unit width can be linked  with
501       more  than  one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
502       be 0 before including pcre2.h, and then use the  real  function  names.
503       Any  code  that  is to be included in an environment where the value of
504       PCRE2_CODE_UNIT_WIDTH is unknown should  also  use  the  real  function
505       names. (Unfortunately, it is not possible in C code to save and restore
506       the value of a macro.)
507
508       If PCRE2_CODE_UNIT_WIDTH is not defined  before  including  pcre2.h,  a
509       compiler error occurs.
510
511       When  using  multiple  libraries  in an application, you must take care
512       when processing any particular pattern to use  only  functions  from  a
513       single  library.   For example, if you want to run a match using a pat-
514       tern that was compiled with pcre2_compile_16(), you  must  do  so  with
515       pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
516
517       In  the  function summaries above, and in the rest of this document and
518       other PCRE2 documents, functions and data  types  are  described  using
519       their generic names, without the _8, _16, or _32 suffix.
520
521
522PCRE2 API OVERVIEW
523
524       PCRE2  has  its  own  native  API, which is described in this document.
525       There are also some wrapper functions for the 8-bit library that corre-
526       spond  to the POSIX regular expression API, but they do not give access
527       to all the functionality of PCRE2. They are described in the pcre2posix
528       documentation. Both these APIs define a set of C function calls.
529
530       The  native  API  C data types, function prototypes, option values, and
531       error codes are defined in the header file pcre2.h, which also contains
532       definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
533       numbers for the library. Applications can use these to include  support
534       for different releases of PCRE2.
535
536       In a Windows environment, if you want to statically link an application
537       program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
538       before including pcre2.h.
539
540       The  functions pcre2_compile() and pcre2_match() are used for compiling
541       and matching regular expressions in a Perl-compatible manner. A  sample
542       program that demonstrates the simplest way of using them is provided in
543       the file called pcre2demo.c in the PCRE2 source distribution. A listing
544       of  this  program  is  given  in  the  pcre2demo documentation, and the
545       pcre2sample documentation describes how to compile and run it.
546
547       The compiling and matching functions recognize various options that are
548       passed as bits in an options argument. There are also some more compli-
549       cated parameters such as custom memory  management  functions  and  re-
550       source  limits  that  are  passed  in "contexts" (which are just memory
551       blocks, described below). Simple applications do not need to  make  use
552       of contexts.
553
554       Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
555       that can be built in  appropriate  hardware  environments.  It  greatly
556       speeds  up  the matching performance of many patterns. Programs can re-
557       quest that it be used if available by calling pcre2_jit_compile() after
558       a  pattern has been successfully compiled by pcre2_compile(). This does
559       nothing if JIT support is not available.
560
561       More complicated programs might need to  make  use  of  the  specialist
562       functions    pcre2_jit_stack_create(),    pcre2_jit_stack_free(),   and
563       pcre2_jit_stack_assign() in order to control the JIT code's memory  us-
564       age.
565
566       JIT matching is automatically used by pcre2_match() if it is available,
567       unless the PCRE2_NO_JIT option is set. There is also a direct interface
568       for  JIT  matching,  which gives improved performance at the expense of
569       less sanity checking. The JIT-specific functions are discussed  in  the
570       pcre2jit documentation.
571
572       A  second  matching function, pcre2_dfa_match(), which is not Perl-com-
573       patible, is also provided. This uses  a  different  algorithm  for  the
574       matching.  The  alternative  algorithm finds all possible matches (at a
575       given point in the subject), and scans the subject  just  once  (unless
576       there  are lookaround assertions). However, this algorithm does not re-
577       turn captured substrings. A description of the two matching  algorithms
578       and  their  advantages  and disadvantages is given in the pcre2matching
579       documentation. There is no JIT support for pcre2_dfa_match().
580
581       In addition to the main compiling and  matching  functions,  there  are
582       convenience functions for extracting captured substrings from a subject
583       string that has been matched by pcre2_match(). They are:
584
585         pcre2_substring_copy_byname()
586         pcre2_substring_copy_bynumber()
587         pcre2_substring_get_byname()
588         pcre2_substring_get_bynumber()
589         pcre2_substring_list_get()
590         pcre2_substring_length_byname()
591         pcre2_substring_length_bynumber()
592         pcre2_substring_nametable_scan()
593         pcre2_substring_number_from_name()
594
595       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
596       vided,  to  free  memory used for extracted strings. If either of these
597       functions is called with a NULL argument, the function returns  immedi-
598       ately without doing anything.
599
600       The  function  pcre2_substitute()  can be called to match a pattern and
601       return a copy of the subject string with substitutions for  parts  that
602       were matched.
603
604       Functions  whose  names begin with pcre2_serialize_ are used for saving
605       compiled patterns on disc or elsewhere, and reloading them later.
606
607       Finally, there are functions for finding out information about  a  com-
608       piled  pattern  (pcre2_pattern_info()) and about the configuration with
609       which PCRE2 was built (pcre2_config()).
610
611       Functions with names ending with _free() are used  for  freeing  memory
612       blocks  of  various  sorts.  In all cases, if one of these functions is
613       called with a NULL argument, it does nothing.
614
615
616STRING LENGTHS AND OFFSETS
617
618       The PCRE2 API uses string lengths and  offsets  into  strings  of  code
619       units  in  several  places. These values are always of type PCRE2_SIZE,
620       which is an unsigned integer type, currently always defined as  size_t.
621       The  largest  value  that  can  be  stored  in  such  a  type  (that is
622       ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
623       strings  and  unset offsets.  Therefore, the longest string that can be
624       handled is one less than this maximum.
625
626
627NEWLINES
628
629       PCRE2 supports five different conventions for indicating line breaks in
630       strings:  a  single  CR (carriage return) character, a single LF (line-
631       feed) character, the two-character sequence CRLF, any of the three pre-
632       ceding,  or any Unicode newline sequence. The Unicode newline sequences
633       are the three just mentioned, plus the single characters  VT  (vertical
634       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
635       separator, U+2028), and PS (paragraph separator, U+2029).
636
637       Each of the first three conventions is used by at least  one  operating
638       system as its standard newline sequence. When PCRE2 is built, a default
639       can be specified.  If it is not, the default is set to LF, which is the
640       Unix standard. However, the newline convention can be changed by an ap-
641       plication when calling pcre2_compile(), or it can be specified by  spe-
642       cial  text at the start of the pattern itself; this overrides any other
643       settings. See the pcre2pattern page for details of the special  charac-
644       ter sequences.
645
646       In  the  PCRE2  documentation  the  word "newline" is used to mean "the
647       character or pair of characters that indicate a line break". The choice
648       of  newline convention affects the handling of the dot, circumflex, and
649       dollar metacharacters, the handling of #-comments in /x mode, and, when
650       CRLF  is a recognized line ending sequence, the match position advance-
651       ment for a non-anchored pattern. There is more detail about this in the
652       section on pcre2_match() options below.
653
654       The  choice of newline convention does not affect the interpretation of
655       the \n or \r escape sequences, nor does it affect what \R matches; this
656       has its own separate convention.
657
658
659MULTITHREADING
660
661       In  a multithreaded application it is important to keep thread-specific
662       data separate from data that can be shared between threads.  The  PCRE2
663       library  code  itself  is  thread-safe: it contains no static or global
664       variables. The API is designed to be fairly simple for non-threaded ap-
665       plications  while at the same time ensuring that multithreaded applica-
666       tions can use it.
667
668       There are several different blocks of data that are used to pass infor-
669       mation between the application and the PCRE2 libraries.
670
671   The compiled pattern
672
673       A  pointer  to  the  compiled form of a pattern is returned to the user
674       when pcre2_compile() is successful. The data in the compiled pattern is
675       fixed,  and  does not change when the pattern is matched. Therefore, it
676       is thread-safe, that is, the same compiled pattern can be used by  more
677       than one thread simultaneously. For example, an application can compile
678       all its patterns at the start, before forking off multiple threads that
679       use  them.  However,  if the just-in-time (JIT) optimization feature is
680       being used, it needs separate memory stack areas for each  thread.  See
681       the pcre2jit documentation for more details.
682
683       In  a more complicated situation, where patterns are compiled only when
684       they are first needed, but are still shared between  threads,  pointers
685       to  compiled  patterns  must  be protected from simultaneous writing by
686       multiple threads. This is somewhat tricky to do correctly. If you  know
687       that  writing  to  a pointer is atomic in your environment, you can use
688       logic like this:
689
690         Get a read-only (shared) lock (mutex) for pointer
691         if (pointer == NULL)
692           {
693           Get a write (unique) lock for pointer
694           if (pointer == NULL) pointer = pcre2_compile(...
695           }
696         Release the lock
697         Use pointer in pcre2_match()
698
699       Of course, testing for compilation errors should also  be  included  in
700       the code.
701
702       The  reason  for checking the pointer a second time is as follows: Sev-
703       eral threads may have acquired the shared lock and tested  the  pointer
704       for being NULL, but only one of them will be given the write lock, with
705       the rest kept waiting. The winning thread will compile the pattern  and
706       store  the  result.  After this thread releases the write lock, another
707       thread will get it, and if it does not retest pointer for  being  NULL,
708       will recompile the pattern and overwrite the pointer, creating a memory
709       leak and possibly causing other issues.
710
711       In an environment where writing to a pointer may  not  be  atomic,  the
712       above  logic  is not sufficient. The thread that is doing the compiling
713       may be descheduled after writing only part of the pointer, which  could
714       cause  other  threads  to use an invalid value. Instead of checking the
715       pointer itself, a separate "pointer is valid" flag (that can be updated
716       atomically) must be used:
717
718         Get a read-only (shared) lock (mutex) for pointer
719         if (!pointer_is_valid)
720           {
721           Get a write (unique) lock for pointer
722           if (!pointer_is_valid)
723             {
724             pointer = pcre2_compile(...
725             pointer_is_valid = TRUE
726             }
727           }
728         Release the lock
729         Use pointer in pcre2_match()
730
731       If JIT is being used, but the JIT compilation is not being done immedi-
732       ately (perhaps waiting to see if the pattern  is  used  often  enough),
733       similar  logic  is required. JIT compilation updates a value within the
734       compiled code block, so a thread must gain unique write access  to  the
735       pointer     before    calling    pcre2_jit_compile().    Alternatively,
736       pcre2_code_copy() or pcre2_code_copy_with_tables() can be used  to  ob-
737       tain  a  private  copy of the compiled code before calling the JIT com-
738       piler.
739
740   Context blocks
741
742       The next main section below introduces the idea of "contexts" in  which
743       PCRE2 functions are called. A context is nothing more than a collection
744       of parameters that control the way PCRE2 operates. Grouping a number of
745       parameters together in a context is a convenient way of passing them to
746       a PCRE2 function without using lots of arguments. The  parameters  that
747       are  stored  in  contexts  are in some sense "advanced features" of the
748       API. Many straightforward applications will not need to use contexts.
749
750       In a multithreaded application, if the parameters in a context are val-
751       ues  that  are  never  changed, the same context can be used by all the
752       threads. However, if any thread needs to change any value in a context,
753       it must make its own thread-specific copy.
754
755   Match blocks
756
757       The  matching  functions need a block of memory for storing the results
758       of a match. This includes details of what was matched, as well as addi-
759       tional  information  such as the name of a (*MARK) setting. Each thread
760       must provide its own copy of this memory.
761
762
763PCRE2 CONTEXTS
764
765       Some PCRE2 functions have a lot of parameters, many of which  are  used
766       only  by  specialist  applications,  for example, those that use custom
767       memory management or non-standard character tables.  To  keep  function
768       argument  lists  at a reasonable size, and at the same time to keep the
769       API extensible, "uncommon" parameters are passed to  certain  functions
770       in  a  context instead of directly. A context is just a block of memory
771       that holds the parameter values.  Applications that do not need to  ad-
772       just any of the context parameters can pass NULL when a context pointer
773       is required.
774
775       There are three different types of context: a general context  that  is
776       relevant  for  several  PCRE2 operations, a compile-time context, and a
777       match-time context.
778
779   The general context
780
781       At present, this context just contains pointers to (and data  for)  ex-
782       ternal  memory management functions that are called from several places
783       in the PCRE2 library.  The  context  is  named  `general'  rather  than
784       specifically  `memory'  because in future other fields may be added. If
785       you do not want to supply your own custom memory management  functions,
786       you  do not need to bother with a general context. A general context is
787       created by:
788
789       pcre2_general_context *pcre2_general_context_create(
790         void *(*private_malloc)(PCRE2_SIZE, void *),
791         void (*private_free)(void *, void *), void *memory_data);
792
793       The two function pointers specify custom memory  management  functions,
794       whose prototypes are:
795
796         void *private_malloc(PCRE2_SIZE, void *);
797         void  private_free(void *, void *);
798
799       Whenever code in PCRE2 calls these functions, the final argument is the
800       value of memory_data. Either of the first two arguments of the creation
801       function  may be NULL, in which case the system memory management func-
802       tions malloc() and free() are used. (This is not currently  useful,  as
803       there  are  no  other  fields in a general context, but in future there
804       might be.)  The private_malloc() function is used (if supplied) to  ob-
805       tain  memory for storing the context, and all three values are saved as
806       part of the context.
807
808       Whenever PCRE2 creates a data block of any kind, the block  contains  a
809       pointer  to the free() function that matches the malloc() function that
810       was used. When the time comes to  free  the  block,  this  function  is
811       called.
812
813       A general context can be copied by calling:
814
815       pcre2_general_context *pcre2_general_context_copy(
816         pcre2_general_context *gcontext);
817
818       The memory used for a general context should be freed by calling:
819
820       void pcre2_general_context_free(pcre2_general_context *gcontext);
821
822       If  this  function  is  passed  a NULL argument, it returns immediately
823       without doing anything.
824
825   The compile context
826
827       A compile context is required if you want to provide an external  func-
828       tion  for  stack  checking  during compilation or to change the default
829       values of any of the following compile-time parameters:
830
831         What \R matches (Unicode newlines or CR, LF, CRLF only)
832         PCRE2's character tables
833         The newline character sequence
834         The compile time nested parentheses limit
835         The maximum length of the pattern string
836         The extra options bits (none set by default)
837
838       A compile context is also required if you are using custom memory  man-
839       agement.   If  none of these apply, just pass NULL as the context argu-
840       ment of pcre2_compile().
841
842       A compile context is created, copied, and freed by the following  func-
843       tions:
844
845       pcre2_compile_context *pcre2_compile_context_create(
846         pcre2_general_context *gcontext);
847
848       pcre2_compile_context *pcre2_compile_context_copy(
849         pcre2_compile_context *ccontext);
850
851       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
852
853       A  compile  context  is created with default values for its parameters.
854       These can be changed by calling the following functions, which return 0
855       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
856
857       int pcre2_set_bsr(pcre2_compile_context *ccontext,
858         uint32_t value);
859
860       The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only
861       CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any
862       Unicode line ending sequence. The value is used by the JIT compiler and
863       by  the  two  interpreted   matching   functions,   pcre2_match()   and
864       pcre2_dfa_match().
865
866       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
867         const uint8_t *tables);
868
869       The  value  must  be  the result of a call to pcre2_maketables(), whose
870       only argument is a general context. This function builds a set of char-
871       acter tables in the current locale.
872
873       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
874         uint32_t extra_options);
875
876       As  PCRE2  has developed, almost all the 32 option bits that are avail-
877       able in the options argument of pcre2_compile() have been used  up.  To
878       avoid  running  out, the compile context contains a set of extra option
879       bits which are used for some newer, assumed rarer, options. This  func-
880       tion  sets  those bits. It always sets all the bits (either on or off).
881       It does not modify any existing setting. The available options are  de-
882       fined in the section entitled "Extra compile options" below.
883
884       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
885         PCRE2_SIZE value);
886
887       This  sets a maximum length, in code units, for any pattern string that
888       is compiled with this context. If the pattern is longer,  an  error  is
889       generated.   This facility is provided so that applications that accept
890       patterns from external sources can limit their size. The default is the
891       largest  number  that  a  PCRE2_SIZE variable can hold, which is effec-
892       tively unlimited.
893
894       int pcre2_set_newline(pcre2_compile_context *ccontext,
895         uint32_t value);
896
897       This specifies which characters or character sequences are to be recog-
898       nized  as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
899       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
900       two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
901       of the above), PCRE2_NEWLINE_ANY (any  Unicode  newline  sequence),  or
902       PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
903
904       A pattern can override the value set in the compile context by starting
905       with a sequence such as (*CRLF). See the pcre2pattern page for details.
906
907       When a  pattern  is  compiled  with  the  PCRE2_EXTENDED  or  PCRE2_EX-
908       TENDED_MORE  option,  the newline convention affects the recognition of
909       the end of internal comments starting with #. The value is  saved  with
910       the  compiled pattern for subsequent use by the JIT compiler and by the
911       two    interpreted    matching     functions,     pcre2_match()     and
912       pcre2_dfa_match().
913
914       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
915         uint32_t value);
916
917       This  parameter  adjusts  the  limit,  set when PCRE2 is built (default
918       250), on the depth of parenthesis nesting  in  a  pattern.  This  limit
919       stops  rogue  patterns  using  up too much system stack when being com-
920       piled. The limit applies to parentheses of all kinds, not just  captur-
921       ing parentheses.
922
923       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
924         int (*guard_function)(uint32_t, void *), void *user_data);
925
926       There  is at least one application that runs PCRE2 in threads with very
927       limited system stack, where running out of stack is to  be  avoided  at
928       all  costs. The parenthesis limit above cannot take account of how much
929       stack is actually available during compilation. For  a  finer  control,
930       you  can  supply  a  function  that  is called whenever pcre2_compile()
931       starts to compile a parenthesized part of a pattern. This function  can
932       check  the  actual  stack  size  (or anything else that it wants to, of
933       course).
934
935       The first argument to the callout function gives the current  depth  of
936       nesting,  and  the second is user data that is set up by the last argu-
937       ment  of  pcre2_set_compile_recursion_guard().  The  callout   function
938       should return zero if all is well, or non-zero to force an error.
939
940   The match context
941
942       A match context is required if you want to:
943
944         Set up a callout function
945         Set an offset limit for matching an unanchored pattern
946         Change the limit on the amount of heap used when matching
947         Change the backtracking match limit
948         Change the backtracking depth limit
949         Set custom memory management specifically for the match
950
951       If  none  of  these  apply,  just  pass NULL as the context argument of
952       pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
953
954       A match context is created, copied, and freed by  the  following  func-
955       tions:
956
957       pcre2_match_context *pcre2_match_context_create(
958         pcre2_general_context *gcontext);
959
960       pcre2_match_context *pcre2_match_context_copy(
961         pcre2_match_context *mcontext);
962
963       void pcre2_match_context_free(pcre2_match_context *mcontext);
964
965       A  match  context  is  created  with default values for its parameters.
966       These can be changed by calling the following functions, which return 0
967       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
968
969       int pcre2_set_callout(pcre2_match_context *mcontext,
970         int (*callout_function)(pcre2_callout_block *, void *),
971         void *callout_data);
972
973       This  sets  up a callout function for PCRE2 to call at specified points
974       during a matching operation. Details are given in the pcre2callout doc-
975       umentation.
976
977       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
978         int (*callout_function)(pcre2_substitute_callout_block *, void *),
979         void *callout_data);
980
981       This  sets up a callout function for PCRE2 to call after each substitu-
982       tion made by pcre2_substitute(). Details are given in the section enti-
983       tled "Creating a new string with substitutions" below.
984
985       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
986         PCRE2_SIZE value);
987
988       The  offset_limit parameter limits how far an unanchored search can ad-
989       vance in the subject string. The  default  value  is  PCRE2_UNSET.  The
990       pcre2_match()  and  pcre2_dfa_match()  functions return PCRE2_ERROR_NO-
991       MATCH if a match with a starting point before or at the given offset is
992       not found. The pcre2_substitute() function makes no more substitutions.
993
994       For  example,  if the pattern /abc/ is matched against "123abc" with an
995       offset limit less than 3, the result is  PCRE2_ERROR_NOMATCH.  A  match
996       can  never  be  found  if  the  startoffset  argument of pcre2_match(),
997       pcre2_dfa_match(), or pcre2_substitute() is  greater  than  the  offset
998       limit set in the match context.
999
1000       When  using  this facility, you must set the PCRE2_USE_OFFSET_LIMIT op-
1001       tion when calling pcre2_compile() so that when JIT is in use, different
1002       code  can  be  compiled. If a match is started with a non-default match
1003       limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
1004
1005       The offset limit facility can be used to track progress when  searching
1006       large  subject  strings or to limit the extent of global substitutions.
1007       See also the PCRE2_FIRSTLINE option, which requires a  match  to  start
1008       before  or  at  the first newline that follows the start of matching in
1009       the subject. If this is set with an offset limit, a match must occur in
1010       the first line and also within the offset limit. In other words, which-
1011       ever limit comes first is used.
1012
1013       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
1014         uint32_t value);
1015
1016       The heap_limit parameter specifies, in units of kibibytes (1024 bytes),
1017       the  maximum  amount  of heap memory that pcre2_match() may use to hold
1018       backtracking information when running an interpretive match. This limit
1019       also applies to pcre2_dfa_match(), which may use the heap when process-
1020       ing patterns with a lot of nested pattern recursion or  lookarounds  or
1021       atomic groups. This limit does not apply to matching with the JIT opti-
1022       mization, which has  its  own  memory  control  arrangements  (see  the
1023       pcre2jit  documentation for more details). If the limit is reached, the
1024       negative error code  PCRE2_ERROR_HEAPLIMIT  is  returned.  The  default
1025       limit  can be set when PCRE2 is built; if it is not, the default is set
1026       very large and is essentially "unlimited".
1027
1028       A value for the heap limit may also be supplied by an item at the start
1029       of a pattern of the form
1030
1031         (*LIMIT_HEAP=ddd)
1032
1033       where  ddd  is a decimal number. However, such a setting is ignored un-
1034       less ddd is less than the limit set by the caller of pcre2_match()  or,
1035       if no such limit is set, less than the default.
1036
1037       The  pcre2_match() function starts out using a 20KiB vector on the sys-
1038       tem stack for recording backtracking points. The more nested backtrack-
1039       ing  points  there  are (that is, the deeper the search tree), the more
1040       memory is needed.  Heap memory is used only if the  initial  vector  is
1041       too small. If the heap limit is set to a value less than 21 (in partic-
1042       ular, zero) no heap memory will be used. In this  case,  only  patterns
1043       that  do not have a lot of nested backtracking can be successfully pro-
1044       cessed.
1045
1046       Similarly, for pcre2_dfa_match(), a vector on the system stack is  used
1047       when  processing pattern recursions, lookarounds, or atomic groups, and
1048       only if this is not big enough is heap memory used. In this case,  too,
1049       setting a value of zero disables the use of the heap.
1050
1051       int pcre2_set_match_limit(pcre2_match_context *mcontext,
1052         uint32_t value);
1053
1054       The match_limit parameter provides a means of preventing PCRE2 from us-
1055       ing up too many computing resources when processing patterns  that  are
1056       not going to match, but which have a very large number of possibilities
1057       in their search trees. The classic  example  is  a  pattern  that  uses
1058       nested unlimited repeats.
1059
1060       There  is an internal counter in pcre2_match() that is incremented each
1061       time round its main matching loop. If  this  value  reaches  the  match
1062       limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
1063       This has the effect of limiting the amount  of  backtracking  that  can
1064       take place. For patterns that are not anchored, the count restarts from
1065       zero for each position in the subject string. This limit  also  applies
1066       to pcre2_dfa_match(), though the counting is done in a different way.
1067
1068       When  pcre2_match() is called with a pattern that was successfully pro-
1069       cessed by pcre2_jit_compile(), the way in which matching is executed is
1070       entirely  different. However, there is still the possibility of runaway
1071       matching that goes on for a very long  time,  and  so  the  match_limit
1072       value  is  also used in this case (but in a different way) to limit how
1073       long the matching can continue.
1074
1075       The default value for the limit can be set when PCRE2 is built; the de-
1076       fault  default  is  10  million, which handles all but the most extreme
1077       cases. A value for the match limit may also be supplied by an  item  at
1078       the start of a pattern of the form
1079
1080         (*LIMIT_MATCH=ddd)
1081
1082       where  ddd  is a decimal number. However, such a setting is ignored un-
1083       less ddd is less than the limit set by the caller of  pcre2_match()  or
1084       pcre2_dfa_match() or, if no such limit is set, less than the default.
1085
1086       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
1087         uint32_t value);
1088
1089       This   parameter   limits   the   depth   of   nested  backtracking  in
1090       pcre2_match().  Each time a nested backtracking point is passed, a  new
1091       memory "frame" is used to remember the state of matching at that point.
1092       Thus, this parameter indirectly limits the amount  of  memory  that  is
1093       used  in  a match. However, because the size of each memory "frame" de-
1094       pends on the number of capturing parentheses, the actual  memory  limit
1095       varies  from pattern to pattern. This limit was more useful in versions
1096       before 10.30, where function recursion was used for backtracking.
1097
1098       The depth limit is not relevant, and is ignored, when matching is  done
1099       using JIT compiled code. However, it is supported by pcre2_dfa_match(),
1100       which uses it to limit the depth of nested internal recursive  function
1101       calls  that implement atomic groups, lookaround assertions, and pattern
1102       recursions. This limits, indirectly, the amount of system stack that is
1103       used.  It  was  more useful in versions before 10.32, when stack memory
1104       was used for local workspace vectors for recursive function calls. From
1105       version  10.32,  only local variables are allocated on the stack and as
1106       each call uses only a few hundred bytes, even a small stack can support
1107       quite a lot of recursion.
1108
1109       If  the depth of internal recursive function calls is great enough, lo-
1110       cal workspace vectors are allocated on the heap from version 10.32  on-
1111       wards,  so  the  depth  limit also indirectly limits the amount of heap
1112       memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
1113       matched  to a very long string using pcre2_dfa_match(), can use a great
1114       deal of memory. However, it is probably better to limit heap usage  di-
1115       rectly by calling pcre2_set_heap_limit().
1116
1117       The  default  value for the depth limit can be set when PCRE2 is built;
1118       if it is not, the default is set to the same value as the  default  for
1119       the   match   limit.   If  the  limit  is  exceeded,  pcre2_match()  or
1120       pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
1121       limit  may also be supplied by an item at the start of a pattern of the
1122       form
1123
1124         (*LIMIT_DEPTH=ddd)
1125
1126       where ddd is a decimal number. However, such a setting is  ignored  un-
1127       less  ddd  is less than the limit set by the caller of pcre2_match() or
1128       pcre2_dfa_match() or, if no such limit is set, less than the default.
1129
1130
1131CHECKING BUILD-TIME OPTIONS
1132
1133       int pcre2_config(uint32_t what, void *where);
1134
1135       The function pcre2_config() makes it possible for  a  PCRE2  client  to
1136       find  the  value  of  certain  configuration parameters and to discover
1137       which optional features have been compiled into the PCRE2 library.  The
1138       pcre2build documentation has more details about these features.
1139
1140       The  first  argument  for pcre2_config() specifies which information is
1141       required. The second argument is a pointer to memory into which the in-
1142       formation is placed. If NULL is passed, the function returns the amount
1143       of memory that is needed for the requested information. For calls  that
1144       return  numerical  values, the value is in bytes; when requesting these
1145       values, where should point to appropriately aligned memory.  For  calls
1146       that  return  strings,  the required length is given in code units, not
1147       counting the terminating zero.
1148
1149       When requesting information, the returned value from pcre2_config()  is
1150       non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1151       TION if the value in the first argument is not recognized. The  follow-
1152       ing information is available:
1153
1154         PCRE2_CONFIG_BSR
1155
1156       The  output  is a uint32_t integer whose value indicates what character
1157       sequences the \R  escape  sequence  matches  by  default.  A  value  of
1158       PCRE2_BSR_UNICODE  means  that  \R  matches any Unicode line ending se-
1159       quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF,
1160       or CRLF. The default can be overridden when a pattern is compiled.
1161
1162         PCRE2_CONFIG_COMPILED_WIDTHS
1163
1164       The  output  is a uint32_t integer whose lower bits indicate which code
1165       unit widths were selected when PCRE2 was  built.  The  1-bit  indicates
1166       8-bit  support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1167       port, respectively.
1168
1169         PCRE2_CONFIG_DEPTHLIMIT
1170
1171       The output is a uint32_t integer that gives the default limit  for  the
1172       depth  of  nested  backtracking in pcre2_match() or the depth of nested
1173       recursions, lookarounds, and atomic groups in  pcre2_dfa_match().  Fur-
1174       ther details are given with pcre2_set_depth_limit() above.
1175
1176         PCRE2_CONFIG_HEAPLIMIT
1177
1178       The  output is a uint32_t integer that gives, in kibibytes, the default
1179       limit  for  the  amount  of  heap  memory  used  by  pcre2_match()   or
1180       pcre2_dfa_match().      Further      details     are     given     with
1181       pcre2_set_heap_limit() above.
1182
1183         PCRE2_CONFIG_JIT
1184
1185       The output is a uint32_t integer that is set  to  one  if  support  for
1186       just-in-time compiling is available; otherwise it is set to zero.
1187
1188         PCRE2_CONFIG_JITTARGET
1189
1190       The  where  argument  should point to a buffer that is at least 48 code
1191       units long.  (The  exact  length  required  can  be  found  by  calling
1192       pcre2_config()  with  where  set  to NULL.) The buffer is filled with a
1193       string that contains the name of the architecture  for  which  the  JIT
1194       compiler  is  configured,  for  example "x86 32bit (little endian + un-
1195       aligned)". If JIT support is not  available,  PCRE2_ERROR_BADOPTION  is
1196       returned,  otherwise the number of code units used is returned. This is
1197       the length of the string, plus one unit for the terminating zero.
1198
1199         PCRE2_CONFIG_LINKSIZE
1200
1201       The output is a uint32_t integer that contains the number of bytes used
1202       for  internal  linkage  in  compiled regular expressions. When PCRE2 is
1203       configured, the value can be set to 2, 3, or 4, with the default  being
1204       2.  This is the value that is returned by pcre2_config(). However, when
1205       the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1206       when  the  32-bit  library  is compiled, internal linkages always use 4
1207       bytes, so the configured value is not relevant.
1208
1209       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1210       for  all but the most massive patterns, since it allows the size of the
1211       compiled pattern to be up to 65535  code  units.  Larger  values  allow
1212       larger  regular  expressions to be compiled by those two libraries, but
1213       at the expense of slower matching.
1214
1215         PCRE2_CONFIG_MATCHLIMIT
1216
1217       The output is a uint32_t integer that gives the default match limit for
1218       pcre2_match().  Further  details are given with pcre2_set_match_limit()
1219       above.
1220
1221         PCRE2_CONFIG_NEWLINE
1222
1223       The output is a uint32_t integer  whose  value  specifies  the  default
1224       character  sequence that is recognized as meaning "newline". The values
1225       are:
1226
1227         PCRE2_NEWLINE_CR       Carriage return (CR)
1228         PCRE2_NEWLINE_LF       Linefeed (LF)
1229         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
1230         PCRE2_NEWLINE_ANY      Any Unicode line ending
1231         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
1232         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
1233
1234       The default should normally correspond to  the  standard  sequence  for
1235       your operating system.
1236
1237         PCRE2_CONFIG_NEVER_BACKSLASH_C
1238
1239       The  output  is  a uint32_t integer that is set to one if the use of \C
1240       was permanently disabled when PCRE2 was built; otherwise it is  set  to
1241       zero.
1242
1243         PCRE2_CONFIG_PARENSLIMIT
1244
1245       The  output is a uint32_t integer that gives the maximum depth of nest-
1246       ing of parentheses (of any kind) in a pattern. This limit is imposed to
1247       cap  the  amount of system stack used when a pattern is compiled. It is
1248       specified when PCRE2 is built; the default is 250. This limit does  not
1249       take into account the stack that may already be used by the calling ap-
1250       plication.  For  finer  control  over  compilation  stack  usage,   see
1251       pcre2_set_compile_recursion_guard().
1252
1253         PCRE2_CONFIG_STACKRECURSE
1254
1255       This parameter is obsolete and should not be used in new code. The out-
1256       put is a uint32_t integer that is always set to zero.
1257
1258         PCRE2_CONFIG_TABLES_LENGTH
1259
1260       The output is a uint32_t integer that gives the length of PCRE2's char-
1261       acter  processing  tables in bytes. For details of these tables see the
1262       section on locale support below.
1263
1264         PCRE2_CONFIG_UNICODE_VERSION
1265
1266       The where argument should point to a buffer that is at  least  24  code
1267       units  long.  (The  exact  length  required  can  be  found  by calling
1268       pcre2_config() with where set to NULL.)  If  PCRE2  has  been  compiled
1269       without  Unicode  support,  the buffer is filled with the text "Unicode
1270       not supported". Otherwise, the Unicode  version  string  (for  example,
1271       "8.0.0")  is  inserted. The number of code units used is returned. This
1272       is the length of the string plus one unit for the terminating zero.
1273
1274         PCRE2_CONFIG_UNICODE
1275
1276       The output is a uint32_t integer that is set to one if Unicode  support
1277       is  available; otherwise it is set to zero. Unicode support implies UTF
1278       support.
1279
1280         PCRE2_CONFIG_VERSION
1281
1282       The where argument should point to a buffer that is at  least  24  code
1283       units  long.  (The  exact  length  required  can  be  found  by calling
1284       pcre2_config() with where set to NULL.) The buffer is filled  with  the
1285       PCRE2 version string, zero-terminated. The number of code units used is
1286       returned. This is the length of the string plus one unit for the termi-
1287       nating zero.
1288
1289
1290COMPILING A PATTERN
1291
1292       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
1293         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
1294         pcre2_compile_context *ccontext);
1295
1296       void pcre2_code_free(pcre2_code *code);
1297
1298       pcre2_code *pcre2_code_copy(const pcre2_code *code);
1299
1300       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
1301
1302       The  pcre2_compile() function compiles a pattern into an internal form.
1303       The pattern is defined by a pointer to a string of  code  units  and  a
1304       length  (in  code units). If the pattern is zero-terminated, the length
1305       can be specified  as  PCRE2_ZERO_TERMINATED.  The  function  returns  a
1306       pointer to a block of memory that contains the compiled pattern and re-
1307       lated data, or NULL if an error occurred.
1308
1309       If the compile context argument ccontext is NULL, memory for  the  com-
1310       piled  pattern  is  obtained  by calling malloc(). Otherwise, it is ob-
1311       tained from the same memory function that was used for the compile con-
1312       text. The caller must free the memory by calling pcre2_code_free() when
1313       it is no longer needed.  If pcre2_code_free() is called with a NULL ar-
1314       gument, it returns immediately, without doing anything.
1315
1316       The function pcre2_code_copy() makes a copy of the compiled code in new
1317       memory, using the same memory allocator as was used for  the  original.
1318       However,  if  the  code has been processed by the JIT compiler (see be-
1319       low), the JIT information cannot be copied (because it is  position-de-
1320       pendent).   The  new copy can initially be used only for non-JIT match-
1321       ing, though it can be passed to  pcre2_jit_compile()  if  required.  If
1322       pcre2_code_copy() is called with a NULL argument, it returns NULL.
1323
1324       The pcre2_code_copy() function provides a way for individual threads in
1325       a multithreaded application to acquire a private copy  of  shared  com-
1326       piled  code.   However, it does not make a copy of the character tables
1327       used by the compiled pattern; the new pattern code points to  the  same
1328       tables  as  the original code.  (See "Locale Support" below for details
1329       of these character tables.) In many applications the  same  tables  are
1330       used  throughout, so this behaviour is appropriate. Nevertheless, there
1331       are occasions when a copy of a compiled pattern and the relevant tables
1332       are  needed.  The pcre2_code_copy_with_tables() provides this facility.
1333       Copies of both the code and the tables are  made,  with  the  new  code
1334       pointing  to the new tables. The memory for the new tables is automati-
1335       cally freed when pcre2_code_free() is called for the new  copy  of  the
1336       compiled  code.  If pcre2_code_copy_with_tables() is called with a NULL
1337       argument, it returns NULL.
1338
1339       NOTE: When one of the matching functions is  called,  pointers  to  the
1340       compiled pattern and the subject string are set in the match data block
1341       so that they can be referenced by the  substring  extraction  functions
1342       after  a  successful match.  After running a match, you must not free a
1343       compiled pattern or a subject string until after all operations on  the
1344       match  data  block have taken place, unless, in the case of the subject
1345       string, you have used the PCRE2_COPY_MATCHED_SUBJECT option,  which  is
1346       described  in  the section entitled "Option bits for pcre2_match()" be-
1347       low.
1348
1349       The options argument for pcre2_compile() contains various bit  settings
1350       that  affect the compilation. It should be zero if none of them are re-
1351       quired. The available options are described below.  Some  of  them  (in
1352       particular,  those  that  are  compatible with Perl, but some others as
1353       well) can also be set and unset from within the pattern  (see  the  de-
1354       tailed description in the pcre2pattern documentation).
1355
1356       For  those options that can be different in different parts of the pat-
1357       tern, the contents of the options argument specifies their settings  at
1358       the  start  of  compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
1359       PCRE2_NO_UTF_CHECK options can be set at the time of matching  as  well
1360       as at compile time.
1361
1362       Some  additional  options and less frequently required compile-time pa-
1363       rameters (for example, the newline setting) can be provided in  a  com-
1364       pile context (as described above).
1365
1366       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1367       diately. Otherwise, the variables to which these point are  set  to  an
1368       error code and an offset (number of code units) within the pattern, re-
1369       spectively, when pcre2_compile() returns NULL because a compilation er-
1370       ror  has  occurred. The values are not defined when compilation is suc-
1371       cessful and pcre2_compile() returns a non-NULL value.
1372
1373       There are nearly 100 positive error codes that pcre2_compile() may  re-
1374       turn  if it finds an error in the pattern. There are also some negative
1375       error codes that are used for invalid UTF strings when validity  check-
1376       ing  is  in  force.  These  are  the same as given by pcre2_match() and
1377       pcre2_dfa_match(), and are described in the pcre2unicode documentation.
1378       There  is  no  separate documentation for the positive error codes, be-
1379       cause the textual error messages  that  are  obtained  by  calling  the
1380       pcre2_get_error_message() function (see "Obtaining a textual error mes-
1381       sage" below) should be  self-explanatory.  Macro  names  starting  with
1382       PCRE2_ERROR_  are defined for both positive and negative error codes in
1383       pcre2.h.
1384
1385       The value returned in erroroffset is an indication of where in the pat-
1386       tern  the  error  occurred. It is not necessarily the furthest point in
1387       the pattern that was read. For example, after the error "lookbehind as-
1388       sertion  is  not fixed length", the error offset points to the start of
1389       the failing assertion. For an invalid UTF-8 or UTF-16 string, the  off-
1390       set is that of the first code unit of the failing character.
1391
1392       Some  errors are not detected until the whole pattern has been scanned;
1393       in these cases, the offset passed back is the length  of  the  pattern.
1394       Note  that  the  offset is in code units, not characters, even in a UTF
1395       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1396       acter.
1397
1398       This  code  fragment shows a typical straightforward call to pcre2_com-
1399       pile():
1400
1401         pcre2_code *re;
1402         PCRE2_SIZE erroffset;
1403         int errorcode;
1404         re = pcre2_compile(
1405           "^A.*Z",                /* the pattern */
1406           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1407           0,                      /* default options */
1408           &errorcode,             /* for error code */
1409           &erroffset,             /* for error offset */
1410           NULL);                  /* no compile context */
1411
1412
1413   Main compile options
1414
1415       The following names for option bits are defined in the  pcre2.h  header
1416       file:
1417
1418         PCRE2_ANCHORED
1419
1420       If this bit is set, the pattern is forced to be "anchored", that is, it
1421       is constrained to match only at the first matching point in the  string
1422       that  is being searched (the "subject string"). This effect can also be
1423       achieved by appropriate constructs in the pattern itself, which is  the
1424       only way to do it in Perl.
1425
1426         PCRE2_ALLOW_EMPTY_CLASS
1427
1428       By  default, for compatibility with Perl, a closing square bracket that
1429       immediately follows an opening one is treated as a data  character  for
1430       the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the
1431       class, which therefore contains no characters and so can never match.
1432
1433         PCRE2_ALT_BSUX
1434
1435       This option request alternative handling  of  three  escape  sequences,
1436       which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript).
1437       When it is set:
1438
1439       (1) \U matches an upper case "U" character; by default \U causes a com-
1440       pile time error (Perl uses \U to upper case subsequent characters).
1441
1442       (2) \u matches a lower case "u" character unless it is followed by four
1443       hexadecimal digits, in which case the hexadecimal  number  defines  the
1444       code  point  to match. By default, \u causes a compile time error (Perl
1445       uses it to upper case the following character).
1446
1447       (3) \x matches a lower case "x" character unless it is followed by  two
1448       hexadecimal  digits,  in  which case the hexadecimal number defines the
1449       code point to match. By default, as in Perl, a  hexadecimal  number  is
1450       always expected after \x, but it may have zero, one, or two digits (so,
1451       for example, \xz matches a binary zero character followed by z).
1452
1453       ECMAscript 6 added additional functionality to \u. This can be accessed
1454       using  the  PCRE2_EXTRA_ALT_BSUX  extra  option (see "Extra compile op-
1455       tions" below).  Note that this alternative escape handling applies only
1456       to  patterns.  Neither  of  these options affects the processing of re-
1457       placement strings passed to pcre2_substitute().
1458
1459         PCRE2_ALT_CIRCUMFLEX
1460
1461       In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex
1462       metacharacter  matches at the start of the subject (unless PCRE2_NOTBOL
1463       is set), and also after any internal  newline.  However,  it  does  not
1464       match after a newline at the end of the subject, for compatibility with
1465       Perl. If you want a multiline circumflex also to match after  a  termi-
1466       nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
1467
1468         PCRE2_ALT_VERBNAMES
1469
1470       By  default, for compatibility with Perl, the name in any verb sequence
1471       such as (*MARK:NAME) is any sequence of characters that  does  not  in-
1472       clude  a closing parenthesis. The name is not processed in any way, and
1473       it is not possible to include a closing parenthesis in the  name.  How-
1474       ever,  if  the PCRE2_ALT_VERBNAMES option is set, normal backslash pro-
1475       cessing is applied to verb names and only an unescaped  closing  paren-
1476       thesis  terminates the name. A closing parenthesis can be included in a
1477       name either as \) or between  \Q  and  \E.  If  the  PCRE2_EXTENDED  or
1478       PCRE2_EXTENDED_MORE  option  is set with PCRE2_ALT_VERBNAMES, unescaped
1479       whitespace in verb names is skipped and #-comments are recognized,  ex-
1480       actly as in the rest of the pattern.
1481
1482         PCRE2_AUTO_CALLOUT
1483
1484       If  this  bit  is  set,  pcre2_compile()  automatically inserts callout
1485       items, all with number 255, before each pattern  item,  except  immedi-
1486       ately  before  or after an explicit callout in the pattern. For discus-
1487       sion of the callout facility, see the pcre2callout documentation.
1488
1489         PCRE2_CASELESS
1490
1491       If this bit is set, letters in the pattern match both upper  and  lower
1492       case  letters in the subject. It is equivalent to Perl's /i option, and
1493       it can be changed within a pattern by a (?i) option setting. If  either
1494       PCRE2_UTF  or  PCRE2_UCP  is  set,  Unicode properties are used for all
1495       characters with more than one other case, and for all characters  whose
1496       code  points  are  greater  than  U+007F. Note that there are two ASCII
1497       characters, K and S, that, in addition to their lower case ASCII equiv-
1498       alents,  are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1499       S) respectively. For lower valued characters with only one other  case,
1500       a  lookup table is used for speed. When neither PCRE2_UTF nor PCRE2_UCP
1501       is set, a lookup table is used for all code points less than  256,  and
1502       higher  code  points  (available  only  in  16-bit  or 32-bit mode) are
1503       treated as not having another case.
1504
1505         PCRE2_DOLLAR_ENDONLY
1506
1507       If this bit is set, a dollar metacharacter in the pattern matches  only
1508       at  the  end  of the subject string. Without this option, a dollar also
1509       matches immediately before a newline at the end of the string (but  not
1510       before  any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
1511       if PCRE2_MULTILINE is set. There is no equivalent  to  this  option  in
1512       Perl, and no way to set it within a pattern.
1513
1514         PCRE2_DOTALL
1515
1516       If  this  bit  is  set,  a dot metacharacter in the pattern matches any
1517       character, including one that indicates a  newline.  However,  it  only
1518       ever matches one character, even if newlines are coded as CRLF. Without
1519       this option, a dot does not match when the current position in the sub-
1520       ject  is  at  a newline. This option is equivalent to Perl's /s option,
1521       and it can be changed within a pattern by a (?s) option setting. A neg-
1522       ative  class such as [^a] always matches newline characters, and the \N
1523       escape sequence always matches a non-newline character, independent  of
1524       the setting of PCRE2_DOTALL.
1525
1526         PCRE2_DUPNAMES
1527
1528       If  this  bit is set, names used to identify capture groups need not be
1529       unique.  This can be helpful for certain types of pattern  when  it  is
1530       known  that  only  one instance of the named group can ever be matched.
1531       There are more details of named capture  groups  below;  see  also  the
1532       pcre2pattern documentation.
1533
1534         PCRE2_ENDANCHORED
1535
1536       If  this  bit is set, the end of any pattern match must be right at the
1537       end of the string being searched (the "subject string"). If the pattern
1538       match succeeds by reaching (*ACCEPT), but does not reach the end of the
1539       subject, the match fails at the current starting point. For  unanchored
1540       patterns,  a  new  match is then tried at the next starting point. How-
1541       ever, if the match succeeds by reaching the end of the pattern, but not
1542       the  end  of  the subject, backtracking occurs and an alternative match
1543       may be found. Consider these two patterns:
1544
1545         .(*ACCEPT)|..
1546         .|..
1547
1548       If matched against "abc" with PCRE2_ENDANCHORED set, the first  matches
1549       "c"  whereas  the  second matches "bc". The effect of PCRE2_ENDANCHORED
1550       can also be achieved by appropriate constructs in the  pattern  itself,
1551       which is the only way to do it in Perl.
1552
1553       For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
1554       to the first (that is, the  longest)  matched  string.  Other  parallel
1555       matches,  which are necessarily substrings of the first one, must obvi-
1556       ously end before the end of the subject.
1557
1558         PCRE2_EXTENDED
1559
1560       If this bit is set, most white space characters in the pattern are  to-
1561       tally ignored except when escaped or inside a character class. However,
1562       white space is not allowed within sequences such as (?> that  introduce
1563       various  parenthesized groups, nor within numerical quantifiers such as
1564       {1,3}. Ignorable white space is permitted between an item and a follow-
1565       ing  quantifier  and  between a quantifier and a following + that indi-
1566       cates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option,
1567       and it can be changed within a pattern by a (?x) option setting.
1568
1569       When  PCRE2  is compiled without Unicode support, PCRE2_EXTENDED recog-
1570       nizes as white space only those characters with code points  less  than
1571       256 that are flagged as white space in its low-character table. The ta-
1572       ble is normally created by pcre2_maketables(), which uses the isspace()
1573       function  to identify space characters. In most ASCII environments, the
1574       relevant characters are those with code  points  0x0009  (tab),  0x000A
1575       (linefeed),  0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage
1576       return), and 0x0020 (space).
1577
1578       When PCRE2 is compiled with Unicode support, in addition to these char-
1579       acters,  five  more Unicode "Pattern White Space" characters are recog-
1580       nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1581       right  mark), U+200F (right-to-left mark), U+2028 (line separator), and
1582       U+2029 (paragraph separator). This set of characters  is  the  same  as
1583       recognized  by  Perl's /x option. Note that the horizontal and vertical
1584       space characters that are matched by the \h and \v escapes in  patterns
1585       are a much bigger set.
1586
1587       As  well as ignoring most white space, PCRE2_EXTENDED also causes char-
1588       acters between an unescaped # outside a character class  and  the  next
1589       newline,  inclusive,  to be ignored, which makes it possible to include
1590       comments inside complicated patterns. Note that the end of this type of
1591       comment  is a literal newline sequence in the pattern; escape sequences
1592       that happen to represent a newline do not count.
1593
1594       Which characters are interpreted as newlines can be specified by a set-
1595       ting  in  the compile context that is passed to pcre2_compile() or by a
1596       special sequence at the start of the pattern, as described in the  sec-
1597       tion  entitled "Newline conventions" in the pcre2pattern documentation.
1598       A default is defined when PCRE2 is built.
1599
1600         PCRE2_EXTENDED_MORE
1601
1602       This option has the effect of PCRE2_EXTENDED,  but,  in  addition,  un-
1603       escaped  space and horizontal tab characters are ignored inside a char-
1604       acter class. Note: only these two characters are ignored, not the  full
1605       set  of pattern white space characters that are ignored outside a char-
1606       acter class. PCRE2_EXTENDED_MORE is equivalent to  Perl's  /xx  option,
1607       and it can be changed within a pattern by a (?xx) option setting.
1608
1609         PCRE2_FIRSTLINE
1610
1611       If this option is set, the start of an unanchored pattern match must be
1612       before or at the first newline in  the  subject  string  following  the
1613       start  of  matching, though the matched text may continue over the new-
1614       line. If startoffset is non-zero, the limiting newline is not necessar-
1615       ily  the  first  newline  in  the  subject. For example, if the subject
1616       string is "abc\nxyz" (where \n represents a single-character newline) a
1617       pattern  match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
1618       greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a  more
1619       general  limiting  facility.  If  PCRE2_FIRSTLINE is set with an offset
1620       limit, a match must occur in the first line and also within the  offset
1621       limit. In other words, whichever limit comes first is used.
1622
1623         PCRE2_LITERAL
1624
1625       If this option is set, all meta-characters in the pattern are disabled,
1626       and it is treated as a literal string. Matching literal strings with  a
1627       regular expression engine is not the most efficient way of doing it. If
1628       you are doing a lot of literal matching and  are  worried  about  effi-
1629       ciency, you should consider using other approaches. The only other main
1630       options  that  are  allowed  with  PCRE2_LITERAL  are:  PCRE2_ANCHORED,
1631       PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
1632       PCRE2_MATCH_INVALID_UTF,  PCRE2_NO_START_OPTIMIZE,  PCRE2_NO_UTF_CHECK,
1633       PCRE2_UTF,  and  PCRE2_USE_OFFSET_LIMIT.  The  extra  options PCRE2_EX-
1634       TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other
1635       options cause an error.
1636
1637         PCRE2_MATCH_INVALID_UTF
1638
1639       This  option  forces PCRE2_UTF (see below) and also enables support for
1640       matching by pcre2_match() in subject strings that contain  invalid  UTF
1641       sequences.   This  facility  is not supported for DFA matching. For de-
1642       tails, see the pcre2unicode documentation.
1643
1644         PCRE2_MATCH_UNSET_BACKREF
1645
1646       If this option is set,  a  backreference  to  an  unset  capture  group
1647       matches  an  empty  string (by default this causes the current matching
1648       alternative to fail).  A pattern such as (\1)(a) succeeds when this op-
1649       tion  is  set  (assuming it can find an "a" in the subject), whereas it
1650       fails by default, for Perl compatibility.  Setting  this  option  makes
1651       PCRE2 behave more like ECMAscript (aka JavaScript).
1652
1653         PCRE2_MULTILINE
1654
1655       By  default,  for  the purposes of matching "start of line" and "end of
1656       line", PCRE2 treats the subject string as consisting of a  single  line
1657       of  characters,  even  if  it actually contains newlines. The "start of
1658       line" metacharacter (^) matches only at the start of  the  string,  and
1659       the  "end  of  line"  metacharacter  ($) matches only at the end of the
1660       string, or before a terminating newline (except  when  PCRE2_DOLLAR_EN-
1661       DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any
1662       character" metacharacter (.) does not match at a newline.  This  behav-
1663       iour (for ^, $, and dot) is the same as Perl.
1664
1665       When  PCRE2_MULTILINE  it is set, the "start of line" and "end of line"
1666       constructs match immediately following or immediately  before  internal
1667       newlines  in  the  subject string, respectively, as well as at the very
1668       start and end. This is equivalent to Perl's /m option, and  it  can  be
1669       changed within a pattern by a (?m) option setting. Note that the "start
1670       of line" metacharacter does not match after a newline at the end of the
1671       subject,  for compatibility with Perl.  However, you can change this by
1672       setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in  a
1673       subject  string,  or  no  occurrences  of  ^ or $ in a pattern, setting
1674       PCRE2_MULTILINE has no effect.
1675
1676         PCRE2_NEVER_BACKSLASH_C
1677
1678       This option locks out the use of \C in the pattern that is  being  com-
1679       piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1680       UTF-16 modes, because it may leave the current matching  point  in  the
1681       middle of a multi-code-unit character. This option may be useful in ap-
1682       plications that process patterns from external sources. Note that there
1683       is also a build-time option that permanently locks out the use of \C.
1684
1685         PCRE2_NEVER_UCP
1686
1687       This  option  locks  out the use of Unicode properties for handling \B,
1688       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
1689       described  for  the  PCRE2_UCP option below. In particular, it prevents
1690       the creator of the pattern from enabling this facility by starting  the
1691       pattern  with  (*UCP).  This  option may be useful in applications that
1692       process patterns from external sources. The option combination PCRE_UCP
1693       and PCRE_NEVER_UCP causes an error.
1694
1695         PCRE2_NEVER_UTF
1696
1697       This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
1698       or UTF-32, depending on which library is in use. In particular, it pre-
1699       vents  the  creator of the pattern from switching to UTF interpretation
1700       by starting the pattern with (*UTF). This option may be useful  in  ap-
1701       plications that process patterns from external sources. The combination
1702       of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
1703
1704         PCRE2_NO_AUTO_CAPTURE
1705
1706       If this option is set, it disables the use of numbered capturing paren-
1707       theses  in the pattern. Any opening parenthesis that is not followed by
1708       ? behaves as if it were followed by ?: but named parentheses can  still
1709       be used for capturing (and they acquire numbers in the usual way). This
1710       is the same as Perl's /n option.  Note that, when this option  is  set,
1711       references  to  capture  groups (backreferences or recursion/subroutine
1712       calls) may only refer to named groups, though the reference can  be  by
1713       name or by number.
1714
1715         PCRE2_NO_AUTO_POSSESS
1716
1717       If this option is set, it disables "auto-possessification", which is an
1718       optimization that, for example, turns a+b into a++b in order  to  avoid
1719       backtracks  into  a+ that can never be successful. However, if callouts
1720       are in use, auto-possessification means that some  callouts  are  never
1721       taken. You can set this option if you want the matching functions to do
1722       a full unoptimized search and run all the callouts, but  it  is  mainly
1723       provided for testing purposes.
1724
1725         PCRE2_NO_DOTSTAR_ANCHOR
1726
1727       If this option is set, it disables an optimization that is applied when
1728       .* is the first significant item in a top-level branch  of  a  pattern,
1729       and  all  the  other branches also start with .* or with \A or \G or ^.
1730       The optimization is automatically disabled for .* if it  is  inside  an
1731       atomic group or a capture group that is the subject of a backreference,
1732       or if the pattern contains (*PRUNE) or (*SKIP). When  the  optimization
1733       is   not   disabled,  such  a  pattern  is  automatically  anchored  if
1734       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
1735       for  any  ^ items. Otherwise, the fact that any match must start either
1736       at the start of the subject or following a newline is remembered.  Like
1737       other optimizations, this can cause callouts to be skipped.
1738
1739         PCRE2_NO_START_OPTIMIZE
1740
1741       This  is  an  option whose main effect is at matching time. It does not
1742       change what pcre2_compile() generates, but it does affect the output of
1743       the JIT compiler.
1744
1745       There  are  a  number of optimizations that may occur at the start of a
1746       match, in order to speed up the process. For example, if  it  is  known
1747       that  an  unanchored  match must start with a specific code unit value,
1748       the matching code searches the subject for that value, and fails  imme-
1749       diately  if it cannot find it, without actually running the main match-
1750       ing function. This means that a special item such as (*COMMIT)  at  the
1751       start  of  a  pattern is not considered until after a suitable starting
1752       point for the match has been found.  Also,  when  callouts  or  (*MARK)
1753       items  are  in use, these "start-up" optimizations can cause them to be
1754       skipped if the pattern is never actually used. The  start-up  optimiza-
1755       tions  are  in effect a pre-scan of the subject that takes place before
1756       the pattern is run.
1757
1758       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1759       possibly  causing  performance  to  suffer,  but ensuring that in cases
1760       where the result is "no match", the callouts do occur, and  that  items
1761       such as (*COMMIT) and (*MARK) are considered at every possible starting
1762       position in the subject string.
1763
1764       Setting PCRE2_NO_START_OPTIMIZE may change the outcome  of  a  matching
1765       operation.  Consider the pattern
1766
1767         (*COMMIT)ABC
1768
1769       When  this  is compiled, PCRE2 records the fact that a match must start
1770       with the character "A". Suppose the subject  string  is  "DEFABC".  The
1771       start-up  optimization  scans along the subject, finds "A" and runs the
1772       first match attempt from there. The (*COMMIT) item means that the  pat-
1773       tern  must  match the current starting position, which in this case, it
1774       does. However, if the same match is  run  with  PCRE2_NO_START_OPTIMIZE
1775       set,  the  initial  scan  along the subject string does not happen. The
1776       first match attempt is run starting  from  "D"  and  when  this  fails,
1777       (*COMMIT)  prevents any further matches being tried, so the overall re-
1778       sult is "no match".
1779
1780       As another start-up optimization makes use of a minimum  length  for  a
1781       matching subject, which is recorded when possible. Consider the pattern
1782
1783         (*MARK:1)B(*MARK:2)(X|Y)
1784
1785       The  minimum  length  for  a match is two characters. If the subject is
1786       "XXBB", the "starting character" optimization skips "XX", then tries to
1787       match  "BB", which is long enough. In the process, (*MARK:2) is encoun-
1788       tered and remembered. When the match attempt fails,  the  next  "B"  is
1789       found,  but  there is only one character left, so there are no more at-
1790       tempts, and "no match" is returned with the "last  mark  seen"  set  to
1791       "2".  If  NO_START_OPTIMIZE is set, however, matches are tried at every
1792       possible starting position, including at the end of the subject,  where
1793       (*MARK:1)  is encountered, but there is no "B", so the "last mark seen"
1794       that is returned is "1". In this case, the optimizations do not  affect
1795       the overall match result, which is still "no match", but they do affect
1796       the auxiliary information that is returned.
1797
1798         PCRE2_NO_UTF_CHECK
1799
1800       When PCRE2_UTF is set, the validity of the pattern as a UTF  string  is
1801       automatically  checked.  There  are  discussions  about the validity of
1802       UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1803       document.  If an invalid UTF sequence is found, pcre2_compile() returns
1804       a negative error code.
1805
1806       If you know that your pattern is a valid UTF string, and  you  want  to
1807       skip   this   check   for   performance   reasons,   you  can  set  the
1808       PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1809       valid  UTF  string as a pattern is undefined. It may cause your program
1810       to crash or loop.
1811
1812       Note  that  this  option  can  also  be  passed  to  pcre2_match()  and
1813       pcre_dfa_match(),  to  suppress  UTF  validity  checking of the subject
1814       string.
1815
1816       Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1817       able  the error that is given if an escape sequence for an invalid Uni-
1818       code code point is encountered in the pattern. In particular,  the  so-
1819       called  "surrogate"  code points (0xd800 to 0xdfff) are invalid. If you
1820       want to allow escape  sequences  such  as  \x{d800}  you  can  set  the
1821       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  extra  option, as described in the
1822       section entitled "Extra compile options" below.  However, this is  pos-
1823       sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1824       resentable in UTF-16.
1825
1826         PCRE2_UCP
1827
1828       This option has two effects. Firstly, it change the way PCRE2 processes
1829       \B,  \b,  \D,  \d,  \S,  \s,  \W,  \w,  and some of the POSIX character
1830       classes. By default, only  ASCII  characters  are  recognized,  but  if
1831       PCRE2_UCP is set, Unicode properties are used instead to classify char-
1832       acters. More details are given in  the  section  on  generic  character
1833       types  in  the pcre2pattern page. If you set PCRE2_UCP, matching one of
1834       the items it affects takes much longer.
1835
1836       The second effect of PCRE2_UCP is to force the use of  Unicode  proper-
1837       ties  for  upper/lower casing operations on characters with code points
1838       greater than 127, even when PCRE2_UTF is not set. This makes it  possi-
1839       ble, for example, to process strings in the 16-bit UCS-2 code. This op-
1840       tion is available only if PCRE2 has been compiled with Unicode  support
1841       (which is the default).
1842
1843         PCRE2_UNGREEDY
1844
1845       This  option  inverts  the "greediness" of the quantifiers so that they
1846       are not greedy by default, but become greedy if followed by "?". It  is
1847       not  compatible  with Perl. It can also be set by a (?U) option setting
1848       within the pattern.
1849
1850         PCRE2_USE_OFFSET_LIMIT
1851
1852       This option must be set for pcre2_compile() if pcre2_set_offset_limit()
1853       is  going  to be used to set a non-default offset limit in a match con-
1854       text for matches that use this pattern. An error  is  generated  if  an
1855       offset  limit is set without this option. For more details, see the de-
1856       scription of pcre2_set_offset_limit() in  the  section  that  describes
1857       match contexts. See also the PCRE2_FIRSTLINE option above.
1858
1859         PCRE2_UTF
1860
1861       This  option  causes  PCRE2  to regard both the pattern and the subject
1862       strings that are subsequently processed as strings  of  UTF  characters
1863       instead  of  single-code-unit  strings.  It  is available when PCRE2 is
1864       built to include Unicode support (which is  the  default).  If  Unicode
1865       support is not available, the use of this option provokes an error. De-
1866       tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in  the
1867       pcre2unicode  page.  In  particular,  note  that  it  changes  the  way
1868       PCRE2_CASELESS handles characters with code points greater than 127.
1869
1870   Extra compile options
1871
1872       The option bits that can be set in a compile  context  by  calling  the
1873       pcre2_set_compile_extra_options() function are as follows:
1874
1875         PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
1876
1877       This  option  applies when compiling a pattern in UTF-8 or UTF-32 mode.
1878       It is forbidden in UTF-16 mode, and ignored in non-UTF  modes.  Unicode
1879       "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
1880       in UTF-16 to encode code points with values in  the  range  0x10000  to
1881       0x10ffff.  The  surrogates  cannot  therefore be represented in UTF-16.
1882       They can be represented in UTF-8 and UTF-32, but are defined as invalid
1883       code  points,  and  cause  errors  if  encountered in a UTF-8 or UTF-32
1884       string that is being checked for validity by PCRE2.
1885
1886       These values also cause errors if encountered in escape sequences  such
1887       as \x{d912} within a pattern. However, it seems that some applications,
1888       when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1889       plicitly   test   for   the  surrogates  using  escape  sequences.  The
1890       PCRE2_NO_UTF_CHECK option does not disable the error that  occurs,  be-
1891       cause it applies only to the testing of input strings for UTF validity.
1892
1893       If  the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1894       gate code point values in UTF-8 and UTF-32 patterns no  longer  provoke
1895       errors  and are incorporated in the compiled pattern. However, they can
1896       only match subject characters if the matching function is  called  with
1897       PCRE2_NO_UTF_CHECK set.
1898
1899         PCRE2_EXTRA_ALT_BSUX
1900
1901       The  original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and
1902       \x in the way that ECMAscript (aka JavaScript) does.  Additional  func-
1903       tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has
1904       the effect of PCRE2_ALT_BSUX, but in addition it  recognizes  \u{hhh..}
1905       as a hexadecimal character code, where hhh.. is any number of hexadeci-
1906       mal digits.
1907
1908         PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
1909
1910       This is a dangerous option. Use with care. By default, an  unrecognized
1911       escape  such  as \j or a malformed one such as \x{2z} causes a compile-
1912       time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1913       tent  in  handling  such items: for example, \j is treated as a literal
1914       "j", and non-hexadecimal digits in \x{} are just ignored, though  warn-
1915       ings  are given in both cases if Perl's warning switch is enabled. How-
1916       ever, a malformed octal number after \o{  always  causes  an  error  in
1917       Perl.
1918
1919       If  the  PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL  extra  option  is passed to
1920       pcre2_compile(), all unrecognized or  malformed  escape  sequences  are
1921       treated  as  single-character escapes. For example, \j is a literal "j"
1922       and \x{2z} is treated as the literal string "x{2z}". Setting  this  op-
1923       tion means that typos in patterns may go undetected and have unexpected
1924       results. Also note that a sequence such as [\N{] is  interpreted  as  a
1925       malformed  attempt  at [\N{...}] and so is treated as [N{] whereas [\N]
1926       gives an error because an unqualified \N is a valid escape sequence but
1927       is  not supported in a character class. To reiterate: this is a danger-
1928       ous option. Use with great care.
1929
1930         PCRE2_EXTRA_ESCAPED_CR_IS_LF
1931
1932       There are some legacy applications where the escape sequence  \r  in  a
1933       pattern  is expected to match a newline. If this option is set, \r in a
1934       pattern is converted to \n so that it matches a LF  (linefeed)  instead
1935       of  a CR (carriage return) character. The option does not affect a lit-
1936       eral CR in the pattern, nor does it affect CR specified as an  explicit
1937       code point such as \x{0D}.
1938
1939         PCRE2_EXTRA_MATCH_LINE
1940
1941       This  option  is  provided  for  use  by the -x option of pcre2grep. It
1942       causes the pattern only to match complete lines. This  is  achieved  by
1943       automatically  inserting  the  code for "^(?:" at the start of the com-
1944       piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE  is  set,
1945       the  matched  line may be in the middle of the subject string. This op-
1946       tion can be used with PCRE2_LITERAL.
1947
1948         PCRE2_EXTRA_MATCH_WORD
1949
1950       This option is provided for use by  the  -w  option  of  pcre2grep.  It
1951       causes  the  pattern only to match strings that have a word boundary at
1952       the start and the end. This is achieved by automatically inserting  the
1953       code  for "\b(?:" at the start of the compiled pattern and ")\b" at the
1954       end. The option may be used with PCRE2_LITERAL. However, it is  ignored
1955       if PCRE2_EXTRA_MATCH_LINE is also set.
1956
1957
1958JUST-IN-TIME (JIT) COMPILATION
1959
1960       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
1961
1962       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
1963         PCRE2_SIZE length, PCRE2_SIZE startoffset,
1964         uint32_t options, pcre2_match_data *match_data,
1965         pcre2_match_context *mcontext);
1966
1967       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
1968
1969       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
1970         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
1971
1972       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
1973         pcre2_jit_callback callback_function, void *callback_data);
1974
1975       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
1976
1977       These  functions  provide  support  for  JIT compilation, which, if the
1978       just-in-time compiler is available, further processes a  compiled  pat-
1979       tern into machine code that executes much faster than the pcre2_match()
1980       interpretive matching function. Full details are given in the  pcre2jit
1981       documentation.
1982
1983       JIT  compilation  is  a heavyweight optimization. It can take some time
1984       for patterns to be analyzed, and for one-off matches  and  simple  pat-
1985       terns  the benefit of faster execution might be offset by a much slower
1986       compilation time.  Most (but not all) patterns can be optimized by  the
1987       JIT compiler.
1988
1989
1990LOCALE SUPPORT
1991
1992       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
1993
1994       void pcre2_maketables_free(pcre2_general_context *gcontext,
1995         const uint8_t *tables);
1996
1997       PCRE2  handles caseless matching, and determines whether characters are
1998       letters, digits, or whatever, by reference to a set of tables,  indexed
1999       by character code point. However, this applies only to characters whose
2000       code points are less than 256. By default,  higher-valued  code  points
2001       never match escapes such as \w or \d.
2002
2003       When  PCRE2  is  built  with Unicode support (the default), the Unicode
2004       properties of all characters can be tested with \p and \P, or, alterna-
2005       tively,  the  PCRE2_UCP  option  can be set when a pattern is compiled;
2006       this causes \w and friends to use Unicode property support  instead  of
2007       the  built-in  tables.  PCRE2_UCP also causes upper/lower casing opera-
2008       tions on characters with code points greater than 127  to  use  Unicode
2009       properties. These effects apply even when PCRE2_UTF is not set.
2010
2011       The  use  of  locales  with Unicode is discouraged. If you are handling
2012       characters with code points greater than 127,  you  should  either  use
2013       Unicode support, or use locales, but not try to mix the two.
2014
2015       PCRE2  contains a built-in set of character tables that are used by de-
2016       fault.  These are sufficient for many applications. Normally,  the  in-
2017       ternal  tables  recognize only ASCII characters. However, when PCRE2 is
2018       built, it is possible to cause the internal tables to be rebuilt in the
2019       default "C" locale of the local system, which may cause them to be dif-
2020       ferent.
2021
2022       The built-in tables can be overridden by tables supplied by the  appli-
2023       cation  that  calls  PCRE2.  These may be created in a different locale
2024       from the default.  As more and more applications change to  using  Uni-
2025       code, the need for this locale support is expected to die away.
2026
2027       External  tables  are built by calling the pcre2_maketables() function,
2028       in the relevant locale. The only argument to this function is a general
2029       context,  which  can  be used to pass a custom memory allocator. If the
2030       argument is NULL, the system malloc() is used. The result can be passed
2031       to pcre2_compile() as often as necessary, by creating a compile context
2032       and calling pcre2_set_character_tables()  to  set  the  tables  pointer
2033       therein.
2034
2035       For  example,  to  build  and  use  tables that are appropriate for the
2036       French locale (where accented characters with values greater  than  127
2037       are treated as letters), the following code could be used:
2038
2039         setlocale(LC_CTYPE, "fr_FR");
2040         tables = pcre2_maketables(NULL);
2041         ccontext = pcre2_compile_context_create(NULL);
2042         pcre2_set_character_tables(ccontext, tables);
2043         re = pcre2_compile(..., ccontext);
2044
2045       The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
2046       if you are using Windows, the name for the French locale is "french".
2047
2048       The pointer that is passed (via the compile context) to pcre2_compile()
2049       is saved with the compiled pattern, and the same tables are used by the
2050       matching functions. Thus,  for  any  single  pattern,  compilation  and
2051       matching  both happen in the same locale, but different patterns can be
2052       processed in different locales.
2053
2054       It is the caller's responsibility to ensure that the memory  containing
2055       the tables remains available while they are still in use. When they are
2056       no longer needed, you can discard them  using  pcre2_maketables_free(),
2057       which  should  pass as its first parameter the same global context that
2058       was used to create the tables.
2059
2060   Saving locale tables
2061
2062       The tables described above are just a sequence of binary  bytes,  which
2063       makes  them  independent of hardware characteristics such as endianness
2064       or whether the processor is 32-bit or 64-bit. A copy of the  result  of
2065       pcre2_maketables()  can  therefore  be saved in a file or elsewhere and
2066       re-used later, even in a different program or on another computer.  The
2067       size  of  the  tables  (number  of  bytes)  must be obtained by calling
2068       pcre2_config()  with  the  PCRE2_CONFIG_TABLES_LENGTH  option   because
2069       pcre2_maketables()   does   not   return  this  value.  Note  that  the
2070       pcre2_dftables program, which is part of the PCRE2 build system, can be
2071       used stand-alone to create a file that contains a set of binary tables.
2072       See the pcre2build documentation for details.
2073
2074
2075INFORMATION ABOUT A COMPILED PATTERN
2076
2077       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
2078
2079       The pcre2_pattern_info() function returns general information  about  a
2080       compiled pattern. For information about callouts, see the next section.
2081       The first argument for pcre2_pattern_info() is a pointer  to  the  com-
2082       piled pattern. The second argument specifies which piece of information
2083       is required, and the third argument is a pointer to a variable  to  re-
2084       ceive  the  data.  If the third argument is NULL, the first argument is
2085       ignored, and the function returns the size in  bytes  of  the  variable
2086       that is required for the information requested. Otherwise, the yield of
2087       the function is zero for success, or one of the following negative num-
2088       bers:
2089
2090         PCRE2_ERROR_NULL           the argument code was NULL
2091         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
2092         PCRE2_ERROR_BADOPTION      the value of what was invalid
2093         PCRE2_ERROR_UNSET          the requested field is not set
2094
2095       The "magic number" is placed at the start of each compiled pattern as a
2096       simple check against passing an arbitrary memory  pointer.  Here  is  a
2097       typical  call of pcre2_pattern_info(), to obtain the length of the com-
2098       piled pattern:
2099
2100         int rc;
2101         size_t length;
2102         rc = pcre2_pattern_info(
2103           re,               /* result of pcre2_compile() */
2104           PCRE2_INFO_SIZE,  /* what is required */
2105           &length);         /* where to put the data */
2106
2107       The possible values for the second argument are defined in pcre2.h, and
2108       are as follows:
2109
2110         PCRE2_INFO_ALLOPTIONS
2111         PCRE2_INFO_ARGOPTIONS
2112         PCRE2_INFO_EXTRAOPTIONS
2113
2114       Return copies of the pattern's options. The third argument should point
2115       to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly  the  op-
2116       tions  that  were  passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
2117       TIONS returns the compile options as modified by any  top-level  (*XXX)
2118       option  settings  such  as  (*UTF)  at the start of the pattern itself.
2119       PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in  the
2120       compile  context by calling the pcre2_set_compile_extra_options() func-
2121       tion.
2122
2123       For example, if the pattern /(*UTF)abc/ is compiled with the  PCRE2_EX-
2124       TENDED  option,  the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED
2125       and PCRE2_UTF.  Option settings such as (?i) that can change  within  a
2126       pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they
2127       appear right at the start of the pattern. (This was different  in  some
2128       earlier releases.)
2129
2130       A  pattern compiled without PCRE2_ANCHORED is automatically anchored by
2131       PCRE2 if the first significant item in every top-level branch is one of
2132       the following:
2133
2134         ^     unless PCRE2_MULTILINE is set
2135         \A    always
2136         \G    always
2137         .*    sometimes - see below
2138
2139       When  .* is the first significant item, anchoring is possible only when
2140       all the following are true:
2141
2142         .* is not in an atomic group
2143         .* is not in a capture group that is the subject
2144              of a backreference
2145         PCRE2_DOTALL is in force for .*
2146         Neither (*PRUNE) nor (*SKIP) appears in the pattern
2147         PCRE2_NO_DOTSTAR_ANCHOR is not set
2148
2149       For patterns that are auto-anchored, the PCRE2_ANCHORED bit is  set  in
2150       the options returned for PCRE2_INFO_ALLOPTIONS.
2151
2152         PCRE2_INFO_BACKREFMAX
2153
2154       Return  the  number  of  the  highest backreference in the pattern. The
2155       third argument should point  to  a  uint32_t  variable.  Named  capture
2156       groups  acquire  numbers  as well as names, and these count towards the
2157       highest backreference. Backreferences such as \4 or  \g{12}  match  the
2158       captured characters of the given group, but in addition, the check that
2159       a capture group is set in a conditional group such as (?(3)a|b) is also
2160       a backreference.  Zero is returned if there are no backreferences.
2161
2162         PCRE2_INFO_BSR
2163
2164       The  output  is a uint32_t integer whose value indicates what character
2165       sequences the \R escape sequence matches. A value of  PCRE2_BSR_UNICODE
2166       means  that  \R  matches  any  Unicode line ending sequence; a value of
2167       PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
2168
2169         PCRE2_INFO_CAPTURECOUNT
2170
2171       Return the highest capture group number in  the  pattern.  In  patterns
2172       where (?| is not used, this is also the total number of capture groups.
2173       The third argument should point to a uint32_t variable.
2174
2175         PCRE2_INFO_DEPTHLIMIT
2176
2177       If the pattern set a backtracking depth limit by including an  item  of
2178       the  form  (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
2179       third argument should point to a uint32_t integer. If no such value has
2180       been  set, the call to pcre2_pattern_info() returns the error PCRE2_ER-
2181       ROR_UNSET. Note that this limit will only be used during matching if it
2182       is  less  than  the  limit  set or defaulted by the caller of the match
2183       function.
2184
2185         PCRE2_INFO_FIRSTBITMAP
2186
2187       In the absence of a single first code unit for a non-anchored  pattern,
2188       pcre2_compile()  may construct a 256-bit table that defines a fixed set
2189       of values for the first code unit in any match. For example, a  pattern
2190       that  starts  with  [abc]  results in a table with three bits set. When
2191       code unit values greater than 255 are supported, the flag bit  for  255
2192       means  "any  code unit of value 255 or above". If such a table was con-
2193       structed, a pointer to it is returned. Otherwise NULL is returned.  The
2194       third argument should point to a const uint8_t * variable.
2195
2196         PCRE2_INFO_FIRSTCODETYPE
2197
2198       Return information about the first code unit of any matched string, for
2199       a non-anchored pattern. The third argument should point to  a  uint32_t
2200       variable.  If there is a fixed first value, for example, the letter "c"
2201       from a pattern such as (cat|cow|coyote), 1 is returned, and  the  value
2202       can  be  retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
2203       first value, but it is known that a match can occur only at  the  start
2204       of  the  subject  or following a newline in the subject, 2 is returned.
2205       Otherwise, and for anchored patterns, 0 is returned.
2206
2207         PCRE2_INFO_FIRSTCODEUNIT
2208
2209       Return the value of the first code unit of any  matched  string  for  a
2210       pattern  where  PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
2211       The third argument should point to a uint32_t variable.  In  the  8-bit
2212       library,  the  value is always less than 256. In the 16-bit library the
2213       value can be up to 0xffff. In the 32-bit library  in  UTF-32  mode  the
2214       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2215       mode.
2216
2217         PCRE2_INFO_FRAMESIZE
2218
2219       Return the size (in bytes) of the data frames that are used to remember
2220       backtracking  positions  when the pattern is processed by pcre2_match()
2221       without the use of JIT. The third argument should  point  to  a  size_t
2222       variable. The frame size depends on the number of capturing parentheses
2223       in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2224       ables.
2225
2226         PCRE2_INFO_HASBACKSLASHC
2227
2228       Return  1 if the pattern contains any instances of \C, otherwise 0. The
2229       third argument should point to a uint32_t variable.
2230
2231         PCRE2_INFO_HASCRORLF
2232
2233       Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
2234       characters,  otherwise 0. The third argument should point to a uint32_t
2235       variable. An explicit match is either a literal CR or LF character,  or
2236       \r  or  \n  or  one  of  the equivalent hexadecimal or octal escape se-
2237       quences.
2238
2239         PCRE2_INFO_HEAPLIMIT
2240
2241       If the pattern set a heap memory limit by including an item of the form
2242       (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2243       ment should point to a uint32_t integer. If no such value has been set,
2244       the  call  to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET.
2245       Note that this limit will only be used during matching if  it  is  less
2246       than the limit set or defaulted by the caller of the match function.
2247
2248         PCRE2_INFO_JCHANGED
2249
2250       Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
2251       otherwise 0. The third argument should point to  a  uint32_t  variable.
2252       (?J)  and  (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2253       tively.
2254
2255         PCRE2_INFO_JITSIZE
2256
2257       If the compiled pattern was successfully  processed  by  pcre2_jit_com-
2258       pile(),  return  the  size  of  the JIT compiled code, otherwise return
2259       zero. The third argument should point to a size_t variable.
2260
2261         PCRE2_INFO_LASTCODETYPE
2262
2263       Returns 1 if there is a rightmost literal code unit that must exist  in
2264       any  matched string, other than at its start. The third argument should
2265       point to a uint32_t variable. If there is no such value, 0 is returned.
2266       When  1  is returned, the code unit value itself can be retrieved using
2267       PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
2268       recorded  only if it follows something of variable length. For example,
2269       for the pattern /^a\d+z\d+/ the returned value is 1 (with "z"  returned
2270       from  PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is
2271       0.
2272
2273         PCRE2_INFO_LASTCODEUNIT
2274
2275       Return the value of the rightmost literal code unit that must exist  in
2276       any  matched  string,  other  than  at  its  start, for a pattern where
2277       PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2278       ment should point to a uint32_t variable.
2279
2280         PCRE2_INFO_MATCHEMPTY
2281
2282       Return  1  if the pattern might match an empty string, otherwise 0. The
2283       third argument should point to a uint32_t variable. When a pattern con-
2284       tains recursive subroutine calls it is not always possible to determine
2285       whether or not it can match an empty string. PCRE2 takes a cautious ap-
2286       proach and returns 1 in such cases.
2287
2288         PCRE2_INFO_MATCHLIMIT
2289
2290       If  the  pattern  set  a  match  limit by including an item of the form
2291       (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third  ar-
2292       gument  should  point  to a uint32_t integer. If no such value has been
2293       set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2294       SET.  Note  that  this limit will only be used during matching if it is
2295       less than the limit set or defaulted by the caller of the  match  func-
2296       tion.
2297
2298         PCRE2_INFO_MAXLOOKBEHIND
2299
2300       A  lookbehind  assertion moves back a certain number of characters (not
2301       code units) when it starts to process each of its  branches.  This  re-
2302       quest  returns  the largest of these backward moves. The third argument
2303       should point to a uint32_t integer. The simple assertions \b and \B re-
2304       quire  a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
2305       return 1 in the absence of anything longer. \A also  registers  a  one-
2306       character  lookbehind, though it does not actually inspect the previous
2307       character.
2308
2309       Note that this information is useful for multi-segment matching only if
2310       the  pattern  contains  no nested lookbehinds. For example, the pattern
2311       (?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it  is  pro-
2312       cessed,  the first lookbehind moves back by two characters, matches one
2313       character, then the nested lookbehind also moves back  by  two  charac-
2314       ters. This puts the matching point three characters earlier than it was
2315       at the start.  PCRE2_INFO_MAXLOOKBEHIND is really only useful as a  de-
2316       bugging  tool.  See  the pcre2partial documentation for a discussion of
2317       multi-segment matching.
2318
2319         PCRE2_INFO_MINLENGTH
2320
2321       If a minimum length for matching  subject  strings  was  computed,  its
2322       value is returned. Otherwise the returned value is 0. This value is not
2323       computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number  of
2324       characters,  which in UTF mode may be different from the number of code
2325       units. The third argument should point  to  a  uint32_t  variable.  The
2326       value  is a lower bound to the length of any matching string. There may
2327       not be any strings of that length that do  actually  match,  but  every
2328       string that does match is at least that long.
2329
2330         PCRE2_INFO_NAMECOUNT
2331         PCRE2_INFO_NAMEENTRYSIZE
2332         PCRE2_INFO_NAMETABLE
2333
2334       PCRE2 supports the use of named as well as numbered capturing parenthe-
2335       ses. The names are just an additional way of identifying the  parenthe-
2336       ses, which still acquire numbers. Several convenience functions such as
2337       pcre2_substring_get_byname() are provided for extracting captured  sub-
2338       strings  by  name. It is also possible to extract the data directly, by
2339       first converting the name to a number in order to  access  the  correct
2340       pointers  in the output vector (described with pcre2_match() below). To
2341       do the conversion, you need to use the name-to-number map, which is de-
2342       scribed by these three values.
2343
2344       The  map  consists  of a number of fixed-size entries. PCRE2_INFO_NAME-
2345       COUNT gives the number of entries, and  PCRE2_INFO_NAMEENTRYSIZE  gives
2346       the  size  of each entry in code units; both of these return a uint32_t
2347       value. The entry size depends on the length of the longest name.
2348
2349       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
2350       This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2351       brary, the first two bytes of each entry are the number of the  captur-
2352       ing  parenthesis,  most  significant byte first. In the 16-bit library,
2353       the pointer points to 16-bit code units, the first  of  which  contains
2354       the  parenthesis  number.  In the 32-bit library, the pointer points to
2355       32-bit code units, the first of which contains the parenthesis  number.
2356       The rest of the entry is the corresponding name, zero terminated.
2357
2358       The  names are in alphabetical order. If (?| is used to create multiple
2359       capture groups with the same number, as described in the section on du-
2360       plicate group numbers in the pcre2pattern page, the groups may be given
2361       the same name, but there is only one  entry  in  the  table.  Different
2362       names for groups of the same number are not permitted.
2363
2364       Duplicate  names  for capture groups with different numbers are permit-
2365       ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the
2366       order  in  which  they were found in the pattern. In the absence of (?|
2367       this is the order of increasing number; when (?| is used  this  is  not
2368       necessarily  the  case because later capture groups may have lower num-
2369       bers.
2370
2371       As a simple example of the name/number table,  consider  the  following
2372       pattern  after  compilation by the 8-bit library (assume PCRE2_EXTENDED
2373       is set, so white space - including newlines - is ignored):
2374
2375         (?<date> (?<year>(\d\d)?\d\d) -
2376         (?<month>\d\d) - (?<day>\d\d) )
2377
2378       There are four named capture groups, so the table has four entries, and
2379       each  entry  in the table is eight bytes long. The table is as follows,
2380       with non-printing bytes shows in hexadecimal, and undefined bytes shown
2381       as ??:
2382
2383         00 01 d  a  t  e  00 ??
2384         00 05 d  a  y  00 ?? ??
2385         00 04 m  o  n  t  h  00
2386         00 02 y  e  a  r  00 ??
2387
2388       When  writing  code to extract data from named capture groups using the
2389       name-to-number map, remember that the length of the entries  is  likely
2390       to be different for each compiled pattern.
2391
2392         PCRE2_INFO_NEWLINE
2393
2394       The output is one of the following uint32_t values:
2395
2396         PCRE2_NEWLINE_CR       Carriage return (CR)
2397         PCRE2_NEWLINE_LF       Linefeed (LF)
2398         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
2399         PCRE2_NEWLINE_ANY      Any Unicode line ending
2400         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
2401         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
2402
2403       This identifies the character sequence that will be recognized as mean-
2404       ing "newline" while matching.
2405
2406         PCRE2_INFO_SIZE
2407
2408       Return the size of the compiled pattern in bytes  (for  all  three  li-
2409       braries).  The  third  argument should point to a size_t variable. This
2410       value includes the size of the general data  block  that  precedes  the
2411       code  units of the compiled pattern itself. The value that is used when
2412       pcre2_compile() is getting memory in which to place the  compiled  pat-
2413       tern may be slightly larger than the value returned by this option, be-
2414       cause there are cases where the code that calculates the  size  has  to
2415       over-estimate.  Processing a pattern with the JIT compiler does not al-
2416       ter the value returned by this option.
2417
2418
2419INFORMATION ABOUT A PATTERN'S CALLOUTS
2420
2421       int pcre2_callout_enumerate(const pcre2_code *code,
2422         int (*callback)(pcre2_callout_enumerate_block *, void *),
2423         void *user_data);
2424
2425       A script language that supports the use of string arguments in callouts
2426       might  like  to  scan  all the callouts in a pattern before running the
2427       match. This can be done by calling pcre2_callout_enumerate(). The first
2428       argument  is  a  pointer  to a compiled pattern, the second points to a
2429       callback function, and the third is arbitrary user data.  The  callback
2430       function  is  called  for  every callout in the pattern in the order in
2431       which they appear. Its first argument is a pointer to a callout enumer-
2432       ation  block,  and  its second argument is the user_data value that was
2433       passed to pcre2_callout_enumerate(). The contents of the  callout  enu-
2434       meration  block  are described in the pcre2callout documentation, which
2435       also gives further details about callouts.
2436
2437
2438SERIALIZATION AND PRECOMPILING
2439
2440       It is possible to save compiled patterns  on  disc  or  elsewhere,  and
2441       reload  them  later,  subject  to a number of restrictions. The host on
2442       which the patterns are reloaded must be running  the  same  version  of
2443       PCRE2, with the same code unit width, and must also have the same endi-
2444       anness, pointer width, and PCRE2_SIZE type.  Before  compiled  patterns
2445       can  be  saved, they must be converted to a "serialized" form, which in
2446       the case of PCRE2 is really just a bytecode dump.  The functions  whose
2447       names  begin  with pcre2_serialize_ are used for converting to and from
2448       the serialized form. They are described in the pcre2serialize  documen-
2449       tation.  Note  that  PCRE2 serialization does not convert compiled pat-
2450       terns to an abstract format like Java or .NET serialization.
2451
2452
2453THE MATCH DATA BLOCK
2454
2455       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
2456         pcre2_general_context *gcontext);
2457
2458       pcre2_match_data *pcre2_match_data_create_from_pattern(
2459         const pcre2_code *code, pcre2_general_context *gcontext);
2460
2461       void pcre2_match_data_free(pcre2_match_data *match_data);
2462
2463       Information about a successful or unsuccessful match  is  placed  in  a
2464       match  data  block,  which  is  an opaque structure that is accessed by
2465       function calls. In particular, the match data block contains  a  vector
2466       of  offsets into the subject string that define the matched part of the
2467       subject and any substrings that were captured. This  is  known  as  the
2468       ovector.
2469
2470       Before  calling  pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
2471       you must create a match data block by calling one of the creation func-
2472       tions  above.  For pcre2_match_data_create(), the first argument is the
2473       number of pairs of offsets in the ovector. One pair of offsets  is  re-
2474       quired  to  identify the string that matched the whole pattern, with an
2475       additional pair for each captured substring. For example, a value of  4
2476       creates  enough space to record the matched portion of the subject plus
2477       three captured substrings. A minimum of at least 1 pair is  imposed  by
2478       pcre2_match_data_create(), so it is always possible to return the over-
2479       all matched string.
2480
2481       The second argument of pcre2_match_data_create() is a pointer to a gen-
2482       eral  context, which can specify custom memory management for obtaining
2483       the memory for the match data block. If you are not using custom memory
2484       management, pass NULL, which causes malloc() to be used.
2485
2486       For  pcre2_match_data_create_from_pattern(),  the  first  argument is a
2487       pointer to a compiled pattern. The ovector is created to be exactly the
2488       right size to hold all the substrings a pattern might capture. The sec-
2489       ond argument is again a pointer to a general context, but in this  case
2490       if NULL is passed, the memory is obtained using the same allocator that
2491       was used for the compiled pattern (custom or default).
2492
2493       A match data block can be used many times, with the same  or  different
2494       compiled  patterns. You can extract information from a match data block
2495       after a match operation has finished,  using  functions  that  are  de-
2496       scribed in the sections on matched strings and other match data below.
2497
2498       When  a  call  of  pcre2_match()  fails, valid data is available in the
2499       match block only  when  the  error  is  PCRE2_ERROR_NOMATCH,  PCRE2_ER-
2500       ROR_PARTIAL,  or  one of the error codes for an invalid UTF string. Ex-
2501       actly what is available depends on the error, and is detailed below.
2502
2503       When one of the matching functions is called, pointers to the  compiled
2504       pattern  and the subject string are set in the match data block so that
2505       they can be referenced by the extraction functions after  a  successful
2506       match. After running a match, you must not free a compiled pattern or a
2507       subject string until after all operations on the match data block  (for
2508       that  match)  have  taken  place,  unless,  in  the case of the subject
2509       string, you have used the PCRE2_COPY_MATCHED_SUBJECT option,  which  is
2510       described  in  the section entitled "Option bits for pcre2_match()" be-
2511       low.
2512
2513       When a match data block itself is no longer needed, it should be  freed
2514       by  calling  pcre2_match_data_free(). If this function is called with a
2515       NULL argument, it returns immediately, without doing anything.
2516
2517
2518MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2519
2520       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
2521         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2522         uint32_t options, pcre2_match_data *match_data,
2523         pcre2_match_context *mcontext);
2524
2525       The function pcre2_match() is called to match a subject string  against
2526       a  compiled pattern, which is passed in the code argument. You can call
2527       pcre2_match() with the same code argument as many times as you like, in
2528       order  to  find multiple matches in the subject string or to match dif-
2529       ferent subject strings with the same pattern.
2530
2531       This function is the main matching facility of the library, and it  op-
2532       erates  in  a Perl-like manner. For specialist use there is also an al-
2533       ternative matching function, which is described below  in  the  section
2534       about the pcre2_dfa_match() function.
2535
2536       Here is an example of a simple call to pcre2_match():
2537
2538         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2539         int rc = pcre2_match(
2540           re,             /* result of pcre2_compile() */
2541           "some string",  /* the subject string */
2542           11,             /* the length of the subject string */
2543           0,              /* start at offset 0 in the subject */
2544           0,              /* default options */
2545           md,             /* the match data block */
2546           NULL);          /* a match context; NULL means use defaults */
2547
2548       If  the  subject  string is zero-terminated, the length can be given as
2549       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
2550       common matching parameters are to be changed. For details, see the sec-
2551       tion on the match context above.
2552
2553   The string to be matched by pcre2_match()
2554
2555       The subject string is passed to pcre2_match() as a pointer in  subject,
2556       a  length  in  length, and a starting offset in startoffset. The length
2557       and offset are in code units, not characters.  That  is,  they  are  in
2558       bytes  for the 8-bit library, 16-bit code units for the 16-bit library,
2559       and 32-bit code units for the 32-bit library, whether or not  UTF  pro-
2560       cessing is enabled.
2561
2562       If startoffset is greater than the length of the subject, pcre2_match()
2563       returns PCRE2_ERROR_BADOFFSET. When the starting offset  is  zero,  the
2564       search  for a match starts at the beginning of the subject, and this is
2565       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2566       set  must  point to the start of a character, or to the end of the sub-
2567       ject (in UTF-32 mode, one code unit equals one character, so  all  off-
2568       sets  are  valid). Like the pattern string, the subject may contain bi-
2569       nary zeros.
2570
2571       A non-zero starting offset is useful when searching for  another  match
2572       in  the  same  subject  by calling pcre2_match() again after a previous
2573       success.  Setting startoffset differs from  passing  over  a  shortened
2574       string  and  setting  PCRE2_NOTBOL in the case of a pattern that begins
2575       with any kind of lookbehind. For example, consider the pattern
2576
2577         \Biss\B
2578
2579       which finds occurrences of "iss" in the middle of  words.  (\B  matches
2580       only  if  the  current position in the subject is not a word boundary.)
2581       When applied to the string "Mississipi" the first call to pcre2_match()
2582       finds  the first occurrence. If pcre2_match() is called again with just
2583       the remainder of the subject, namely "issipi", it does not  match,  be-
2584       cause  \B  is always false at the start of the subject, which is deemed
2585       to be a word boundary. However, if pcre2_match() is passed  the  entire
2586       string again, but with startoffset set to 4, it finds the second occur-
2587       rence of "iss" because it is able to look behind the starting point  to
2588       discover that it is preceded by a letter.
2589
2590       Finding  all  the  matches  in a subject is tricky when the pattern can
2591       match an empty string. It is possible to emulate Perl's /g behaviour by
2592       first   trying   the   match   again  at  the  same  offset,  with  the
2593       PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options,  and  then  if  that
2594       fails,  advancing  the  starting  offset  and  trying an ordinary match
2595       again. There is some code that demonstrates  how  to  do  this  in  the
2596       pcre2demo  sample  program. In the most general case, you have to check
2597       to see if the newline convention recognizes CRLF as a newline,  and  if
2598       so,  and the current character is CR followed by LF, advance the start-
2599       ing offset by two characters instead of one.
2600
2601       If a non-zero starting offset is passed when the pattern is anchored, a
2602       single attempt to match at the given offset is made. This can only suc-
2603       ceed if the pattern does not require the match to be at  the  start  of
2604       the  subject.  In other words, the anchoring must be the result of set-
2605       ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL,  not
2606       by starting the pattern with ^ or \A.
2607
2608   Option bits for pcre2_match()
2609
2610       The unused bits of the options argument for pcre2_match() must be zero.
2611       The   only   bits    that    may    be    set    are    PCRE2_ANCHORED,
2612       PCRE2_COPY_MATCHED_SUBJECT,  PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
2613       TEOL,     PCRE2_NOTEMPTY,     PCRE2_NOTEMPTY_ATSTART,     PCRE2_NO_JIT,
2614       PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and PCRE2_PARTIAL_SOFT. Their
2615       action is described below.
2616
2617       Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is  not  sup-
2618       ported  by  the just-in-time (JIT) compiler. If it is set, JIT matching
2619       is disabled and the interpretive code in pcre2_match()  is  run.  Apart
2620       from  PCRE2_NO_JIT (obviously), the remaining options are supported for
2621       JIT matching.
2622
2623         PCRE2_ANCHORED
2624
2625       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
2626       matching  position.  If  a pattern was compiled with PCRE2_ANCHORED, or
2627       turned out to be anchored by virtue of its contents, it cannot be  made
2628       unachored  at matching time. Note that setting the option at match time
2629       disables JIT matching.
2630
2631         PCRE2_COPY_MATCHED_SUBJECT
2632
2633       By default, a pointer to the subject is remembered in  the  match  data
2634       block  so  that,  after a successful match, it can be referenced by the
2635       substring extraction functions. This means that  the  subject's  memory
2636       must  not be freed until all such operations are complete. For some ap-
2637       plications where the lifetime of the subject string is not  guaranteed,
2638       it  may  be  necessary  to make a copy of the subject string, but it is
2639       wasteful to do this unless the match is successful. After a  successful
2640       match,  if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied and
2641       the new pointer is remembered in the match data block  instead  of  the
2642       original  subject  pointer.  The memory allocator that was used for the
2643       match block itself is  used.  The  copy  is  automatically  freed  when
2644       pcre2_match_data_free()  is  called to free the match data block. It is
2645       also automatically freed if the match data block is re-used for another
2646       match operation.
2647
2648         PCRE2_ENDANCHORED
2649
2650       If  the  PCRE2_ENDANCHORED option is set, any string that pcre2_match()
2651       matches must be right at the end of the subject string. Note that  set-
2652       ting the option at match time disables JIT matching.
2653
2654         PCRE2_NOTBOL
2655
2656       This option specifies that first character of the subject string is not
2657       the beginning of a line, so the  circumflex  metacharacter  should  not
2658       match  before  it.  Setting  this without having set PCRE2_MULTILINE at
2659       compile time causes circumflex never to match. This option affects only
2660       the behaviour of the circumflex metacharacter. It does not affect \A.
2661
2662         PCRE2_NOTEOL
2663
2664       This option specifies that the end of the subject string is not the end
2665       of a line, so the dollar metacharacter should not match it nor  (except
2666       in  multiline mode) a newline immediately before it. Setting this with-
2667       out having set PCRE2_MULTILINE at compile time causes dollar  never  to
2668       match. This option affects only the behaviour of the dollar metacharac-
2669       ter. It does not affect \Z or \z.
2670
2671         PCRE2_NOTEMPTY
2672
2673       An empty string is not considered to be a valid match if this option is
2674       set.  If  there are alternatives in the pattern, they are tried. If all
2675       the alternatives match the empty string, the entire  match  fails.  For
2676       example, if the pattern
2677
2678         a?b?
2679
2680       is  applied  to  a  string not beginning with "a" or "b", it matches an
2681       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
2682       match  is  not valid, so pcre2_match() searches further into the string
2683       for occurrences of "a" or "b".
2684
2685         PCRE2_NOTEMPTY_ATSTART
2686
2687       This is like PCRE2_NOTEMPTY, except that it locks out an  empty  string
2688       match only at the first matching position, that is, at the start of the
2689       subject plus the starting offset. An empty string match  later  in  the
2690       subject is permitted.  If the pattern is anchored, such a match can oc-
2691       cur only if the pattern contains \K.
2692
2693         PCRE2_NO_JIT
2694
2695       By  default,  if  a  pattern  has  been   successfully   processed   by
2696       pcre2_jit_compile(),  JIT  is  automatically used when pcre2_match() is
2697       called with options that JIT supports.  Setting  PCRE2_NO_JIT  disables
2698       the use of JIT; it forces matching to be done by the interpreter.
2699
2700         PCRE2_NO_UTF_CHECK
2701
2702       When PCRE2_UTF is set at compile time, the validity of the subject as a
2703       UTF  string  is  checked  unless  PCRE2_NO_UTF_CHECK   is   passed   to
2704       pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile().
2705       The latter special case is discussed in detail in the pcre2unicode doc-
2706       umentation.
2707
2708       In  the default case, if a non-zero starting offset is given, the check
2709       is applied only to that part of the subject  that  could  be  inspected
2710       during  matching,  and there is a check that the starting offset points
2711       to the first code unit of a character or to the end of the subject.  If
2712       there  are no lookbehind assertions in the pattern, the check starts at
2713       the starting offset.  Otherwise, it starts at the length of the longest
2714       lookbehind  before  the starting offset, or at the start of the subject
2715       if there are not that many characters before the starting offset.  Note
2716       that the sequences \b and \B are one-character lookbehinds.
2717
2718       The check is carried out before any other processing takes place, and a
2719       negative error code is returned if the check fails. There  are  several
2720       UTF  error  codes  for each code unit width, corresponding to different
2721       problems with the code unit sequence. There are discussions  about  the
2722       validity  of  UTF-8  strings, UTF-16 strings, and UTF-32 strings in the
2723       pcre2unicode documentation.
2724
2725       If you know that your subject is valid, and you want to skip this check
2726       for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when
2727       calling pcre2_match(). You might want to do this  for  the  second  and
2728       subsequent  calls  to pcre2_match() if you are making repeated calls to
2729       find multiple matches in the same subject string.
2730
2731       Warning: Unless PCRE2_MATCH_INVALID_UTF was set at compile  time,  when
2732       PCRE2_NO_UTF_CHECK  is  set  at match time the effect of passing an in-
2733       valid string as a subject, or an invalid value of startoffset, is unde-
2734       fined.   Your  program may crash or loop indefinitely or give wrong re-
2735       sults.
2736
2737         PCRE2_PARTIAL_HARD
2738         PCRE2_PARTIAL_SOFT
2739
2740       These options turn on the partial matching feature. A partial match oc-
2741       curs  if  the  end  of  the subject string is reached successfully, but
2742       there are not enough subject characters to complete the match. In addi-
2743       tion,  either  at  least  one character must have been inspected or the
2744       pattern must contain a lookbehind, or the  pattern  must  be  one  that
2745       could match an empty string.
2746
2747       If  this  situation  arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR-
2748       TIAL_HARD) is set, matching continues by testing any remaining alterna-
2749       tives.  Only  if  no complete match can be found is PCRE2_ERROR_PARTIAL
2750       returned instead of PCRE2_ERROR_NOMATCH.  In  other  words,  PCRE2_PAR-
2751       TIAL_SOFT  specifies  that  the  caller is prepared to handle a partial
2752       match, but only if no complete match can be found.
2753
2754       If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In  this
2755       case,  if  a  partial match is found, pcre2_match() immediately returns
2756       PCRE2_ERROR_PARTIAL, without considering  any  other  alternatives.  In
2757       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2758       ered to be more important that an alternative complete match.
2759
2760       There is a more detailed discussion of partial and multi-segment match-
2761       ing, with examples, in the pcre2partial documentation.
2762
2763
2764NEWLINE HANDLING WHEN MATCHING
2765
2766       When  PCRE2 is built, a default newline convention is set; this is usu-
2767       ally the standard convention for the operating system. The default  can
2768       be  overridden  in a compile context by calling pcre2_set_newline(). It
2769       can also be overridden by starting a pattern string with, for  example,
2770       (*CRLF),  as  described  in  the  section on newline conventions in the
2771       pcre2pattern page. During matching, the newline choice affects the  be-
2772       haviour  of the dot, circumflex, and dollar metacharacters. It may also
2773       alter the way the match starting position is  advanced  after  a  match
2774       failure for an unanchored pattern.
2775
2776       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
2777       set as the newline convention, and a match attempt  for  an  unanchored
2778       pattern fails when the current starting position is at a CRLF sequence,
2779       and the pattern contains no explicit matches for CR or  LF  characters,
2780       the  match  position  is  advanced by two characters instead of one, in
2781       other words, to after the CRLF.
2782
2783       The above rule is a compromise that makes the most common cases work as
2784       expected.  For example, if the pattern is .+A (and the PCRE2_DOTALL op-
2785       tion is not set), it does not match the string "\r\nA"  because,  after
2786       failing  at the start, it skips both the CR and the LF before retrying.
2787       However, the pattern [\r\n]A does match that string,  because  it  con-
2788       tains an explicit CR or LF reference, and so advances only by one char-
2789       acter after the first failure.
2790
2791       An explicit match for CR of LF is either a literal appearance of one of
2792       those  characters  in the pattern, or one of the \r or \n or equivalent
2793       octal or hexadecimal escape sequences. Implicit matches such as [^X] do
2794       not  count, nor does \s, even though it includes CR and LF in the char-
2795       acters that it matches.
2796
2797       Notwithstanding the above, anomalous effects may still occur when  CRLF
2798       is a valid newline sequence and explicit \r or \n escapes appear in the
2799       pattern.
2800
2801
2802HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
2803
2804       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
2805
2806       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
2807
2808       In general, a pattern matches a certain portion of the subject, and  in
2809       addition,  further  substrings  from  the  subject may be picked out by
2810       parenthesized parts of the pattern.  Following  the  usage  in  Jeffrey
2811       Friedl's  book,  this  is  called  "capturing" in what follows, and the
2812       phrase "capture group" (Perl terminology) is used for a fragment  of  a
2813       pattern  that picks out a substring. PCRE2 supports several other kinds
2814       of parenthesized group that do not cause substrings to be captured. The
2815       pcre2_pattern_info()  function can be used to find out how many capture
2816       groups there are in a compiled pattern.
2817
2818       You can use auxiliary functions for accessing  captured  substrings  by
2819       number or by name, as described in sections below.
2820
2821       Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2822       ues, called  the  ovector,  which  contains  the  offsets  of  captured
2823       strings.   It   is   part  of  the  match  data  block.   The  function
2824       pcre2_get_ovector_pointer() returns the address  of  the  ovector,  and
2825       pcre2_get_ovector_count() returns the number of pairs of values it con-
2826       tains.
2827
2828       Within the ovector, the first in each pair of values is set to the off-
2829       set of the first code unit of a substring, and the second is set to the
2830       offset of the first code unit after the end of a substring. These  val-
2831       ues  are always code unit offsets, not character offsets. That is, they
2832       are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
2833       brary, and 32-bit offsets in the 32-bit library.
2834
2835       After  a  partial  match  (error  return PCRE2_ERROR_PARTIAL), only the
2836       first pair of offsets (that is, ovector[0]  and  ovector[1])  are  set.
2837       They  identify  the part of the subject that was partially matched. See
2838       the pcre2partial documentation for details of partial matching.
2839
2840       After a fully successful match, the first pair  of  offsets  identifies
2841       the  portion  of the subject string that was matched by the entire pat-
2842       tern. The next pair is used for the first captured  substring,  and  so
2843       on.  The  value  returned by pcre2_match() is one more than the highest
2844       numbered pair that has been set. For example, if  two  substrings  have
2845       been  captured,  the returned value is 3. If there are no captured sub-
2846       strings, the return value from a successful match is 1, indicating that
2847       just the first pair of offsets has been set.
2848
2849       If  a  pattern uses the \K escape sequence within a positive assertion,
2850       the reported start of a successful match can be greater than the end of
2851       the  match.   For  example,  if the pattern (?=ab\K) is matched against
2852       "ab", the start and end offset values for the match are 2 and 0.
2853
2854       If a capture group is matched repeatedly within a single  match  opera-
2855       tion, it is the last portion of the subject that it matched that is re-
2856       turned.
2857
2858       If the ovector is too small to hold all the captured substring offsets,
2859       as  much  as possible is filled in, and the function returns a value of
2860       zero. If captured substrings are not of interest, pcre2_match() may  be
2861       called with a match data block whose ovector is of minimum length (that
2862       is, one pair).
2863
2864       It is possible for capture group number n+1 to match some part  of  the
2865       subject  when  group  n  has  not been used at all. For example, if the
2866       string "abc" is matched against the pattern (a|(z))(bc) the return from
2867       the  function  is 4, and groups 1 and 3 are matched, but 2 is not. When
2868       this happens, both values in the offset pairs corresponding  to  unused
2869       groups are set to PCRE2_UNSET.
2870
2871       Offset  values  that  correspond to unused groups at the end of the ex-
2872       pression are also set to PCRE2_UNSET. For example, if the string  "abc"
2873       is  matched  against  the pattern (abc)(x(yz)?)? groups 2 and 3 are not
2874       matched. The return from the function is 2, because  the  highest  used
2875       capture  group  number  is  1. The offsets for for the second and third
2876       capture groupss (assuming the vector is large enough,  of  course)  are
2877       set to PCRE2_UNSET.
2878
2879       Elements in the ovector that do not correspond to capturing parentheses
2880       in the pattern are never changed. That is, if a pattern contains n cap-
2881       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
2882       pcre2_match(). The other elements retain whatever  values  they  previ-
2883       ously  had.  After  a failed match attempt, the contents of the ovector
2884       are unchanged.
2885
2886
2887OTHER INFORMATION ABOUT A MATCH
2888
2889       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
2890
2891       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
2892
2893       As well as the offsets in the ovector, other information about a  match
2894       is  retained  in the match data block and can be retrieved by the above
2895       functions in appropriate circumstances. If they  are  called  at  other
2896       times, the result is undefined.
2897
2898       After  a  successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
2899       failure to match (PCRE2_ERROR_NOMATCH), a mark name may  be  available.
2900       The  function pcre2_get_mark() can be called to access this name, which
2901       can be specified in the pattern by  any  of  the  backtracking  control
2902       verbs, not just (*MARK). The same function applies to all the verbs. It
2903       returns a pointer to the zero-terminated name, which is within the com-
2904       piled pattern. If no name is available, NULL is returned. The length of
2905       the name (excluding the terminating zero) is stored in  the  code  unit
2906       that  precedes  the name. You should use this length instead of relying
2907       on the terminating zero if the name might contain a binary zero.
2908
2909       After a successful match, the name that is returned is  the  last  mark
2910       name encountered on the matching path through the pattern. Instances of
2911       backtracking verbs without names do not count. Thus,  for  example,  if
2912       the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
2913       After a "no match" or a partial match, the last encountered name is re-
2914       turned. For example, consider this pattern:
2915
2916         ^(*MARK:A)((*MARK:B)a|b)c
2917
2918       When  it  matches "bc", the returned name is A. The B mark is "seen" in
2919       the first branch of the group, but it is not on the matching  path.  On
2920       the  other  hand,  when  this pattern fails to match "bx", the returned
2921       name is B.
2922
2923       Warning: By default, certain start-of-match optimizations are  used  to
2924       give  a  fast "no match" result in some situations. For example, if the
2925       anchoring is removed from the pattern above, there is an initial  check
2926       for  the presence of "c" in the subject before running the matching en-
2927       gine. This check fails for "bx", causing a match failure without seeing
2928       any  marks. You can disable the start-of-match optimizations by setting
2929       the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or  by  starting
2930       the pattern with (*NO_START_OPT).
2931
2932       After  a  successful  match, a partial match, or one of the invalid UTF
2933       errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar()  can
2934       be called. After a successful or partial match it returns the code unit
2935       offset of the character at which the match started. For  a  non-partial
2936       match,  this can be different to the value of ovector[0] if the pattern
2937       contains the \K escape sequence. After a partial match,  however,  this
2938       value  is  always the same as ovector[0] because \K does not affect the
2939       result of a partial match.
2940
2941       After a UTF check failure, pcre2_get_startchar() can be used to  obtain
2942       the code unit offset of the invalid UTF character. Details are given in
2943       the pcre2unicode page.
2944
2945
2946ERROR RETURNS FROM pcre2_match()
2947
2948       If pcre2_match() fails, it returns a negative number. This can be  con-
2949       verted  to a text string by calling the pcre2_get_error_message() func-
2950       tion (see "Obtaining a textual error message" below).   Negative  error
2951       codes  are  also  returned  by other functions, and are documented with
2952       them. The codes are given names in the header file. If UTF checking  is
2953       in force and an invalid UTF subject string is detected, one of a number
2954       of UTF-specific negative error codes is returned. Details are given  in
2955       the  pcre2unicode  page. The following are the other errors that may be
2956       returned by pcre2_match():
2957
2958         PCRE2_ERROR_NOMATCH
2959
2960       The subject string did not match the pattern.
2961
2962         PCRE2_ERROR_PARTIAL
2963
2964       The subject string did not match, but it did match partially.  See  the
2965       pcre2partial documentation for details of partial matching.
2966
2967         PCRE2_ERROR_BADMAGIC
2968
2969       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2970       to catch the case when it is passed a junk pointer. This is  the  error
2971       that is returned when the magic number is not present.
2972
2973         PCRE2_ERROR_BADMODE
2974
2975       This  error is given when a compiled pattern is passed to a function in
2976       a library of a different code unit width, for example, a  pattern  com-
2977       piled  by  the  8-bit  library  is passed to a 16-bit or 32-bit library
2978       function.
2979
2980         PCRE2_ERROR_BADOFFSET
2981
2982       The value of startoffset was greater than the length of the subject.
2983
2984         PCRE2_ERROR_BADOPTION
2985
2986       An unrecognized bit was set in the options argument.
2987
2988         PCRE2_ERROR_BADUTFOFFSET
2989
2990       The UTF code unit sequence that was passed as a subject was checked and
2991       found  to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the
2992       value of startoffset did not point to the beginning of a UTF  character
2993       or the end of the subject.
2994
2995         PCRE2_ERROR_CALLOUT
2996
2997       This  error  is never generated by pcre2_match() itself. It is provided
2998       for use by callout  functions  that  want  to  cause  pcre2_match()  or
2999       pcre2_callout_enumerate()  to  return a distinctive error code. See the
3000       pcre2callout documentation for details.
3001
3002         PCRE2_ERROR_DEPTHLIMIT
3003
3004       The nested backtracking depth limit was reached.
3005
3006         PCRE2_ERROR_HEAPLIMIT
3007
3008       The heap limit was reached.
3009
3010         PCRE2_ERROR_INTERNAL
3011
3012       An unexpected internal error has occurred. This error could  be  caused
3013       by a bug in PCRE2 or by overwriting of the compiled pattern.
3014
3015         PCRE2_ERROR_JIT_STACKLIMIT
3016
3017       This error is returned when a pattern that was successfully studied us-
3018       ing JIT is being matched, but the memory available for the just-in-time
3019       processing  stack  is  not large enough. See the pcre2jit documentation
3020       for more details.
3021
3022         PCRE2_ERROR_MATCHLIMIT
3023
3024       The backtracking match limit was reached.
3025
3026         PCRE2_ERROR_NOMEMORY
3027
3028       If a pattern contains many nested backtracking points, heap  memory  is
3029       used  to  remember them. This error is given when the memory allocation
3030       function (default or  custom)  fails.  Note  that  a  different  error,
3031       PCRE2_ERROR_HEAPLIMIT,  is given if the amount of memory needed exceeds
3032       the   heap   limit.   PCRE2_ERROR_NOMEMORY   is   also   returned    if
3033       PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
3034
3035         PCRE2_ERROR_NULL
3036
3037       Either the code, subject, or match_data argument was passed as NULL.
3038
3039         PCRE2_ERROR_RECURSELOOP
3040
3041       This  error  is  returned  when  pcre2_match() detects a recursion loop
3042       within the pattern. Specifically, it means that either the  whole  pat-
3043       tern or a capture group has been called recursively for the second time
3044       at the same position in the subject string. Some simple  patterns  that
3045       might  do  this are detected and faulted at compile time, but more com-
3046       plicated cases, in particular mutual recursions between  two  different
3047       groups, cannot be detected until matching is attempted.
3048
3049
3050OBTAINING A TEXTUAL ERROR MESSAGE
3051
3052       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
3053         PCRE2_SIZE bufflen);
3054
3055       A  text  message  for  an  error code from any PCRE2 function (compile,
3056       match, or auxiliary) can be obtained  by  calling  pcre2_get_error_mes-
3057       sage().  The  code  is passed as the first argument, with the remaining
3058       two arguments specifying a code unit buffer  and  its  length  in  code
3059       units,  into  which the text message is placed. The message is returned
3060       in code units of the appropriate width for the library  that  is  being
3061       used.
3062
3063       The  returned message is terminated with a trailing zero, and the func-
3064       tion returns the number of code  units  used,  excluding  the  trailing
3065       zero. If the error number is unknown, the negative error code PCRE2_ER-
3066       ROR_BADDATA is returned. If the buffer is too  small,  the  message  is
3067       truncated (but still with a trailing zero), and the negative error code
3068       PCRE2_ERROR_NOMEMORY is returned.  None of the messages are very  long;
3069       a buffer size of 120 code units is ample.
3070
3071
3072EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3073
3074       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
3075         uint32_t number, PCRE2_SIZE *length);
3076
3077       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
3078         uint32_t number, PCRE2_UCHAR *buffer,
3079         PCRE2_SIZE *bufflen);
3080
3081       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
3082         uint32_t number, PCRE2_UCHAR **bufferptr,
3083         PCRE2_SIZE *bufflen);
3084
3085       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3086
3087       Captured  substrings  can  be accessed directly by using the ovector as
3088       described above.  For convenience, auxiliary functions are provided for
3089       extracting   captured  substrings  as  new,  separate,  zero-terminated
3090       strings. A substring that contains a binary zero is correctly extracted
3091       and  has  a  further  zero  added on the end, but the result is not, of
3092       course, a C string.
3093
3094       The functions in this section identify substrings by number. The number
3095       zero refers to the entire matched substring, with higher numbers refer-
3096       ring to substrings captured by parenthesized groups.  After  a  partial
3097       match,  only  substring  zero  is  available. An attempt to extract any
3098       other substring gives the error PCRE2_ERROR_PARTIAL. The  next  section
3099       describes similar functions for extracting captured substrings by name.
3100
3101       If  a  pattern uses the \K escape sequence within a positive assertion,
3102       the reported start of a successful match can be greater than the end of
3103       the  match.   For  example,  if the pattern (?=ab\K) is matched against
3104       "ab", the start and end offset values for the match are  2  and  0.  In
3105       this  situation,  calling  these functions with a zero substring number
3106       extracts a zero-length empty string.
3107
3108       You can find the length in code units of a captured  substring  without
3109       extracting  it  by calling pcre2_substring_length_bynumber(). The first
3110       argument is a pointer to the match data block, the second is the  group
3111       number,  and the third is a pointer to a variable into which the length
3112       is placed. If you just want to know whether or not  the  substring  has
3113       been captured, you can pass the third argument as NULL.
3114
3115       The  pcre2_substring_copy_bynumber()  function  copies  a captured sub-
3116       string into a supplied buffer,  whereas  pcre2_substring_get_bynumber()
3117       copies  it  into  new memory, obtained using the same memory allocation
3118       function that was used for the match data block. The  first  two  argu-
3119       ments  of  these  functions are a pointer to the match data block and a
3120       capture group number.
3121
3122       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
3123       the buffer and a pointer to a variable that contains its length in code
3124       units.  This is updated to contain the actual number of code units used
3125       for the extracted substring, excluding the terminating zero.
3126
3127       For pcre2_substring_get_bynumber() the third and fourth arguments point
3128       to variables that are updated with a pointer to the new memory and  the
3129       number  of  code units that comprise the substring, again excluding the
3130       terminating zero. When the substring is no longer  needed,  the  memory
3131       should be freed by calling pcre2_substring_free().
3132
3133       The  return  value  from  all these functions is zero for success, or a
3134       negative error code. If the pattern match  failed,  the  match  failure
3135       code  is returned.  If a substring number greater than zero is used af-
3136       ter a partial match, PCRE2_ERROR_PARTIAL is  returned.  Other  possible
3137       error codes are:
3138
3139         PCRE2_ERROR_NOMEMORY
3140
3141       The  buffer  was  too small for pcre2_substring_copy_bynumber(), or the
3142       attempt to get memory failed for pcre2_substring_get_bynumber().
3143
3144         PCRE2_ERROR_NOSUBSTRING
3145
3146       There is no substring with that number in the  pattern,  that  is,  the
3147       number is greater than the number of capturing parentheses.
3148
3149         PCRE2_ERROR_UNAVAILABLE
3150
3151       The substring number, though not greater than the number of captures in
3152       the pattern, is greater than the number of slots in the ovector, so the
3153       substring could not be captured.
3154
3155         PCRE2_ERROR_UNSET
3156
3157       The  substring  did  not  participate in the match. For example, if the
3158       pattern is (abc)|(def) and the subject is "def", and the  ovector  con-
3159       tains at least two capturing slots, substring number 1 is unset.
3160
3161
3162EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
3163
3164       int pcre2_substring_list_get(pcre2_match_data *match_data,
3165         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
3166
3167       void pcre2_substring_list_free(PCRE2_SPTR *list);
3168
3169       The  pcre2_substring_list_get()  function  extracts  all available sub-
3170       strings and builds a list of pointers to  them.  It  also  (optionally)
3171       builds  a  second list that contains their lengths (in code units), ex-
3172       cluding a terminating zero that is added to each of them. All  this  is
3173       done in a single block of memory that is obtained using the same memory
3174       allocation function that was used to get the match data block.
3175
3176       This function must be called only after a successful match.  If  called
3177       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
3178
3179       The  address of the memory block is returned via listptr, which is also
3180       the start of the list of string pointers. The end of the list is marked
3181       by  a  NULL pointer. The address of the list of lengths is returned via
3182       lengthsptr. If your strings do not contain binary zeros and you do  not
3183       therefore need the lengths, you may supply NULL as the lengthsptr argu-
3184       ment to disable the creation of a list of lengths.  The  yield  of  the
3185       function  is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
3186       ory block could not be obtained. When the list is no longer needed,  it
3187       should be freed by calling pcre2_substring_list_free().
3188
3189       If this function encounters a substring that is unset, which can happen
3190       when capture group number n+1 matches some part  of  the  subject,  but
3191       group  n has not been used at all, it returns an empty string. This can
3192       be distinguished from a genuine zero-length substring by inspecting the
3193       appropriate  offset in the ovector, which contain PCRE2_UNSET for unset
3194       substrings, or by calling pcre2_substring_length_bynumber().
3195
3196
3197EXTRACTING CAPTURED SUBSTRINGS BY NAME
3198
3199       int pcre2_substring_number_from_name(const pcre2_code *code,
3200         PCRE2_SPTR name);
3201
3202       int pcre2_substring_length_byname(pcre2_match_data *match_data,
3203         PCRE2_SPTR name, PCRE2_SIZE *length);
3204
3205       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
3206         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
3207
3208       int pcre2_substring_get_byname(pcre2_match_data *match_data,
3209         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
3210
3211       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3212
3213       To extract a substring by name, you first have to find associated  num-
3214       ber.  For example, for this pattern:
3215
3216         (a+)b(?<xxx>\d+)...
3217
3218       the number of the capture group called "xxx" is 2. If the name is known
3219       to be unique (PCRE2_DUPNAMES was not set), you can find the number from
3220       the name by calling pcre2_substring_number_from_name(). The first argu-
3221       ment is the compiled pattern, and the second is the name. The yield  of
3222       the  function  is the group number, PCRE2_ERROR_NOSUBSTRING if there is
3223       no group with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if  there  is
3224       more  than one group with that name.  Given the number, you can extract
3225       the substring directly from the ovector, or use one of  the  "bynumber"
3226       functions described above.
3227
3228       For  convenience,  there are also "byname" functions that correspond to
3229       the "bynumber" functions, the only difference being that the second ar-
3230       gument  is  a  name  instead  of a number. If PCRE2_DUPNAMES is set and
3231       there are duplicate names, these functions scan all the groups with the
3232       given  name,  and  return  the  captured substring from the first named
3233       group that is set.
3234
3235       If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING  is
3236       returned.  If  all  groups  with the name have numbers that are greater
3237       than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3238       turned.  If there is at least one group with a slot in the ovector, but
3239       no group is found to be set, PCRE2_ERROR_UNSET is returned.
3240
3241       Warning: If the pattern uses the (?| feature to set up multiple capture
3242       groups  with  the same number, as described in the section on duplicate
3243       group numbers in the pcre2pattern page, you cannot use names to distin-
3244       guish  the  different capture groups, because names are not included in
3245       the compiled code. The matching process uses  only  numbers.  For  this
3246       reason,  the  use  of  different  names for groups with the same number
3247       causes an error at compile time.
3248
3249
3250CREATING A NEW STRING WITH SUBSTITUTIONS
3251
3252       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
3253         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3254         uint32_t options, pcre2_match_data *match_data,
3255         pcre2_match_context *mcontext, PCRE2_SPTR replacement,
3256         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
3257         PCRE2_SIZE *outlengthptr);
3258
3259       This function optionally calls pcre2_match() and then makes a  copy  of
3260       the  subject  string in outputbuffer, replacing parts that were matched
3261       with the replacement string, whose length is supplied in rlength.  This
3262       can  be  given  as  PCRE2_ZERO_TERMINATED for a zero-terminated string.
3263       There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3264       turn  just  the replacement string(s). The default action is to perform
3265       just one replacement if the pattern matches, but  there  is  an  option
3266       that  requests  multiple  replacements (see PCRE2_SUBSTITUTE_GLOBAL be-
3267       low).
3268
3269       If successful, pcre2_substitute() returns the number  of  substitutions
3270       that  were  carried out. This may be zero if no match was found, and is
3271       never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set.  A  nega-
3272       tive value is returned if an error is detected.
3273
3274       Matches  in  which  a  \K item in a lookahead in the pattern causes the
3275       match to end before it starts are not supported, and give  rise  to  an
3276       error return. For global replacements, matches in which \K in a lookbe-
3277       hind causes the match to start earlier than the point that was  reached
3278       in the previous iteration are also not supported.
3279
3280       The  first  seven  arguments  of pcre2_substitute() are the same as for
3281       pcre2_match(), except that the partial matching options are not permit-
3282       ted,  and  match_data may be passed as NULL, in which case a match data
3283       block is obtained and freed within this function, using memory  manage-
3284       ment  functions from the match context, if provided, or else those that
3285       were used to allocate memory for the compiled code.
3286
3287       If match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set,  the
3288       provided block is used for all calls to pcre2_match(), and its contents
3289       afterwards are the result of the final call. For global  changes,  this
3290       will always be a no-match error. The contents of the ovector within the
3291       match data block may or may not have been changed.
3292
3293       As well as the usual options for pcre2_match(), a number of  additional
3294       options  can be set in the options argument of pcre2_substitute().  One
3295       such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an  external
3296       match_data  block  must  be provided, and it must have been used for an
3297       external call to pcre2_match(). The data in the match_data  block  (re-
3298       turn code, offset vector) is used for the first substitution instead of
3299       calling pcre2_match() from within pcre2_substitute().  This  allows  an
3300       application to check for a match before choosing to substitute, without
3301       having to repeat the match.
3302
3303       The contents of the  externally  supplied  match  data  block  are  not
3304       changed   when   PCRE2_SUBSTITUTE_MATCHED   is  set.  If  PCRE2_SUBSTI-
3305       TUTE_GLOBAL is also set, pcre2_match() is called after the  first  sub-
3306       stitution  to  check for further matches, but this is done using an in-
3307       ternally obtained match data block, thus always  leaving  the  external
3308       block unchanged.
3309
3310       The  code  argument is not used for matching before the first substitu-
3311       tion when PCRE2_SUBSTITUTE_MATCHED is set, but  it  must  be  provided,
3312       even  when  PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3313       formation such as the UTF setting and the number of capturing parenthe-
3314       ses in the pattern.
3315
3316       The  default  action  of  pcre2_substitute() is to return a copy of the
3317       subject string with matched substrings replaced. However, if PCRE2_SUB-
3318       STITUTE_REPLACEMENT_ONLY  is  set,  only the replacement substrings are
3319       returned. In the global case, multiple replacements are concatenated in
3320       the  output  buffer.  Substitution  callouts (see below) can be used to
3321       separate them if necessary.
3322
3323       The outlengthptr argument of pcre2_substitute() must point to  a  vari-
3324       able  that contains the length, in code units, of the output buffer. If
3325       the function is successful, the value is updated to contain the  length
3326       in  code  units  of the new string, excluding the trailing zero that is
3327       automatically added.
3328
3329       If the function is not successful, the value set via  outlengthptr  de-
3330       pends  on  the  type  of  error.  For  syntax errors in the replacement
3331       string, the value is the offset in the replacement string where the er-
3332       ror  was  detected.  For  other errors, the value is PCRE2_UNSET by de-
3333       fault. This includes the case of the output buffer being too small, un-
3334       less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
3335
3336       PCRE2_SUBSTITUTE_OVERFLOW_LENGTH  changes  what happens when the output
3337       buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3338       ORY  immediately.  If  this  option is set, however, pcre2_substitute()
3339       continues to go through the motions of matching and substituting (with-
3340       out,  of course, writing anything) in order to compute the size of buf-
3341       fer that is needed. This value is  passed  back  via  the  outlengthptr
3342       variable,  with  the  result  of  the  function  still  being PCRE2_ER-
3343       ROR_NOMEMORY.
3344
3345       Passing a buffer size of zero is a permitted way  of  finding  out  how
3346       much  memory  is needed for given substitution. However, this does mean
3347       that the entire operation is carried out twice. Depending on the appli-
3348       cation,  it  may  be more efficient to allocate a large buffer and free
3349       the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3350       FLOW_LENGTH.
3351
3352       The  replacement  string,  which  is interpreted as a UTF string in UTF
3353       mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set.  An
3354       invalid UTF replacement string causes an immediate return with the rel-
3355       evant UTF error code.
3356
3357       If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is  not  in-
3358       terpreted in any way. By default, however, a dollar character is an es-
3359       cape character that can specify the insertion of characters  from  cap-
3360       ture  groups  and names from (*MARK) or other control verbs in the pat-
3361       tern. The following forms are always recognized:
3362
3363         $$                  insert a dollar character
3364         $<n> or ${<n>}      insert the contents of group <n>
3365         $*MARK or ${*MARK}  insert a control verb name
3366
3367       Either a group number or a group name  can  be  given  for  <n>.  Curly
3368       brackets  are  required only if the following character would be inter-
3369       preted as part of the number or name. The number may be zero to include
3370       the  entire  matched  string.   For  example,  if  the pattern a(b)c is
3371       matched with "=abc=" and the replacement string "+$1$0$1+", the  result
3372       is "=+babcb+=".
3373
3374       $*MARK  inserts the name from the last encountered backtracking control
3375       verb on the matching path that has a name. (*MARK) must always  include
3376       a  name,  but  the  other  verbs  need not. For example, in the case of
3377       (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B)
3378       the  relevant  name is "B". This facility can be used to perform simple
3379       simultaneous substitutions, as this pcre2test example shows:
3380
3381         /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
3382             apple lemon
3383          2: pear orange
3384
3385       PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
3386       string,  replacing every matching substring. If this option is not set,
3387       only the first matching substring is replaced. The search  for  matches
3388       takes  place in the original subject string (that is, previous replace-
3389       ments do not affect it).  Iteration is  implemented  by  advancing  the
3390       startoffset  value  for  each search, which is always passed the entire
3391       subject string. If an offset limit is set in the match context, search-
3392       ing stops when that limit is reached.
3393
3394       You  can  restrict  the effect of a global substitution to a portion of
3395       the subject string by setting either or both of startoffset and an off-
3396       set limit. Here is a pcre2test example:
3397
3398         /B/g,replace=!,use_offset_limit
3399         ABC ABC ABC ABC\=offset=3,offset_limit=12
3400          2: ABC A!C A!C ABC
3401
3402       When  continuing  with  global substitutions after matching a substring
3403       with zero length, an attempt to find a non-empty match at the same off-
3404       set is performed.  If this is not successful, the offset is advanced by
3405       one character except when CRLF is a valid newline sequence and the next
3406       two  characters are CR, LF. In this case, the offset is advanced by two
3407       characters.
3408
3409       PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that
3410       do not appear in the pattern to be treated as unset groups. This option
3411       should be used with care, because it means that a typo in a group  name
3412       or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.
3413
3414       PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3415       known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be  treated
3416       as  empty  strings  when inserted as described above. If this option is
3417       not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3418       SET  error.  This  option  does not influence the extended substitution
3419       syntax described below.
3420
3421       PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to  the
3422       replacement  string.  Without this option, only the dollar character is
3423       special, and only the group insertion forms  listed  above  are  valid.
3424       When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
3425
3426       Firstly,  backslash in a replacement string is interpreted as an escape
3427       character. The usual forms such as \n or \x{ddd} can be used to specify
3428       particular  character codes, and backslash followed by any non-alphanu-
3429       meric character quotes that character. Extended quoting  can  be  coded
3430       using \Q...\E, exactly as in pattern strings.
3431
3432       There  are  also four escape sequences for forcing the case of inserted
3433       letters.  The insertion mechanism has three states:  no  case  forcing,
3434       force upper case, and force lower case. The escape sequences change the
3435       current state: \U and \L change to upper or lower case forcing, respec-
3436       tively,  and  \E (when not terminating a \Q quoted sequence) reverts to
3437       no case forcing. The sequences \u and \l force the next  character  (if
3438       it  is  a  letter)  to  upper or lower case, respectively, and then the
3439       state automatically reverts to no case forcing. Case forcing applies to
3440       all  inserted  characters, including those from capture groups and let-
3441       ters within \Q...\E quoted sequences. If either PCRE2_UTF or  PCRE2_UCP
3442       was  set when the pattern was compiled, Unicode properties are used for
3443       case forcing characters whose code points are greater than 127.
3444
3445       Note that case forcing sequences such as \U...\E do not nest. For exam-
3446       ple,  the  result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
3447       \E has no effect. Note  also  that  the  PCRE2_ALT_BSUX  and  PCRE2_EX-
3448       TRA_ALT_BSUX options do not apply to replacement strings.
3449
3450       The  second  effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
3451       flexibility to capture group substitution. The  syntax  is  similar  to
3452       that used by Bash:
3453
3454         ${<n>:-<string>}
3455         ${<n>:+<string1>:<string2>}
3456
3457       As  before,  <n> may be a group number or a name. The first form speci-
3458       fies a default value. If group <n> is set, its value  is  inserted;  if
3459       not,  <string>  is  expanded  and  the result inserted. The second form
3460       specifies strings that are expanded and inserted when group <n> is  set
3461       or  unset,  respectively. The first form is just a convenient shorthand
3462       for
3463
3464         ${<n>:+${<n>}:<string>}
3465
3466       Backslash can be used to escape colons and closing  curly  brackets  in
3467       the  replacement  strings.  A change of the case forcing state within a
3468       replacement string remains  in  force  afterwards,  as  shown  in  this
3469       pcre2test example:
3470
3471         /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
3472             body
3473          1: hello
3474             somebody
3475          1: HELLO
3476
3477       The  PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
3478       substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does  cause  un-
3479       known groups in the extended syntax forms to be treated as unset.
3480
3481       If  PCRE2_SUBSTITUTE_LITERAL  is  set,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
3482       PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3483       vant and are ignored.
3484
3485   Substitution errors
3486
3487       In  the  event of an error, pcre2_substitute() returns a negative error
3488       code. Except for PCRE2_ERROR_NOMATCH (which is never returned),  errors
3489       from pcre2_match() are passed straight back.
3490
3491       PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3492       tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
3493
3494       PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3495       ing  an  unknown  substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
3496       when the simple (non-extended) syntax is used and  PCRE2_SUBSTITUTE_UN-
3497       SET_EMPTY is not set.
3498
3499       PCRE2_ERROR_NOMEMORY  is  returned  if  the  output  buffer  is not big
3500       enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
3501       of  buffer  that is needed is returned via outlengthptr. Note that this
3502       does not happen by default.
3503
3504       PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
3505       match_data argument is NULL.
3506
3507       PCRE2_ERROR_BADREPLACEMENT  is  used for miscellaneous syntax errors in
3508       the replacement string, with more  particular  errors  being  PCRE2_ER-
3509       ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE
3510       (closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION  (syntax
3511       error  in  extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN
3512       (the pattern match ended before it started or the match started earlier
3513       than  the  current  position  in the subject, which can happen if \K is
3514       used in an assertion).
3515
3516       As for all PCRE2 errors, a text message that describes the error can be
3517       obtained  by  calling  the pcre2_get_error_message() function (see "Ob-
3518       taining a textual error message" above).
3519
3520   Substitution callouts
3521
3522       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
3523         int (*callout_function)(pcre2_substitute_callout_block *, void *),
3524         void *callout_data);
3525
3526       The pcre2_set_substitution_callout() function can be used to specify  a
3527       callout  function for pcre2_substitute(). This information is passed in
3528       a match context. The callout function is called after each substitution
3529       has been processed, but it can cause the replacement not to happen. The
3530       callout function is not called for simulated substitutions that  happen
3531       as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option.
3532
3533       The first argument of the callout function is a pointer to a substitute
3534       callout block structure, which contains the following fields, not  nec-
3535       essarily in this order:
3536
3537         uint32_t    version;
3538         uint32_t    subscount;
3539         PCRE2_SPTR  input;
3540         PCRE2_SPTR  output;
3541         PCRE2_SIZE *ovector;
3542         uint32_t    oveccount;
3543         PCRE2_SIZE  output_offsets[2];
3544
3545       The  version field contains the version number of the block format. The
3546       current version is 0. The version number will  increase  in  future  if
3547       more  fields are added, but the intention is never to remove any of the
3548       existing fields.
3549
3550       The subscount field is the number of the current match. It is 1 for the
3551       first callout, 2 for the second, and so on. The input and output point-
3552       ers are copies of the values passed to pcre2_substitute().
3553
3554       The ovector field points to the ovector, which contains the  result  of
3555       the most recent match. The oveccount field contains the number of pairs
3556       that are set in the ovector, and is always greater than zero.
3557
3558       The output_offsets vector contains the offsets of  the  replacement  in
3559       the  output  string. This has already been processed for dollar and (if
3560       requested) backslash substitutions as described above.
3561
3562       The second argument of the callout function  is  the  value  passed  as
3563       callout_data  when  the  function was registered. The value returned by
3564       the callout function is interpreted as follows:
3565
3566       If the value is zero, the replacement is accepted, and,  if  PCRE2_SUB-
3567       STITUTE_GLOBAL  is set, processing continues with a search for the next
3568       match. If the value is not zero, the current  replacement  is  not  ac-
3569       cepted.  If  the  value is greater than zero, processing continues when
3570       PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than  zero
3571       or  PCRE2_SUBSTITUTE_GLOBAL  is  not set), the the rest of the input is
3572       copied to the output and the call to pcre2_substitute() exits,  return-
3573       ing the number of matches so far.
3574
3575
3576DUPLICATE CAPTURE GROUP NAMES
3577
3578       int pcre2_substring_nametable_scan(const pcre2_code *code,
3579         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
3580
3581       When  a  pattern  is compiled with the PCRE2_DUPNAMES option, names for
3582       capture groups are not required to be unique. Duplicate names  are  al-
3583       ways  allowed for groups with the same number, created by using the (?|
3584       feature. Indeed, if such groups are named, they are required to use the
3585       same names.
3586
3587       Normally,  patterns  that  use duplicate names are such that in any one
3588       match, only one of each set of identically-named  groups  participates.
3589       An example is shown in the pcre2pattern documentation.
3590
3591       When   duplicates   are   present,   pcre2_substring_copy_byname()  and
3592       pcre2_substring_get_byname() return the first  substring  corresponding
3593       to  the given name that is set. Only if none are set is PCRE2_ERROR_UN-
3594       SET is returned. The  pcre2_substring_number_from_name()  function  re-
3595       turns  the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate
3596       names.
3597
3598       If you want to get full details of all captured substrings for a  given
3599       name,  you  must use the pcre2_substring_nametable_scan() function. The
3600       first argument is the compiled pattern, and the second is the name.  If
3601       the  third  and fourth arguments are NULL, the function returns a group
3602       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
3603
3604       When the third and fourth arguments are not NULL, they must be pointers
3605       to  variables  that are updated by the function. After it has run, they
3606       point to the first and last entries in the name-to-number table for the
3607       given  name,  and the function returns the length of each entry in code
3608       units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there  are
3609       no entries for the given name.
3610
3611       The format of the name table is described above in the section entitled
3612       Information about a pattern. Given all the  relevant  entries  for  the
3613       name,  you  can  extract  each of their numbers, and hence the captured
3614       data.
3615
3616
3617FINDING ALL POSSIBLE MATCHES AT ONE POSITION
3618
3619       The traditional matching function uses a  similar  algorithm  to  Perl,
3620       which  stops when it finds the first match at a given point in the sub-
3621       ject. If you want to find all possible matches, or the longest possible
3622       match  at  a  given  position,  consider using the alternative matching
3623       function (see below) instead. If you cannot use the  alternative  func-
3624       tion, you can kludge it up by making use of the callout facility, which
3625       is described in the pcre2callout documentation.
3626
3627       What you have to do is to insert a callout right at the end of the pat-
3628       tern.   When your callout function is called, extract and save the cur-
3629       rent matched substring. Then return 1, which  forces  pcre2_match()  to
3630       backtrack  and  try other alternatives. Ultimately, when it runs out of
3631       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
3632
3633
3634MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3635
3636       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
3637         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3638         uint32_t options, pcre2_match_data *match_data,
3639         pcre2_match_context *mcontext,
3640         int *workspace, PCRE2_SIZE wscount);
3641
3642       The function pcre2_dfa_match() is called  to  match  a  subject  string
3643       against  a  compiled pattern, using a matching algorithm that scans the
3644       subject string just once (not counting lookaround assertions), and does
3645       not  backtrack.  This has different characteristics to the normal algo-
3646       rithm, and is not compatible with Perl. Some of the features  of  PCRE2
3647       patterns  are  not  supported.  Nevertheless, there are times when this
3648       kind of matching can be useful. For a discussion of  the  two  matching
3649       algorithms, and a list of features that pcre2_dfa_match() does not sup-
3650       port, see the pcre2matching documentation.
3651
3652       The arguments for the pcre2_dfa_match() function are the  same  as  for
3653       pcre2_match(), plus two extras. The ovector within the match data block
3654       is used in a different way, and this is described below. The other com-
3655       mon  arguments  are used in the same way as for pcre2_match(), so their
3656       description is not repeated here.
3657
3658       The two additional arguments provide workspace for  the  function.  The
3659       workspace  vector  should  contain at least 20 elements. It is used for
3660       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
3661       workspace  is needed for patterns and subjects where there are a lot of
3662       potential matches.
3663
3664       Here is an example of a simple call to pcre2_dfa_match():
3665
3666         int wspace[20];
3667         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
3668         int rc = pcre2_dfa_match(
3669           re,             /* result of pcre2_compile() */
3670           "some string",  /* the subject string */
3671           11,             /* the length of the subject string */
3672           0,              /* start at offset 0 in the subject */
3673           0,              /* default options */
3674           md,             /* the match data block */
3675           NULL,           /* a match context; NULL means use defaults */
3676           wspace,         /* working space vector */
3677           20);            /* number of elements (NOT size in bytes) */
3678
3679   Option bits for pcre_dfa_match()
3680
3681       The unused bits of the options argument for pcre2_dfa_match()  must  be
3682       zero.   The   only   bits   that   may   be   set  are  PCRE2_ANCHORED,
3683       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL,  PCRE2_NO-
3684       TEOL,   PCRE2_NOTEMPTY,   PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_UTF_CHECK,
3685       PCRE2_PARTIAL_HARD,   PCRE2_PARTIAL_SOFT,    PCRE2_DFA_SHORTEST,    and
3686       PCRE2_DFA_RESTART.  All but the last four of these are exactly the same
3687       as for pcre2_match(), so their description is not repeated here.
3688
3689         PCRE2_PARTIAL_HARD
3690         PCRE2_PARTIAL_SOFT
3691
3692       These have the same general effect as they do  for  pcre2_match(),  but
3693       the  details are slightly different. When PCRE2_PARTIAL_HARD is set for
3694       pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if  the  end  of  the
3695       subject is reached and there is still at least one matching possibility
3696       that requires additional characters. This happens even if some complete
3697       matches  have  already  been found. When PCRE2_PARTIAL_SOFT is set, the
3698       return code PCRE2_ERROR_NOMATCH is converted  into  PCRE2_ERROR_PARTIAL
3699       if  the  end  of  the  subject  is reached, there have been no complete
3700       matches, but there is still at least one matching possibility. The por-
3701       tion  of  the  string that was inspected when the longest partial match
3702       was found is set as the first matching string in both cases. There is a
3703       more  detailed  discussion  of partial and multi-segment matching, with
3704       examples, in the pcre2partial documentation.
3705
3706         PCRE2_DFA_SHORTEST
3707
3708       Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm  to
3709       stop as soon as it has found one match. Because of the way the alterna-
3710       tive algorithm works, this is necessarily the shortest  possible  match
3711       at the first possible matching point in the subject string.
3712
3713         PCRE2_DFA_RESTART
3714
3715       When  pcre2_dfa_match() returns a partial match, it is possible to call
3716       it again, with additional subject characters, and have it continue with
3717       the same match. The PCRE2_DFA_RESTART option requests this action; when
3718       it is set, the workspace and wscount options must  reference  the  same
3719       vector  as  before  because data about the match so far is left in them
3720       after a partial match. There is more discussion of this facility in the
3721       pcre2partial documentation.
3722
3723   Successful returns from pcre2_dfa_match()
3724
3725       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3726       string in the subject. Note, however, that all the matches from one run
3727       of  the  function  start  at the same point in the subject. The shorter
3728       matches are all initial substrings of the longer matches. For  example,
3729       if the pattern
3730
3731         <.*>
3732
3733       is matched against the string
3734
3735         This is <something> <something else> <something further> no more
3736
3737       the three matched strings are
3738
3739         <something> <something else> <something further>
3740         <something> <something else>
3741         <something>
3742
3743       On  success,  the  yield of the function is a number greater than zero,
3744       which is the number of matched substrings.  The  offsets  of  the  sub-
3745       strings  are returned in the ovector, and can be extracted by number in
3746       the same way as for pcre2_match(), but the numbers bear no relation  to
3747       any  capture groups that may exist in the pattern, because DFA matching
3748       does not support capturing.
3749
3750       Calls to the convenience functions that extract substrings by name  re-
3751       turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3752       ter a DFA match. The convenience functions that extract  substrings  by
3753       number never return PCRE2_ERROR_NOSUBSTRING.
3754
3755       The  matched  strings  are  stored  in  the ovector in reverse order of
3756       length; that is, the longest matching string is first.  If  there  were
3757       too  many matches to fit into the ovector, the yield of the function is
3758       zero, and the vector is filled with the longest matches.
3759
3760       NOTE: PCRE2's "auto-possessification" optimization usually  applies  to
3761       character  repeats at the end of a pattern (as well as internally). For
3762       example, the pattern "a\d+" is compiled as if it were "a\d++". For  DFA
3763       matching,  this means that only one possible match is found. If you re-
3764       ally do want multiple matches in such cases, either use an ungreedy re-
3765       peat  such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com-
3766       piling.
3767
3768   Error returns from pcre2_dfa_match()
3769
3770       The pcre2_dfa_match() function returns a negative number when it fails.
3771       Many  of  the  errors  are  the same as for pcre2_match(), as described
3772       above.  There are in addition the following errors that are specific to
3773       pcre2_dfa_match():
3774
3775         PCRE2_ERROR_DFA_UITEM
3776
3777       This  return  is  given  if pcre2_dfa_match() encounters an item in the
3778       pattern that it does not support, for instance, the use of \C in a  UTF
3779       mode or a backreference.
3780
3781         PCRE2_ERROR_DFA_UCOND
3782
3783       This  return  is given if pcre2_dfa_match() encounters a condition item
3784       that uses a backreference for the condition, or a test for recursion in
3785       a specific capture group. These are not supported.
3786
3787         PCRE2_ERROR_DFA_UINVALID_UTF
3788
3789       This  return is given if pcre2_dfa_match() is called for a pattern that
3790       was compiled with PCRE2_MATCH_INVALID_UTF. This is  not  supported  for
3791       DFA matching.
3792
3793         PCRE2_ERROR_DFA_WSSIZE
3794
3795       This  return  is  given  if  pcre2_dfa_match() runs out of space in the
3796       workspace vector.
3797
3798         PCRE2_ERROR_DFA_RECURSE
3799
3800       When a recursion or subroutine call is processed, the matching function
3801       calls  itself  recursively,  using  private  memory for the ovector and
3802       workspace.  This error is given if the internal ovector  is  not  large
3803       enough.  This  should  be  extremely  rare, as a vector of size 1000 is
3804       used.
3805
3806         PCRE2_ERROR_DFA_BADRESTART
3807
3808       When pcre2_dfa_match() is called  with  the  PCRE2_DFA_RESTART  option,
3809       some  plausibility  checks  are  made on the contents of the workspace,
3810       which should contain data about the previous partial match. If  any  of
3811       these checks fail, this error is given.
3812
3813
3814SEE ALSO
3815
3816       pcre2build(3),    pcre2callout(3),    pcre2demo(3),   pcre2matching(3),
3817       pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
3818
3819
3820AUTHOR
3821
3822       Philip Hazel
3823       University Computing Service
3824       Cambridge, England.
3825
3826
3827REVISION
3828
3829       Last updated: 04 November 2020
3830       Copyright (c) 1997-2020 University of Cambridge.
3831------------------------------------------------------------------------------
3832
3833
3834PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
3835
3836
3837
3838NAME
3839       PCRE2 - Perl-compatible regular expressions (revised API)
3840
3841BUILDING PCRE2
3842
3843       PCRE2  is distributed with a configure script that can be used to build
3844       the library in Unix-like environments using the applications  known  as
3845       Autotools. Also in the distribution are files to support building using
3846       CMake instead of configure. The text file README contains  general  in-
3847       formation  about building with Autotools (some of which is repeated be-
3848       low), and also has some comments about building  on  various  operating
3849       systems.  There  is a lot more information about building PCRE2 without
3850       using Autotools (including information about using CMake  and  building
3851       "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
3852       consult this file as well as the README file if you are building  in  a
3853       non-Unix-like environment.
3854
3855
3856PCRE2 BUILD-TIME OPTIONS
3857
3858       The rest of this document describes the optional features of PCRE2 that
3859       can be selected when the library is compiled. It  assumes  use  of  the
3860       configure  script,  where  the  optional features are selected or dese-
3861       lected by providing options to configure before running the  make  com-
3862       mand.  However,  the same options can be selected in both Unix-like and
3863       non-Unix-like environments if you are using CMake instead of  configure
3864       to build PCRE2.
3865
3866       If  you  are not using Autotools or CMake, option selection can be done
3867       by editing the config.h file, or by passing parameter settings  to  the
3868       compiler, as described in NON-AUTOTOOLS-BUILD.
3869
3870       The complete list of options for configure (which includes the standard
3871       ones such as the selection of the installation directory)  can  be  ob-
3872       tained by running
3873
3874         ./configure --help
3875
3876       The  following  sections include descriptions of "on/off" options whose
3877       names begin with --enable or --disable. Because of the way that config-
3878       ure  works, --enable and --disable always come in pairs, so the comple-
3879       mentary option always exists as well, but as it specifies the  default,
3880       it is not described.  Options that specify values have names that start
3881       with --with. At the end of a configure run, a summary of the configura-
3882       tion is output.
3883
3884
3885BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3886
3887       By  default, a library called libpcre2-8 is built, containing functions
3888       that take string arguments contained in arrays  of  bytes,  interpreted
3889       either  as single-byte characters, or UTF-8 strings. You can also build
3890       two other libraries, called libpcre2-16 and libpcre2-32, which  process
3891       strings  that  are contained in arrays of 16-bit and 32-bit code units,
3892       respectively. These can be interpreted either as single-unit characters
3893       or  UTF-16/UTF-32 strings. To build these additional libraries, add one
3894       or both of the following to the configure command:
3895
3896         --enable-pcre2-16
3897         --enable-pcre2-32
3898
3899       If you do not want the 8-bit library, add
3900
3901         --disable-pcre2-8
3902
3903       as well. At least one of the three libraries must be built.  Note  that
3904       the  POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3905       an 8-bit program. Neither of these are built if  you  select  only  the
3906       16-bit or 32-bit libraries.
3907
3908
3909BUILDING SHARED AND STATIC LIBRARIES
3910
3911       The  Autotools PCRE2 building process uses libtool to build both shared
3912       and static libraries by default. You can suppress an  unwanted  library
3913       by adding one of
3914
3915         --disable-shared
3916         --disable-static
3917
3918       to the configure command.
3919
3920
3921UNICODE AND UTF SUPPORT
3922
3923       By  default,  PCRE2 is built with support for Unicode and UTF character
3924       strings.  To build it without Unicode support, add
3925
3926         --disable-unicode
3927
3928       to the configure command. This setting applies to all three  libraries.
3929       It  is  not  possible to build one library with Unicode support and an-
3930       other without in the same configuration.
3931
3932       Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
3933       UTF-16 or UTF-32. To do that, applications that use the library can set
3934       the PCRE2_UTF option when they call pcre2_compile() to compile  a  pat-
3935       tern.   Alternatively,  patterns  may be started with (*UTF) unless the
3936       application has locked this out by setting PCRE2_NEVER_UTF.
3937
3938       UTF support allows the libraries to process character code points up to
3939       0x10ffff  in  the  strings that they handle. Unicode support also gives
3940       access to the Unicode properties of characters, using  pattern  escapes
3941       such as \P, \p, and \X. Only the general category properties such as Lu
3942       and Nd are supported. Details are given in the pcre2pattern  documenta-
3943       tion.
3944
3945       Pattern escapes such as \d and \w do not by default make use of Unicode
3946       properties. The application can request that they  do  by  setting  the
3947       PCRE2_UCP  option.  Unless  the  application has set PCRE2_NEVER_UCP, a
3948       pattern may also request this by starting with (*UCP).
3949
3950
3951DISABLING THE USE OF \C
3952
3953       The \C escape sequence, which matches a single code unit, even in a UTF
3954       mode,  can  cause unpredictable behaviour because it may leave the cur-
3955       rent matching point in the middle of a multi-code-unit  character.  The
3956       application  can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
3957       tion when calling pcre2_compile(). There is also a build-time option
3958
3959         --enable-never-backslash-C
3960
3961       (note the upper case C) which locks out the use of \C entirely.
3962
3963
3964JUST-IN-TIME COMPILER SUPPORT
3965
3966       Just-in-time (JIT) compiler support is included in the build by  speci-
3967       fying
3968
3969         --enable-jit
3970
3971       This  support  is available only for certain hardware architectures. If
3972       this option is set for an unsupported architecture,  a  building  error
3973       occurs.  If in doubt, use
3974
3975         --enable-jit=auto
3976
3977       which  enables  JIT  only if the current hardware is supported. You can
3978       check if JIT is enabled in the configuration summary that is output  at
3979       the  end  of a configure run. If you are enabling JIT under SELinux you
3980       may also want to add
3981
3982         --enable-jit-sealloc
3983
3984       which enables the use of an execmem allocator in JIT that is compatible
3985       with  SELinux.  This  has  no  effect  if  JIT  is not enabled. See the
3986       pcre2jit documentation for a discussion of JIT usage. When JIT  support
3987       is enabled, pcre2grep automatically makes use of it, unless you add
3988
3989         --disable-pcre2grep-jit
3990
3991       to the configure command.
3992
3993
3994NEWLINE RECOGNITION
3995
3996       By  default, PCRE2 interprets the linefeed (LF) character as indicating
3997       the end of a line. This is the normal newline  character  on  Unix-like
3998       systems.  You can compile PCRE2 to use carriage return (CR) instead, by
3999       adding
4000
4001         --enable-newline-is-cr
4002
4003       to the configure command. There is also an  --enable-newline-is-lf  op-
4004       tion, which explicitly specifies linefeed as the newline character.
4005
4006       Alternatively, you can specify that line endings are to be indicated by
4007       the two-character sequence CRLF (CR immediately followed by LF). If you
4008       want this, add
4009
4010         --enable-newline-is-crlf
4011
4012       to the configure command. There is a fourth option, specified by
4013
4014         --enable-newline-is-anycrlf
4015
4016       which  causes  PCRE2 to recognize any of the three sequences CR, LF, or
4017       CRLF as indicating a line ending. A fifth option, specified by
4018
4019         --enable-newline-is-any
4020
4021       causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode
4022       newline sequences are the three just mentioned, plus the single charac-
4023       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
4024       U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator,
4025       U+2029). The final option is
4026
4027         --enable-newline-is-nul
4028
4029       which causes NUL (binary zero) to be set  as  the  default  line-ending
4030       character.
4031
4032       Whatever default line ending convention is selected when PCRE2 is built
4033       can be overridden by applications that use the library. At  build  time
4034       it is recommended to use the standard for your operating system.
4035
4036
4037WHAT \R MATCHES
4038
4039       By  default,  the  sequence \R in a pattern matches any Unicode newline
4040       sequence, independently of what has been selected as  the  line  ending
4041       sequence. If you specify
4042
4043         --enable-bsr-anycrlf
4044
4045       the  default  is changed so that \R matches only CR, LF, or CRLF. What-
4046       ever is selected when PCRE2 is built can be overridden by  applications
4047       that use the library.
4048
4049
4050HANDLING VERY LARGE PATTERNS
4051
4052       Within  a  compiled  pattern,  offset values are used to point from one
4053       part to another (for example, from an opening parenthesis to an  alter-
4054       nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
4055       two-byte values are used for these offsets, leading to a  maximum  size
4056       for a compiled pattern of around 64 thousand code units. This is suffi-
4057       cient to handle all but the most gigantic patterns. Nevertheless,  some
4058       people do want to process truly enormous patterns, so it is possible to
4059       compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
4060       ting such as
4061
4062         --with-link-size=3
4063
4064       to  the  configure command. The value given must be 2, 3, or 4. For the
4065       16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
4066       using  longer  offsets slows down the operation of PCRE2 because it has
4067       to load additional data when handling them. For the 32-bit library  the
4068       value  is  always 4 and cannot be overridden; the value of --with-link-
4069       size is ignored.
4070
4071
4072LIMITING PCRE2 RESOURCE USAGE
4073
4074       The pcre2_match() function increments a counter each time it goes round
4075       its  main  loop. Putting a limit on this counter controls the amount of
4076       computing resource used by a single call to  pcre2_match().  The  limit
4077       can be changed at run time, as described in the pcre2api documentation.
4078       The default is 10 million, but this can be changed by adding a  setting
4079       such as
4080
4081         --with-match-limit=500000
4082
4083       to   the   configure   command.   This  setting  also  applies  to  the
4084       pcre2_dfa_match() matching function, and to JIT  matching  (though  the
4085       counting is done differently).
4086
4087       The  pcre2_match() function starts out using a 20KiB vector on the sys-
4088       tem stack to record backtracking points. The more  nested  backtracking
4089       points there are (that is, the deeper the search tree), the more memory
4090       is needed. If the initial vector is not large enough,  heap  memory  is
4091       used,  up to a certain limit, which is specified in kibibytes (units of
4092       1024 bytes). The limit can be changed at run time, as described in  the
4093       pcre2api  documentation.  The default limit (in effect unlimited) is 20
4094       million. You can change this by a setting such as
4095
4096         --with-heap-limit=500
4097
4098       which limits the amount of heap to 500 KiB. This limit applies only  to
4099       interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
4100       also use the heap for internal workspace  when  processing  complicated
4101       patterns.  This limit does not apply when JIT (which has its own memory
4102       arrangements) is used.
4103
4104       You can also explicitly limit the depth of nested backtracking  in  the
4105       pcre2_match() interpreter. This limit defaults to the value that is set
4106       for --with-match-limit. You can set a lower default  limit  by  adding,
4107       for example,
4108
4109         --with-match-limit_depth=10000
4110
4111       to  the  configure  command.  This value can be overridden at run time.
4112       This depth limit indirectly limits the amount of heap  memory  that  is
4113       used,  but because the size of each backtracking "frame" depends on the
4114       number of capturing parentheses in a pattern, the amount of  heap  that
4115       is  used  before  the  limit is reached varies from pattern to pattern.
4116       This limit was more useful in versions before 10.30, where function re-
4117       cursion was used for backtracking.
4118
4119       As well as applying to pcre2_match(), the depth limit also controls the
4120       depth of recursive function calls in pcre2_dfa_match(). These are  used
4121       for  lookaround  assertions,  atomic  groups, and recursion within pat-
4122       terns.  The limit does not apply to JIT matching.
4123
4124
4125CREATING CHARACTER TABLES AT BUILD TIME
4126
4127       PCRE2 uses fixed tables for processing characters whose code points are
4128       less than 256. By default, PCRE2 is built with a set of tables that are
4129       distributed in the file src/pcre2_chartables.c.dist. These  tables  are
4130       for ASCII codes only. If you add
4131
4132         --enable-rebuild-chartables
4133
4134       to  the  configure  command, the distributed tables are no longer used.
4135       Instead, a program called pcre2_dftables is compiled and run. This out-
4136       puts the source for new set of tables, created in the default locale of
4137       your C run-time system. This method of replacing the  tables  does  not
4138       work if you are cross compiling, because pcre2_dftables needs to be run
4139       on the local host and therefore not compiled with the cross compiler.
4140
4141       If you need to create alternative tables when cross compiling, you will
4142       have  to  do so "by hand". There may also be other reasons for creating
4143       tables manually.  To cause pcre2_dftables to  be  built  on  the  local
4144       host, run a normal compiling command, and then run the program with the
4145       output file as its argument, for example:
4146
4147         cc src/pcre2_dftables.c -o pcre2_dftables
4148         ./pcre2_dftables src/pcre2_chartables.c
4149
4150       This builds the tables in the default locale of the local host. If  you
4151       want to specify a locale, you must use the -L option:
4152
4153         LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4154
4155       You can also specify -b (with or without -L). This causes the tables to
4156       be written in binary instead of as source code. A set of binary  tables
4157       can  be  loaded  into memory by an application and passed to pcre2_com-
4158       pile() in the same way as tables created by calling pcre2_maketables().
4159       The  tables are just a string of bytes, independent of hardware charac-
4160       teristics such as endianness. This means they can be  bundled  with  an
4161       application  that  runs in different environments, to ensure consistent
4162       behaviour.
4163
4164
4165USING EBCDIC CODE
4166
4167       PCRE2 assumes by default that it will run in an environment  where  the
4168       character  code is ASCII or Unicode, which is a superset of ASCII. This
4169       is the case for most computer operating systems. PCRE2 can, however, be
4170       compiled to run in an 8-bit EBCDIC environment by adding
4171
4172         --enable-ebcdic --disable-unicode
4173
4174       to the configure command. This setting implies --enable-rebuild-charta-
4175       bles. You should only use it if you know that you are in an EBCDIC  en-
4176       vironment (for example, an IBM mainframe operating system).
4177
4178       It  is  not possible to support both EBCDIC and UTF-8 codes in the same
4179       version of the library. Consequently,  --enable-unicode  and  --enable-
4180       ebcdic are mutually exclusive.
4181
4182       The EBCDIC character that corresponds to an ASCII LF is assumed to have
4183       the value 0x15 by default. However, in some EBCDIC  environments,  0x25
4184       is used. In such an environment you should use
4185
4186         --enable-ebcdic-nl25
4187
4188       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4189       has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
4190       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4191       acter (which, in Unicode, is 0x85).
4192
4193       The options that select newline behaviour, such as --enable-newline-is-
4194       cr, and equivalent run-time options, refer to these character values in
4195       an EBCDIC environment.
4196
4197
4198PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
4199
4200       By default pcre2grep supports the use of callouts with string arguments
4201       within  the patterns it is matching. There are two kinds: one that gen-
4202       erates output using local code, and another that calls an external pro-
4203       gram  or  script.   If --disable-pcre2grep-callout-fork is added to the
4204       configure command, only the first kind  of  callout  is  supported;  if
4205       --disable-pcre2grep-callout  is  used,  all callouts are completely ig-
4206       nored. For more details of pcre2grep callouts, see the pcre2grep  docu-
4207       mentation.
4208
4209
4210PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
4211
4212       By  default,  pcre2grep reads all files as plain text. You can build it
4213       so that it recognizes files whose names end in .gz or .bz2,  and  reads
4214       them with libz or libbz2, respectively, by adding one or both of
4215
4216         --enable-pcre2grep-libz
4217         --enable-pcre2grep-libbz2
4218
4219       to the configure command. These options naturally require that the rel-
4220       evant libraries are installed on your system. Configuration  will  fail
4221       if they are not.
4222
4223
4224PCRE2GREP BUFFER SIZE
4225
4226       pcre2grep  uses an internal buffer to hold a "window" on the file it is
4227       scanning, in order to be able to output "before" and "after" lines when
4228       it finds a match. The default starting size of the buffer is 20KiB. The
4229       buffer itself is three times this size, but because of the  way  it  is
4230       used for holding "before" lines, the longest line that is guaranteed to
4231       be processable is the notional buffer size. If a longer line is encoun-
4232       tered,  pcre2grep  automatically  expands the buffer, up to a specified
4233       maximum size, whose default is 1MiB or the starting size, whichever  is
4234       the  larger. You can change the default parameter values by adding, for
4235       example,
4236
4237         --with-pcre2grep-bufsize=51200
4238         --with-pcre2grep-max-bufsize=2097152
4239
4240       to the configure command. The caller of pcre2grep  can  override  these
4241       values  by  using  --buffer-size  and  --max-buffer-size on the command
4242       line.
4243
4244
4245PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
4246
4247       If you add one of
4248
4249         --enable-pcre2test-libreadline
4250         --enable-pcre2test-libedit
4251
4252       to the configure command, pcre2test is linked with the libreadline  or-
4253       libedit  library,  respectively, and when its input is from a terminal,
4254       it reads it using the readline() function. This  provides  line-editing
4255       and  history  facilities.  Note that libreadline is GPL-licensed, so if
4256       you distribute a binary of pcre2test linked in this way, there  may  be
4257       licensing issues. These can be avoided by linking instead with libedit,
4258       which has a BSD licence.
4259
4260       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
4261       be  added to the pcre2test build. In many operating environments with a
4262       sytem-installed readline library this is sufficient. However,  in  some
4263       environments (e.g. if an unmodified distribution version of readline is
4264       in use), some extra configuration may be necessary.  The  INSTALL  file
4265       for libreadline says this:
4266
4267         "Readline uses the termcap functions, but does not link with
4268         the termcap or curses library itself, allowing applications
4269         which link with readline the to choose an appropriate library."
4270
4271       If  your environment has not been set up so that an appropriate library
4272       is automatically included, you may need to add something like
4273
4274         LIBS="-ncurses"
4275
4276       immediately before the configure command.
4277
4278
4279INCLUDING DEBUGGING CODE
4280
4281       If you add
4282
4283         --enable-debug
4284
4285       to the configure command, additional debugging code is included in  the
4286       build. This feature is intended for use by the PCRE2 maintainers.
4287
4288
4289DEBUGGING WITH VALGRIND SUPPORT
4290
4291       If you add
4292
4293         --enable-valgrind
4294
4295       to  the  configure command, PCRE2 will use valgrind annotations to mark
4296       certain memory regions as unaddressable. This allows it to  detect  in-
4297       valid memory accesses, and is mostly useful for debugging PCRE2 itself.
4298
4299
4300CODE COVERAGE REPORTING
4301
4302       If  your  C  compiler is gcc, you can build a version of PCRE2 that can
4303       generate a code coverage report for its test suite. To enable this, you
4304       must install lcov version 1.6 or above. Then specify
4305
4306         --enable-coverage
4307
4308       to the configure command and build PCRE2 in the usual way.
4309
4310       Note that using ccache (a caching C compiler) is incompatible with code
4311       coverage reporting. If you have configured ccache to run  automatically
4312       on your system, you must set the environment variable
4313
4314         CCACHE_DISABLE=1
4315
4316       before running make to build PCRE2, so that ccache is not used.
4317
4318       When  --enable-coverage  is  used,  the  following addition targets are
4319       added to the Makefile:
4320
4321         make coverage
4322
4323       This creates a fresh coverage report for the PCRE2 test  suite.  It  is
4324       equivalent  to running "make coverage-reset", "make coverage-baseline",
4325       "make check", and then "make coverage-report".
4326
4327         make coverage-reset
4328
4329       This zeroes the coverage counters, but does nothing else.
4330
4331         make coverage-baseline
4332
4333       This captures baseline coverage information.
4334
4335         make coverage-report
4336
4337       This creates the coverage report.
4338
4339         make coverage-clean-report
4340
4341       This removes the generated coverage report without cleaning the  cover-
4342       age data itself.
4343
4344         make coverage-clean-data
4345
4346       This  removes  the captured coverage data without removing the coverage
4347       files created at compile time (*.gcno).
4348
4349         make coverage-clean
4350
4351       This cleans all coverage data including the generated coverage  report.
4352       For  more  information about code coverage, see the gcov and lcov docu-
4353       mentation.
4354
4355
4356DISABLING THE Z AND T FORMATTING MODIFIERS
4357
4358       The C99 standard defines formatting modifiers z and t  for  size_t  and
4359       ptrdiff_t  values, respectively. By default, PCRE2 uses these modifiers
4360       in environments other than Microsoft  Visual  Studio  when  __STDC_VER-
4361       SION__ is defined and has a value greater than or equal to 199901L (in-
4362       dicating C99).  However, there is at least one environment that  claims
4363       to be C99 but does not support these modifiers. If
4364
4365         --disable-percent-zt
4366
4367       is specified, no use is made of the z or t modifiers. Instead of %td or
4368       %zu, %lu is used, with a cast for size_t values.
4369
4370
4371SUPPORT FOR FUZZERS
4372
4373       There is a special option for use by people who  want  to  run  fuzzing
4374       tests on PCRE2:
4375
4376         --enable-fuzz-support
4377
4378       At present this applies only to the 8-bit library. If set, it causes an
4379       extra library called libpcre2-fuzzsupport.a to be built,  but  not  in-
4380       stalled.  This  contains  a single function called LLVMFuzzerTestOneIn-
4381       put() whose arguments are a pointer to a string and the length  of  the
4382       string.  When  called,  this  function tries to compile the string as a
4383       pattern, and if that succeeds, to match it.  This is done both with  no
4384       options  and  with some random options bits that are generated from the
4385       string.
4386
4387       Setting --enable-fuzz-support also causes  a  binary  called  pcre2fuz-
4388       zcheck  to be created. This is normally run under valgrind or used when
4389       PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
4390       function  and  outputs  information  about  what it is doing. The input
4391       strings are specified by arguments: if an argument starts with "="  the
4392       rest  of it is a literal input string. Otherwise, it is assumed to be a
4393       file name, and the contents of the file are the test string.
4394
4395
4396OBSOLETE OPTION
4397
4398       In versions of PCRE2 prior to 10.30, there were two  ways  of  handling
4399       backtracking  in the pcre2_match() function. The default was to use the
4400       system stack, but if
4401
4402         --disable-stack-for-recursion
4403
4404       was set, memory on the heap was used. From release 10.30  onwards  this
4405       has  changed  (the  stack  is  no longer used) and this option now does
4406       nothing except give a warning.
4407
4408
4409SEE ALSO
4410
4411       pcre2api(3), pcre2-config(3).
4412
4413
4414AUTHOR
4415
4416       Philip Hazel
4417       University Computing Service
4418       Cambridge, England.
4419
4420
4421REVISION
4422
4423       Last updated: 20 March 2020
4424       Copyright (c) 1997-2020 University of Cambridge.
4425------------------------------------------------------------------------------
4426
4427
4428PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
4429
4430
4431
4432NAME
4433       PCRE2 - Perl-compatible regular expressions (revised API)
4434
4435SYNOPSIS
4436
4437       #include <pcre2.h>
4438
4439       int (*pcre2_callout)(pcre2_callout_block *, void *);
4440
4441       int pcre2_callout_enumerate(const pcre2_code *code,
4442         int (*callback)(pcre2_callout_enumerate_block *, void *),
4443         void *user_data);
4444
4445
4446DESCRIPTION
4447
4448       PCRE2  provides  a feature called "callout", which is a means of tempo-
4449       rarily passing control to the caller of PCRE2 in the middle of  pattern
4450       matching.  The caller of PCRE2 provides an external function by putting
4451       its entry point in a match  context  (see  pcre2_set_callout()  in  the
4452       pcre2api documentation).
4453
4454       When  using the pcre2_substitute() function, an additional callout fea-
4455       ture is available. This does a callout after each change to the subject
4456       string and is described in the pcre2api documentation; the rest of this
4457       document is concerned with callouts during pattern matching.
4458
4459       Within a regular expression, (?C<arg>) indicates a point at  which  the
4460       external  function  is  to  be  called. Different callout points can be
4461       identified by putting a number less than 256 after the  letter  C.  The
4462       default  value is zero.  Alternatively, the argument may be a delimited
4463       string. The starting delimiter must be one of ` ' " ^ % # $ {  and  the
4464       ending delimiter is the same as the start, except for {, where the end-
4465       ing delimiter is }. If  the  ending  delimiter  is  needed  within  the
4466       string,  it  must be doubled. For example, this pattern has two callout
4467       points:
4468
4469         (?C1)abc(?C"some ""arbitrary"" text")def
4470
4471       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
4472       PCRE2  automatically inserts callouts, all with number 255, before each
4473       item in the pattern except for immediately before or after an  explicit
4474       callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
4475
4476         A(?C3)B
4477
4478       it is processed as if it were
4479
4480         (?C255)A(?C3)B(?C255)
4481
4482       Here is a more complicated example:
4483
4484         A(\d{2}|--)
4485
4486       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
4487
4488         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4489
4490       Notice  that  there  is a callout before and after each parenthesis and
4491       alternation bar. If the pattern contains a conditional group whose con-
4492       dition  is  an  assertion, an automatic callout is inserted immediately
4493       before the condition. Such a callout may also be  inserted  explicitly,
4494       for example:
4495
4496         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
4497
4498       This  applies only to assertion conditions (because they are themselves
4499       independent groups).
4500
4501       Callouts can be useful for tracking the progress of  pattern  matching.
4502       The pcre2test program has a pattern qualifier (/auto_callout) that sets
4503       automatic callouts.  When any callouts are  present,  the  output  from
4504       pcre2test  indicates  how  the pattern is being matched. This is useful
4505       information when you are trying to optimize the performance of  a  par-
4506       ticular pattern.
4507
4508
4509MISSING CALLOUTS
4510
4511       You  should  be  aware  that, because of optimizations in the way PCRE2
4512       compiles and matches patterns, callouts sometimes do not happen exactly
4513       as you might expect.
4514
4515   Auto-possessification
4516
4517       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4518       that what follows cannot be part of the repeat. For example, a+[bc]  is
4519       compiled  as if it were a++[bc]. The pcre2test output when this pattern
4520       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
4521       to the string "aaaa" is:
4522
4523         --->aaaa
4524          +0 ^        a+
4525          +2 ^   ^    [bc]
4526         No match
4527
4528       This  indicates that when matching [bc] fails, there is no backtracking
4529       into a+ (because it is being treated as a++) and therefore the callouts
4530       that  would  be  taken for the backtracks do not occur. You can disable
4531       the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
4532       pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In
4533       this case, the output changes to this:
4534
4535         --->aaaa
4536          +0 ^        a+
4537          +2 ^   ^    [bc]
4538          +2 ^  ^     [bc]
4539          +2 ^ ^      [bc]
4540          +2 ^^       [bc]
4541         No match
4542
4543       This time, when matching [bc] fails, the matcher backtracks into a+ and
4544       tries again, repeatedly, until a+ itself fails.
4545
4546   Automatic .* anchoring
4547
4548       By default, an optimization is applied when .* is the first significant
4549       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
4550       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
4551       is not set, a match can start only after an internal newline or at  the
4552       beginning of the subject, and pcre2_compile() remembers this. If a pat-
4553       tern has more than one top-level branch, automatic anchoring occurs  if
4554       all branches are anchorable.
4555
4556       This  optimization is disabled, however, if .* is in an atomic group or
4557       if there is a backreference to the capture group in which  it  appears.
4558       It  is  also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4559       ever, the presence of callouts does not affect it.
4560
4561       For example, if the pattern .*\d is  compiled  with  PCRE2_AUTO_CALLOUT
4562       and applied to the string "aa", the pcre2test output is:
4563
4564         --->aa
4565          +0 ^      .*
4566          +2 ^ ^    \d
4567          +2 ^^     \d
4568          +2 ^      \d
4569         No match
4570
4571       This  shows  that all match attempts start at the beginning of the sub-
4572       ject. In other words, the pattern is anchored. You can disable this op-
4573       timization  by  passing  PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
4574       starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
4575       put changes to:
4576
4577         --->aa
4578          +0 ^      .*
4579          +2 ^ ^    \d
4580          +2 ^^     \d
4581          +2 ^      \d
4582          +0  ^     .*
4583          +2  ^^    \d
4584          +2  ^     \d
4585         No match
4586
4587       This  shows more match attempts, starting at the second subject charac-
4588       ter.  Another optimization, described in the next section,  means  that
4589       there is no subsequent attempt to match with an empty subject.
4590
4591   Other optimizations
4592
4593       Other  optimizations  that  provide fast "no match" results also affect
4594       callouts.  For example, if the pattern is
4595
4596         ab(?C4)cd
4597
4598       PCRE2 knows that any matching string must contain the  letter  "d".  If
4599       the  subject  string  is  "abyz",  the  lack of "d" means that matching
4600       doesn't ever start, and the callout is  never  reached.  However,  with
4601       "abyd", though the result is still no match, the callout is obeyed.
4602
4603       For  most  patterns  PCRE2  also knows the minimum length of a matching
4604       string, and will immediately give a "no match" return without  actually
4605       running  a  match if the subject is not long enough, or, for unanchored
4606       patterns, if it has been scanned far enough.
4607
4608       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4609       MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
4610       (*NO_START_OPT). This slows down the matching process, but does  ensure
4611       that callouts such as the example above are obeyed.
4612
4613
4614THE CALLOUT INTERFACE
4615
4616       During  matching,  when  PCRE2  reaches a callout point, if an external
4617       function is provided in the match context, it is called.  This  applies
4618       to  both normal, DFA, and JIT matching. The first argument to the call-
4619       out function is a pointer to a pcre2_callout block. The second argument
4620       is  the  void * callout data that was supplied when the callout was set
4621       up by calling pcre2_set_callout() (see the pcre2api documentation). The
4622       callout  block structure contains the following fields, not necessarily
4623       in this order:
4624
4625         uint32_t      version;
4626         uint32_t      callout_number;
4627         uint32_t      capture_top;
4628         uint32_t      capture_last;
4629         uint32_t      callout_flags;
4630         PCRE2_SIZE   *offset_vector;
4631         PCRE2_SPTR    mark;
4632         PCRE2_SPTR    subject;
4633         PCRE2_SIZE    subject_length;
4634         PCRE2_SIZE    start_match;
4635         PCRE2_SIZE    current_position;
4636         PCRE2_SIZE    pattern_position;
4637         PCRE2_SIZE    next_item_length;
4638         PCRE2_SIZE    callout_string_offset;
4639         PCRE2_SIZE    callout_string_length;
4640         PCRE2_SPTR    callout_string;
4641
4642       The version field contains the version number of the block format.  The
4643       current  version  is  2; the three callout string fields were added for
4644       version 1, and the callout_flags field for version 2. If you are  writ-
4645       ing  an  application  that  might  use an earlier release of PCRE2, you
4646       should check the version number before accessing any of  these  fields.
4647       The  version  number  will increase in future if more fields are added,
4648       but the intention is never to remove any of the existing fields.
4649
4650   Fields for numerical callouts
4651
4652       For a numerical callout, callout_string  is  NULL,  and  callout_number
4653       contains  the  number  of  the callout, in the range 0-255. This is the
4654       number that follows (?C for callouts that part of the  pattern;  it  is
4655       255 for automatically generated callouts.
4656
4657   Fields for string callouts
4658
4659       For  callouts with string arguments, callout_number is always zero, and
4660       callout_string points to the string that is contained within  the  com-
4661       piled pattern. Its length is given by callout_string_length. Duplicated
4662       ending delimiters that were present in the original pattern string have
4663       been turned into single characters, but there is no other processing of
4664       the callout string argument. An additional code unit containing  binary
4665       zero  is  present  after the string, but is not included in the length.
4666       The delimiter that was used to start the string is also  stored  within
4667       the  pattern, immediately before the string itself. You can access this
4668       delimiter as callout_string[-1] if you need it.
4669
4670       The callout_string_offset field is the code unit offset to the start of
4671       the callout argument string within the original pattern string. This is
4672       provided for the benefit of applications such as script languages  that
4673       might need to report errors in the callout string within the pattern.
4674
4675   Fields for all callouts
4676
4677       The  remaining  fields in the callout block are the same for both kinds
4678       of callout.
4679
4680       The offset_vector field is a pointer to a vector of  capturing  offsets
4681       (the "ovector"). You may read the elements in this vector, but you must
4682       not change any of them.
4683
4684       For calls to pcre2_match(), the offset_vector field is not  (since  re-
4685       lease  10.30)  a  pointer  to the actual ovector that was passed to the
4686       matching function in the match data block. Instead it points to an  in-
4687       ternal  ovector  of  a  size large enough to hold all possible captured
4688       substrings in the pattern. Note that whenever a recursion or subroutine
4689       call  within  a pattern completes, the capturing state is reset to what
4690       it was before.
4691
4692       The capture_last field contains the number of the  most  recently  cap-
4693       tured  substring,  and the capture_top field contains one more than the
4694       number of the highest numbered captured substring so far.  If  no  sub-
4695       strings  have yet been captured, the value of capture_last is 0 and the
4696       value of capture_top is 1. The values of these  fields  do  not  always
4697       differ   by   one;  for  example,  when  the  callout  in  the  pattern
4698       ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
4699
4700       The contents of ovector[2] to  ovector[<capture_top>*2-1]  can  be  in-
4701       spected  in  order to extract substrings that have been matched so far,
4702       in the same way as extracting substrings after a match  has  completed.
4703       The  values in ovector[0] and ovector[1] are always PCRE2_UNSET because
4704       the match is by definition not complete. Substrings that have not  been
4705       captured  but whose numbers are less than capture_top also have both of
4706       their ovector slots set to PCRE2_UNSET.
4707
4708       For DFA matching, the offset_vector field points to  the  ovector  that
4709       was  passed  to the matching function in the match data block for call-
4710       outs at the top level, but to an internal ovector during the processing
4711       of  pattern  recursions, lookarounds, and atomic groups. However, these
4712       ovectors hold no useful information because pcre2_dfa_match() does  not
4713       support  substring  capturing. The value of capture_top is always 1 and
4714       the value of capture_last is always 0 for DFA matching.
4715
4716       The subject and subject_length fields contain copies of the values that
4717       were passed to the matching function.
4718
4719       The  start_match  field normally contains the offset within the subject
4720       at which the current match attempt started. However, if the escape  se-
4721       quence  \K  has  been encountered, this value is changed to reflect the
4722       modified starting point. If the pattern is not  anchored,  the  callout
4723       function may be called several times from the same point in the pattern
4724       for different starting points in the subject.
4725
4726       The current_position field contains the offset within  the  subject  of
4727       the current match pointer.
4728
4729       The pattern_position field contains the offset in the pattern string to
4730       the next item to be matched.
4731
4732       The next_item_length field contains the length of the next item  to  be
4733       processed  in the pattern string. When the callout is at the end of the
4734       pattern, the length is zero.  When  the  callout  precedes  an  opening
4735       parenthesis, the length includes meta characters that follow the paren-
4736       thesis. For example, in a callout before an assertion  such  as  (?=ab)
4737       the  length  is  3. For an an alternation bar or a closing parenthesis,
4738       the length is one, unless a closing parenthesis is followed by a  quan-
4739       tifier, in which case its length is included.  (This changed in release
4740       10.23. In earlier releases, before an opening  parenthesis  the  length
4741       was  that of the entire group, and before an alternation bar or a clos-
4742       ing parenthesis the length was zero.)
4743
4744       The pattern_position and next_item_length fields are intended  to  help
4745       in  distinguishing between different automatic callouts, which all have
4746       the same callout number. However, they are set for  all  callouts,  and
4747       are used by pcre2test to show the next item to be matched when display-
4748       ing callout information.
4749
4750       In callouts from pcre2_match() the mark field contains a pointer to the
4751       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
4752       (*THEN) item in the match, or NULL if no such items have  been  passed.
4753       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
4754       previous (*MARK). In callouts from the DFA matching function this field
4755       always contains NULL.
4756
4757       The   callout_flags   field   is   always   zero   in   callouts   from
4758       pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4759       JIT is used, the following bits may be set:
4760
4761         PCRE2_CALLOUT_STARTMATCH
4762
4763       This  is set for the first callout after the start of matching for each
4764       new starting position in the subject.
4765
4766         PCRE2_CALLOUT_BACKTRACK
4767
4768       This is set if there has been a matching backtrack since  the  previous
4769       callout,  or  since  the start of matching if this is the first callout
4770       from a pcre2_match() run.
4771
4772       Both bits are set when a backtrack has caused a "bumpalong"  to  a  new
4773       starting  position in the subject. Output from pcre2test does not indi-
4774       cate the presence of these bits unless the  callout_extra  modifier  is
4775       set.
4776
4777       The information in the callout_flags field is provided so that applica-
4778       tions can track and tell their users how matching with backtracking  is
4779       done.  This  can be useful when trying to optimize patterns, or just to
4780       understand how PCRE2 works. There is no  support  in  pcre2_dfa_match()
4781       because  there is no backtracking in DFA matching, and there is no sup-
4782       port in JIT because JIT is all about maximimizing matching performance.
4783       In both these cases the callout_flags field is always zero.
4784
4785
4786RETURN VALUES FROM CALLOUTS
4787
4788       The external callout function returns an integer to PCRE2. If the value
4789       is zero, matching proceeds as normal. If  the  value  is  greater  than
4790       zero,  matching  fails  at  the current point, but the testing of other
4791       matching possibilities goes ahead, just as if a lookahead assertion had
4792       failed. If the value is less than zero, the match is abandoned, and the
4793       matching function returns the negative value.
4794
4795       Negative values should normally be chosen from  the  set  of  PCRE2_ER-
4796       ROR_xxx  values.  In  particular, PCRE2_ERROR_NOMATCH forces a standard
4797       "no match" failure. The error number  PCRE2_ERROR_CALLOUT  is  reserved
4798       for use by callout functions; it will never be used by PCRE2 itself.
4799
4800
4801CALLOUT ENUMERATION
4802
4803       int pcre2_callout_enumerate(const pcre2_code *code,
4804         int (*callback)(pcre2_callout_enumerate_block *, void *),
4805         void *user_data);
4806
4807       A script language that supports the use of string arguments in callouts
4808       might like to scan all the callouts in a  pattern  before  running  the
4809       match. This can be done by calling pcre2_callout_enumerate(). The first
4810       argument is a pointer to a compiled pattern, the  second  points  to  a
4811       callback  function,  and the third is arbitrary user data. The callback
4812       function is called for every callout in the pattern  in  the  order  in
4813       which they appear. Its first argument is a pointer to a callout enumer-
4814       ation block, and its second argument is the user_data  value  that  was
4815       passed  to  pcre2_callout_enumerate(). The data block contains the fol-
4816       lowing fields:
4817
4818         version                Block version number
4819         pattern_position       Offset to next item in pattern
4820         next_item_length       Length of next item in pattern
4821         callout_number         Number for numbered callouts
4822         callout_string_offset  Offset to string within pattern
4823         callout_string_length  Length of callout string
4824         callout_string         Points to callout string or is NULL
4825
4826       The version number is currently 0. It will increase if new  fields  are
4827       ever  added  to  the  block. The remaining fields are the same as their
4828       namesakes in the pcre2_callout block that is used for  callouts  during
4829       matching, as described above.
4830
4831       Note  that  the  value  of pattern_position is unique for each callout.
4832       However, if a callout occurs inside a group that is quantified  with  a
4833       non-zero minimum or a fixed maximum, the group is replicated inside the
4834       compiled pattern. For example, a pattern such as /(a){2}/  is  compiled
4835       as  if it were /(a)(a)/. This means that the callout will be enumerated
4836       more than once, but with the same value for  pattern_position  in  each
4837       case.
4838
4839       The callback function should normally return zero. If it returns a non-
4840       zero value, scanning the pattern stops, and that value is returned from
4841       pcre2_callout_enumerate().
4842
4843
4844AUTHOR
4845
4846       Philip Hazel
4847       University Computing Service
4848       Cambridge, England.
4849
4850
4851REVISION
4852
4853       Last updated: 03 February 2019
4854       Copyright (c) 1997-2019 University of Cambridge.
4855------------------------------------------------------------------------------
4856
4857
4858PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
4859
4860
4861
4862NAME
4863       PCRE2 - Perl-compatible regular expressions (revised API)
4864
4865DIFFERENCES BETWEEN PCRE2 AND PERL
4866
4867       This  document describes some of the differences in the ways that PCRE2
4868       and Perl handle regular expressions. The differences described here are
4869       with  respect  to  Perl  version 5.32.0, but as both Perl and PCRE2 are
4870       continually changing, the information may at times be out of date.
4871
4872       1. PCRE2 has only a subset of Perl's Unicode support. Details  of  what
4873       it does have are given in the pcre2unicode page.
4874
4875       2.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4876       tions, but they do not mean what you might think. For example, (?!a){3}
4877       does not assert that the next three characters are not "a". It just as-
4878       serts that the next character is not "a"  three  times  (in  principle;
4879       PCRE2  optimizes this to run the assertion just once). Perl allows some
4880       repeat quantifiers on other  assertions,  for  example,  \b*  (but  not
4881       \b{3},  though oddly it does allow ^{3}), but these do not seem to have
4882       any use. PCRE2 does not allow any kind of quantifier on  non-lookaround
4883       assertions.
4884
4885       3.  Capture groups that occur inside negative lookaround assertions are
4886       counted, but their entries in the offsets vector are set  only  when  a
4887       negative  assertion is a condition that has a matching branch (that is,
4888       the condition is false).  Perl may set such  capture  groups  in  other
4889       circumstances.
4890
4891       4.  The  following Perl escape sequences are not supported: \F, \l, \L,
4892       \u, \U, and \N when followed by a character name. \N on its own, match-
4893       ing  a  non-newline  character, and \N{U+dd..}, matching a Unicode code
4894       point, are supported. The escapes that modify  the  case  of  following
4895       letters  are  implemented by Perl's general string-handling and are not
4896       part of its pattern matching engine. If any of these are encountered by
4897       PCRE2,  an  error  is  generated  by default. However, if either of the
4898       PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U  and  \u  are
4899       interpreted as ECMAScript interprets them.
4900
4901       5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4902       is built with Unicode support (the default). The properties that can be
4903       tested  with  \p  and \P are limited to the general category properties
4904       such as Lu and Nd, script names such as Greek or Han, and  the  derived
4905       properties  Any and L&.  Both PCRE2 and Perl support the Cs (surrogate)
4906       property, but in PCRE2 its use is limited. See the  pcre2pattern  docu-
4907       mentation  for  details. The long synonyms for property names that Perl
4908       supports (such as \p{Letter}) are not supported by  PCRE2,  nor  is  it
4909       permitted to prefix any of these properties with "Is".
4910
4911       6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
4912       in between are treated as literals. However, this is slightly different
4913       from  Perl  in  that  $  and  @ are also handled as literals inside the
4914       quotes. In Perl, they cause variable interpolation (but of course PCRE2
4915       does not have variables). Also, Perl does "double-quotish backslash in-
4916       terpolation" on any backslashes between \Q and \E which, its documenta-
4917       tion  says,  "may  lead to confusing results". PCRE2 treats a backslash
4918       between \Q and \E just like any other character. Note the following ex-
4919       amples:
4920
4921           Pattern            PCRE2 matches     Perl matches
4922
4923           \Qabc$xyz\E        abc$xyz           abc followed by the
4924                                                  contents of $xyz
4925           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
4926           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
4927           \QA\B\E            A\B               A\B
4928           \Q\\E              \                 \\E
4929
4930       The  \Q...\E  sequence  is recognized both inside and outside character
4931       classes by both PCRE2 and Perl.
4932
4933       7.  Fairly  obviously,  PCRE2  does  not  support  the  (?{code})   and
4934       (??{code}) constructions. However, PCRE2 does have a "callout" feature,
4935       which allows an external function to be called during pattern matching.
4936       See the pcre2callout documentation for details.
4937
4938       8.  Subroutine  calls (whether recursive or not) were treated as atomic
4939       groups up to PCRE2 release 10.23, but from release 10.30 this  changed,
4940       and backtracking into subroutine calls is now supported, as in Perl.
4941
4942       9.  In  PCRE2,  if  any of the backtracking control verbs are used in a
4943       group that is called as a  subroutine  (whether  or  not  recursively),
4944       their  effect is confined to that group; it does not extend to the sur-
4945       rounding pattern. This is not always the case in Perl.  In  particular,
4946       if  (*THEN)  is  present in a group that is called as a subroutine, its
4947       action is limited to that group, even if the group does not contain any
4948       |  characters.  Note  that such groups are processed as anchored at the
4949       point where they are tested.
4950
4951       10. If a pattern contains more than one backtracking control verb,  the
4952       first  one  that  is backtracked onto acts. For example, in the pattern
4953       A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but  a  failure
4954       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4955       it is the same as PCRE2, but there are cases where it differs.
4956
4957       11. There are some differences that are concerned with the settings  of
4958       captured  strings  when  part  of  a  pattern is repeated. For example,
4959       matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves  $2  un-
4960       set, but in PCRE2 it is set to "b".
4961
4962       12.  PCRE2's  handling  of duplicate capture group numbers and names is
4963       not as general as Perl's. This is a consequence of the fact  the  PCRE2
4964       works  internally  just with numbers, using an external table to trans-
4965       late between numbers and  names.  In  particular,  a  pattern  such  as
4966       (?|(?<a>A)|(?<b>B)),  where the two capture groups have the same number
4967       but different names, is not supported, and causes an error  at  compile
4968       time. If it were allowed, it would not be possible to distinguish which
4969       group matched, because both names map to capture  group  number  1.  To
4970       avoid this confusing situation, an error is given at compile time.
4971
4972       13. Perl used to recognize comments in some places that PCRE2 does not,
4973       for example, between the ( and ? at the start of a  group.  If  the  /x
4974       modifier  is  set,  Perl allowed white space between ( and ? though the
4975       latest Perls give an error (for a while it was just deprecated).  There
4976       may still be some cases where Perl behaves differently.
4977
4978       14.  Perl,  when  in warning mode, gives warnings for character classes
4979       such as [A-\d] or [a-[:digit:]]. It then treats the hyphens  as  liter-
4980       als. PCRE2 has no warning features, so it gives an error in these cases
4981       because they are almost certainly user mistakes.
4982
4983       15. In PCRE2, the upper/lower case character properties Lu and  Ll  are
4984       not  affected when case-independent matching is specified. For example,
4985       \p{Lu} always matches an upper case letter. I think Perl has changed in
4986       this  respect; in the release at the time of writing (5.32), \p{Lu} and
4987       \p{Ll} match all letters, regardless of case, when case independence is
4988       specified.
4989
4990       16. From release 5.32.0, Perl locks out the use of \K in lookaround as-
4991       sertions. In PCRE2, \K is acted on when it occurs  in  positive  asser-
4992       tions, but is ignored in negative assertions.
4993
4994       17.  PCRE2  provides some extensions to the Perl regular expression fa-
4995       cilities.  Perl 5.10 included new features that  were  not  in  earlier
4996       versions  of  Perl,  some  of which (such as named parentheses) were in
4997       PCRE2 for some time before. This list is with respect to Perl 5.32:
4998
4999       (a) Although lookbehind assertions in PCRE2  must  match  fixed  length
5000       strings, each alternative toplevel branch of a lookbehind assertion can
5001       match a different length of string. Perl requires them all to have  the
5002       same length.
5003
5004       (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
5005       ported in lookbehinds, provided that there is no possibility of  refer-
5006       encing  a  non-unique  number or name. Perl does not support backrefer-
5007       ences in lookbehinds.
5008
5009       (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set,  the
5010       $ meta-character matches only at the very end of the string.
5011
5012       (d)  A  backslash  followed  by  a  letter  with  no special meaning is
5013       faulted. (Perl can be made to issue a warning.)
5014
5015       (e) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti-
5016       fiers is inverted, that is, by default they are not greedy, but if fol-
5017       lowed by a question mark they are.
5018
5019       (f) PCRE2_ANCHORED can be used at matching time to force a  pattern  to
5020       be tried only at the first matching position in the subject string.
5021
5022       (g)     The     PCRE2_NOTBOL,    PCRE2_NOTEOL,    PCRE2_NOTEMPTY    and
5023       PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
5024
5025       (h) The \R escape sequence can be restricted to match only CR,  LF,  or
5026       CRLF by the PCRE2_BSR_ANYCRLF option.
5027
5028       (i)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
5029       and variable interpolation, but not general hooks on every match.
5030
5031       (j) The partial matching facility is PCRE2-specific.
5032
5033       (k) The alternative matching function (pcre2_dfa_match() matches  in  a
5034       different way and is not Perl-compatible.
5035
5036       (l)  PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
5037       at the start of a pattern. These set overall  options  that  cannot  be
5038       changed within the pattern.
5039
5040       (m)  PCRE2  supports non-atomic positive lookaround assertions. This is
5041       an extension to the lookaround facilities. The default, Perl-compatible
5042       lookarounds are atomic.
5043
5044       18.  The  Perl  /a modifier restricts /d numbers to pure ascii, and the
5045       /aa modifier restricts /i case-insensitive matching to pure ascii,  ig-
5046       noring  Unicode  rules.  This  separation  cannot  be  represented with
5047       PCRE2_UCP.
5048
5049       19. Perl has different limits than PCRE2. See the pcre2limit documenta-
5050       tion for details. Perl went with 5.10 from recursion to iteration keep-
5051       ing the intermediate matches on the heap, which is ~10% slower but does
5052       not  fall into any stack-overflow limit. PCRE2 made a similar change at
5053       release 10.30, and also has many build-time and  run-time  customizable
5054       limits.
5055
5056
5057AUTHOR
5058
5059       Philip Hazel
5060       University Computing Service
5061       Cambridge, England.
5062
5063
5064REVISION
5065
5066       Last updated: 06 October 2020
5067       Copyright (c) 1997-2019 University of Cambridge.
5068------------------------------------------------------------------------------
5069
5070
5071PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
5072
5073
5074
5075NAME
5076       PCRE2 - Perl-compatible regular expressions (revised API)
5077
5078PCRE2 JUST-IN-TIME COMPILER SUPPORT
5079
5080       Just-in-time  compiling  is a heavyweight optimization that can greatly
5081       speed up pattern matching. However, it comes at the cost of extra  pro-
5082       cessing  before  the  match is performed, so it is of most benefit when
5083       the same pattern is going to be matched many times. This does not  nec-
5084       essarily  mean many calls of a matching function; if the pattern is not
5085       anchored, matching attempts may take place many times at various  posi-
5086       tions in the subject, even for a single call. Therefore, if the subject
5087       string is very long, it may still pay  to  use  JIT  even  for  one-off
5088       matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
5089       32-bit PCRE2 libraries.
5090
5091       JIT support applies only to the  traditional  Perl-compatible  matching
5092       function.   It  does  not apply when the DFA matching function is being
5093       used. The code for this support was written by Zoltan Herczeg.
5094
5095
5096AVAILABILITY OF JIT SUPPORT
5097
5098       JIT support is an optional feature of  PCRE2.  The  "configure"  option
5099       --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
5100       built if you want to use JIT. The support is limited to  the  following
5101       hardware platforms:
5102
5103         ARM 32-bit (v5, v7, and Thumb2)
5104         ARM 64-bit
5105         Intel x86 32-bit and 64-bit
5106         MIPS 32-bit and 64-bit
5107         Power PC 32-bit and 64-bit
5108         SPARC 32-bit
5109
5110       If --enable-jit is set on an unsupported platform, compilation fails.
5111
5112       A  program  can  tell if JIT support is available by calling pcre2_con-
5113       fig() with the PCRE2_CONFIG_JIT option. The result is  1  when  JIT  is
5114       available,  and 0 otherwise. However, a simple program does not need to
5115       check this in order to use JIT. The API is implemented in  a  way  that
5116       falls  back  to the interpretive code if JIT is not available. For pro-
5117       grams that need the best possible performance, there is  also  a  "fast
5118       path" API that is JIT-specific.
5119
5120
5121SIMPLE USE OF JIT
5122
5123       To  make use of the JIT support in the simplest way, all you have to do
5124       is to call pcre2_jit_compile() after successfully compiling  a  pattern
5125       with pcre2_compile(). This function has two arguments: the first is the
5126       compiled pattern pointer that was returned by pcre2_compile(), and  the
5127       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
5128       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
5129
5130       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
5131       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
5132       pattern is passed to the JIT compiler, which turns it into machine code
5133       that executes much faster than the normal interpretive code, but yields
5134       exactly the same results. The returned value  from  pcre2_jit_compile()
5135       is zero on success, or a negative error code.
5136
5137       There  is  a limit to the size of pattern that JIT supports, imposed by
5138       the size of machine stack that it uses. The exact rules are  not  docu-
5139       mented because they may change at any time, in particular, when new op-
5140       timizations are introduced.  If  a  pattern  is  too  big,  a  call  to
5141       pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
5142
5143       PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
5144       plete matches. If you want to run partial matches using the  PCRE2_PAR-
5145       TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should
5146       set one or both of  the  other  options  as  well  as,  or  instead  of
5147       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
5148       for each of the three modes (normal, soft partial, hard partial).  When
5149       pcre2_match()  is  called,  the appropriate code is run if it is avail-
5150       able. Otherwise, the pattern is matched using interpretive code.
5151
5152       You can call pcre2_jit_compile() multiple times for the  same  compiled
5153       pattern.  It does nothing if it has previously compiled code for any of
5154       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
5155       PLETE  and  (perhaps  later,  when  you find you need partial matching)
5156       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
5157       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5158       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5159       diately returns zero. This is an alternative way of testing whether JIT
5160       is available.
5161
5162       At present, it is not possible to free JIT compiled  code  except  when
5163       the entire compiled pattern is freed by calling pcre2_code_free().
5164
5165       In  some circumstances you may need to call additional functions. These
5166       are described in the section entitled "Controlling the JIT  stack"  be-
5167       low.
5168
5169       There are some pcre2_match() options that are not supported by JIT, and
5170       there are also some pattern items that JIT cannot handle.  Details  are
5171       given  below.  In  both cases, matching automatically falls back to the
5172       interpretive code. If you want to know whether JIT  was  actually  used
5173       for  a particular match, you should arrange for a JIT callback function
5174       to be set up as described in the section entitled "Controlling the  JIT
5175       stack"  below,  even  if  you  do  not need to supply a non-default JIT
5176       stack. Such a callback function is called whenever JIT code is about to
5177       be  obeyed.  If the match-time options are not right for JIT execution,
5178       the callback function is not obeyed.
5179
5180       If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
5181       ated.  You  can find out if JIT matching is available after compiling a
5182       pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5183       tion.  A  non-zero  result means that JIT compilation was successful. A
5184       result of 0 means that JIT support is not available, or the pattern was
5185       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
5186       to handle the pattern.
5187
5188
5189MATCHING SUBJECTS CONTAINING INVALID UTF
5190
5191       When a pattern is compiled with the PCRE2_UTF option,  subject  strings
5192       are  normally expected to be a valid sequence of UTF code units. By de-
5193       fault, this is checked at the start of matching and an error is  gener-
5194       ated  if  invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be
5195       passed to pcre2_match() to skip the check (for improved performance) if
5196       you  are  sure  that  a subject string is valid. If this option is used
5197       with an invalid string, the result is undefined.
5198
5199       However, a way of running matches on strings that may  contain  invalid
5200       UTF   sequences   is   available.   Calling  pcre2_compile()  with  the
5201       PCRE2_MATCH_INVALID_UTF option has two effects:  it  tells  the  inter-
5202       preter  in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5203       pile() is called, the compiled JIT code also supports invalid UTF.  De-
5204       tails  of  how this support works, in both the JIT and the interpretive
5205       cases, is given in the pcre2unicode documentation.
5206
5207       There  is  also  an  obsolete  option  for  pcre2_jit_compile()  called
5208       PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5209       ibility.    It   is   superseded   by   the   pcre2_compile()    option
5210       PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed
5211       in future.
5212
5213
5214UNSUPPORTED OPTIONS AND PATTERN ITEMS
5215
5216       The pcre2_match() options that  are  supported  for  JIT  matching  are
5217       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
5218       PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,   and
5219       PCRE2_PARTIAL_SOFT.  The  PCRE2_ANCHORED  and PCRE2_ENDANCHORED options
5220       are not supported at match time.
5221
5222       If the PCRE2_NO_JIT option is passed to pcre2_match() it  disables  the
5223       use of JIT, forcing matching by the interpreter code.
5224
5225       The  only  unsupported  pattern items are \C (match a single data unit)
5226       when running in a UTF mode, and a callout immediately before an  asser-
5227       tion condition in a conditional group.
5228
5229
5230RETURN VALUES FROM JIT MATCHING
5231
5232       When a pattern is matched using JIT matching, the return values are the
5233       same as those given by the interpretive pcre2_match()  code,  with  the
5234       addition  of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
5235       that the memory used for the JIT stack was insufficient. See  "Control-
5236       ling the JIT stack" below for a discussion of JIT stack usage.
5237
5238       The  error  code  PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
5239       searching a very large pattern tree goes on for too long, as it  is  in
5240       the  same circumstance when JIT is not used, but the details of exactly
5241       what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
5242       is never returned when JIT matching is used.
5243
5244
5245CONTROLLING THE JIT STACK
5246
5247       When the compiled JIT code runs, it needs a block of memory to use as a
5248       stack.  By default, it uses 32KiB on the machine stack.  However,  some
5249       large  or complicated patterns need more than this. The error PCRE2_ER-
5250       ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5251       tions are provided for managing blocks of memory for use as JIT stacks.
5252       There is further discussion about the use of JIT stacks in the  section
5253       entitled "JIT stack FAQ" below.
5254
5255       The  pcre2_jit_stack_create()  function  creates a JIT stack. Its argu-
5256       ments are a starting size, a maximum size, and a general  context  (for
5257       memory  allocation  functions, or NULL for standard memory allocation).
5258       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
5259       NULL  if there is an error. The pcre2_jit_stack_free() function is used
5260       to free a stack that is no longer needed. If its argument is NULL, this
5261       function  returns immediately, without doing anything. (For the techni-
5262       cally minded: the address space is allocated by mmap or  VirtualAlloc.)
5263       A  maximum  stack size of 512KiB to 1MiB should be more than enough for
5264       any pattern.
5265
5266       The pcre2_jit_stack_assign() function specifies which  stack  JIT  code
5267       should use. Its arguments are as follows:
5268
5269         pcre2_match_context  *mcontext
5270         pcre2_jit_callback    callback
5271         void                 *data
5272
5273       The first argument is a pointer to a match context. When this is subse-
5274       quently passed to a matching function, its information determines which
5275       JIT stack is used. If this argument is NULL, the function returns imme-
5276       diately, without doing anything. There are three cases for  the  values
5277       of the other two options:
5278
5279         (1) If callback is NULL and data is NULL, an internal 32KiB block
5280             on the machine stack is used. This is the default when a match
5281             context is created.
5282
5283         (2) If callback is NULL and data is not NULL, data must be
5284             a pointer to a valid JIT stack, the result of calling
5285             pcre2_jit_stack_create().
5286
5287         (3) If callback is not NULL, it must point to a function that is
5288             called with data as an argument at the start of matching, in
5289             order to set up a JIT stack. If the return from the callback
5290             function is NULL, the internal 32KiB stack is used; otherwise the
5291             return value must be a valid JIT stack, the result of calling
5292             pcre2_jit_stack_create().
5293
5294       A  callback function is obeyed whenever JIT code is about to be run; it
5295       is not obeyed when pcre2_match() is called with options that are incom-
5296       patible  for JIT matching. A callback function can therefore be used to
5297       determine whether a match operation was executed by JIT or by  the  in-
5298       terpreter.
5299
5300       You may safely use the same JIT stack for more than one pattern (either
5301       by assigning directly or by callback), as  long  as  the  patterns  are
5302       matched sequentially in the same thread. Currently, the only way to set
5303       up non-sequential matches in one thread is to use callouts: if a  call-
5304       out  function starts another match, that match must use a different JIT
5305       stack to the one used for currently suspended match(es).
5306
5307       In a multithread application, if you do not specify a JIT stack, or  if
5308       you  assign or pass back NULL from a callback, that is thread-safe, be-
5309       cause each thread has its own machine stack. However, if you assign  or
5310       pass back a non-NULL JIT stack, this must be a different stack for each
5311       thread so that the application is thread-safe.
5312
5313       Strictly speaking, even more is allowed. You can assign the  same  non-
5314       NULL  stack  to a match context that is used by any number of patterns,
5315       as long as they are not used for matching by multiple  threads  at  the
5316       same  time.  For  example, you could use the same stack in all compiled
5317       patterns, with a global mutex in the callback to wait until  the  stack
5318       is available for use. However, this is an inefficient solution, and not
5319       recommended.
5320
5321       This is a suggestion for how a multithreaded program that needs to  set
5322       up non-default JIT stacks might operate:
5323
5324         During thread initalization
5325           thread_local_var = pcre2_jit_stack_create(...)
5326
5327         During thread exit
5328           pcre2_jit_stack_free(thread_local_var)
5329
5330         Use a one-line callback function
5331           return thread_local_var
5332
5333       All  the  functions  described in this section do nothing if JIT is not
5334       available.
5335
5336
5337JIT STACK FAQ
5338
5339       (1) Why do we need JIT stacks?
5340
5341       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5342       where  the local data of the current node is pushed before checking its
5343       child nodes.  Allocating real machine stack on some platforms is diffi-
5344       cult. For example, the stack chain needs to be updated every time if we
5345       extend the stack on PowerPC.  Although it  is  possible,  its  updating
5346       time overhead decreases performance. So we do the recursion in memory.
5347
5348       (2) Why don't we simply allocate blocks of memory with malloc()?
5349
5350       Modern  operating  systems have a nice feature: they can reserve an ad-
5351       dress space instead of allocating memory. We can safely allocate memory
5352       pages inside this address space, so the stack could grow without moving
5353       memory data (this is important because of pointers). Thus we can  allo-
5354       cate  1MiB  address  space,  and use only a single memory page (usually
5355       4KiB) if that is enough. However, we can still grow up to 1MiB  anytime
5356       if needed.
5357
5358       (3) Who "owns" a JIT stack?
5359
5360       The owner of the stack is the user program, not the JIT studied pattern
5361       or anything else. The user program must ensure that if a stack is being
5362       used by pcre2_match(), (that is, it is assigned to a match context that
5363       is passed to the pattern currently running), that  stack  must  not  be
5364       used  by any other threads (to avoid overwriting the same memory area).
5365       The best practice for multithreaded programs is to allocate a stack for
5366       each thread, and return this stack through the JIT callback function.
5367
5368       (4) When should a JIT stack be freed?
5369
5370       You can free a JIT stack at any time, as long as it will not be used by
5371       pcre2_match() again. When you assign the stack to a match context, only
5372       a  pointer  is  set. There is no reference counting or any other magic.
5373       You can free compiled patterns, contexts, and stacks in any order, any-
5374       time.   Just do not call pcre2_match() with a match context pointing to
5375       an already freed stack, as that will cause SEGFAULT. (Also, do not free
5376       a  stack  currently  used  by pcre2_match() in another thread). You can
5377       also replace the stack in a context at any time when it is not in  use.
5378       You should free the previous stack before assigning a replacement.
5379
5380       (5)  Should  I  allocate/free  a  stack every time before/after calling
5381       pcre2_match()?
5382
5383       No, because this is too costly in  terms  of  resources.  However,  you
5384       could  implement  some clever idea which release the stack if it is not
5385       used in let's say two minutes. The JIT callback  can  help  to  achieve
5386       this without keeping a list of patterns.
5387
5388       (6)  OK, the stack is for long term memory allocation. But what happens
5389       if a pattern causes stack overflow with a stack of 1MiB? Is  that  1MiB
5390       kept until the stack is freed?
5391
5392       Especially  on embedded sytems, it might be a good idea to release mem-
5393       ory sometimes without freeing the stack. There is no API  for  this  at
5394       the  moment.  Probably a function call which returns with the currently
5395       allocated memory for any stack and another which allows releasing  mem-
5396       ory (shrinking the stack) would be a good idea if someone needs this.
5397
5398       (7) This is too much of a headache. Isn't there any better solution for
5399       JIT stack handling?
5400
5401       No, thanks to Windows. If POSIX threads were used everywhere, we  could
5402       throw out this complicated API.
5403
5404
5405FREEING JIT SPECULATIVE MEMORY
5406
5407       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
5408
5409       The JIT executable allocator does not free all memory when it is possi-
5410       ble.  It expects new allocations, and keeps some free memory around  to
5411       improve  allocation  speed. However, in low memory conditions, it might
5412       be better to free all possible memory. You can cause this to happen  by
5413       calling  pcre2_jit_free_unused_memory(). Its argument is a general con-
5414       text, for custom memory management, or NULL for standard memory manage-
5415       ment.
5416
5417
5418EXAMPLE CODE
5419
5420       This  is  a  single-threaded example that specifies a JIT stack without
5421       using a callback. A real program should include  error  checking  after
5422       all the function calls.
5423
5424         int rc;
5425         pcre2_code *re;
5426         pcre2_match_data *match_data;
5427         pcre2_match_context *mcontext;
5428         pcre2_jit_stack *jit_stack;
5429
5430         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
5431           &errornumber, &erroffset, NULL);
5432         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
5433         mcontext = pcre2_match_context_create(NULL);
5434         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
5435         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
5436         match_data = pcre2_match_data_create(re, 10);
5437         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
5438         /* Process result */
5439
5440         pcre2_code_free(re);
5441         pcre2_match_data_free(match_data);
5442         pcre2_match_context_free(mcontext);
5443         pcre2_jit_stack_free(jit_stack);
5444
5445
5446JIT FAST PATH API
5447
5448       Because the API described above falls back to interpreted matching when
5449       JIT is not available, it is convenient for programs  that  are  written
5450       for  general  use  in  many  environments.  However,  calling  JIT  via
5451       pcre2_match() does have a performance impact. Programs that are written
5452       for  use  where  JIT  is known to be available, and which need the best
5453       possible performance, can instead use a "fast path"  API  to  call  JIT
5454       matching  directly instead of calling pcre2_match() (obviously only for
5455       patterns that have been successfully processed by pcre2_jit_compile()).
5456
5457       The fast path function is called pcre2_jit_match(), and  it  takes  ex-
5458       actly  the same arguments as pcre2_match(). However, the subject string
5459       must be specified with a  length;  PCRE2_ZERO_TERMINATED  is  not  sup-
5460       ported. Unsupported option bits (for example, PCRE2_ANCHORED, PCRE2_EN-
5461       DANCHORED  and  PCRE2_COPY_MATCHED_SUBJECT)  are  ignored,  as  is  the
5462       PCRE2_NO_JIT  option.  The  return  values  are  also  the  same as for
5463       pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode  (par-
5464       tial or complete) is requested that was not compiled.
5465
5466       When  you call pcre2_match(), as well as testing for invalid options, a
5467       number of other sanity checks are performed on the arguments. For exam-
5468       ple, if the subject pointer is NULL, an immediate error is given. Also,
5469       unless PCRE2_NO_UTF_CHECK is set, a UTF subject string  is  tested  for
5470       validity.  In the interests of speed, these checks do not happen on the
5471       JIT fast path, and if invalid data is passed, the result is undefined.
5472
5473       Bypassing the sanity checks and the  pcre2_match()  wrapping  can  give
5474       speedups of more than 10%.
5475
5476
5477SEE ALSO
5478
5479       pcre2api(3)
5480
5481
5482AUTHOR
5483
5484       Philip Hazel (FAQ by Zoltan Herczeg)
5485       University Computing Service
5486       Cambridge, England.
5487
5488
5489REVISION
5490
5491       Last updated: 23 May 2019
5492       Copyright (c) 1997-2019 University of Cambridge.
5493------------------------------------------------------------------------------
5494
5495
5496PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
5497
5498
5499
5500NAME
5501       PCRE2 - Perl-compatible regular expressions (revised API)
5502
5503SIZE AND OTHER LIMITATIONS
5504
5505       There are some size limitations in PCRE2 but it is hoped that they will
5506       never in practice be relevant.
5507
5508       The maximum size of a compiled pattern  is  approximately  64  thousand
5509       code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5510       the default internal linkage size, which  is  2  bytes  for  these  li-
5511       braries.  If  you  want  to  process regular expressions that are truly
5512       enormous, you can compile PCRE2 with an internal linkage size of 3 or 4
5513       (when  building  the  16-bit  library,  3  is rounded up to 4). See the
5514       README file in the source distribution and the pcre2build documentation
5515       for  details.  In  these cases the limit is substantially larger.  How-
5516       ever, the speed of execution is slower. In the 32-bit library, the  in-
5517       ternal linkage size is always 4.
5518
5519       The maximum length of a source pattern string is essentially unlimited;
5520       it is the largest number a PCRE2_SIZE variable can hold.  However,  the
5521       program that calls pcre2_compile() can specify a smaller limit.
5522
5523       The maximum length (in code units) of a subject string is one less than
5524       the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5525       signed integer type, usually defined as size_t. Its maximum value (that
5526       is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-termi-
5527       nated strings and unset offsets.
5528
5529       All values in repeating quantifiers must be less than 65536.
5530
5531       The maximum length of a lookbehind assertion is 65535 characters.
5532
5533       There  is no limit to the number of parenthesized groups, but there can
5534       be no more than 65535 capture groups, and there is a limit to the depth
5535       of  nesting  of parenthesized subpatterns of all kinds. This is imposed
5536       in order to limit the amount of system stack used at compile time.  The
5537       default limit can be specified when PCRE2 is built; if not, the default
5538       is set to  250.  An  application  can  change  this  limit  by  calling
5539       pcre2_set_parens_nest_limit() to set the limit in a compile context.
5540
5541       The  maximum length of name for a named capture group is 32 code units,
5542       and the maximum number of such groups is 10000.
5543
5544       The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
5545       (*THEN)  verb  is  255  code units for the 8-bit library and 65535 code
5546       units for the 16-bit and 32-bit libraries.
5547
5548       The maximum length of a string argument to a  callout  is  the  largest
5549       number a 32-bit unsigned integer can hold.
5550
5551
5552AUTHOR
5553
5554       Philip Hazel
5555       University Computing Service
5556       Cambridge, England.
5557
5558
5559REVISION
5560
5561       Last updated: 02 February 2019
5562       Copyright (c) 1997-2019 University of Cambridge.
5563------------------------------------------------------------------------------
5564
5565
5566PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
5567
5568
5569
5570NAME
5571       PCRE2 - Perl-compatible regular expressions (revised API)
5572
5573PCRE2 MATCHING ALGORITHMS
5574
5575       This document describes the two different algorithms that are available
5576       in PCRE2 for matching a compiled regular  expression  against  a  given
5577       subject  string.  The  "standard"  algorithm is the one provided by the
5578       pcre2_match() function. This works in the same as  as  Perl's  matching
5579       function,  and  provide a Perl-compatible matching operation. The just-
5580       in-time (JIT) optimization that is described in the pcre2jit documenta-
5581       tion is compatible with this function.
5582
5583       An alternative algorithm is provided by the pcre2_dfa_match() function;
5584       it operates in a different way, and is not Perl-compatible. This alter-
5585       native  has advantages and disadvantages compared with the standard al-
5586       gorithm, and these are described below.
5587
5588       When there is only one possible way in which a given subject string can
5589       match  a pattern, the two algorithms give the same answer. A difference
5590       arises, however, when there are multiple possibilities. For example, if
5591       the pattern
5592
5593         ^<.*>
5594
5595       is matched against the string
5596
5597         <something> <something else> <something further>
5598
5599       there are three possible answers. The standard algorithm finds only one
5600       of them, whereas the alternative algorithm finds all three.
5601
5602
5603REGULAR EXPRESSIONS AS TREES
5604
5605       The set of strings that are matched by a regular expression can be rep-
5606       resented  as  a  tree structure. An unlimited repetition in the pattern
5607       makes the tree of infinite size, but it is still a tree.  Matching  the
5608       pattern  to a given subject string (from a given starting point) can be
5609       thought of as a search of the tree.  There are two  ways  to  search  a
5610       tree:  depth-first  and  breadth-first, and these correspond to the two
5611       matching algorithms provided by PCRE2.
5612
5613
5614THE STANDARD MATCHING ALGORITHM
5615
5616       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
5617       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
5618       depth-first search of the pattern tree. That is, it  proceeds  along  a
5619       single path through the tree, checking that the subject matches what is
5620       required. When there is a mismatch, the algorithm  tries  any  alterna-
5621       tives  at  the  current point, and if they all fail, it backs up to the
5622       previous branch point in the  tree,  and  tries  the  next  alternative
5623       branch  at  that  level.  This often involves backing up (moving to the
5624       left) in the subject string as well.  The  order  in  which  repetition
5625       branches  are  tried  is controlled by the greedy or ungreedy nature of
5626       the quantifier.
5627
5628       If a leaf node is reached, a matching string has  been  found,  and  at
5629       that  point the algorithm stops. Thus, if there is more than one possi-
5630       ble match, this algorithm returns the first one that it finds.  Whether
5631       this  is the shortest, the longest, or some intermediate length depends
5632       on the way the greedy and ungreedy repetition quantifiers are specified
5633       in the pattern.
5634
5635       Because  it  ends  up  with a single path through the tree, it is rela-
5636       tively straightforward for this algorithm to keep  track  of  the  sub-
5637       strings  that  are  matched  by portions of the pattern in parentheses.
5638       This provides support for capturing parentheses and backreferences.
5639
5640
5641THE ALTERNATIVE MATCHING ALGORITHM
5642
5643       This algorithm conducts a breadth-first search of  the  tree.  Starting
5644       from  the  first  matching  point  in the subject, it scans the subject
5645       string from left to right, once, character by character, and as it does
5646       this,  it remembers all the paths through the tree that represent valid
5647       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
5648       though  it is not implemented as a traditional finite state machine (it
5649       keeps multiple states active simultaneously).
5650
5651       Although the general principle of this matching algorithm  is  that  it
5652       scans  the subject string only once, without backtracking, there is one
5653       exception: when a lookaround assertion is encountered,  the  characters
5654       following  or  preceding the current point have to be independently in-
5655       spected.
5656
5657       The scan continues until either the end of the subject is  reached,  or
5658       there  are  no more unterminated paths. At this point, terminated paths
5659       represent the different matching possibilities (if there are none,  the
5660       match  has  failed).   Thus,  if there is more than one possible match,
5661       this algorithm finds all of them, and in particular, it finds the long-
5662       est.  The  matches are returned in decreasing order of length. There is
5663       an option to stop the algorithm after the first match (which is  neces-
5664       sarily the shortest) is found.
5665
5666       Note that all the matches that are found start at the same point in the
5667       subject. If the pattern
5668
5669         cat(er(pillar)?)?
5670
5671       is matched against the string "the caterpillar catchment",  the  result
5672       is  the  three  strings "caterpillar", "cater", and "cat" that start at
5673       the fifth character of the subject. The algorithm  does  not  automati-
5674       cally move on to find matches that start at later positions.
5675
5676       PCRE2's "auto-possessification" optimization usually applies to charac-
5677       ter repeats at the end of a pattern (as well as internally). For  exam-
5678       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
5679       is no point even considering the possibility of backtracking  into  the
5680       repeated  digits.  For  DFA matching, this means that only one possible
5681       match is found. If you really do want multiple matches in  such  cases,
5682       either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5683       SESS option when compiling.
5684
5685       There are a number of features of PCRE2 regular  expressions  that  are
5686       not  supported  or behave differently in the alternative matching func-
5687       tion. Those that are not supported cause an error if encountered.
5688
5689       1. Because the algorithm finds all possible matches, the greedy or  un-
5690       greedy  nature of repetition quantifiers is not relevant (though it may
5691       affect auto-possessification,  as  just  described).  During  matching,
5692       greedy  and  ungreedy  quantifiers are treated in exactly the same way.
5693       However, possessive quantifiers can make a difference when what follows
5694       could  also  match  what  is  quantified, for example in a pattern like
5695       this:
5696
5697         ^a++\w!
5698
5699       This pattern matches "aaab!" but not "aaa!", which would be matched  by
5700       a  non-possessive quantifier. Similarly, if an atomic group is present,
5701       it is matched as if it were a standalone pattern at the current  point,
5702       and  the  longest match is then "locked in" for the rest of the overall
5703       pattern.
5704
5705       2. When dealing with multiple paths through the tree simultaneously, it
5706       is  not  straightforward  to  keep track of captured substrings for the
5707       different matching possibilities, and PCRE2's  implementation  of  this
5708       algorithm does not attempt to do this. This means that no captured sub-
5709       strings are available.
5710
5711       3. Because no substrings are captured, backreferences within  the  pat-
5712       tern are not supported.
5713
5714       4.  For  the same reason, conditional expressions that use a backrefer-
5715       ence as the condition or test for a specific group  recursion  are  not
5716       supported.
5717
5718       5. Again for the same reason, script runs are not supported.
5719
5720       6. Because many paths through the tree may be active, the \K escape se-
5721       quence, which resets the start of the match when encountered  (but  may
5722       be on some paths and not on others), is not supported.
5723
5724       7.  Callouts  are  supported, but the value of the capture_top field is
5725       always 1, and the value of the capture_last field is always 0.
5726
5727       8. The \C escape sequence, which (in  the  standard  algorithm)  always
5728       matches  a  single  code  unit, even in a UTF mode, is not supported in
5729       these modes, because the alternative algorithm moves through  the  sub-
5730       ject  string  one  character  (not code unit) at a time, for all active
5731       paths through the tree.
5732
5733       9. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
5734       are  not  supported.  (*FAIL)  is supported, and behaves like a failing
5735       negative assertion.
5736
5737       10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not  sup-
5738       ported by pcre2_dfa_match().
5739
5740
5741ADVANTAGES OF THE ALTERNATIVE ALGORITHM
5742
5743       Using  the alternative matching algorithm provides the following advan-
5744       tages:
5745
5746       1. All possible matches (at a single point in the subject) are automat-
5747       ically  found,  and  in particular, the longest match is found. To find
5748       more than one match using the standard algorithm, you have to do kludgy
5749       things with callouts.
5750
5751       2.  Because  the  alternative  algorithm  scans the subject string just
5752       once, and never needs to backtrack (except for lookbehinds), it is pos-
5753       sible  to  pass  very  long subject strings to the matching function in
5754       several pieces, checking for partial matching each time. Although it is
5755       also  possible  to  do  multi-segment matching using the standard algo-
5756       rithm, by retaining partially matched substrings, it  is  more  compli-
5757       cated. The pcre2partial documentation gives details of partial matching
5758       and discusses multi-segment matching.
5759
5760
5761DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
5762
5763       The alternative algorithm suffers from a number of disadvantages:
5764
5765       1. It is substantially slower than  the  standard  algorithm.  This  is
5766       partly  because  it has to search for all possible matches, but is also
5767       because it is less susceptible to optimization.
5768
5769       2. Capturing parentheses, backreferences,  script  runs,  and  matching
5770       within invalid UTF string are not supported.
5771
5772       3. Although atomic groups are supported, their use does not provide the
5773       performance advantage that it does for the standard algorithm.
5774
5775
5776AUTHOR
5777
5778       Philip Hazel
5779       University Computing Service
5780       Cambridge, England.
5781
5782
5783REVISION
5784
5785       Last updated: 23 May 2019
5786       Copyright (c) 1997-2019 University of Cambridge.
5787------------------------------------------------------------------------------
5788
5789
5790PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
5791
5792
5793
5794NAME
5795       PCRE2 - Perl-compatible regular expressions
5796
5797PARTIAL MATCHING IN PCRE2
5798
5799       In  normal use of PCRE2, if there is a match up to the end of a subject
5800       string, but more characters are needed to  match  the  entire  pattern,
5801       PCRE2_ERROR_NOMATCH  is  returned,  just  like any other failing match.
5802       There are circumstances where it might be helpful to  distinguish  this
5803       "partial match" case.
5804
5805       One  example  is  an application where the subject string is very long,
5806       and not all available at once. The requirement here is to be able to do
5807       the  matching  segment  by segment, but special action is needed when a
5808       matched substring spans the boundary between two segments.
5809
5810       Another example is checking a user input string as it is typed, to  en-
5811       sure  that  it conforms to a required format. Invalid characters can be
5812       immediately diagnosed and rejected, giving instant feedback.
5813
5814       Partial matching is a PCRE2-specific feature; it is  not  Perl-compati-
5815       ble.  It  is  requested  by  setting  one  of the PCRE2_PARTIAL_HARD or
5816       PCRE2_PARTIAL_SOFT options when calling a matching function.  The  dif-
5817       ference  between  the  two options is whether or not a partial match is
5818       preferred to an alternative complete match, though the  details  differ
5819       between  the  two  types of matching function. If both options are set,
5820       PCRE2_PARTIAL_HARD takes precedence.
5821
5822       If you want to use partial matching with just-in-time  optimized  code,
5823       as  well  as  setting a partial match option for the matching function,
5824       you must also call pcre2_jit_compile() with one or both  of  these  op-
5825       tions:
5826
5827         PCRE2_JIT_PARTIAL_HARD
5828         PCRE2_JIT_PARTIAL_SOFT
5829
5830       PCRE2_JIT_COMPLETE  should also be set if you are going to run non-par-
5831       tial matches on the same pattern. Separate code is  compiled  for  each
5832       mode.  If  the appropriate JIT mode has not been compiled, interpretive
5833       matching code is used.
5834
5835       Setting a partial matching option disables two of PCRE2's standard  op-
5836       timization  hints. PCRE2 remembers the last literal code unit in a pat-
5837       tern, and abandons matching immediately if it is  not  present  in  the
5838       subject  string.  This optimization cannot be used for a subject string
5839       that might match only partially. PCRE2 also remembers a minimum  length
5840       of  a matching string, and does not bother to run the matching function
5841       on shorter strings. This optimization  is  also  disabled  for  partial
5842       matching.
5843
5844
5845REQUIREMENTS FOR A PARTIAL MATCH
5846
5847       A  possible  partial  match  occurs during matching when the end of the
5848       subject string is reached successfully, but either more characters  are
5849       needed  to complete the match, or the addition of more characters might
5850       change what is matched.
5851
5852       Example 1: if the pattern is /abc/ and the subject is "ab", more  char-
5853       acters  are  definitely  needed  to complete a match. In this case both
5854       hard and soft matching options yield a partial match.
5855
5856       Example 2: if the pattern is /ab+/ and the subject is "ab", a  complete
5857       match  can  be  found, but the addition of more characters might change
5858       what is matched. In this case, only PCRE2_PARTIAL_HARD returns  a  par-
5859       tial match; PCRE2_PARTIAL_SOFT returns the complete match.
5860
5861       On  reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if
5862       the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
5863       match.   Otherwise, for both options, the next pattern item must be one
5864       that inspects a character, and at least one of the  following  must  be
5865       true:
5866
5867       (1)  At  least  one  character has already been inspected. An inspected
5868       character need not form part of the final  matched  string;  lookbehind
5869       assertions  and the \K escape sequence provide ways of inspecting char-
5870       acters before the start of a matched string.
5871
5872       (2) The pattern contains one or more lookbehind assertions. This condi-
5873       tion  exists in case there is a lookbehind that inspects characters be-
5874       fore the start of the match.
5875
5876       (3) There is a special case when the whole pattern can match  an  empty
5877       string.   When  the  starting  point  is at the end of the subject, the
5878       empty string match is a possibility, and if PCRE2_PARTIAL_SOFT  is  set
5879       and  neither  of the above conditions is true, it is returned. However,
5880       because adding more characters  might  result  in  a  non-empty  match,
5881       PCRE2_PARTIAL_HARD  returns  a  partial match, which in this case means
5882       "there is going to be a match at this point, but until some more  char-
5883       acters are added, we do not know if it will be an empty string or some-
5884       thing longer".
5885
5886
5887PARTIAL MATCHING USING pcre2_match()
5888
5889       When  a  partial  matching  option  is  set,  the  result  of   calling
5890       pcre2_match() can be one of the following:
5891
5892       A successful match
5893         A complete match has been found, starting and ending within this sub-
5894         ject.
5895
5896       PCRE2_ERROR_NOMATCH
5897         No match can start anywhere in this subject.
5898
5899       PCRE2_ERROR_PARTIAL
5900         Adding more characters may result in a complete match that  uses  one
5901         or more characters from the end of this subject.
5902
5903       When a partial match is returned, the first two elements in the ovector
5904       point to the portion of the subject that was matched, but the values in
5905       the rest of the ovector are undefined. The appearance of \K in the pat-
5906       tern has no effect for a partial match. Consider this pattern:
5907
5908         /abc\K123/
5909
5910       If it is matched against "456abc123xyz" the result is a complete match,
5911       and  the ovector defines the matched string as "123", because \K resets
5912       the "start of match" point. However, if a partial  match  is  requested
5913       and  the subject string is "456abc12", a partial match is found for the
5914       string "abc12", because all these characters are needed  for  a  subse-
5915       quent re-match with additional characters.
5916
5917       If  there  is more than one partial match, the first one that was found
5918       provides the data that is returned. Consider this pattern:
5919
5920         /123\w+X|dogY/
5921
5922       If this is matched against the subject string "abc123dog", both  alter-
5923       natives  fail  to  match,  but the end of the subject is reached during
5924       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
5925       and  9, identifying "123dog" as the first partial match. (In this exam-
5926       ple, there are two partial matches, because "dog" on its own  partially
5927       matches the second alternative.)
5928
5929   How a partial match is processed by pcre2_match()
5930
5931       What happens when a partial match is identified depends on which of the
5932       two partial matching options is set.
5933
5934       If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned  as  soon
5935       as  a partial match is found, without continuing to search for possible
5936       complete matches. This option is "hard" because it prefers  an  earlier
5937       partial match over a later complete match. For this reason, the assump-
5938       tion is made that the end of the supplied subject  string  is  not  the
5939       true  end of the available data, which is why \z, \Z, \b, \B, and $ al-
5940       ways give a partial match.
5941
5942       If PCRE2_PARTIAL_SOFT is set, the  partial  match  is  remembered,  but
5943       matching continues as normal, and other alternatives in the pattern are
5944       tried. If no complete match can be found,  PCRE2_ERROR_PARTIAL  is  re-
5945       turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
5946       prefers a complete match over a partial match. All the various matching
5947       items  in a pattern behave as if the subject string is potentially com-
5948       plete; \z, \Z, and $ match at the end of the subject,  as  normal,  and
5949       for \b and \B the end of the subject is treated as a non-alphanumeric.
5950
5951       The  difference  between the two partial matching options can be illus-
5952       trated by a pattern such as:
5953
5954         /dog(sbody)?/
5955
5956       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
5957       the  longer  string  if  possible). If it is matched against the string
5958       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
5959       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
5960       TIAL. On the other hand, if the pattern is made ungreedy the result  is
5961       different:
5962
5963         /dog(sbody)??/
5964
5965       In  this  case  the  result  is always a complete match because that is
5966       found first, and matching never  continues  after  finding  a  complete
5967       match. It might be easier to follow this explanation by thinking of the
5968       two patterns like this:
5969
5970         /dog(sbody)?/    is the same as  /dogsbody|dog/
5971         /dog(sbody)??/   is the same as  /dog|dogsbody/
5972
5973       The second pattern will never match "dogsbody", because it will  always
5974       find the shorter match first.
5975
5976   Example of partial matching using pcre2test
5977
5978       The  pcre2test data modifiers partial_hard (or ph) and partial_soft (or
5979       ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,  respectively,  when
5980       calling  pcre2_match(). Here is a run of pcre2test using a pattern that
5981       matches the whole subject in the form of a date:
5982
5983           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5984         data> 25dec3\=ph
5985         Partial match: 23dec3
5986         data> 3ju\=ph
5987         Partial match: 3ju
5988         data> 3juj\=ph
5989         No match
5990
5991       This example gives the same results for  both  hard  and  soft  partial
5992       matching options. Here is an example where there is a difference:
5993
5994           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5995         data> 25jun04\=ps
5996          0: 25jun04
5997          1: jun
5998         data> 25jun04\=ph
5999         Partial match: 25jun04
6000
6001       With   PCRE2_PARTIAL_SOFT,  the  subject  is  matched  completely.  For
6002       PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
6003       so there is only a partial match.
6004
6005
6006MULTI-SEGMENT MATCHING WITH pcre2_match()
6007
6008       PCRE  was  not originally designed with multi-segment matching in mind.
6009       However, over time, features (including  partial  matching)  that  make
6010       multi-segment matching possible have been added. A very long string can
6011       be searched segment by segment  by  calling  pcre2_match()  repeatedly,
6012       with the aim of achieving the same results that would happen if the en-
6013       tire string was available for searching all  the  time.  Normally,  the
6014       strings  that  are  being  sought are much shorter than each individual
6015       segment, and are in the middle of very long strings, so the pattern  is
6016       normally not anchored.
6017
6018       Special  logic  must  be implemented to handle a matched substring that
6019       spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
6020       returns  a  partial match at the end of a segment whenever there is the
6021       possibility of changing  the  match  by  adding  more  characters.  The
6022       PCRE2_NOTBOL option should also be set for all but the first segment.
6023
6024       When a partial match occurs, the next segment must be added to the cur-
6025       rent subject and the match re-run, using the  startoffset  argument  of
6026       pcre2_match()  to  begin  at the point where the partial match started.
6027       For example:
6028
6029           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
6030         data> ...the date is 23ja\=ph
6031         Partial match: 23ja
6032         data> ...the date is 23jan19 and on that day...\=offset=15
6033          0: 23jan19
6034          1: jan
6035
6036       Note the use of the offset modifier to start the new  match  where  the
6037       partial match was found. In this example, the next segment was added to
6038       the one in which  the  partial  match  was  found.  This  is  the  most
6039       straightforward approach, typically using a memory buffer that is twice
6040       the size of each segment. After a partial match, the first half of  the
6041       buffer  is discarded, the second half is moved to the start of the buf-
6042       fer, and a new segment is added before repeating the match  as  in  the
6043       example above. After a no match, the entire buffer can be discarded.
6044
6045       If there are memory constraints, you may want to discard text that pre-
6046       cedes a partial match before adding the  next  segment.  Unfortunately,
6047       this  is  not  at  present straightforward. In cases such as the above,
6048       where the pattern does not contain any lookbehinds, it is sufficient to
6049       retain  only  the  partially matched substring. However, if the pattern
6050       contains a lookbehind assertion, characters that precede the  start  of
6051       the  partial match may have been inspected during the matching process.
6052       When pcre2test displays a partial match, it indicates these  characters
6053       with '<' if the allusedtext modifier is set:
6054
6055           re> "(?<=123)abc"
6056         data> xx123ab\=ph,allusedtext
6057         Partial match: 123ab
6058                        <<<
6059
6060       However,  the  allusedtext  modifier is not available for JIT matching,
6061       because JIT matching does not record  the  first  (or  last)  consulted
6062       characters.  For this reason, this information is not available via the
6063       API. It is therefore not possible in general to obtain the exact number
6064       of characters that must be retained in order to get the right match re-
6065       sult. If you cannot retain the  entire  segment,  you  must  find  some
6066       heuristic way of choosing.
6067
6068       If  you know the approximate length of the matching substrings, you can
6069       use that to decide how much text to retain. The only lookbehind  infor-
6070       mation  that  is  currently  available via the API is the length of the
6071       longest individual lookbehind in a pattern, but this can be  misleading
6072       if  there  are  nested  lookbehinds.  The  value  returned  by  calling
6073       pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND  option  is  the
6074       maximum number of characters (not code units) that any individual look-
6075       behind  moves  back  when  it  is  processed.   A   pattern   such   as
6076       "(?<=(?<!b)a)"  has a maximum lookbehind value of one, but inspects two
6077       characters before its starting point.
6078
6079       In a non-UTF or a 32-bit case, moving back is just a  subtraction,  but
6080       in  UTF-8  or  UTF-16  you  have  to count characters while moving back
6081       through the code units.
6082
6083
6084PARTIAL MATCHING USING pcre2_dfa_match()
6085
6086       The DFA function moves along the subject string character by character,
6087       without  backtracking,  searching  for  all possible matches simultane-
6088       ously. If the end of the subject is reached before the end of the  pat-
6089       tern, there is the possibility of a partial match.
6090
6091       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
6092       there have been no complete matches. Otherwise,  the  complete  matches
6093       are  returned.   If  PCRE2_PARTIAL_HARD  is  set, a partial match takes
6094       precedence over any complete matches. The portion of  the  string  that
6095       was  matched  when  the  longest  partial match was found is set as the
6096       first matching string.
6097
6098       Because the DFA function always searches for all possible matches,  and
6099       there  is no difference between greedy and ungreedy repetition, its be-
6100       haviour is different from the pcre2_match(). Consider the string  "dog"
6101       matched against this ungreedy pattern:
6102
6103         /dog(sbody)??/
6104
6105       Whereas  the  standard  function stops as soon as it finds the complete
6106       match for "dog", the DFA function also  finds  the  partial  match  for
6107       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
6108
6109
6110MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6111
6112       When a partial match has been found using the DFA matching function, it
6113       is possible to continue the match by providing additional subject  data
6114       and  calling  the function again with the same compiled regular expres-
6115       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
6116       same working space as before, because this is where details of the pre-
6117       vious partial match are stored. You can set the  PCRE2_PARTIAL_SOFT  or
6118       PCRE2_PARTIAL_HARD  options  with PCRE2_DFA_RESTART to continue partial
6119       matching over multiple segments. Here is an example using pcre2test:
6120
6121           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6122         data> 23ja\=dfa,ps
6123         Partial match: 23ja
6124         data> n05\=dfa,dfa_restart
6125          0: n05
6126
6127       The first call has "23ja" as the subject, and requests  partial  match-
6128       ing;  the  second  call  has  "n05"  as  the  subject for the continued
6129       (restarted) match.  Notice that when the match is  complete,  only  the
6130       last  part  is  shown;  PCRE2 does not retain the previously partially-
6131       matched string. It is up to the calling program to do that if it  needs
6132       to.  This  means  that, for an unanchored pattern, if a continued match
6133       fails, it is not possible to try again at a  new  starting  point.  All
6134       this facility is capable of doing is continuing with the previous match
6135       attempt. For example, consider this pattern:
6136
6137         1234|3789
6138
6139       If the first part of the subject is "ABC123", a partial  match  of  the
6140       first  alternative  is found at offset 3. There is no partial match for
6141       the second alternative, because such a match does not start at the same
6142       point  in  the  subject  string. Attempting to continue with the string
6143       "7890" does not yield a match  because  only  those  alternatives  that
6144       match  at one point in the subject are remembered. Depending on the ap-
6145       plication, this may or may not be what you want.
6146
6147       If you do want to allow for starting again at the next  character,  one
6148       way  of  doing it is to retain some or all of the segment and try a new
6149       complete match, as described for pcre2_match() above. Another possibil-
6150       ity  is to work with two buffers. If a partial match at offset n in the
6151       first buffer is followed by "no match" when PCRE2_DFA_RESTART  is  used
6152       on  the  second buffer, you can then try a new match starting at offset
6153       n+1 in the first buffer.
6154
6155
6156AUTHOR
6157
6158       Philip Hazel
6159       University Computing Service
6160       Cambridge, England.
6161
6162
6163REVISION
6164
6165       Last updated: 04 September 2019
6166       Copyright (c) 1997-2019 University of Cambridge.
6167------------------------------------------------------------------------------
6168
6169
6170PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
6171
6172
6173
6174NAME
6175       PCRE2 - Perl-compatible regular expressions (revised API)
6176
6177PCRE2 REGULAR EXPRESSION DETAILS
6178
6179       The  syntax and semantics of the regular expressions that are supported
6180       by PCRE2 are described in detail below. There is a quick-reference syn-
6181       tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
6182       and semantics as closely as it can.  PCRE2 also supports some  alterna-
6183       tive  regular  expression syntax (which does not conflict with the Perl
6184       syntax) in order to provide some compatibility with regular expressions
6185       in Python, .NET, and Oniguruma.
6186
6187       Perl's  regular expressions are described in its own documentation, and
6188       regular expressions in general are covered in a number of  books,  some
6189       of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6190       pressions", published by O'Reilly, covers regular expressions in  great
6191       detail.  This description of PCRE2's regular expressions is intended as
6192       reference material.
6193
6194       This document discusses the regular expression patterns that  are  sup-
6195       ported  by  PCRE2  when  its  main matching function, pcre2_match(), is
6196       used.   PCRE2   also   has   an    alternative    matching    function,
6197       pcre2_dfa_match(),  which  matches  using a different algorithm that is
6198       not Perl-compatible. Some of  the  features  discussed  below  are  not
6199       available  when  DFA matching is used. The advantages and disadvantages
6200       of the alternative function, and how it differs from the  normal  func-
6201       tion, are discussed in the pcre2matching page.
6202
6203
6204SPECIAL START-OF-PATTERN ITEMS
6205
6206       A  number  of options that can be passed to pcre2_compile() can also be
6207       set by special items at the start of a pattern. These are not Perl-com-
6208       patible,  but  are provided to make these options accessible to pattern
6209       writers who are not able to change the program that processes the  pat-
6210       tern.  Any  number  of these items may appear, but they must all be to-
6211       gether right at the start of the pattern string, and the  letters  must
6212       be in upper case.
6213
6214   UTF support
6215
6216       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6217       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6218       can  be  specified  for the 32-bit library, in which case it constrains
6219       the character values to valid  Unicode  code  points.  To  process  UTF
6220       strings,  PCRE2  must be built to include Unicode support (which is the
6221       default). When using UTF strings you must  either  call  the  compiling
6222       function  with  one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
6223       options, or the pattern must start with the  special  sequence  (*UTF),
6224       which  is  equivalent  to setting the relevant PCRE2_UTF. How setting a
6225       UTF mode affects pattern matching is mentioned in several places below.
6226       There is also a summary of features in the pcre2unicode page.
6227
6228       Some applications that allow their users to supply patterns may wish to
6229       restrict  them  to  non-UTF  data  for   security   reasons.   If   the
6230       PCRE2_NEVER_UTF  option is passed to pcre2_compile(), (*UTF) is not al-
6231       lowed, and its appearance in a pattern causes an error.
6232
6233   Unicode property support
6234
6235       Another special sequence that may appear at the start of a  pattern  is
6236       (*UCP).   This  has the same effect as setting the PCRE2_UCP option: it
6237       causes sequences such as \d and \w to use Unicode properties to  deter-
6238       mine character types, instead of recognizing only characters with codes
6239       less than 256 via a lookup table. If also causes upper/lower casing op-
6240       erations  to  use  Unicode  properties  for characters with code points
6241       greater than 127, even when UTF is not set.
6242
6243       Some applications that allow their users to supply patterns may wish to
6244       restrict  them  for  security reasons. If the PCRE2_NEVER_UCP option is
6245       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
6246       a pattern causes an error.
6247
6248   Locking out empty string matching
6249
6250       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
6251       effect as passing the PCRE2_NOTEMPTY or  PCRE2_NOTEMPTY_ATSTART  option
6252       to whichever matching function is subsequently called to match the pat-
6253       tern. These options lock out the matching of empty strings, either  en-
6254       tirely, or only at the start of the subject.
6255
6256   Disabling auto-possessification
6257
6258       If  a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
6259       setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from  making
6260       quantifiers  possessive  when  what  follows  cannot match the repeated
6261       item. For example, by default a+b is treated as a++b. For more details,
6262       see the pcre2api documentation.
6263
6264   Disabling start-up optimizations
6265
6266       If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
6267       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6268       mizations  for  quickly  reaching "no match" results. For more details,
6269       see the pcre2api documentation.
6270
6271   Disabling automatic anchoring
6272
6273       If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the  same  effect
6274       as  setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6275       tions that apply to patterns whose top-level branches all start with .*
6276       (match  any  number of arbitrary characters). For more details, see the
6277       pcre2api documentation.
6278
6279   Disabling JIT compilation
6280
6281       If a pattern that starts with (*NO_JIT) is  successfully  compiled,  an
6282       attempt  by  the  application  to apply the JIT optimization by calling
6283       pcre2_jit_compile() is ignored.
6284
6285   Setting match resource limits
6286
6287       The pcre2_match() function contains a counter that is incremented every
6288       time it goes round its main loop. The caller of pcre2_match() can set a
6289       limit on this counter, which therefore limits the amount  of  computing
6290       resource used for a match. The maximum depth of nested backtracking can
6291       also be limited; this indirectly restricts the amount  of  heap  memory
6292       that  is  used,  but there is also an explicit memory limit that can be
6293       set.
6294
6295       These facilities are provided to catch runaway matches  that  are  pro-
6296       voked  by patterns with huge matching trees. A common example is a pat-
6297       tern with nested unlimited repeats applied to a long string  that  does
6298       not  match. When one of these limits is reached, pcre2_match() gives an
6299       error return. The limits can also be set by items at the start  of  the
6300       pattern of the form
6301
6302         (*LIMIT_HEAP=d)
6303         (*LIMIT_MATCH=d)
6304         (*LIMIT_DEPTH=d)
6305
6306       where d is any number of decimal digits. However, the value of the set-
6307       ting must be less than the value set (or defaulted) by  the  caller  of
6308       pcre2_match()  for  it  to have any effect. In other words, the pattern
6309       writer can lower the limits set by the programmer, but not raise  them.
6310       If  there  is  more  than one setting of one of these limits, the lower
6311       value is used. The heap limit is specified in kibibytes (units of  1024
6312       bytes).
6313
6314       Prior  to  release  10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
6315       name is still recognized for backwards compatibility.
6316
6317       The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
6318       interpreters are used for matching. It does not apply to JIT. The match
6319       limit is used (but in a different way) when JIT is being used, or  when
6320       pcre2_dfa_match() is called, to limit computing resource usage by those
6321       matching functions. The depth limit is ignored by JIT but  is  relevant
6322       for  DFA  matching, which uses function recursion for recursions within
6323       the pattern and for lookaround assertions and atomic  groups.  In  this
6324       case, the depth limit controls the depth of such recursion.
6325
6326   Newline conventions
6327
6328       PCRE2  supports six different conventions for indicating line breaks in
6329       strings: a single CR (carriage return) character, a  single  LF  (line-
6330       feed) character, the two-character sequence CRLF, any of the three pre-
6331       ceding, any Unicode newline sequence,  or  the  NUL  character  (binary
6332       zero).  The  pcre2api  page  has further discussion about newlines, and
6333       shows how to set the newline convention when calling pcre2_compile().
6334
6335       It is also possible to specify a newline convention by starting a  pat-
6336       tern string with one of the following sequences:
6337
6338         (*CR)        carriage return
6339         (*LF)        linefeed
6340         (*CRLF)      carriage return, followed by linefeed
6341         (*ANYCRLF)   any of the three above
6342         (*ANY)       all Unicode newline sequences
6343         (*NUL)       the NUL character (binary zero)
6344
6345       These override the default and the options given to the compiling func-
6346       tion. For example, on a Unix system where LF is the default newline se-
6347       quence, the pattern
6348
6349         (*CR)a.b
6350
6351       changes the convention to CR. That pattern matches "a\nb" because LF is
6352       no longer a newline. If more than one of these settings is present, the
6353       last one is used.
6354
6355       The  newline  convention affects where the circumflex and dollar asser-
6356       tions are true. It also affects the interpretation of the dot metachar-
6357       acter  when  PCRE2_DOTALL  is not set, and the behaviour of \N when not
6358       followed by an opening brace. However, it does not affect what  the  \R
6359       escape  sequence  matches.  By default, this is any Unicode newline se-
6360       quence, for Perl compatibility. However, this can be changed;  see  the
6361       next section and the description of \R in the section entitled "Newline
6362       sequences" below. A change of \R setting can be combined with a  change
6363       of newline convention.
6364
6365   Specifying what \R matches
6366
6367       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6368       the complete set  of  Unicode  line  endings)  by  setting  the  option
6369       PCRE2_BSR_ANYCRLF  at compile time. This effect can also be achieved by
6370       starting a pattern with (*BSR_ANYCRLF).  For  completeness,  (*BSR_UNI-
6371       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
6372
6373
6374EBCDIC CHARACTER CODES
6375
6376       PCRE2  can be compiled to run in an environment that uses EBCDIC as its
6377       character code instead of ASCII or Unicode (typically a mainframe  sys-
6378       tem).  In  the  sections below, character code values are ASCII or Uni-
6379       code; in an EBCDIC environment these characters may have different code
6380       values, and there are no code points greater than 255.
6381
6382
6383CHARACTERS AND METACHARACTERS
6384
6385       A  regular  expression  is  a pattern that is matched against a subject
6386       string from left to right. Most characters stand for  themselves  in  a
6387       pattern,  and  match  the corresponding characters in the subject. As a
6388       trivial example, the pattern
6389
6390         The quick brown fox
6391
6392       matches a portion of a subject string that is identical to itself. When
6393       caseless  matching  is  specified  (the  PCRE2_CASELESS  option or (?i)
6394       within the pattern), letters are matched independently  of  case.  Note
6395       that  there  are  two  ASCII  characters, K and S, that, in addition to
6396       their lower case ASCII equivalents, are  case-equivalent  with  Unicode
6397       U+212A  (Kelvin  sign)  and  U+017F  (long  S) respectively when either
6398       PCRE2_UTF or PCRE2_UCP is set.
6399
6400       The power of regular expressions comes from the ability to include wild
6401       cards, character classes, alternatives, and repetitions in the pattern.
6402       These are encoded in the pattern by the use of metacharacters, which do
6403       not  stand  for  themselves but instead are interpreted in some special
6404       way.
6405
6406       There are two different sets of metacharacters: those that  are  recog-
6407       nized  anywhere in the pattern except within square brackets, and those
6408       that are recognized within square brackets.  Outside  square  brackets,
6409       the metacharacters are as follows:
6410
6411         \      general escape character with several uses
6412         ^      assert start of string (or line, in multiline mode)
6413         $      assert end of string (or line, in multiline mode)
6414         .      match any character except newline (by default)
6415         [      start character class definition
6416         |      start of alternative branch
6417         (      start group or control verb
6418         )      end group or control verb
6419         *      0 or more quantifier
6420         +      1 or more quantifier; also "possessive quantifier"
6421         ?      0 or 1 quantifier; also quantifier minimizer
6422         {      start min/max quantifier
6423
6424       Part  of  a  pattern  that is in square brackets is called a "character
6425       class". In a character class the only metacharacters are:
6426
6427         \      general escape character
6428         ^      negate the class, but only if the first character
6429         -      indicates character range
6430         [      POSIX character class (if followed by POSIX syntax)
6431         ]      terminates the character class
6432
6433       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
6434       space  in  the pattern, other than in a character class, and characters
6435       between a # outside a character class and the next newline,  inclusive,
6436       are ignored. An escaping backslash can be used to include a white space
6437       or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op-
6438       tion is set, the same applies, but in addition unescaped space and hor-
6439       izontal tab characters are ignored inside a character class. Note: only
6440       these  two  characters  are  ignored, not the full set of pattern white
6441       space characters that are ignored outside  a  character  class.  Option
6442       settings can be changed within a pattern; see the section entitled "In-
6443       ternal Option Setting" below.
6444
6445       The following sections describe the use of each of the metacharacters.
6446
6447
6448BACKSLASH
6449
6450       The backslash character has several uses. Firstly, if it is followed by
6451       a  character that is not a digit or a letter, it takes away any special
6452       meaning that character may have. This use of  backslash  as  an  escape
6453       character applies both inside and outside character classes.
6454
6455       For  example,  if you want to match a * character, you must write \* in
6456       the pattern. This escaping action applies whether or not the  following
6457       character  would  otherwise be interpreted as a metacharacter, so it is
6458       always safe to precede a non-alphanumeric  with  backslash  to  specify
6459       that it stands for itself.  In particular, if you want to match a back-
6460       slash, you write \\.
6461
6462       Only ASCII digits and letters have any special meaning  after  a  back-
6463       slash. All other characters (in particular, those whose code points are
6464       greater than 127) are treated as literals.
6465
6466       If you want to treat all characters in a sequence as literals, you  can
6467       do so by putting them between \Q and \E. This is different from Perl in
6468       that $ and @ are handled as literals in  \Q...\E  sequences  in  PCRE2,
6469       whereas  in Perl, $ and @ cause variable interpolation. Also, Perl does
6470       "double-quotish backslash interpolation" on any backslashes between  \Q
6471       and  \E which, its documentation says, "may lead to confusing results".
6472       PCRE2 treats a backslash between \Q and \E just like any other  charac-
6473       ter. Note the following examples:
6474
6475         Pattern            PCRE2 matches   Perl matches
6476
6477         \Qabc$xyz\E        abc$xyz        abc followed by the
6478                                             contents of $xyz
6479         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
6480         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
6481         \QA\B\E            A\B            A\B
6482         \Q\\E              \              \\E
6483
6484       The  \Q...\E  sequence  is recognized both inside and outside character
6485       classes.  An isolated \E that is not preceded by \Q is ignored.  If  \Q
6486       is  not followed by \E later in the pattern, the literal interpretation
6487       continues to the end of the pattern (that is,  \E  is  assumed  at  the
6488       end).  If  the  isolated \Q is inside a character class, this causes an
6489       error, because the character class  is  not  terminated  by  a  closing
6490       square bracket.
6491
6492   Non-printing characters
6493
6494       A second use of backslash provides a way of encoding non-printing char-
6495       acters in patterns in a visible manner. There is no restriction on  the
6496       appearance  of non-printing characters in a pattern, but when a pattern
6497       is being prepared by text editing, it is often easier to use one of the
6498       following  escape  sequences  instead of the binary character it repre-
6499       sents. In an ASCII or Unicode environment, these escapes  are  as  fol-
6500       lows:
6501
6502         \a          alarm, that is, the BEL character (hex 07)
6503         \cx         "control-x", where x is any printable ASCII character
6504         \e          escape (hex 1B)
6505         \f          form feed (hex 0C)
6506         \n          linefeed (hex 0A)
6507         \r          carriage return (hex 0D) (but see below)
6508         \t          tab (hex 09)
6509         \0dd        character with octal code 0dd
6510         \ddd        character with octal code ddd, or backreference
6511         \o{ddd..}   character with octal code ddd..
6512         \xhh        character with hex code hh
6513         \x{hhh..}   character with hex code hhh..
6514         \N{U+hhh..} character with Unicode hex code point hhh..
6515
6516       By  default, after \x that is not followed by {, from zero to two hexa-
6517       decimal digits are read (letters can be in upper or  lower  case).  Any
6518       number of hexadecimal digits may appear between \x{ and }. If a charac-
6519       ter other than a hexadecimal digit appears between \x{  and  },  or  if
6520       there is no terminating }, an error occurs.
6521
6522       Characters whose code points are less than 256 can be defined by either
6523       of the two syntaxes for \x or by an octal sequence. There is no differ-
6524       ence in the way they are handled. For example, \xdc is exactly the same
6525       as \x{dc} or \334.  However, using the braced versions does  make  such
6526       sequences easier to read.
6527
6528       Support  is  available  for some ECMAScript (aka JavaScript) escape se-
6529       quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6530       quence  \x  followed  by { is not recognized. Only if \x is followed by
6531       two hexadecimal digits is it recognized as a character  escape.  Other-
6532       wise  it  is interpreted as a literal "x" character. In this mode, sup-
6533       port for code points greater than 256 is provided by \u, which must  be
6534       followed  by  four hexadecimal digits; otherwise it is interpreted as a
6535       literal "u" character.
6536
6537       PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in  ad-
6538       dition, \u{hhh..} is recognized as the character specified by hexadeci-
6539       mal code point.  There may be any number of  hexadecimal  digits.  This
6540       syntax is from ECMAScript 6.
6541
6542       The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6543       ating in UTF mode. Perl also uses \N{name}  to  specify  characters  by
6544       Unicode  name;  PCRE2  does  not support this. Note that when \N is not
6545       followed by an opening brace (curly bracket) it has an entirely differ-
6546       ent meaning, matching any character that is not a newline.
6547
6548       There  are some legacy applications where the escape sequence \r is ex-
6549       pected to match a newline. If the  PCRE2_EXTRA_ESCAPED_CR_IS_LF  option
6550       is  set,  \r  in  a  pattern is converted to \n so that it matches a LF
6551       (linefeed) instead of a CR (carriage return) character.
6552
6553       The precise effect of \cx on ASCII characters is as follows: if x is  a
6554       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
6555       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
6556       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
6557       hex 7B (; is 3B). If the code unit following \c has a value  less  than
6558       32 or greater than 126, a compile-time error occurs.
6559
6560       When  PCRE2  is  compiled in EBCDIC mode, \N{U+hhh..} is not supported.
6561       \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
6562       The \c escape is processed as specified for Perl in the perlebcdic doc-
6563       ument. The only characters that are allowed after \c are A-Z,  a-z,  or
6564       one  of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6565       time error. The sequence \c@ encodes character code  0;  after  \c  the
6566       letters  (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6567       \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?  be-
6568       comes either 255 (hex FF) or 95 (hex 5F).
6569
6570       Thus,  apart  from  \c?, these escapes generate the same character code
6571       values as they do in an ASCII environment, though the meanings  of  the
6572       values  mostly  differ. For example, \cG always generates code value 7,
6573       which is BEL in ASCII but DEL in EBCDIC.
6574
6575       The sequence \c? generates DEL (127, hex 7F) in an  ASCII  environment,
6576       but  because  127  is  not a control character in EBCDIC, Perl makes it
6577       generate the APC character. Unfortunately, there are  several  variants
6578       of  EBCDIC.  In  most  of them the APC character has the value 255 (hex
6579       FF), but in the one Perl calls POSIX-BC its value is 95  (hex  5F).  If
6580       certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6581       95; otherwise it generates 255.
6582
6583       After \0 up to two further octal digits are read. If  there  are  fewer
6584       than  two  digits,  just  those that are present are used. Thus the se-
6585       quence \0\x\015 specifies two binary zeros followed by a  CR  character
6586       (code value 13). Make sure you supply two digits after the initial zero
6587       if the pattern character that follows is itself an octal digit.
6588
6589       The escape \o must be followed by a sequence of octal digits,  enclosed
6590       in  braces.  An  error occurs if this is not the case. This escape is a
6591       recent addition to Perl; it provides way of specifying  character  code
6592       points  as  octal  numbers  greater than 0777, and it also allows octal
6593       numbers and backreferences to be unambiguously specified.
6594
6595       For greater clarity and unambiguity, it is best to avoid following \ by
6596       a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6597       cal character code points, and \g{} to specify backreferences. The fol-
6598       lowing paragraphs describe the old, ambiguous syntax.
6599
6600       The handling of a backslash followed by a digit other than 0 is compli-
6601       cated, and Perl has changed over time, causing PCRE2 also to change.
6602
6603       Outside a character class, PCRE2 reads the digit and any following dig-
6604       its as a decimal number. If the number is less than 10, begins with the
6605       digit 8 or 9, or if there are  at  least  that  many  previous  capture
6606       groups  in the expression, the entire sequence is taken as a backrefer-
6607       ence. A description of how this works is  given  later,  following  the
6608       discussion  of parenthesized groups.  Otherwise, up to three octal dig-
6609       its are read to form a character code.
6610
6611       Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
6612       acters  "8"  and "9", and otherwise reads up to three octal digits fol-
6613       lowing the backslash, using them to generate a data character. Any sub-
6614       sequent  digits  stand for themselves. For example, outside a character
6615       class:
6616
6617         \040   is another way of writing an ASCII space
6618         \40    is the same, provided there are fewer than 40
6619                   previous capture groups
6620         \7     is always a backreference
6621         \11    might be a backreference, or another way of
6622                   writing a tab
6623         \011   is always a tab
6624         \0113  is a tab followed by the character "3"
6625         \113   might be a backreference, otherwise the
6626                   character with octal code 113
6627         \377   might be a backreference, otherwise
6628                   the value 255 (decimal)
6629         \81    is always a backreference
6630
6631       Note that octal values of 100 or greater that are specified using  this
6632       syntax  must  not be introduced by a leading zero, because no more than
6633       three octal digits are ever read.
6634
6635   Constraints on character values
6636
6637       Characters that are specified using octal or  hexadecimal  numbers  are
6638       limited to certain values, as follows:
6639
6640         8-bit non-UTF mode    no greater than 0xff
6641         16-bit non-UTF mode   no greater than 0xffff
6642         32-bit non-UTF mode   no greater than 0xffffffff
6643         All UTF modes         no greater than 0x10ffff and a valid code point
6644
6645       Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
6646       (the so-called "surrogate" code points). The check  for  these  can  be
6647       disabled  by  the  caller  of  pcre2_compile()  by  setting  the option
6648       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only  in
6649       UTF-8  and  UTF-32 modes, because these values are not representable in
6650       UTF-16.
6651
6652   Escape sequences in character classes
6653
6654       All the sequences that define a single character value can be used both
6655       inside  and  outside character classes. In addition, inside a character
6656       class, \b is interpreted as the backspace character (hex 08).
6657
6658       When not followed by an opening brace, \N is not allowed in a character
6659       class.   \B,  \R, and \X are not special inside a character class. Like
6660       other unrecognized alphabetic escape sequences, they  cause  an  error.
6661       Outside a character class, these sequences have different meanings.
6662
6663   Unsupported escape sequences
6664
6665       In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized by its
6666       string handler and used to modify the case of following characters.  By
6667       default,  PCRE2  does  not  support these escape sequences in patterns.
6668       However, if either of the PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  op-
6669       tions  is set, \U matches a "U" character, and \u can be used to define
6670       a character by code point, as described above.
6671
6672   Absolute and relative backreferences
6673
6674       The sequence \g followed by a signed or unsigned number, optionally en-
6675       closed  in  braces,  is  an absolute or relative backreference. A named
6676       backreference can be coded as \g{name}.  Backreferences  are  discussed
6677       later, following the discussion of parenthesized groups.
6678
6679   Absolute and relative subroutine calls
6680
6681       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
6682       name or a number enclosed either in angle brackets or single quotes, is
6683       an  alternative syntax for referencing a capture group as a subroutine.
6684       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
6685       \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6686       erence; the latter is a subroutine call.
6687
6688   Generic character types
6689
6690       Another use of backslash is for specifying generic character types:
6691
6692         \d     any decimal digit
6693         \D     any character that is not a decimal digit
6694         \h     any horizontal white space character
6695         \H     any character that is not a horizontal white space character
6696         \N     any character that is not a newline
6697         \s     any white space character
6698         \S     any character that is not a white space character
6699         \v     any vertical white space character
6700         \V     any character that is not a vertical white space character
6701         \w     any "word" character
6702         \W     any "non-word" character
6703
6704       The \N escape sequence has the same meaning as  the  "."  metacharacter
6705       when  PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
6706       the meaning of \N. Note that when \N is followed by an opening brace it
6707       has a different meaning. See the section entitled "Non-printing charac-
6708       ters" above for details. Perl also uses \N{name} to specify  characters
6709       by Unicode name; PCRE2 does not support this.
6710
6711       Each  pair of lower and upper case escape sequences partitions the com-
6712       plete set of characters into two disjoint  sets.  Any  given  character
6713       matches  one, and only one, of each pair. The sequences can appear both
6714       inside and outside character classes. They each match one character  of
6715       the  appropriate  type.  If the current matching point is at the end of
6716       the subject string, all of them fail, because there is no character  to
6717       match.
6718
6719       The  default  \s  characters  are HT (9), LF (10), VT (11), FF (12), CR
6720       (13), and space (32), which are defined as white space in the  "C"  lo-
6721       cale.  This  list may vary if locale-specific matching is taking place.
6722       For example, in some locales the "non-breaking space" character  (\xA0)
6723       is recognized as white space, and in others the VT character is not.
6724
6725       A  "word"  character is an underscore or any character that is a letter
6726       or digit.  By default, the definition of letters  and  digits  is  con-
6727       trolled by PCRE2's low-valued character tables, and may vary if locale-
6728       specific matching is taking place (see "Locale support" in the pcre2api
6729       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
6730       systems, or "french" in Windows, some character codes greater than  127
6731       are  used  for  accented letters, and these are then matched by \w. The
6732       use of locales with Unicode is discouraged.
6733
6734       By default, characters whose code points are  greater  than  127  never
6735       match \d, \s, or \w, and always match \D, \S, and \W, although this may
6736       be different for characters in the range 128-255  when  locale-specific
6737       matching  is  happening.   These escape sequences retain their original
6738       meanings from before Unicode support was available,  mainly  for  effi-
6739       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
6740       changed so that Unicode properties  are  used  to  determine  character
6741       types, as follows:
6742
6743         \d  any character that matches \p{Nd} (decimal digit)
6744         \s  any character that matches \p{Z} or \h or \v
6745         \w  any character that matches \p{L} or \p{N}, plus underscore
6746
6747       The  upper case escapes match the inverse sets of characters. Note that
6748       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
6749       as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
6750       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
6751       Matching these sequences is noticeably slower when PCRE2_UCP is set.
6752
6753       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
6754       which match only ASCII characters by default, always match  a  specific
6755       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
6756       space characters are:
6757
6758         U+0009     Horizontal tab (HT)
6759         U+0020     Space
6760         U+00A0     Non-break space
6761         U+1680     Ogham space mark
6762         U+180E     Mongolian vowel separator
6763         U+2000     En quad
6764         U+2001     Em quad
6765         U+2002     En space
6766         U+2003     Em space
6767         U+2004     Three-per-em space
6768         U+2005     Four-per-em space
6769         U+2006     Six-per-em space
6770         U+2007     Figure space
6771         U+2008     Punctuation space
6772         U+2009     Thin space
6773         U+200A     Hair space
6774         U+202F     Narrow no-break space
6775         U+205F     Medium mathematical space
6776         U+3000     Ideographic space
6777
6778       The vertical space characters are:
6779
6780         U+000A     Linefeed (LF)
6781         U+000B     Vertical tab (VT)
6782         U+000C     Form feed (FF)
6783         U+000D     Carriage return (CR)
6784         U+0085     Next line (NEL)
6785         U+2028     Line separator
6786         U+2029     Paragraph separator
6787
6788       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
6789       than 256 are relevant.
6790
6791   Newline sequences
6792
6793       Outside  a  character class, by default, the escape sequence \R matches
6794       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
6795       to the following:
6796
6797         (?>\r\n|\n|\x0b|\f|\r|\x85)
6798
6799       This is an example of an "atomic group", details of which are given be-
6800       low.  This particular group matches either the  two-character  sequence
6801       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
6802       U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
6803       riage  return,  U+000D), or NEL (next line, U+0085). Because this is an
6804       atomic group, the two-character sequence is treated as  a  single  unit
6805       that cannot be split.
6806
6807       In other modes, two additional characters whose code points are greater
6808       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6809       rator,  U+2029).  Unicode support is not needed for these characters to
6810       be recognized.
6811
6812       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6813       the  complete  set  of  Unicode  line  endings)  by  setting the option
6814       PCRE2_BSR_ANYCRLF at compile time. (BSR is an  abbrevation  for  "back-
6815       slash R".) This can be made the default when PCRE2 is built; if this is
6816       the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI-
6817       CODE  option. It is also possible to specify these settings by starting
6818       a pattern string with one of the following sequences:
6819
6820         (*BSR_ANYCRLF)   CR, LF, or CRLF only
6821         (*BSR_UNICODE)   any Unicode newline sequence
6822
6823       These override the default and the options given to the compiling func-
6824       tion.  Note that these special settings, which are not Perl-compatible,
6825       are recognized only at the very start of a pattern, and that they  must
6826       be  in upper case. If more than one of them is present, the last one is
6827       used. They can be combined with a change of newline convention; for ex-
6828       ample, a pattern can start with:
6829
6830         (*ANY)(*BSR_ANYCRLF)
6831
6832       They  can also be combined with the (*UTF) or (*UCP) special sequences.
6833       Inside a character class, \R is treated as an unrecognized  escape  se-
6834       quence, and causes an error.
6835
6836   Unicode character properties
6837
6838       When  PCRE2  is  built  with Unicode support (the default), three addi-
6839       tional escape sequences that match characters with specific  properties
6840       are available. They can be used in any mode, though in 8-bit and 16-bit
6841       non-UTF modes these sequences are of course limited to testing  charac-
6842       ters  whose code points are less than U+0100 and U+10000, respectively.
6843       In 32-bit non-UTF mode, code points greater than 0x10ffff (the  Unicode
6844       limit)  may  be  encountered. These are all treated as being in the Un-
6845       known script and with an unassigned type. The  extra  escape  sequences
6846       are:
6847
6848         \p{xx}   a character with the xx property
6849         \P{xx}   a character without the xx property
6850         \X       a Unicode extended grapheme cluster
6851
6852       The property names represented by xx above are case-sensitive. There is
6853       support for Unicode script names, Unicode general category  properties,
6854       "Any",  which  matches any character (including newline), and some spe-
6855       cial PCRE2 properties (described in  the  next  section).   Other  Perl
6856       properties such as "InMusicalSymbols" are not supported by PCRE2.  Note
6857       that \P{Any} does not match any characters, so always  causes  a  match
6858       failure.
6859
6860       Sets of Unicode characters are defined as belonging to certain scripts.
6861       A character from one of these sets can be matched using a script  name.
6862       For example:
6863
6864         \p{Greek}
6865         \P{Han}
6866
6867       Unassigned characters (and in non-UTF 32-bit mode, characters with code
6868       points greater than 0x10FFFF) are assigned the "Unknown" script. Others
6869       that  are not part of an identified script are lumped together as "Com-
6870       mon". The current list of scripts is:
6871
6872       Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali-
6873       nese,  Bamum,  Bassa_Vah,  Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
6874       Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Caucasian_Alba-
6875       nian,  Chakma,  Cham,  Cherokee, Chorasmian, Common, Coptic, Cuneiform,
6876       Cypriot, Cyrillic, Deseret, Devanagari, Dives_Akuru,  Dogra,  Duployan,
6877       Egyptian_Hieroglyphs, Elbasan, Elymaic, Ethiopic, Georgian, Glagolitic,
6878       Gothic, Grantha, Greek, Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul,
6879       Hanifi_Rohingya,  Hanunoo,  Hatran, Hebrew, Hiragana, Imperial_Aramaic,
6880       Inherited,  Inscriptional_Pahlavi,  Inscriptional_Parthian,   Javanese,
6881       Kaithi,  Kannada,  Katakana, Kayah_Li, Kharoshthi, Khitan_Small_Script,
6882       Khmer, Khojki, Khudawadi, Lao, Latin,  Lepcha,  Limbu,  Linear_A,  Lin-
6883       ear_B,  Lisu,  Lycian,  Lydian,  Mahajani, Makasar, Malayalam, Mandaic,
6884       Manichaean,   Marchen,   Masaram_Gondi,   Medefaidrin,    Meetei_Mayek,
6885       Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mon-
6886       golian, Mro, Multani,  Myanmar,  Nabataean,  Nandinagari,  New_Tai_Lue,
6887       Newa,  Nko,  Nushu, Nyakeng_Puachue_Hmong, Ogham, Ol_Chiki, Old_Hungar-
6888       ian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,  Old_Sog-
6889       dian,   Old_South_Arabian,   Old_Turkic,  Oriya,  Osage,  Osmanya,  Pa-
6890       hawh_Hmong,    Palmyrene,    Pau_Cin_Hau,     Phags_Pa,     Phoenician,
6891       Psalter_Pahlavi,  Rejang,  Runic,  Samaritan, Saurashtra, Sharada, Sha-
6892       vian, Siddham, SignWriting, Sinhala,  Sogdian,  Sora_Sompeng,  Soyombo,
6893       Sundanese,  Syloti_Nagri,  Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
6894       Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana,  Thai,  Tibetan,  Tifi-
6895       nagh, Tirhuta, Ugaritic, Unknown, Vai, Wancho, Warang_Citi, Yezidi, Yi,
6896       Zanabazar_Square.
6897
6898       Each character has exactly one Unicode general category property, spec-
6899       ified  by a two-letter abbreviation. For compatibility with Perl, nega-
6900       tion can be specified by including a  circumflex  between  the  opening
6901       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
6902       \P{Lu}.
6903
6904       If only one letter is specified with \p or \P, it includes all the gen-
6905       eral  category properties that start with that letter. In this case, in
6906       the absence of negation, the curly brackets in the escape sequence  are
6907       optional; these two examples have the same effect:
6908
6909         \p{L}
6910         \pL
6911
6912       The following general category property codes are supported:
6913
6914         C     Other
6915         Cc    Control
6916         Cf    Format
6917         Cn    Unassigned
6918         Co    Private use
6919         Cs    Surrogate
6920
6921         L     Letter
6922         Ll    Lower case letter
6923         Lm    Modifier letter
6924         Lo    Other letter
6925         Lt    Title case letter
6926         Lu    Upper case letter
6927
6928         M     Mark
6929         Mc    Spacing mark
6930         Me    Enclosing mark
6931         Mn    Non-spacing mark
6932
6933         N     Number
6934         Nd    Decimal number
6935         Nl    Letter number
6936         No    Other number
6937
6938         P     Punctuation
6939         Pc    Connector punctuation
6940         Pd    Dash punctuation
6941         Pe    Close punctuation
6942         Pf    Final punctuation
6943         Pi    Initial punctuation
6944         Po    Other punctuation
6945         Ps    Open punctuation
6946
6947         S     Symbol
6948         Sc    Currency symbol
6949         Sk    Modifier symbol
6950         Sm    Mathematical symbol
6951         So    Other symbol
6952
6953         Z     Separator
6954         Zl    Line separator
6955         Zp    Paragraph separator
6956         Zs    Space separator
6957
6958       The  special property L& is also supported: it matches a character that
6959       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
6960       classified as a modifier or "other".
6961
6962       The  Cs  (Surrogate)  property  applies  only  to characters whose code
6963       points are in the range U+D800 to U+DFFF. These characters are no  dif-
6964       ferent  to any other character when PCRE2 is not in UTF mode (using the
6965       16-bit or 32-bit library).  However, they  are  not  valid  in  Unicode
6966       strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
6967       ity  checking  has   been   turned   off   (see   the   discussion   of
6968       PCRE2_NO_UTF_CHECK in the pcre2api page).
6969
6970       The  long  synonyms  for  property  names  that  Perl supports (such as
6971       \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix
6972       any of these properties with "Is".
6973
6974       No character that is in the Unicode table has the Cn (unassigned) prop-
6975       erty.  Instead, this property is assumed for any code point that is not
6976       in the Unicode table.
6977
6978       Specifying  caseless  matching  does not affect these escape sequences.
6979       For example, \p{Lu} always matches only upper  case  letters.  This  is
6980       different from the behaviour of current versions of Perl.
6981
6982       Matching  characters by Unicode property is not fast, because PCRE2 has
6983       to do a multistage table lookup in order to find  a  character's  prop-
6984       erty. That is why the traditional escape sequences such as \d and \w do
6985       not use Unicode properties in PCRE2 by default,  though  you  can  make
6986       them  do  so by setting the PCRE2_UCP option or by starting the pattern
6987       with (*UCP).
6988
6989   Extended grapheme clusters
6990
6991       The \X escape matches any number of Unicode  characters  that  form  an
6992       "extended grapheme cluster", and treats the sequence as an atomic group
6993       (see below).  Unicode supports various kinds of composite character  by
6994       giving  each  character  a grapheme breaking property, and having rules
6995       that use these properties to define the boundaries of extended grapheme
6996       clusters.  The rules are defined in Unicode Standard Annex 29, "Unicode
6997       Text Segmentation". Unicode 11.0.0 abandoned the use of  some  previous
6998       properties  that had been used for emojis.  Instead it introduced vari-
6999       ous emoji-specific properties. PCRE2  uses  only  the  Extended  Picto-
7000       graphic property.
7001
7002       \X  always  matches  at least one character. Then it decides whether to
7003       add additional characters according to the following rules for ending a
7004       cluster:
7005
7006       1. End at the end of the subject string.
7007
7008       2.  Do not end between CR and LF; otherwise end after any control char-
7009       acter.
7010
7011       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
7012       characters  are of five types: L, V, T, LV, and LVT. An L character may
7013       be followed by an L, V, LV, or LVT character; an LV or V character  may
7014       be followed by a V or T character; an LVT or T character may be follwed
7015       only by a T character.
7016
7017       4. Do not end before extending  characters  or  spacing  marks  or  the
7018       "zero-width  joiner" character. Characters with the "mark" property al-
7019       ways have the "extend" grapheme breaking property.
7020
7021       5. Do not end after prepend characters.
7022
7023       6. Do not break within emoji modifier sequences or emoji zwj sequences.
7024       That is, do not break between characters with the Extended_Pictographic
7025       property.  Extend and ZWJ characters are allowed  between  the  charac-
7026       ters.
7027
7028       7.  Do not break within emoji flag sequences. That is, do not break be-
7029       tween regional indicator (RI) characters if there are an odd number  of
7030       RI characters before the break point.
7031
7032       8. Otherwise, end the cluster.
7033
7034   PCRE2's additional properties
7035
7036       As  well as the standard Unicode properties described above, PCRE2 sup-
7037       ports four more that make it possible to convert traditional escape se-
7038       quences  such  as \w and \s to use Unicode properties. PCRE2 uses these
7039       non-standard, non-Perl properties internally  when  PCRE2_UCP  is  set.
7040       However, they may also be used explicitly. These properties are:
7041
7042         Xan   Any alphanumeric character
7043         Xps   Any POSIX space character
7044         Xsp   Any Perl space character
7045         Xwd   Any Perl "word" character
7046
7047       Xan  matches  characters that have either the L (letter) or the N (num-
7048       ber) property. Xps matches the characters tab, linefeed, vertical  tab,
7049       form  feed,  or carriage return, and any other character that has the Z
7050       (separator) property.  Xsp is the same as Xps; in PCRE1 it used to  ex-
7051       clude  vertical  tab,  for  Perl  compatibility,  but Perl changed. Xwd
7052       matches the same characters as Xan, plus underscore.
7053
7054       There is another non-standard property, Xuc, which matches any  charac-
7055       ter  that  can  be represented by a Universal Character Name in C++ and
7056       other programming languages. These are the characters $,  @,  `  (grave
7057       accent),  and  all  characters with Unicode code points greater than or
7058       equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note  that
7059       most  base  (ASCII) characters are excluded. (Universal Character Names
7060       are of the form \uHHHH or \UHHHHHHHH where H is  a  hexadecimal  digit.
7061       Note that the Xuc property does not match these sequences but the char-
7062       acters that they represent.)
7063
7064   Resetting the match start
7065
7066       In normal use, the escape sequence \K  causes  any  previously  matched
7067       characters not to be included in the final matched sequence that is re-
7068       turned. For example, the pattern:
7069
7070         foo\Kbar
7071
7072       matches "foobar", but reports that it has matched "bar".  \K  does  not
7073       interact with anchoring in any way. The pattern:
7074
7075         ^foo\Kbar
7076
7077       matches  only  when  the  subject  begins with "foobar" (in single line
7078       mode), though it again reports the matched string as "bar".  This  fea-
7079       ture  is similar to a lookbehind assertion (described below).  However,
7080       in this case, the part of the subject before the real  match  does  not
7081       have  to be of fixed length, as lookbehind assertions do. The use of \K
7082       does not interfere with the setting of captured substrings.  For  exam-
7083       ple, when the pattern
7084
7085         (foo)\Kbar
7086
7087       matches "foobar", the first substring is still set to "foo".
7088
7089       Perl  used  to document that the use of \K within lookaround assertions
7090       is "not well defined", but from version 5.32.0 Perl  does  not  support
7091       this  usage  at  all.  In PCRE2, \K is acted upon when it occurs inside
7092       positive assertions, but is ignored in negative assertions.  Note  that
7093       when  a  pattern  such  as  (?=ab\K) matches, the reported start of the
7094       match can be greater than the end of the match. Using \K in  a  lookbe-
7095       hind  assertion at the start of a pattern can also lead to odd effects.
7096       For example, consider this pattern:
7097
7098         (?<=\Kfoo)bar
7099
7100       If the subject is "foobar", a call to  pcre2_match()  with  a  starting
7101       offset  of 3 succeeds and reports the matching string as "foobar", that
7102       is, the start of the reported match is earlier  than  where  the  match
7103       started.
7104
7105   Simple assertions
7106
7107       The  final use of backslash is for certain simple assertions. An asser-
7108       tion specifies a condition that has to be met at a particular point  in
7109       a  match, without consuming any characters from the subject string. The
7110       use of groups for more complicated assertions is described below.   The
7111       backslashed assertions are:
7112
7113         \b     matches at a word boundary
7114         \B     matches when not at a word boundary
7115         \A     matches at the start of the subject
7116         \Z     matches at the end of the subject
7117                 also matches before a newline at the end of the subject
7118         \z     matches only at the end of the subject
7119         \G     matches at the first matching position in the subject
7120
7121       Inside  a  character  class, \b has a different meaning; it matches the
7122       backspace character. If any other of  these  assertions  appears  in  a
7123       character class, an "invalid escape sequence" error is generated.
7124
7125       A  word  boundary is a position in the subject string where the current
7126       character and the previous character do not both match \w or  \W  (i.e.
7127       one  matches  \w  and the other matches \W), or the start or end of the
7128       string if the first or last character matches  \w,  respectively.  When
7129       PCRE2  is  built with Unicode support, the meanings of \w and \W can be
7130       changed by setting the PCRE2_UCP option. When this is done, it also af-
7131       fects  \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
7132       or "end of word" metasequence. However, whatever  follows  \b  normally
7133       determines  which  it  is. For example, the fragment \ba matches "a" at
7134       the start of a word.
7135
7136       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
7137       and dollar (described in the next section) in that they only ever match
7138       at the very start and end of the subject string, whatever  options  are
7139       set.  Thus,  they are independent of multiline mode. These three asser-
7140       tions are not affected by the  PCRE2_NOTBOL  or  PCRE2_NOTEOL  options,
7141       which  affect only the behaviour of the circumflex and dollar metachar-
7142       acters. However, if the startoffset argument of pcre2_match()  is  non-
7143       zero,  indicating  that  matching is to start at a point other than the
7144       beginning of the subject, \A can never match.  The  difference  between
7145       \Z  and \z is that \Z matches before a newline at the end of the string
7146       as well as at the very end, whereas \z matches only at the end.
7147
7148       The \G assertion is true only when the current matching position is  at
7149       the  start point of the matching process, as specified by the startoff-
7150       set argument of pcre2_match(). It differs from \A  when  the  value  of
7151       startoffset  is  non-zero. By calling pcre2_match() multiple times with
7152       appropriate arguments, you can mimic Perl's /g option,  and  it  is  in
7153       this kind of implementation where \G can be useful.
7154
7155       Note,  however,  that  PCRE2's  implementation of \G, being true at the
7156       starting character of the matching process, is  subtly  different  from
7157       Perl's,  which  defines it as true at the end of the previous match. In
7158       Perl, these can be different when the  previously  matched  string  was
7159       empty. Because PCRE2 does just one match at a time, it cannot reproduce
7160       this behaviour.
7161
7162       If all the alternatives of a pattern begin with \G, the  expression  is
7163       anchored to the starting match position, and the "anchored" flag is set
7164       in the compiled regular expression.
7165
7166
7167CIRCUMFLEX AND DOLLAR
7168
7169       The circumflex and dollar  metacharacters  are  zero-width  assertions.
7170       That  is,  they test for a particular condition being true without con-
7171       suming any characters from the subject string. These two metacharacters
7172       are  concerned  with matching the starts and ends of lines. If the new-
7173       line convention is set so that only the two-character sequence CRLF  is
7174       recognized  as  a newline, isolated CR and LF characters are treated as
7175       ordinary data characters, and are not recognized as newlines.
7176
7177       Outside a character class, in the default matching mode, the circumflex
7178       character  is  an  assertion  that is true only if the current matching
7179       point is at the start of the subject string. If the  startoffset  argu-
7180       ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
7181       flex can never match if the PCRE2_MULTILINE option is unset.  Inside  a
7182       character  class, circumflex has an entirely different meaning (see be-
7183       low).
7184
7185       Circumflex need not be the first character of the pattern if  a  number
7186       of  alternatives are involved, but it should be the first thing in each
7187       alternative in which it appears if the pattern is ever  to  match  that
7188       branch.  If all possible alternatives start with a circumflex, that is,
7189       if the pattern is constrained to match only at the start  of  the  sub-
7190       ject,  it  is  said  to be an "anchored" pattern. (There are also other
7191       constructs that can cause a pattern to be anchored.)
7192
7193       The dollar character is an assertion that is true only if  the  current
7194       matching  point is at the end of the subject string, or immediately be-
7195       fore a newline at the end of the string (by default), unless  PCRE2_NO-
7196       TEOL  is  set.  Note, however, that it does not actually match the new-
7197       line. Dollar need not be the last character of the pattern if a  number
7198       of  alternatives  are  involved,  but it should be the last item in any
7199       branch in which it appears. Dollar has no special meaning in a  charac-
7200       ter class.
7201
7202       The  meaning  of  dollar  can be changed so that it matches only at the
7203       very end of the string, by setting the PCRE2_DOLLAR_ENDONLY  option  at
7204       compile time. This does not affect the \Z assertion.
7205
7206       The meanings of the circumflex and dollar metacharacters are changed if
7207       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
7208       character  matches before any newlines in the string, as well as at the
7209       very end, and a circumflex matches immediately after internal  newlines
7210       as  well as at the start of the subject string. It does not match after
7211       a newline that ends the string, for compatibility with  Perl.  However,
7212       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
7213
7214       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
7215       (where \n represents a newline) in multiline mode, but  not  otherwise.
7216       Consequently,  patterns  that  are anchored in single line mode because
7217       all branches start with ^ are not anchored in  multiline  mode,  and  a
7218       match  for  circumflex  is  possible  when  the startoffset argument of
7219       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
7220       if PCRE2_MULTILINE is set.
7221
7222       When  the  newline  convention (see "Newline conventions" below) recog-
7223       nizes the two-character sequence CRLF as a newline, this is  preferred,
7224       even  if  the  single  characters CR and LF are also recognized as new-
7225       lines. For example, if the newline convention  is  "any",  a  multiline
7226       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
7227       than after CR, even though CR on its own is a valid newline.  (It  also
7228       matches at the very start of the string, of course.)
7229
7230       Note  that  the sequences \A, \Z, and \z can be used to match the start
7231       and end of the subject in both modes, and if all branches of a  pattern
7232       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
7233       set.
7234
7235
7236FULL STOP (PERIOD, DOT) AND \N
7237
7238       Outside a character class, a dot in the pattern matches any one charac-
7239       ter  in  the subject string except (by default) a character that signi-
7240       fies the end of a line.
7241
7242       When a line ending is defined as a single character, dot never  matches
7243       that  character; when the two-character sequence CRLF is used, dot does
7244       not match CR if it is immediately followed  by  LF,  but  otherwise  it
7245       matches  all characters (including isolated CRs and LFs). When any Uni-
7246       code line endings are being recognized, dot does not match CR or LF  or
7247       any of the other line ending characters.
7248
7249       The  behaviour  of  dot  with regard to newlines can be changed. If the
7250       PCRE2_DOTALL option is set, a dot matches any  one  character,  without
7251       exception.   If  the two-character sequence CRLF is present in the sub-
7252       ject string, it takes two dots to match it.
7253
7254       The handling of dot is entirely independent of the handling of  circum-
7255       flex  and  dollar,  the  only relationship being that they both involve
7256       newlines. Dot has no special meaning in a character class.
7257
7258       The escape sequence \N when not followed by an  opening  brace  behaves
7259       like  a dot, except that it is not affected by the PCRE2_DOTALL option.
7260       In other words, it matches any character except one that signifies  the
7261       end of a line.
7262
7263       When \N is followed by an opening brace it has a different meaning. See
7264       the section entitled "Non-printing characters" above for details.  Perl
7265       also  uses  \N{name}  to specify characters by Unicode name; PCRE2 does
7266       not support this.
7267
7268
7269MATCHING A SINGLE CODE UNIT
7270
7271       Outside a character class, the escape sequence \C matches any one  code
7272       unit,  whether or not a UTF mode is set. In the 8-bit library, one code
7273       unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
7274       32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
7275       line-ending characters. The feature is provided in  Perl  in  order  to
7276       match individual bytes in UTF-8 mode, but it is unclear how it can use-
7277       fully be used.
7278
7279       Because \C breaks up characters into individual  code  units,  matching
7280       one  unit  with  \C  in UTF-8 or UTF-16 mode means that the rest of the
7281       string may start with a malformed UTF character. This has undefined re-
7282       sults, because PCRE2 assumes that it is matching character by character
7283       in a valid UTF string (by default it checks the subject string's valid-
7284       ity  at  the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK or
7285       PCRE2_MATCH_INVALID_UTF option is used).
7286
7287       An  application  can  lock  out  the  use  of   \C   by   setting   the
7288       PCRE2_NEVER_BACKSLASH_C  option  when  compiling  a pattern. It is also
7289       possible to build PCRE2 with the use of \C permanently disabled.
7290
7291       PCRE2 does not allow \C to appear in lookbehind  assertions  (described
7292       below)  in UTF-8 or UTF-16 modes, because this would make it impossible
7293       to calculate the length of  the  lookbehind.  Neither  the  alternative
7294       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
7295       these UTF modes.  The former gives a match-time error; the latter fails
7296       to optimize and so the match is always run using the interpreter.
7297
7298       In  the  32-bit  library, however, \C is always supported (when not ex-
7299       plicitly locked out) because it always  matches  a  single  code  unit,
7300       whether or not UTF-32 is specified.
7301
7302       In general, the \C escape sequence is best avoided. However, one way of
7303       using it that avoids the problem of malformed UTF-8 or  UTF-16  charac-
7304       ters  is  to use a lookahead to check the length of the next character,
7305       as in this pattern, which could be used with  a  UTF-8  string  (ignore
7306       white space and line breaks):
7307
7308         (?| (?=[\x00-\x7f])(\C) |
7309             (?=[\x80-\x{7ff}])(\C)(\C) |
7310             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7311             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7312
7313       In  this  example,  a  group  that starts with (?| resets the capturing
7314       parentheses numbers in each alternative (see "Duplicate Group  Numbers"
7315       below). The assertions at the start of each branch check the next UTF-8
7316       character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec-
7317       tively.  The  character's individual bytes are then captured by the ap-
7318       propriate number of \C groups.
7319
7320
7321SQUARE BRACKETS AND CHARACTER CLASSES
7322
7323       An opening square bracket introduces a character class, terminated by a
7324       closing square bracket. A closing square bracket on its own is not spe-
7325       cial by default.  If a closing square bracket is required as  a  member
7326       of the class, it should be the first data character in the class (after
7327       an initial circumflex, if present) or escaped with  a  backslash.  This
7328       means  that,  by default, an empty class cannot be defined. However, if
7329       the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket  at
7330       the start does end the (empty) class.
7331
7332       A  character class matches a single character in the subject. A matched
7333       character must be in the set of characters defined by the class, unless
7334       the  first  character in the class definition is a circumflex, in which
7335       case the subject character must not be in the set defined by the class.
7336       If  a  circumflex is actually required as a member of the class, ensure
7337       it is not the first character, or escape it with a backslash.
7338
7339       For example, the character class [aeiou] matches any lower case  vowel,
7340       while  [^aeiou]  matches  any character that is not a lower case vowel.
7341       Note that a circumflex is just a convenient notation for specifying the
7342       characters  that  are in the class by enumerating those that are not. A
7343       class that starts with a circumflex is not an assertion; it still  con-
7344       sumes  a  character  from the subject string, and therefore it fails if
7345       the current pointer is at the end of the string.
7346
7347       Characters in a class may be specified by their code points  using  \o,
7348       \x,  or \N{U+hh..} in the usual way. When caseless matching is set, any
7349       letters in a class represent both their upper case and lower case  ver-
7350       sions,  so  for example, a caseless [aeiou] matches "A" as well as "a",
7351       and a caseless [^aeiou] does not match "A", whereas a  caseful  version
7352       would.  Note that there are two ASCII characters, K and S, that, in ad-
7353       dition to their lower case ASCII equivalents, are case-equivalent  with
7354       Unicode  U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7355       ther PCRE2_UTF or PCRE2_UCP is set.
7356
7357       Characters that might indicate line breaks are  never  treated  in  any
7358       special  way  when matching character classes, whatever line-ending se-
7359       quence is  in  use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
7360       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
7361       one of these characters.
7362
7363       The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
7364       \S,  \v,  \V,  \w,  and \W may appear in a character class, and add the
7365       characters that they  match  to  the  class.  For  example,  [\dABCDEF]
7366       matches  any  hexadecimal digit. In UTF modes, the PCRE2_UCP option af-
7367       fects the meanings of \d, \s, \w and their upper case partners, just as
7368       it does when they appear outside a character class, as described in the
7369       section entitled "Generic character types" above. The  escape  sequence
7370       \b  has  a  different  meaning inside a character class; it matches the
7371       backspace character. The sequences \B, \R, and \X are not  special  in-
7372       side  a  character class. Like any other unrecognized escape sequences,
7373       they cause an error. The same is true for \N when not  followed  by  an
7374       opening brace.
7375
7376       The  minus (hyphen) character can be used to specify a range of charac-
7377       ters in a character class. For example, [d-m] matches  any  letter  be-
7378       tween  d and m, inclusive. If a minus character is required in a class,
7379       it must be escaped with a backslash or appear in a  position  where  it
7380       cannot  be interpreted as indicating a range, typically as the first or
7381       last character in the class, or immediately after a range. For example,
7382       [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7383
7384       Perl treats a hyphen as a literal if it appears before or after a POSIX
7385       class (see below) or before or after a character type escape such as as
7386       \d  or  \H.   However,  unless  the hyphen is the last character in the
7387       class, Perl outputs a warning in its warning  mode,  as  this  is  most
7388       likely  a user error. As PCRE2 has no facility for warning, an error is
7389       given in these cases.
7390
7391       It is not possible to have the literal character "]" as the end charac-
7392       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
7393       two characters ("W" and "-") followed by a literal string "46]", so  it
7394       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
7395       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
7396       preted  as a class containing a range followed by two other characters.
7397       The octal or hexadecimal representation of "]" can also be used to  end
7398       a range.
7399
7400       Ranges normally include all code points between the start and end char-
7401       acters, inclusive. They can also be used for code points specified  nu-
7402       merically,  for  example [\000-\037]. Ranges can include any characters
7403       that are valid for the current mode. In any  UTF  mode,  the  so-called
7404       "surrogate"  characters (those whose code points lie between 0xd800 and
7405       0xdfff inclusive) may not  be  specified  explicitly  by  default  (the
7406       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7407       ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7408       are always permitted.
7409
7410       There  is  a  special  case in EBCDIC environments for ranges whose end
7411       points are both specified as literal letters in the same case. For com-
7412       patibility  with Perl, EBCDIC code points within the range that are not
7413       letters are omitted. For example, [h-k] matches only  four  characters,
7414       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
7415       points. However, if the range is specified  numerically,  for  example,
7416       [\x88-\x92] or [h-\x92], all code points are included.
7417
7418       If a range that includes letters is used when caseless matching is set,
7419       it matches the letters in either case. For example, [W-c] is equivalent
7420       to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
7421       character tables for a French locale are in  use,  [\xc8-\xcb]  matches
7422       accented E characters in both cases.
7423
7424       A  circumflex  can  conveniently  be used with the upper case character
7425       types to specify a more restricted set of characters than the  matching
7426       lower  case  type.  For example, the class [^\W_] matches any letter or
7427       digit, but not underscore, whereas [\w] includes underscore. A positive
7428       character class should be read as "something OR something OR ..." and a
7429       negative class as "NOT something AND NOT something AND NOT ...".
7430
7431       The only metacharacters that are recognized in  character  classes  are
7432       backslash,  hyphen  (only  where  it can be interpreted as specifying a
7433       range), circumflex (only at the start), opening  square  bracket  (only
7434       when  it can be interpreted as introducing a POSIX class name, or for a
7435       special compatibility feature - see the next  two  sections),  and  the
7436       terminating  closing  square  bracket.  However, escaping other non-al-
7437       phanumeric characters does no harm.
7438
7439
7440POSIX CHARACTER CLASSES
7441
7442       Perl supports the POSIX notation for character classes. This uses names
7443       enclosed  by [: and :] within the enclosing square brackets. PCRE2 also
7444       supports this notation. For example,
7445
7446         [01[:alpha:]%]
7447
7448       matches "0", "1", any alphabetic character, or "%". The supported class
7449       names are:
7450
7451         alnum    letters and digits
7452         alpha    letters
7453         ascii    character codes 0 - 127
7454         blank    space or tab only
7455         cntrl    control characters
7456         digit    decimal digits (same as \d)
7457         graph    printing characters, excluding space
7458         lower    lower case letters
7459         print    printing characters, including space
7460         punct    printing characters, excluding letters and digits and space
7461         space    white space (the same as \s from PCRE2 8.34)
7462         upper    upper case letters
7463         word     "word" characters (same as \w)
7464         xdigit   hexadecimal digits
7465
7466       The  default  "space" characters are HT (9), LF (10), VT (11), FF (12),
7467       CR (13), and space (32). If locale-specific matching is  taking  place,
7468       the  list  of  space characters may be different; there may be fewer or
7469       more of them. "Space" and \s match the same set of characters.
7470
7471       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
7472       from  Perl  5.8. Another Perl extension is negation, which is indicated
7473       by a ^ character after the colon. For example,
7474
7475         [12[:^digit:]]
7476
7477       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7478       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
7479       these are not supported, and an error is given if they are encountered.
7480
7481       By default, characters with values greater than 127 do not match any of
7482       the POSIX character classes, although this may be different for charac-
7483       ters in the range 128-255 when locale-specific matching  is  happening.
7484       However,  if the PCRE2_UCP option is passed to pcre2_compile(), some of
7485       the classes are changed so that Unicode character properties are  used.
7486       This  is  achieved  by  replacing  certain POSIX classes with other se-
7487       quences, as follows:
7488
7489         [:alnum:]  becomes  \p{Xan}
7490         [:alpha:]  becomes  \p{L}
7491         [:blank:]  becomes  \h
7492         [:cntrl:]  becomes  \p{Cc}
7493         [:digit:]  becomes  \p{Nd}
7494         [:lower:]  becomes  \p{Ll}
7495         [:space:]  becomes  \p{Xps}
7496         [:upper:]  becomes  \p{Lu}
7497         [:word:]   becomes  \p{Xwd}
7498
7499       Negated versions, such as [:^alpha:] use \P instead of \p. Three  other
7500       POSIX classes are handled specially in UCP mode:
7501
7502       [:graph:] This  matches  characters that have glyphs that mark the page
7503                 when printed. In Unicode property terms, it matches all char-
7504                 acters with the L, M, N, P, S, or Cf properties, except for:
7505
7506                   U+061C           Arabic Letter Mark
7507                   U+180E           Mongolian Vowel Separator
7508                   U+2066 - U+2069  Various "isolate"s
7509
7510
7511       [:print:] This  matches  the  same  characters  as [:graph:] plus space
7512                 characters that are not controls, that  is,  characters  with
7513                 the Zs property.
7514
7515       [:punct:] This matches all characters that have the Unicode P (punctua-
7516                 tion) property, plus those characters with code  points  less
7517                 than 256 that have the S (Symbol) property.
7518
7519       The  other  POSIX classes are unchanged, and match only characters with
7520       code points less than 256.
7521
7522
7523COMPATIBILITY FEATURE FOR WORD BOUNDARIES
7524
7525       In the POSIX.2 compliant library that was included in 4.4BSD Unix,  the
7526       ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
7527       and "end of word". PCRE2 treats these items as follows:
7528
7529         [[:<:]]  is converted to  \b(?=\w)
7530         [[:>:]]  is converted to  \b(?<=\w)
7531
7532       Only these exact character sequences are recognized. A sequence such as
7533       [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
7534       support is not compatible with Perl. It is provided to help  migrations
7535       from other environments, and is best not used in any new patterns. Note
7536       that \b matches at the start and the end of a word (see "Simple  asser-
7537       tions"  above),  and in a Perl-style pattern the preceding or following
7538       character normally shows which is wanted, without the need for the  as-
7539       sertions  that are used above in order to give exactly the POSIX behav-
7540       iour.
7541
7542
7543VERTICAL BAR
7544
7545       Vertical bar characters are used to separate alternative patterns.  For
7546       example, the pattern
7547
7548         gilbert|sullivan
7549
7550       matches  either "gilbert" or "sullivan". Any number of alternatives may
7551       appear, and an empty  alternative  is  permitted  (matching  the  empty
7552       string). The matching process tries each alternative in turn, from left
7553       to right, and the first one that succeeds is used. If the  alternatives
7554       are  within a group (defined below), "succeeds" means matching the rest
7555       of the main pattern as well as the alternative in the group.
7556
7557
7558INTERNAL OPTION SETTING
7559
7560       The settings  of  the  PCRE2_CASELESS,  PCRE2_MULTILINE,  PCRE2_DOTALL,
7561       PCRE2_EXTENDED,  PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
7562       can be changed from within the pattern by a  sequence  of  letters  en-
7563       closed  between  "(?"   and ")". These options are Perl-compatible, and
7564       are described in detail in the pcre2api documentation. The option  let-
7565       ters are:
7566
7567         i  for PCRE2_CASELESS
7568         m  for PCRE2_MULTILINE
7569         n  for PCRE2_NO_AUTO_CAPTURE
7570         s  for PCRE2_DOTALL
7571         x  for PCRE2_EXTENDED
7572         xx for PCRE2_EXTENDED_MORE
7573
7574       For example, (?im) sets caseless, multiline matching. It is also possi-
7575       ble to unset these options by preceding the relevant letters with a hy-
7576       phen,  for  example (?-im). The two "extended" options are not indepen-
7577       dent; unsetting either one cancels the effects of both of them.
7578
7579       A  combined  setting  and  unsetting  such  as  (?im-sx),  which   sets
7580       PCRE2_CASELESS  and  PCRE2_MULTILINE  while  unsetting PCRE2_DOTALL and
7581       PCRE2_EXTENDED, is also permitted. Only one hyphen may  appear  in  the
7582       options  string.  If a letter appears both before and after the hyphen,
7583       the option is unset. An empty options setting "(?)" is  allowed.  Need-
7584       less to say, it has no effect.
7585
7586       If  the  first character following (? is a circumflex, it causes all of
7587       the above options to be unset. Thus, (?^) is equivalent  to  (?-imnsx).
7588       Letters  may  follow  the circumflex to cause some options to be re-in-
7589       stated, but a hyphen may not appear.
7590
7591       The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
7592       changed  in  the  same  way as the Perl-compatible options by using the
7593       characters J and U respectively. However, these are not unset by (?^).
7594
7595       When one of these option changes occurs at top level (that is, not  in-
7596       side  group  parentheses),  the  change applies to the remainder of the
7597       pattern that follows. An option change within a group (see below for  a
7598       description of groups) affects only that part of the group that follows
7599       it, so
7600
7601         (a(?i)b)c
7602
7603       matches abc and aBc and no other strings  (assuming  PCRE2_CASELESS  is
7604       not  used).   By this means, options can be made to have different set-
7605       tings in different parts of the pattern. Any changes made in one alter-
7606       native  do carry on into subsequent branches within the same group. For
7607       example,
7608
7609         (a(?i)b|c)
7610
7611       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
7612       first  branch  is  abandoned before the option setting. This is because
7613       the effects of option settings happen at compile time. There  would  be
7614       some very weird behaviour otherwise.
7615
7616       As  a  convenient shorthand, if any option settings are required at the
7617       start of a non-capturing group (see the next section), the option  let-
7618       ters may appear between the "?" and the ":". Thus the two patterns
7619
7620         (?i:saturday|sunday)
7621         (?:(?i)saturday|sunday)
7622
7623       match exactly the same set of strings.
7624
7625       Note:  There  are  other  PCRE2-specific options, applying to the whole
7626       pattern, which can be set by the application when the  compiling  func-
7627       tion  is  called.  In addition, the pattern can contain special leading
7628       sequences such as (*CRLF) to override what the application has  set  or
7629       what  has  been  defaulted.   Details are given in the section entitled
7630       "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
7631       sequences  that can be used to set UTF and Unicode property modes; they
7632       are equivalent to setting the PCRE2_UTF and PCRE2_UCP options,  respec-
7633       tively.  However,  the  application  can  set  the  PCRE2_NEVER_UTF and
7634       PCRE2_NEVER_UCP options, which lock out  the  use  of  the  (*UTF)  and
7635       (*UCP) sequences.
7636
7637
7638GROUPS
7639
7640       Groups  are  delimited  by  parentheses  (round brackets), which can be
7641       nested.  Turning part of a pattern into a group does two things:
7642
7643       1. It localizes a set of alternatives. For example, the pattern
7644
7645         cat(aract|erpillar|)
7646
7647       matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
7648       it would match "cataract", "erpillar" or an empty string.
7649
7650       2.  It  creates a "capture group". This means that, when the whole pat-
7651       tern matches, the portion of the subject string that matched the  group
7652       is  passed back to the caller, separately from the portion that matched
7653       the whole pattern.  (This applies  only  to  the  traditional  matching
7654       function; the DFA matching function does not support capturing.)
7655
7656       Opening parentheses are counted from left to right (starting from 1) to
7657       obtain numbers for capture groups. For example, if the string "the  red
7658       king" is matched against the pattern
7659
7660         the ((red|white) (king|queen))
7661
7662       the captured substrings are "red king", "red", and "king", and are num-
7663       bered 1, 2, and 3, respectively.
7664
7665       The fact that plain parentheses fulfil  two  functions  is  not  always
7666       helpful.   There are often times when grouping is required without cap-
7667       turing. If an opening parenthesis is followed by a question mark and  a
7668       colon,  the  group  does  not do any capturing, and is not counted when
7669       computing the number of any subsequent capture groups. For example,  if
7670       the string "the white queen" is matched against the pattern
7671
7672         the ((?:red|white) (king|queen))
7673
7674       the captured substrings are "white queen" and "queen", and are numbered
7675       1 and 2. The maximum number of capture groups is 65535.
7676
7677       As a convenient shorthand, if any option settings are required  at  the
7678       start  of  a non-capturing group, the option letters may appear between
7679       the "?" and the ":". Thus the two patterns
7680
7681         (?i:saturday|sunday)
7682         (?:(?i)saturday|sunday)
7683
7684       match exactly the same set of strings. Because alternative branches are
7685       tried  from  left  to right, and options are not reset until the end of
7686       the group is reached, an option setting in one branch does affect  sub-
7687       sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
7688       urday".
7689
7690
7691DUPLICATE GROUP NUMBERS
7692
7693       Perl 5.10 introduced a feature whereby each alternative in a group uses
7694       the  same  numbers  for  its capturing parentheses. Such a group starts
7695       with (?| and is itself a non-capturing  group.  For  example,  consider
7696       this pattern:
7697
7698         (?|(Sat)ur|(Sun))day
7699
7700       Because  the two alternatives are inside a (?| group, both sets of cap-
7701       turing parentheses are numbered one. Thus, when  the  pattern  matches,
7702       you  can  look  at captured substring number one, whichever alternative
7703       matched. This construct is useful when you want to  capture  part,  but
7704       not all, of one of a number of alternatives. Inside a (?| group, paren-
7705       theses are numbered as usual, but the number is reset at the  start  of
7706       each  branch.  The numbers of any capturing parentheses that follow the
7707       whole group start after the highest number used in any branch. The fol-
7708       lowing example is taken from the Perl documentation. The numbers under-
7709       neath show in which buffer the captured content will be stored.
7710
7711         # before  ---------------branch-reset----------- after
7712         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
7713         # 1            2         2  3        2     3     4
7714
7715       A backreference to a capture group uses the most recent value  that  is
7716       set for the group. The following pattern matches "abcabc" or "defdef":
7717
7718         /(?|(abc)|(def))\1/
7719
7720       In  contrast, a subroutine call to a capture group always refers to the
7721       first one in the pattern with the given number. The  following  pattern
7722       matches "abcabc" or "defabc":
7723
7724         /(?|(abc)|(def))(?1)/
7725
7726       A relative reference such as (?-1) is no different: it is just a conve-
7727       nient way of computing an absolute group number.
7728
7729       If a condition test for a group's having matched refers to a non-unique
7730       number, the test is true if any group with that number has matched.
7731
7732       An  alternative approach to using this "branch reset" feature is to use
7733       duplicate named groups, as described in the next section.
7734
7735
7736NAMED CAPTURE GROUPS
7737
7738       Identifying capture groups by number is simple, but it can be very hard
7739       to  keep  track of the numbers in complicated patterns. Furthermore, if
7740       an expression is modified, the numbers may change. To  help  with  this
7741       difficulty,  PCRE2  supports the naming of capture groups. This feature
7742       was not added to Perl until release 5.10. Python had the  feature  ear-
7743       lier,  and PCRE1 introduced it at release 4.0, using the Python syntax.
7744       PCRE2 supports both the Perl and the Python syntax.
7745
7746       In PCRE2,  a  capture  group  can  be  named  in  one  of  three  ways:
7747       (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
7748       Names may be up to 32 code units long. When PCRE2_UTF is not set,  they
7749       may  contain  only  ASCII  alphanumeric characters and underscores, but
7750       must start with a non-digit. When PCRE2_UTF is set, the syntax of group
7751       names is extended to allow any Unicode letter or Unicode decimal digit.
7752       In other words, group names must match one of these patterns:
7753
7754         ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
7755         ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set
7756
7757       References to capture groups from other parts of the pattern,  such  as
7758       backreferences,  recursion,  and conditions, can all be made by name as
7759       well as by number.
7760
7761       Named capture groups are allocated numbers as well as names, exactly as
7762       if  the  names were not present. In both PCRE2 and Perl, capture groups
7763       are primarily identified by numbers; any names  are  just  aliases  for
7764       these numbers. The PCRE2 API provides function calls for extracting the
7765       complete name-to-number translation table from a compiled  pattern,  as
7766       well  as  convenience  functions  for extracting captured substrings by
7767       name.
7768
7769       Warning: When more than one capture group has the same number,  as  de-
7770       scribed in the previous section, a name given to one of them applies to
7771       all of them. Perl allows identically numbered groups to have  different
7772       names.  Consider this pattern, where there are two capture groups, both
7773       numbered 1:
7774
7775         (?|(?<AA>aa)|(?<BB>bb))
7776
7777       Perl allows this, with both names AA and BB  as  aliases  of  group  1.
7778       Thus, after a successful match, both names yield the same value (either
7779       "aa" or "bb").
7780
7781       In an attempt to reduce confusion, PCRE2 does not allow the same  group
7782       number to be associated with more than one name. The example above pro-
7783       vokes a compile-time error. However, there is still  scope  for  confu-
7784       sion. Consider this pattern:
7785
7786         (?|(?<AA>aa)|(bb))
7787
7788       Although the second group number 1 is not explicitly named, the name AA
7789       is still an alias for any group 1. Whether the pattern matches "aa"  or
7790       "bb", a reference by name to group AA yields the matched string.
7791
7792       By  default, a name must be unique within a pattern, except that dupli-
7793       cate names are permitted for groups with the same number, for example:
7794
7795         (?|(?<AA>aa)|(?<AA>bb))
7796
7797       The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7798       NAMES option at compile time, or by the use of (?J) within the pattern,
7799       as described in the section entitled "Internal Option Setting" above.
7800
7801       Duplicate names can be useful for patterns where only one  instance  of
7802       the  named  capture group can match. Suppose you want to match the name
7803       of a weekday, either as a 3-letter abbreviation or as  the  full  name,
7804       and  in  both  cases you want to extract the abbreviation. This pattern
7805       (ignoring the line breaks) does the job:
7806
7807         (?J)
7808         (?<DN>Mon|Fri|Sun)(?:day)?|
7809         (?<DN>Tue)(?:sday)?|
7810         (?<DN>Wed)(?:nesday)?|
7811         (?<DN>Thu)(?:rsday)?|
7812         (?<DN>Sat)(?:urday)?
7813
7814       There are five capture groups, but only one is ever set after a  match.
7815       The  convenience  functions for extracting the data by name returns the
7816       substring for the first (and in this example, the only) group  of  that
7817       name that matched. This saves searching to find which numbered group it
7818       was. (An alternative way of solving this problem is to  use  a  "branch
7819       reset" group, as described in the previous section.)
7820
7821       If  you make a backreference to a non-unique named group from elsewhere
7822       in the pattern, the groups to which the name refers are checked in  the
7823       order  in  which they appear in the overall pattern. The first one that
7824       is set is used for the reference. For  example,  this  pattern  matches
7825       both "foofoo" and "barbar" but not "foobar" or "barfoo":
7826
7827         (?J)(?:(?<n>foo)|(?<n>bar))\k<n>
7828
7829
7830       If you make a subroutine call to a non-unique named group, the one that
7831       corresponds to the first occurrence of the name is used. In the absence
7832       of duplicate numbers this is the one with the lowest number.
7833
7834       If you use a named reference in a condition test (see the section about
7835       conditions below), either to check whether a capture group has matched,
7836       or to check for recursion, all groups with the same name are tested. If
7837       the condition is true for any one of them,  the  overall  condition  is
7838       true.  This is the same behaviour as testing by number. For further de-
7839       tails of the interfaces for handling  named  capture  groups,  see  the
7840       pcre2api documentation.
7841
7842
7843REPETITION
7844
7845       Repetition  is  specified  by  quantifiers, which can follow any of the
7846       following items:
7847
7848         a literal data character
7849         the dot metacharacter
7850         the \C escape sequence
7851         the \R escape sequence
7852         the \X escape sequence
7853         an escape such as \d or \pL that matches a single character
7854         a character class
7855         a backreference
7856         a parenthesized group (including lookaround assertions)
7857         a subroutine call (recursive or otherwise)
7858
7859       The general repetition quantifier specifies a minimum and maximum  num-
7860       ber  of  permitted matches, by giving the two numbers in curly brackets
7861       (braces), separated by a comma. The numbers must be  less  than  65536,
7862       and the first must be less than or equal to the second. For example,
7863
7864         z{2,4}
7865
7866       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
7867       special character. If the second number is omitted, but  the  comma  is
7868       present,  there  is  no upper limit; if the second number and the comma
7869       are both omitted, the quantifier specifies an exact number of  required
7870       matches. Thus
7871
7872         [aeiou]{3,}
7873
7874       matches at least 3 successive vowels, but may match many more, whereas
7875
7876         \d{8}
7877
7878       matches  exactly  8  digits. An opening curly bracket that appears in a
7879       position where a quantifier is not allowed, or one that does not  match
7880       the  syntax of a quantifier, is taken as a literal character. For exam-
7881       ple, {,6} is not a quantifier, but a literal string of four characters.
7882
7883       In UTF modes, quantifiers apply to characters rather than to individual
7884       code  units. Thus, for example, \x{100}{2} matches two characters, each
7885       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7886       larly,  \X{3} matches three Unicode extended grapheme clusters, each of
7887       which may be several code units long (and  they  may  be  of  different
7888       lengths).
7889
7890       The quantifier {0} is permitted, causing the expression to behave as if
7891       the previous item and the quantifier were not present. This may be use-
7892       ful  for  capture  groups that are referenced as subroutines from else-
7893       where in the pattern (but see also the section entitled "Defining  cap-
7894       ture groups for use by reference only" below). Except for parenthesized
7895       groups, items that have a {0} quantifier are omitted from the  compiled
7896       pattern.
7897
7898       For  convenience, the three most common quantifiers have single-charac-
7899       ter abbreviations:
7900
7901         *    is equivalent to {0,}
7902         +    is equivalent to {1,}
7903         ?    is equivalent to {0,1}
7904
7905       It is possible to construct infinite loops by following  a  group  that
7906       can  match no characters with a quantifier that has no upper limit, for
7907       example:
7908
7909         (a?)*
7910
7911       Earlier versions of Perl and PCRE1 used to give  an  error  at  compile
7912       time for such patterns. However, because there are cases where this can
7913       be useful, such patterns are now accepted, but whenever an iteration of
7914       such  a group matches no characters, matching moves on to the next item
7915       in the pattern instead of repeatedly matching  an  empty  string.  This
7916       does  not  prevent  backtracking into any of the iterations if a subse-
7917       quent item fails to match.
7918
7919       By default, quantifiers are "greedy", that is, they match  as  much  as
7920       possible (up to the maximum number of permitted times), without causing
7921       the rest of the pattern to fail. The  classic  example  of  where  this
7922       gives  problems is in trying to match comments in C programs. These ap-
7923       pear between /* and */ and within the comment, individual * and / char-
7924       acters  may appear. An attempt to match C comments by applying the pat-
7925       tern
7926
7927         /\*.*\*/
7928
7929       to the string
7930
7931         /* first comment */  not comment  /* second comment */
7932
7933       fails, because it matches the entire string owing to the greediness  of
7934       the  .*  item. However, if a quantifier is followed by a question mark,
7935       it ceases to be greedy, and instead matches the minimum number of times
7936       possible, so the pattern
7937
7938         /\*.*?\*/
7939
7940       does  the  right  thing with the C comments. The meaning of the various
7941       quantifiers is not otherwise changed,  just  the  preferred  number  of
7942       matches.   Do  not  confuse this use of question mark with its use as a
7943       quantifier in its own right. Because it has two uses, it can  sometimes
7944       appear doubled, as in
7945
7946         \d??\d
7947
7948       which matches one digit by preference, but can match two if that is the
7949       only way the rest of the pattern matches.
7950
7951       If the PCRE2_UNGREEDY option is set (an option that is not available in
7952       Perl),  the  quantifiers are not greedy by default, but individual ones
7953       can be made greedy by following them with a  question  mark.  In  other
7954       words, it inverts the default behaviour.
7955
7956       When  a  parenthesized  group is quantified with a minimum repeat count
7957       that is greater than 1 or with a limited maximum, more  memory  is  re-
7958       quired for the compiled pattern, in proportion to the size of the mini-
7959       mum or maximum.
7960
7961       If a pattern starts with  .*  or  .{0,}  and  the  PCRE2_DOTALL  option
7962       (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
7963       lines, the pattern is implicitly  anchored,  because  whatever  follows
7964       will  be  tried against every character position in the subject string,
7965       so there is no point in retrying the overall match at any position  af-
7966       ter  the  first. PCRE2 normally treats such a pattern as though it were
7967       preceded by \A.
7968
7969       In cases where it is known that the subject  string  contains  no  new-
7970       lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
7971       mization, or alternatively, using ^ to indicate anchoring explicitly.
7972
7973       However, there are some cases where the optimization  cannot  be  used.
7974       When  .*   is  inside  capturing  parentheses that are the subject of a
7975       backreference elsewhere in the pattern, a match at the start  may  fail
7976       where a later one succeeds. Consider, for example:
7977
7978         (.*)abc\1
7979
7980       If  the subject is "xyz123abc123" the match point is the fourth charac-
7981       ter. For this reason, such a pattern is not implicitly anchored.
7982
7983       Another case where implicit anchoring is not applied is when the  lead-
7984       ing  .* is inside an atomic group. Once again, a match at the start may
7985       fail where a later one succeeds. Consider this pattern:
7986
7987         (?>.*?a)b
7988
7989       It matches "ab" in the subject "aab". The use of the backtracking  con-
7990       trol  verbs  (*PRUNE)  and  (*SKIP) also disable this optimization, and
7991       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
7992
7993       When a capture group is repeated, the value captured is  the  substring
7994       that matched the final iteration. For example, after
7995
7996         (tweedle[dume]{3}\s*)+
7997
7998       has matched "tweedledum tweedledee" the value of the captured substring
7999       is "tweedledee". However, if there are nested capture groups, the  cor-
8000       responding  captured  values  may have been set in previous iterations.
8001       For example, after
8002
8003         (a|(b))+
8004
8005       matches "aba" the value of the second captured substring is "b".
8006
8007
8008ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
8009
8010       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
8011       repetition,  failure  of what follows normally causes the repeated item
8012       to be re-evaluated to see if a different number of repeats  allows  the
8013       rest  of  the pattern to match. Sometimes it is useful to prevent this,
8014       either to change the nature of the match, or to cause it  fail  earlier
8015       than  it otherwise might, when the author of the pattern knows there is
8016       no point in carrying on.
8017
8018       Consider, for example, the pattern \d+foo when applied to  the  subject
8019       line
8020
8021         123456bar
8022
8023       After matching all 6 digits and then failing to match "foo", the normal
8024       action of the matcher is to try again with only 5 digits  matching  the
8025       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
8026       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
8027       the means for specifying that once a group has matched, it is not to be
8028       re-evaluated in this way.
8029
8030       If we use atomic grouping for the previous example, the  matcher  gives
8031       up  immediately  on failing to match "foo" the first time. The notation
8032       is a kind of special parenthesis, starting with (?> as in this example:
8033
8034         (?>\d+)foo
8035
8036       Perl 5.28 introduced an experimental alphabetic form starting  with  (*
8037       which may be easier to remember:
8038
8039         (*atomic:\d+)foo
8040
8041       This kind of parenthesized group "locks up" the  part of the pattern it
8042       contains once it has matched, and a failure further into the pattern is
8043       prevented  from  backtracking into it. Backtracking past it to previous
8044       items, however, works as normal.
8045
8046       An alternative description is that a group of this type matches exactly
8047       the  string  of  characters  that an identical standalone pattern would
8048       match, if anchored at the current point in the subject string.
8049
8050       Atomic groups are not capture groups. Simple cases such  as  the  above
8051       example  can be thought of as a maximizing repeat that must swallow ev-
8052       erything it can.  So, while both \d+ and \d+? are  prepared  to  adjust
8053       the  number  of digits they match in order to make the rest of the pat-
8054       tern match, (?>\d+) can only match an entire sequence of digits.
8055
8056       Atomic groups in general can of course contain arbitrarily  complicated
8057       expressions, and can be nested. However, when the contents of an atomic
8058       group is just a single repeated item, as in the example above,  a  sim-
8059       pler  notation, called a "possessive quantifier" can be used. This con-
8060       sists of an additional + character following a quantifier.  Using  this
8061       notation, the previous example can be rewritten as
8062
8063         \d++foo
8064
8065       Note that a possessive quantifier can be used with an entire group, for
8066       example:
8067
8068         (abc|xyz){2,3}+
8069
8070       Possessive quantifiers are always greedy; the setting of the  PCRE2_UN-
8071       GREEDY  option  is ignored. They are a convenient notation for the sim-
8072       pler forms of atomic group. However, there  is  no  difference  in  the
8073       meaning  of  a  possessive  quantifier and the equivalent atomic group,
8074       though there may be a performance  difference;  possessive  quantifiers
8075       should be slightly faster.
8076
8077       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
8078       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
8079       edition of his book. Mike McCloskey liked it, so implemented it when he
8080       built Sun's Java package, and PCRE1 copied it from there. It found  its
8081       way into Perl at release 5.10.
8082
8083       PCRE2  has  an  optimization  that automatically "possessifies" certain
8084       simple pattern constructs. For example, the sequence A+B is treated  as
8085       A++B  because  there is no point in backtracking into a sequence of A's
8086       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
8087       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
8088
8089       When a pattern contains an unlimited repeat inside a group that can it-
8090       self be repeated an unlimited number of times, the  use  of  an  atomic
8091       group  is the only way to avoid some failing matches taking a very long
8092       time indeed. The pattern
8093
8094         (\D+|<\d+>)*[!?]
8095
8096       matches an unlimited number of substrings that either consist  of  non-
8097       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
8098       matches, it runs quickly. However, if it is applied to
8099
8100         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
8101
8102       it takes a long time before reporting  failure.  This  is  because  the
8103       string  can be divided between the internal \D+ repeat and the external
8104       * repeat in a large number of ways, and all have to be tried. (The  ex-
8105       ample uses [!?] rather than a single character at the end, because both
8106       PCRE2 and Perl have an optimization that allows for fast failure when a
8107       single  character is used. They remember the last single character that
8108       is required for a match, and fail early if it is  not  present  in  the
8109       string.)  If  the  pattern  is changed so that it uses an atomic group,
8110       like this:
8111
8112         ((?>\D+)|<\d+>)*[!?]
8113
8114       sequences of non-digits cannot be broken, and failure happens quickly.
8115
8116
8117BACKREFERENCES
8118
8119       Outside a character class, a backslash followed by a digit greater than
8120       0  (and  possibly further digits) is a backreference to a capture group
8121       earlier (that is, to its left) in the pattern, provided there have been
8122       that many previous capture groups.
8123
8124       However,  if the decimal number following the backslash is less than 8,
8125       it is always taken as a backreference, and  causes  an  error  only  if
8126       there  are not that many capture groups in the entire pattern. In other
8127       words, the group that is referenced need not be to the left of the ref-
8128       erence  for numbers less than 8. A "forward backreference" of this type
8129       can make sense when a repetition is involved and the group to the right
8130       has participated in an earlier iteration.
8131
8132       It  is  not  possible  to have a numerical "forward backreference" to a
8133       group whose number is 8 or more using this syntax  because  a  sequence
8134       such  as  \50  is  interpreted as a character defined in octal. See the
8135       subsection entitled "Non-printing characters" above for further details
8136       of  the  handling of digits following a backslash. Other forms of back-
8137       referencing do not suffer from this restriction. In  particular,  there
8138       is no problem when named capture groups are used (see below).
8139
8140       Another  way  of  avoiding  the ambiguity inherent in the use of digits
8141       following a backslash is to use the \g  escape  sequence.  This  escape
8142       must be followed by a signed or unsigned number, optionally enclosed in
8143       braces. These examples are all identical:
8144
8145         (ring), \1
8146         (ring), \g1
8147         (ring), \g{1}
8148
8149       An unsigned number specifies an absolute reference without the  ambigu-
8150       ity that is present in the older syntax. It is also useful when literal
8151       digits follow the reference. A signed number is a  relative  reference.
8152       Consider this example:
8153
8154         (abc(def)ghi)\g{-1}
8155
8156       The sequence \g{-1} is a reference to the most recently started capture
8157       group before \g, that is, is it equivalent to \2 in this example. Simi-
8158       larly, \g{-2} would be equivalent to \1. The use of relative references
8159       can be helpful in long patterns, and also in patterns that are  created
8160       by  joining  together  fragments  that  contain references within them-
8161       selves.
8162
8163       The sequence \g{+1} is a reference to the next capture group. This kind
8164       of  forward  reference can be useful in patterns that repeat. Perl does
8165       not support the use of + in this way.
8166
8167       A backreference matches whatever actually  most  recently  matched  the
8168       capture  group  in  the current subject string, rather than anything at
8169       all that matches the group (see "Groups as subroutines" below for a way
8170       of doing that). So the pattern
8171
8172         (sens|respons)e and \1ibility
8173
8174       matches  "sense and sensibility" and "response and responsibility", but
8175       not "sense and responsibility". If caseful matching is in force at  the
8176       time  of  the backreference, the case of letters is relevant. For exam-
8177       ple,
8178
8179         ((?i)rah)\s+\1
8180
8181       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
8182       original capture group is matched caselessly.
8183
8184       There  are  several  different  ways of writing backreferences to named
8185       capture groups. The .NET syntax \k{name} and the Perl  syntax  \k<name>
8186       or  \k'name'  are  supported,  as  is the Python syntax (?P=name). Perl
8187       5.10's unified backreference syntax, in which \g can be used  for  both
8188       numeric  and  named references, is also supported. We could rewrite the
8189       above example in any of the following ways:
8190
8191         (?<p1>(?i)rah)\s+\k<p1>
8192         (?'p1'(?i)rah)\s+\k{p1}
8193         (?P<p1>(?i)rah)\s+(?P=p1)
8194         (?<p1>(?i)rah)\s+\g{p1}
8195
8196       A capture group that is referenced by name may appear  in  the  pattern
8197       before or after the reference.
8198
8199       There  may be more than one backreference to the same group. If a group
8200       has not actually been used in a particular match, backreferences to  it
8201       always fail by default. For example, the pattern
8202
8203         (a|(bc))\2
8204
8205       always  fails  if  it starts to match "a" rather than "bc". However, if
8206       the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8207       erence to an unset value matches an empty string.
8208
8209       Because  there may be many capture groups in a pattern, all digits fol-
8210       lowing a backslash are taken as part of a potential backreference  num-
8211       ber.  If  the  pattern continues with a digit character, some delimiter
8212       must be used to terminate the backreference. If the  PCRE2_EXTENDED  or
8213       PCRE2_EXTENDED_MORE  option is set, this can be white space. Otherwise,
8214       the \g{} syntax or an empty comment (see "Comments" below) can be used.
8215
8216   Recursive backreferences
8217
8218       A backreference that occurs inside the group to which it  refers  fails
8219       when  the  group  is  first used, so, for example, (a\1) never matches.
8220       However, such references can be useful inside repeated groups. For  ex-
8221       ample, the pattern
8222
8223         (a|b\1)+
8224
8225       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8226       ation of the group, the backreference matches the character string cor-
8227       responding  to  the  previous iteration. In order for this to work, the
8228       pattern must be such that the first iteration does not  need  to  match
8229       the  backreference. This can be done using alternation, as in the exam-
8230       ple above, or by a quantifier with a minimum of zero.
8231
8232       For versions of PCRE2 less than 10.25, backreferences of this type used
8233       to  cause  the  group  that  they  reference to be treated as an atomic
8234       group.  This restriction no longer applies, and backtracking into  such
8235       groups can occur as normal.
8236
8237
8238ASSERTIONS
8239
8240       An  assertion  is  a  test on the characters following or preceding the
8241       current matching point that does not consume any characters. The simple
8242       assertions  coded  as  \b,  \B,  \A,  \G, \Z, \z, ^ and $ are described
8243       above.
8244
8245       More complicated assertions are coded as  parenthesized  groups.  There
8246       are  two  kinds:  those  that look ahead of the current position in the
8247       subject string, and those that look behind it, and in each case an  as-
8248       sertion  may  be  positive (must match for the assertion to be true) or
8249       negative (must not match for the assertion to be  true).  An  assertion
8250       group is matched in the normal way, and if it is true, matching contin-
8251       ues after it, but with the matching position in the subject string  re-
8252       set to what it was before the assertion was processed.
8253
8254       The  Perl-compatible  lookaround assertions are atomic. If an assertion
8255       is true, but there is a subsequent matching failure, there is no  back-
8256       tracking  into  the assertion. However, there are some cases where non-
8257       atomic assertions can be useful. PCRE2 has some support for these,  de-
8258       scribed in the section entitled "Non-atomic assertions" below, but they
8259       are not Perl-compatible.
8260
8261       A lookaround assertion may appear as the  condition  in  a  conditional
8262       group  (see  below). In this case, the result of matching the assertion
8263       determines which branch of the condition is followed.
8264
8265       Assertion groups are not capture groups. If an assertion contains  cap-
8266       ture  groups within it, these are counted for the purposes of numbering
8267       the capture groups in the whole pattern. Within each branch of  an  as-
8268       sertion,  locally  captured  substrings  may be referenced in the usual
8269       way. For example, a sequence such as (.)\g{-1} can  be  used  to  check
8270       that two adjacent characters are the same.
8271
8272       When  a  branch within an assertion fails to match, any substrings that
8273       were captured are discarded (as happens with any  pattern  branch  that
8274       fails  to  match).  A  negative  assertion  is  true  only when all its
8275       branches fail to match; this means that no captured substrings are ever
8276       retained  after a successful negative assertion. When an assertion con-
8277       tains a matching branch, what happens depends on the type of assertion.
8278
8279       For a positive assertion, internally captured substrings  in  the  suc-
8280       cessful  branch are retained, and matching continues with the next pat-
8281       tern item after the assertion. For a  negative  assertion,  a  matching
8282       branch  means  that  the assertion is not true. If such an assertion is
8283       being used as a condition in a conditional group (see below),  captured
8284       substrings  are  retained,  because  matching  continues  with the "no"
8285       branch of the condition. For other failing negative assertions, control
8286       passes to the previous backtracking point, thus discarding any captured
8287       strings within the assertion.
8288
8289       Most assertion groups may be repeated; though it makes no sense to  as-
8290       sert the same thing several times, the side effect of capturing in pos-
8291       itive assertions may occasionally be useful. However, an assertion that
8292       forms  the  condition  for  a  conditional group may not be quantified.
8293       PCRE2 used to restrict the repetition of assertions, but  from  release
8294       10.35  the  only restriction is that an unlimited maximum repetition is
8295       changed to be one more than the minimum. For example, {3,}  is  treated
8296       as {3,4}.
8297
8298   Alphabetic assertion names
8299
8300       Traditionally,  symbolic  sequences such as (?= and (?<= have been used
8301       to specify lookaround assertions. Perl 5.28 introduced some  experimen-
8302       tal alphabetic alternatives which might be easier to remember. They all
8303       start with (* instead of (? and must be written using lower  case  let-
8304       ters. PCRE2 supports the following synonyms:
8305
8306         (*positive_lookahead:  or (*pla: is the same as (?=
8307         (*negative_lookahead:  or (*nla: is the same as (?!
8308         (*positive_lookbehind: or (*plb: is the same as (?<=
8309         (*negative_lookbehind: or (*nlb: is the same as (?<!
8310
8311       For  example,  (*pla:foo) is the same assertion as (?=foo). In the fol-
8312       lowing sections, the various assertions are described using the  origi-
8313       nal symbolic forms.
8314
8315   Lookahead assertions
8316
8317       Lookahead assertions start with (?= for positive assertions and (?! for
8318       negative assertions. For example,
8319
8320         \w+(?=;)
8321
8322       matches a word followed by a semicolon, but does not include the  semi-
8323       colon in the match, and
8324
8325         foo(?!bar)
8326
8327       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
8328       that the apparently similar pattern
8329
8330         (?!foo)bar
8331
8332       does not find an occurrence of "bar"  that  is  preceded  by  something
8333       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
8334       the assertion (?!foo) is always true when the next three characters are
8335       "bar". A lookbehind assertion is needed to achieve the other effect.
8336
8337       If you want to force a matching failure at some point in a pattern, the
8338       most convenient way to do it is with (?!) because an empty  string  al-
8339       ways  matches,  so  an assertion that requires there not to be an empty
8340       string must always fail.  The backtracking control verb (*FAIL) or (*F)
8341       is a synonym for (?!).
8342
8343   Lookbehind assertions
8344
8345       Lookbehind  assertions start with (?<= for positive assertions and (?<!
8346       for negative assertions. For example,
8347
8348         (?<!foo)bar
8349
8350       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
8351       contents  of  a  lookbehind  assertion are restricted such that all the
8352       strings it matches must have a fixed length. However, if there are sev-
8353       eral  top-level  alternatives,  they  do  not all have to have the same
8354       fixed length. Thus
8355
8356         (?<=bullock|donkey)
8357
8358       is permitted, but
8359
8360         (?<!dogs?|cats?)
8361
8362       causes an error at compile time. Branches that match  different  length
8363       strings  are permitted only at the top level of a lookbehind assertion.
8364       This is an extension compared with Perl, which requires all branches to
8365       match the same length of string. An assertion such as
8366
8367         (?<=ab(c|de))
8368
8369       is  not  permitted,  because  its single top-level branch can match two
8370       different lengths, but it is acceptable to PCRE2 if  rewritten  to  use
8371       two top-level branches:
8372
8373         (?<=abc|abde)
8374
8375       In  some  cases, the escape sequence \K (see above) can be used instead
8376       of a lookbehind assertion to get round the fixed-length restriction.
8377
8378       The implementation of lookbehind assertions is, for  each  alternative,
8379       to  temporarily  move the current position back by the fixed length and
8380       then try to match. If there are insufficient characters before the cur-
8381       rent position, the assertion fails.
8382
8383       In  UTF-8  and  UTF-16 modes, PCRE2 does not allow the \C escape (which
8384       matches a single code unit even in a UTF mode) to appear in  lookbehind
8385       assertions,  because  it makes it impossible to calculate the length of
8386       the lookbehind. The \X and \R escapes, which can match  different  num-
8387       bers of code units, are never permitted in lookbehinds.
8388
8389       "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
8390       lookbehinds, as long as the called capture group matches a fixed-length
8391       string.  However,  recursion, that is, a "subroutine" call into a group
8392       that is already active, is not supported.
8393
8394       Perl does not support backreferences in lookbehinds. PCRE2 does support
8395       them,  but  only  if  certain  conditions  are met. The PCRE2_MATCH_UN-
8396       SET_BACKREF option must not be set, there must be no use of (?| in  the
8397       pattern  (it creates duplicate group numbers), and if the backreference
8398       is by name, the name must be unique. Of course,  the  referenced  group
8399       must  itself  match  a  fixed  length  substring. The following pattern
8400       matches words containing at least two characters  that  begin  and  end
8401       with the same character:
8402
8403          \b(\w)\w++(?<=\1)
8404
8405       Possessive  quantifiers  can be used in conjunction with lookbehind as-
8406       sertions to specify efficient matching of fixed-length strings  at  the
8407       end of subject strings. Consider a simple pattern such as
8408
8409         abcd$
8410
8411       when  applied  to  a  long string that does not match. Because matching
8412       proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8413       ject  and  then see if what follows matches the rest of the pattern. If
8414       the pattern is specified as
8415
8416         ^.*abcd$
8417
8418       the initial .* matches the entire string at first, but when this  fails
8419       (because there is no following "a"), it backtracks to match all but the
8420       last character, then all but the last two characters, and so  on.  Once
8421       again  the search for "a" covers the entire string, from right to left,
8422       so we are no better off. However, if the pattern is written as
8423
8424         ^.*+(?<=abcd)
8425
8426       there can be no backtracking for the .*+ item because of the possessive
8427       quantifier; it can match only the entire string. The subsequent lookbe-
8428       hind assertion does a single test on the last four  characters.  If  it
8429       fails,  the  match  fails  immediately. For long strings, this approach
8430       makes a significant difference to the processing time.
8431
8432   Using multiple assertions
8433
8434       Several assertions (of any sort) may occur in succession. For example,
8435
8436         (?<=\d{3})(?<!999)foo
8437
8438       matches "foo" preceded by three digits that are not "999". Notice  that
8439       each  of  the  assertions is applied independently at the same point in
8440       the subject string. First there is a  check  that  the  previous  three
8441       characters  are  all  digits,  and  then there is a check that the same
8442       three characters are not "999".  This pattern does not match "foo" pre-
8443       ceded  by  six  characters,  the first of which are digits and the last
8444       three of which are not "999". For example, it  doesn't  match  "123abc-
8445       foo". A pattern to do that is
8446
8447         (?<=\d{3}...)(?<!999)foo
8448
8449       This  time  the  first assertion looks at the preceding six characters,
8450       checking that the first three are digits, and then the second assertion
8451       checks that the preceding three characters are not "999".
8452
8453       Assertions can be nested in any combination. For example,
8454
8455         (?<=(?<!foo)bar)baz
8456
8457       matches  an occurrence of "baz" that is preceded by "bar" which in turn
8458       is not preceded by "foo", while
8459
8460         (?<=\d{3}(?!999)...)foo
8461
8462       is another pattern that matches "foo" preceded by three digits and  any
8463       three characters that are not "999".
8464
8465
8466NON-ATOMIC ASSERTIONS
8467
8468       The  traditional Perl-compatible lookaround assertions are atomic. That
8469       is, if an assertion is true, but there is a subsequent  matching  fail-
8470       ure,  there  is  no backtracking into the assertion. However, there are
8471       some cases where non-atomic positive assertions can  be  useful.  PCRE2
8472       provides these using the following syntax:
8473
8474         (*non_atomic_positive_lookahead:  or (*napla: or (?*
8475         (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
8476
8477       Consider  the  problem  of finding the right-most word in a string that
8478       also appears earlier in the string, that is, it must  appear  at  least
8479       twice  in  total.  This pattern returns the required result as captured
8480       substring 1:
8481
8482         ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
8483
8484       For a subject such as "word1 word2 word3 word2 word3 word4" the  result
8485       is  "word3".  How does it work? At the start, ^(?x) anchors the pattern
8486       and sets the "x" option, which causes white space (introduced for read-
8487       ability)  to  be  ignored. Inside the assertion, the greedy .* at first
8488       consumes the entire string, but then has to backtrack until the rest of
8489       the  assertion can match a word, which is captured by group 1. In other
8490       words, when the assertion first succeeds, it  captures  the  right-most
8491       word in the string.
8492
8493       The  current  matching point is then reset to the start of the subject,
8494       and the rest of the pattern match checks for  two  occurrences  of  the
8495       captured  word,  using  an  ungreedy .*? to scan from the left. If this
8496       succeeds, we are done, but if the last word in the string does not  oc-
8497       cur  twice,  this  part  of  the pattern fails. If a traditional atomic
8498       lookhead (?= or (*pla: had been used, the assertion could not be re-en-
8499       tered,  and  the whole match would fail. The pattern would succeed only
8500       if the very last word in the subject was found twice.
8501
8502       Using a non-atomic lookahead, however, means that when  the  last  word
8503       does  not  occur  twice  in the string, the lookahead can backtrack and
8504       find the second-last word, and so on, until either the match  succeeds,
8505       or all words have been tested.
8506
8507       Two conditions must be met for a non-atomic assertion to be useful: the
8508       contents of one or more capturing groups must change after a  backtrack
8509       into  the  assertion,  and  there  must be a backreference to a changed
8510       group later in the pattern. If this is not the case, the  rest  of  the
8511       pattern  match  fails exactly as before because nothing has changed, so
8512       using a non-atomic assertion just wastes resources.
8513
8514       There is one exception to backtracking into a non-atomic assertion.  If
8515       an  (*ACCEPT)  control verb is triggered, the assertion succeeds atomi-
8516       cally. That is, a subsequent match failure cannot  backtrack  into  the
8517       assertion.
8518
8519       Non-atomic  assertions  are  not  supported by the alternative matching
8520       function pcre2_dfa_match(). They are supported by JIT, but only if they
8521       do not contain any control verbs such as (*ACCEPT). (This may change in
8522       future). Note that assertions that appear as conditions for conditional
8523       groups (see below) must be atomic.
8524
8525
8526SCRIPT RUNS
8527
8528       In  concept, a script run is a sequence of characters that are all from
8529       the same Unicode script such as Latin or Greek. However,  because  some
8530       scripts  are  commonly  used together, and because some diacritical and
8531       other marks are used with multiple scripts,  it  is  not  that  simple.
8532       There is a full description of the rules that PCRE2 uses in the section
8533       entitled "Script Runs" in the pcre2unicode documentation.
8534
8535       If part of a pattern is enclosed between (*script_run: or (*sr:  and  a
8536       closing  parenthesis,  it  fails  if the sequence of characters that it
8537       matches are not a script run. After a failure, normal backtracking  oc-
8538       curs.  Script runs can be used to detect spoofing attacks using charac-
8539       ters that look the same, but are from  different  scripts.  The  string
8540       "paypal.com"  is an infamous example, where the letters could be a mix-
8541       ture of Latin and Cyrillic. This pattern ensures that the matched char-
8542       acters in a sequence of non-spaces that follow white space are a script
8543       run:
8544
8545         \s+(*sr:\S+)
8546
8547       To be sure that they are all from the Latin  script  (for  example),  a
8548       lookahead can be used:
8549
8550         \s+(?=\p{Latin})(*sr:\S+)
8551
8552       This works as long as the first character is expected to be a character
8553       in that script, and not (for example)  punctuation,  which  is  allowed
8554       with  any script. If this is not the case, a more creative lookahead is
8555       needed. For example, if digits, underscore, and dots are  permitted  at
8556       the start:
8557
8558         \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8559
8560
8561       In  many  cases, backtracking into a script run pattern fragment is not
8562       desirable. The script run can employ an atomic group to  prevent  this.
8563       Because  this is a common requirement, a shorthand notation is provided
8564       by (*atomic_script_run: or (*asr:
8565
8566         (*asr:...) is the same as (*sr:(?>...))
8567
8568       Note that the atomic group is inside the script run. Putting it outside
8569       would not prevent backtracking into the script run pattern.
8570
8571       Support  for  script runs is not available if PCRE2 is compiled without
8572       Unicode support. A compile-time error is given if any of the above con-
8573       structs  is encountered. Script runs are not supported by the alternate
8574       matching function, pcre2_dfa_match() because they use the  same  mecha-
8575       nism as capturing parentheses.
8576
8577       Warning:  The  (*ACCEPT)  control  verb  (see below) should not be used
8578       within a script run group, because it causes an immediate exit from the
8579       group, bypassing the script run checking.
8580
8581
8582CONDITIONAL GROUPS
8583
8584       It is possible to cause the matching process to obey a pattern fragment
8585       conditionally or to choose between two alternative fragments, depending
8586       on  the result of an assertion, or whether a specific capture group has
8587       already been matched. The two possible forms of conditional group are:
8588
8589         (?(condition)yes-pattern)
8590         (?(condition)yes-pattern|no-pattern)
8591
8592       If the condition is satisfied, the yes-pattern is used;  otherwise  the
8593       no-pattern  (if present) is used. An absent no-pattern is equivalent to
8594       an empty string (it always matches). If there are more than two  alter-
8595       natives  in the group, a compile-time error occurs. Each of the two al-
8596       ternatives may itself contain nested groups of any form, including con-
8597       ditional  groups;  the  restriction to two alternatives applies only at
8598       the level of the condition itself. This pattern fragment is an  example
8599       where the alternatives are complex:
8600
8601         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
8602
8603
8604       There are five kinds of condition: references to capture groups, refer-
8605       ences to recursion, two pseudo-conditions called  DEFINE  and  VERSION,
8606       and assertions.
8607
8608   Checking for a used capture group by number
8609
8610       If  the  text between the parentheses consists of a sequence of digits,
8611       the condition is true if a capture group of that number has  previously
8612       matched.  If  there is more than one capture group with the same number
8613       (see the earlier section about duplicate group numbers), the  condition
8614       is true if any of them have matched. An alternative notation is to pre-
8615       cede the digits with a plus or minus sign. In this case, the group num-
8616       ber  is relative rather than absolute. The most recently opened capture
8617       group can be referenced by (?(-1), the next most recent by (?(-2),  and
8618       so  on.  Inside  loops  it  can  also make sense to refer to subsequent
8619       groups. The next capture group can be referenced as (?(+1), and so  on.
8620       (The  value  zero in any of these forms is not used; it provokes a com-
8621       pile-time error.)
8622
8623       Consider the following pattern, which  contains  non-significant  white
8624       space  to  make it more readable (assume the PCRE2_EXTENDED option) and
8625       to divide it into three parts for ease of discussion:
8626
8627         ( \( )?    [^()]+    (?(1) \) )
8628
8629       The first part matches an optional opening  parenthesis,  and  if  that
8630       character is present, sets it as the first captured substring. The sec-
8631       ond part matches one or more characters that are not  parentheses.  The
8632       third  part  is a conditional group that tests whether or not the first
8633       capture group matched. If it did, that is, if subject started  with  an
8634       opening  parenthesis,  the condition is true, and so the yes-pattern is
8635       executed and a closing parenthesis is required.  Otherwise,  since  no-
8636       pattern is not present, the conditional group matches nothing. In other
8637       words, this pattern matches a sequence of  non-parentheses,  optionally
8638       enclosed in parentheses.
8639
8640       If  you  were  embedding  this pattern in a larger one, you could use a
8641       relative reference:
8642
8643         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
8644
8645       This makes the fragment independent of the parentheses  in  the  larger
8646       pattern.
8647
8648   Checking for a used capture group by name
8649
8650       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
8651       used capture group by name. For compatibility with earlier versions  of
8652       PCRE1,  which had this facility before Perl, the syntax (?(name)...) is
8653       also recognized.  Note, however, that undelimited names  consisting  of
8654       the  letter  R followed by digits are ambiguous (see the following sec-
8655       tion). Rewriting the above example to use a named group gives this:
8656
8657         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
8658
8659       If the name used in a condition of this kind is a duplicate,  the  test
8660       is  applied  to  all groups of the same name, and is true if any one of
8661       them has matched.
8662
8663   Checking for pattern recursion
8664
8665       "Recursion" in this sense refers to any subroutine-like call  from  one
8666       part  of  the  pattern to another, whether or not it is actually recur-
8667       sive. See the sections entitled "Recursive  patterns"  and  "Groups  as
8668       subroutines" below for details of recursion and subroutine calls.
8669
8670       If  a  condition  is the string (R), and there is no capture group with
8671       the name R, the condition is true if matching is currently in a  recur-
8672       sion  or  subroutine call to the whole pattern or any capture group. If
8673       digits follow the letter R, and there is no group with that  name,  the
8674       condition  is  true  if  the  most recent call is into a group with the
8675       given number, which must exist somewhere in the overall  pattern.  This
8676       is a contrived example that is equivalent to a+b:
8677
8678         ((?(R1)a+|(?1)b))
8679
8680       However,  in  both  cases,  if there is a capture group with a matching
8681       name, the condition tests for its being set, as described in  the  sec-
8682       tion  above,  instead of testing for recursion. For example, creating a
8683       group with the name R1 by adding (?<R1>)  to  the  above  pattern  com-
8684       pletely changes its meaning.
8685
8686       If a name preceded by ampersand follows the letter R, for example:
8687
8688         (?(R&name)...)
8689
8690       the  condition  is true if the most recent recursion is into a group of
8691       that name (which must exist within the pattern).
8692
8693       This condition does not check the entire recursion stack. It tests only
8694       the  current  level.  If the name used in a condition of this kind is a
8695       duplicate, the test is applied to all groups of the same name,  and  is
8696       true if any one of them is the most recent recursion.
8697
8698       At "top level", all these recursion test conditions are false.
8699
8700   Defining capture groups for use by reference only
8701
8702       If the condition is the string (DEFINE), the condition is always false,
8703       even if there is a group with the name DEFINE. In this case, there  may
8704       be only one alternative in the rest of the conditional group. It is al-
8705       ways skipped if control reaches this point in the pattern; the idea  of
8706       DEFINE  is that it can be used to define subroutines that can be refer-
8707       enced from elsewhere. (The use of subroutines is described below.)  For
8708       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
8709       could be written like this (ignore white space and line breaks):
8710
8711         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8712         \b (?&byte) (\.(?&byte)){3} \b
8713
8714       The first part of the pattern is a DEFINE group inside which a  another
8715       group  named "byte" is defined. This matches an individual component of
8716       an IPv4 address (a number less than 256). When  matching  takes  place,
8717       this  part  of  the pattern is skipped because DEFINE acts like a false
8718       condition. The rest of the pattern uses references to the  named  group
8719       to  match the four dot-separated components of an IPv4 address, insist-
8720       ing on a word boundary at each end.
8721
8722   Checking the PCRE2 version
8723
8724       Programs that link with a PCRE2 library can check the version by  call-
8725       ing  pcre2_config()  with  appropriate arguments. Users of applications
8726       that do not have access to the underlying code cannot do this.  A  spe-
8727       cial  "condition" called VERSION exists to allow such users to discover
8728       which version of PCRE2 they are dealing with by using this condition to
8729       match  a string such as "yesno". VERSION must be followed either by "="
8730       or ">=" and a version number.  For example:
8731
8732         (?(VERSION>=10.4)yes|no)
8733
8734       This pattern matches "yes" if the PCRE2 version is greater or equal  to
8735       10.4,  or "no" otherwise. The fractional part of the version number may
8736       not contain more than two digits.
8737
8738   Assertion conditions
8739
8740       If the condition is not in any of the  above  formats,  it  must  be  a
8741       parenthesized  assertion.  This may be a positive or negative lookahead
8742       or lookbehind assertion. However, it must be a traditional  atomic  as-
8743       sertion, not one of the PCRE2-specific non-atomic assertions.
8744
8745       Consider  this  pattern,  again containing non-significant white space,
8746       and with the two alternatives on the second line:
8747
8748         (?(?=[^a-z]*[a-z])
8749         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
8750
8751       The condition is a positive lookahead assertion  that  matches  an  op-
8752       tional sequence of non-letters followed by a letter. In other words, it
8753       tests for the presence of at least one letter in the subject. If a let-
8754       ter  is  found,  the  subject is matched against the first alternative;
8755       otherwise it is  matched  against  the  second.  This  pattern  matches
8756       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8757       letters and dd are digits.
8758
8759       When an assertion that is a condition contains capture groups, any cap-
8760       turing  that  occurs  in  a matching branch is retained afterwards, for
8761       both positive and negative assertions, because matching always  contin-
8762       ues  after  the  assertion, whether it succeeds or fails. (Compare non-
8763       conditional assertions, for which captures are retained only for  posi-
8764       tive assertions that succeed.)
8765
8766
8767COMMENTS
8768
8769       There are two ways of including comments in patterns that are processed
8770       by PCRE2. In both cases, the start of the comment  must  not  be  in  a
8771       character  class,  nor  in  the middle of any other sequence of related
8772       characters such as (?: or a group name or number. The  characters  that
8773       make up a comment play no part in the pattern matching.
8774
8775       The  sequence (?# marks the start of a comment that continues up to the
8776       next closing parenthesis. Nested parentheses are not permitted. If  the
8777       PCRE2_EXTENDED  or  PCRE2_EXTENDED_MORE  option  is set, an unescaped #
8778       character also introduces a comment, which in this  case  continues  to
8779       immediately  after  the next newline character or character sequence in
8780       the pattern. Which characters are interpreted as newlines is controlled
8781       by  an option passed to the compiling function or by a special sequence
8782       at the start of the pattern, as described in the section entitled "New-
8783       line conventions" above. Note that the end of this type of comment is a
8784       literal newline sequence in the pattern; escape sequences  that  happen
8785       to represent a newline do not count. For example, consider this pattern
8786       when PCRE2_EXTENDED is set, and the default newline convention (a  sin-
8787       gle linefeed character) is in force:
8788
8789         abc #comment \n still comment
8790
8791       On  encountering  the # character, pcre2_compile() skips along, looking
8792       for a newline in the pattern. The sequence \n is still literal at  this
8793       stage,  so  it does not terminate the comment. Only an actual character
8794       with the code value 0x0a (the default newline) does so.
8795
8796
8797RECURSIVE PATTERNS
8798
8799       Consider the problem of matching a string in parentheses, allowing  for
8800       unlimited  nested  parentheses.  Without the use of recursion, the best
8801       that can be done is to use a pattern that  matches  up  to  some  fixed
8802       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
8803       depth.
8804
8805       For some time, Perl has provided a facility that allows regular expres-
8806       sions  to recurse (amongst other things). It does this by interpolating
8807       Perl code in the expression at run time, and the code can refer to  the
8808       expression itself. A Perl pattern using code interpolation to solve the
8809       parentheses problem can be created like this:
8810
8811         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
8812
8813       The (?p{...}) item interpolates Perl code at run time, and in this case
8814       refers recursively to the pattern in which it appears.
8815
8816       Obviously,  PCRE2  cannot  support  the interpolation of Perl code. In-
8817       stead, it supports special syntax for recursion of the entire  pattern,
8818       and also for individual capture group recursion. After its introduction
8819       in PCRE1 and Python, this kind of recursion was subsequently introduced
8820       into Perl at release 5.10.
8821
8822       A  special  item  that consists of (? followed by a number greater than
8823       zero and a closing parenthesis is a recursive subroutine  call  of  the
8824       capture  group of the given number, provided that it occurs inside that
8825       group. (If not, it is a non-recursive subroutine  call,  which  is  de-
8826       scribed in the next section.) The special item (?R) or (?0) is a recur-
8827       sive call of the entire regular expression.
8828
8829       This PCRE2 pattern solves the nested parentheses  problem  (assume  the
8830       PCRE2_EXTENDED option is set so that white space is ignored):
8831
8832         \( ( [^()]++ | (?R) )* \)
8833
8834       First  it matches an opening parenthesis. Then it matches any number of
8835       substrings which can either be a sequence of non-parentheses, or a  re-
8836       cursive match of the pattern itself (that is, a correctly parenthesized
8837       substring).  Finally there is a closing parenthesis. Note the use of  a
8838       possessive  quantifier  to  avoid  backtracking  into sequences of non-
8839       parentheses.
8840
8841       If this were part of a larger pattern, you would not  want  to  recurse
8842       the entire pattern, so instead you could use this:
8843
8844         ( \( ( [^()]++ | (?1) )* \) )
8845
8846       We  have  put the pattern into parentheses, and caused the recursion to
8847       refer to them instead of the whole pattern.
8848
8849       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
8850       tricky.  This is made easier by the use of relative references. Instead
8851       of (?1) in the pattern above you can write (?-2) to refer to the second
8852       most  recently  opened  parentheses  preceding  the recursion. In other
8853       words, a negative number counts capturing  parentheses  leftwards  from
8854       the point at which it is encountered.
8855
8856       Be  aware  however, that if duplicate capture group numbers are in use,
8857       relative references refer to the earliest group  with  the  appropriate
8858       number. Consider, for example:
8859
8860         (?|(a)|(b)) (c) (?-2)
8861
8862       The first two capture groups (a) and (b) are both numbered 1, and group
8863       (c) is number 2. When the reference (?-2) is  encountered,  the  second
8864       most  recently opened parentheses has the number 1, but it is the first
8865       such group (the (a) group) to which the recursion refers. This would be
8866       the  same if an absolute reference (?1) was used. In other words, rela-
8867       tive references are just a shorthand for computing a group number.
8868
8869       It is also possible to refer to subsequent capture groups,  by  writing
8870       references  such  as  (?+2). However, these cannot be recursive because
8871       the reference is not inside the parentheses that are  referenced.  They
8872       are  always  non-recursive  subroutine  calls, as described in the next
8873       section.
8874
8875       An alternative approach is to use named parentheses.  The  Perl  syntax
8876       for  this  is  (?&name);  PCRE1's earlier syntax (?P>name) is also sup-
8877       ported. We could rewrite the above example as follows:
8878
8879         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
8880
8881       If there is more than one group with the same name, the earliest one is
8882       used.
8883
8884       The example pattern that we have been looking at contains nested unlim-
8885       ited repeats, and so the use of a possessive  quantifier  for  matching
8886       strings  of  non-parentheses  is important when applying the pattern to
8887       strings that do not match. For example, when this pattern is applied to
8888
8889         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
8890
8891       it yields "no match" quickly. However, if a  possessive  quantifier  is
8892       not  used, the match runs for a very long time indeed because there are
8893       so many different ways the + and * repeats can carve  up  the  subject,
8894       and all have to be tested before failure can be reported.
8895
8896       At  the  end  of a match, the values of capturing parentheses are those
8897       from the outermost level. If you want to obtain intermediate values,  a
8898       callout function can be used (see below and the pcre2callout documenta-
8899       tion). If the pattern above is matched against
8900
8901         (ab(cd)ef)
8902
8903       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
8904       which  is  the last value taken on at the top level. If a capture group
8905       is not matched at the top level, its final  captured  value  is  unset,
8906       even  if it was (temporarily) set at a deeper level during the matching
8907       process.
8908
8909       Do not confuse the (?R) item with the condition (R),  which  tests  for
8910       recursion.   Consider  this pattern, which matches text in angle brack-
8911       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
8912       brackets  (that is, when recursing), whereas any characters are permit-
8913       ted at the outer level.
8914
8915         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
8916
8917       In this pattern, (?(R) is the start of a conditional  group,  with  two
8918       different  alternatives  for the recursive and non-recursive cases. The
8919       (?R) item is the actual recursive call.
8920
8921   Differences in recursion processing between PCRE2 and Perl
8922
8923       Some former differences between PCRE2 and Perl no longer exist.
8924
8925       Before release 10.30, recursion processing in PCRE2 differed from  Perl
8926       in  that  a  recursive  subroutine call was always treated as an atomic
8927       group. That is, once it had matched some of the subject string, it  was
8928       never  re-entered,  even if it contained untried alternatives and there
8929       was a subsequent matching failure. (Historical note:  PCRE  implemented
8930       recursion before Perl did.)
8931
8932       Starting  with  release 10.30, recursive subroutine calls are no longer
8933       treated as atomic. That is, they can be re-entered to try unused alter-
8934       natives  if  there  is a matching failure later in the pattern. This is
8935       now compatible with the way Perl works. If you want a  subroutine  call
8936       to be atomic, you must explicitly enclose it in an atomic group.
8937
8938       Supporting backtracking into recursions simplifies certain types of re-
8939       cursive pattern. For example, this pattern matches palindromic strings:
8940
8941         ^((.)(?1)\2|.?)$
8942
8943       The second branch in the group matches a single  central  character  in
8944       the  palindrome  when there are an odd number of characters, or nothing
8945       when there are an even number of characters, but in order  to  work  it
8946       has  to  be  able  to  try the second case when the rest of the pattern
8947       match fails. If you want to match typical palindromic phrases, the pat-
8948       tern  has  to  ignore  all  non-word characters, which can be done like
8949       this:
8950
8951         ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
8952
8953       If run with the PCRE2_CASELESS option,  this  pattern  matches  phrases
8954       such  as "A man, a plan, a canal: Panama!". Note the use of the posses-
8955       sive quantifier *+ to avoid backtracking  into  sequences  of  non-word
8956       characters. Without this, PCRE2 takes a great deal longer (ten times or
8957       more) to match typical phrases, and Perl takes so long that  you  think
8958       it has gone into a loop.
8959
8960       Another  way  in which PCRE2 and Perl used to differ in their recursion
8961       processing is in the handling of captured  values.  Formerly  in  Perl,
8962       when  a  group  was called recursively or as a subroutine (see the next
8963       section), it had no access to any values that were captured outside the
8964       recursion,  whereas  in  PCRE2 these values can be referenced. Consider
8965       this pattern:
8966
8967         ^(.)(\1|a(?2))
8968
8969       This pattern matches "bab". The first capturing parentheses match  "b",
8970       then in the second group, when the backreference \1 fails to match "b",
8971       the second alternative matches "a" and then recurses. In the recursion,
8972       \1  does now match "b" and so the whole match succeeds. This match used
8973       to fail in Perl, but in later versions (I tried 5.024) it now works.
8974
8975
8976GROUPS AS SUBROUTINES
8977
8978       If the syntax for a recursive group call (either by number or by  name)
8979       is  used  outside the parentheses to which it refers, it operates a bit
8980       like a subroutine in a programming  language.  More  accurately,  PCRE2
8981       treats the referenced group as an independent subpattern which it tries
8982       to match at the current matching position. The called group may be  de-
8983       fined  before or after the reference. A numbered reference can be abso-
8984       lute or relative, as in these examples:
8985
8986         (...(absolute)...)...(?2)...
8987         (...(relative)...)...(?-1)...
8988         (...(?+1)...(relative)...
8989
8990       An earlier example pointed out that the pattern
8991
8992         (sens|respons)e and \1ibility
8993
8994       matches "sense and sensibility" and "response and responsibility",  but
8995       not "sense and responsibility". If instead the pattern
8996
8997         (sens|respons)e and (?1)ibility
8998
8999       is  used, it does match "sense and responsibility" as well as the other
9000       two strings. Another example is  given  in  the  discussion  of  DEFINE
9001       above.
9002
9003       Like  recursions,  subroutine  calls  used to be treated as atomic, but
9004       this changed at PCRE2 release 10.30, so  backtracking  into  subroutine
9005       calls  can  now  occur. However, any capturing parentheses that are set
9006       during the subroutine call revert to their previous values afterwards.
9007
9008       Processing options such as case-independence are fixed when a group  is
9009       defined,  so  if  it  is  used  as a subroutine, such options cannot be
9010       changed for different calls. For example, consider this pattern:
9011
9012         (abc)(?i:(?-1))
9013
9014       It matches "abcabc". It does not match "abcABC" because the  change  of
9015       processing option does not affect the called group.
9016
9017       The  behaviour  of  backtracking control verbs in groups when called as
9018       subroutines is described in the section entitled "Backtracking verbs in
9019       subroutines" below.
9020
9021
9022ONIGURUMA SUBROUTINE SYNTAX
9023
9024       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
9025       name or a number enclosed either in angle brackets or single quotes, is
9026       an alternative syntax for calling a group as a subroutine, possibly re-
9027       cursively. Here are two of the examples  used  above,  rewritten  using
9028       this syntax:
9029
9030         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
9031         (sens|respons)e and \g'1'ibility
9032
9033       PCRE2  supports an extension to Oniguruma: if a number is preceded by a
9034       plus or a minus sign it is taken as a relative reference. For example:
9035
9036         (abc)(?i:\g<-1>)
9037
9038       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
9039       synonymous.  The  former is a backreference; the latter is a subroutine
9040       call.
9041
9042
9043CALLOUTS
9044
9045       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
9046       Perl  code to be obeyed in the middle of matching a regular expression.
9047       This makes it possible, amongst other things, to extract different sub-
9048       strings that match the same pair of parentheses when there is a repeti-
9049       tion.
9050
9051       PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
9052       trary  Perl  code. The feature is called "callout". The caller of PCRE2
9053       provides an external function by putting its entry  point  in  a  match
9054       context  using  the function pcre2_set_callout(), and then passing that
9055       context to pcre2_match() or pcre2_dfa_match(). If no match  context  is
9056       passed, or if the callout entry point is set to NULL, callouts are dis-
9057       abled.
9058
9059       Within a regular expression, (?C<arg>) indicates a point at  which  the
9060       external  function  is  to  be  called. There are two kinds of callout:
9061       those with a numerical argument and those with a string argument.  (?C)
9062       on  its  own with no argument is treated as (?C0). A numerical argument
9063       allows the  application  to  distinguish  between  different  callouts.
9064       String  arguments  were added for release 10.20 to make it possible for
9065       script languages that use PCRE2 to embed short scripts within  patterns
9066       in a similar way to Perl.
9067
9068       During matching, when PCRE2 reaches a callout point, the external func-
9069       tion is called. It is provided with the number or  string  argument  of
9070       the  callout, the position in the pattern, and one item of data that is
9071       also set in the match block. The callout function may cause matching to
9072       proceed, to backtrack, or to fail.
9073
9074       By  default,  PCRE2  implements  a  number of optimizations at matching
9075       time, and one side-effect is that sometimes callouts  are  skipped.  If
9076       you  need all possible callouts to happen, you need to set options that
9077       disable the relevant optimizations. More details, including a  complete
9078       description  of  the programming interface to the callout function, are
9079       given in the pcre2callout documentation.
9080
9081   Callouts with numerical arguments
9082
9083       If you just want to have  a  means  of  identifying  different  callout
9084       points,  put  a  number  less than 256 after the letter C. For example,
9085       this pattern has two callout points:
9086
9087         (?C1)abc(?C2)def
9088
9089       If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(),  numerical
9090       callouts  are  automatically installed before each item in the pattern.
9091       They are all numbered 255. If there is a conditional group in the  pat-
9092       tern whose condition is an assertion, an additional callout is inserted
9093       just before the condition. An explicit callout may also be set at  this
9094       position, as in this example:
9095
9096         (?(?C9)(?=a)abc|def)
9097
9098       Note that this applies only to assertion conditions, not to other types
9099       of condition.
9100
9101   Callouts with string arguments
9102
9103       A delimited string may be used instead of a number as a  callout  argu-
9104       ment.  The  starting  delimiter  must be one of ` ' " ^ % # $ { and the
9105       ending delimiter is the same as the start, except for {, where the end-
9106       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
9107       string, it must be doubled. For example:
9108
9109         (?C'ab ''c'' d')xyz(?C{any text})pqr
9110
9111       The doubling is removed before the string  is  passed  to  the  callout
9112       function.
9113
9114
9115BACKTRACKING CONTROL
9116
9117       There  are  a  number  of  special "Backtracking Control Verbs" (to use
9118       Perl's terminology) that modify the behaviour  of  backtracking  during
9119       matching.  They are generally of the form (*VERB) or (*VERB:NAME). Some
9120       verbs take either form, and may behave differently depending on whether
9121       or  not  a  name  argument is present. The names are not required to be
9122       unique within the pattern.
9123
9124       By default, for compatibility with Perl, a  name  is  any  sequence  of
9125       characters that does not include a closing parenthesis. The name is not
9126       processed in any way, and it is  not  possible  to  include  a  closing
9127       parenthesis   in  the  name.   This  can  be  changed  by  setting  the
9128       PCRE2_ALT_VERBNAMES option, but the result is no  longer  Perl-compati-
9129       ble.
9130
9131       When  PCRE2_ALT_VERBNAMES  is  set,  backslash processing is applied to
9132       verb names and only an unescaped  closing  parenthesis  terminates  the
9133       name.  However, the only backslash items that are permitted are \Q, \E,
9134       and sequences such as \x{100} that define character code points.  Char-
9135       acter type escapes such as \d are faulted.
9136
9137       A closing parenthesis can be included in a name either as \) or between
9138       \Q and \E. In addition to backslash processing, if  the  PCRE2_EXTENDED
9139       or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
9140       names is skipped, and #-comments are recognized, exactly as in the rest
9141       of  the  pattern.  PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
9142       verb names unless PCRE2_ALT_VERBNAMES is also set.
9143
9144       The maximum length of a name is 255 in the 8-bit library and  65535  in
9145       the  16-bit and 32-bit libraries. If the name is empty, that is, if the
9146       closing parenthesis immediately follows the colon, the effect is as  if
9147       the colon were not there. Any number of these verbs may occur in a pat-
9148       tern. Except for (*ACCEPT), they may not be quantified.
9149
9150       Since these verbs are specifically related  to  backtracking,  most  of
9151       them  can be used only when the pattern is to be matched using the tra-
9152       ditional matching function, because that uses a backtracking algorithm.
9153       With  the  exception  of (*FAIL), which behaves like a failing negative
9154       assertion, the backtracking control verbs cause an error if encountered
9155       by the DFA matching function.
9156
9157       The  behaviour  of  these  verbs in repeated groups, assertions, and in
9158       capture groups called as subroutines (whether or  not  recursively)  is
9159       documented below.
9160
9161   Optimizations that affect backtracking verbs
9162
9163       PCRE2 contains some optimizations that are used to speed up matching by
9164       running some checks at the start of each match attempt. For example, it
9165       may  know  the minimum length of matching subject, or that a particular
9166       character must be present. When one of these optimizations bypasses the
9167       running  of  a  match,  any  included  backtracking  verbs will not, of
9168       course, be processed. You can suppress the start-of-match optimizations
9169       by  setting  the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9170       pile(), or by starting the pattern with (*NO_START_OPT). There is  more
9171       discussion of this option in the section entitled "Compiling a pattern"
9172       in the pcre2api documentation.
9173
9174       Experiments with Perl suggest that it too  has  similar  optimizations,
9175       and like PCRE2, turning them off can change the result of a match.
9176
9177   Verbs that act immediately
9178
9179       The following verbs act as soon as they are encountered.
9180
9181          (*ACCEPT) or (*ACCEPT:NAME)
9182
9183       This  verb causes the match to end successfully, skipping the remainder
9184       of the pattern. However, when it is inside  a  capture  group  that  is
9185       called as a subroutine, only that group is ended successfully. Matching
9186       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9187       tive  assertion,  the  assertion succeeds; in a negative assertion, the
9188       assertion fails.
9189
9190       If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-
9191       tured. For example:
9192
9193         A((?:A|B(*ACCEPT)|C)D)
9194
9195       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
9196       tured by the outer parentheses.
9197
9198       (*ACCEPT) is the only backtracking verb that is allowed to  be  quanti-
9199       fied  because  an  ungreedy  quantification with a minimum of zero acts
9200       only when a backtrack happens. Consider, for example,
9201
9202         (A(*ACCEPT)??B)C
9203
9204       where A, B, and C may be complex expressions. After matching  "A",  the
9205       matcher  processes  "BC"; if that fails, causing a backtrack, (*ACCEPT)
9206       is triggered and the match succeeds. In both cases, all but C  is  cap-
9207       tured.  Whereas  (*COMMIT) (see below) means "fail on backtrack", a re-
9208       peated (*ACCEPT) of this type means "succeed on backtrack".
9209
9210       Warning: (*ACCEPT) should not be used within a script  run  group,  be-
9211       cause  it causes an immediate exit from the group, bypassing the script
9212       run checking.
9213
9214         (*FAIL) or (*FAIL:NAME)
9215
9216       This verb causes a matching failure, forcing backtracking to occur.  It
9217       may  be  abbreviated  to  (*F).  It is equivalent to (?!) but easier to
9218       read. The Perl documentation notes that it is probably useful only when
9219       combined with (?{}) or (??{}). Those are, of course, Perl features that
9220       are not present in PCRE2. The nearest equivalent is  the  callout  fea-
9221       ture, as for example in this pattern:
9222
9223         a+(?C)(*FAIL)
9224
9225       A  match  with the string "aaaa" always fails, but the callout is taken
9226       before each backtrack happens (in this example, 10 times).
9227
9228       (*ACCEPT:NAME) and (*FAIL:NAME) behave the  same  as  (*MARK:NAME)(*AC-
9229       CEPT)  and  (*MARK:NAME)(*FAIL),  respectively,  that  is, a (*MARK) is
9230       recorded just before the verb acts.
9231
9232   Recording which path was taken
9233
9234       There is one verb whose main purpose is to track how a  match  was  ar-
9235       rived  at,  though  it also has a secondary use in conjunction with ad-
9236       vancing the match starting point (see (*SKIP) below).
9237
9238         (*MARK:NAME) or (*:NAME)
9239
9240       A name is always required with this verb. For all the other  backtrack-
9241       ing control verbs, a NAME argument is optional.
9242
9243       When  a  match  succeeds, the name of the last-encountered mark name on
9244       the matching path is passed back to the caller as described in the sec-
9245       tion entitled "Other information about the match" in the pcre2api docu-
9246       mentation. This applies to all instances of (*MARK)  and  other  verbs,
9247       including those inside assertions and atomic groups. However, there are
9248       differences in those cases when (*MARK) is  used  in  conjunction  with
9249       (*SKIP) as described below.
9250
9251       The  mark name that was last encountered on the matching path is passed
9252       back. A verb without a NAME argument is ignored for this purpose.  Here
9253       is  an  example of pcre2test output, where the "mark" modifier requests
9254       the retrieval and outputting of (*MARK) data:
9255
9256           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9257         data> XY
9258          0: XY
9259         MK: A
9260         XZ
9261          0: XZ
9262         MK: B
9263
9264       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9265       ple  it indicates which of the two alternatives matched. This is a more
9266       efficient way of obtaining this information than putting each  alterna-
9267       tive in its own capturing parentheses.
9268
9269       If  a  verb  with a name is encountered in a positive assertion that is
9270       true, the name is recorded and passed back if it  is  the  last-encoun-
9271       tered. This does not happen for negative assertions or failing positive
9272       assertions.
9273
9274       After a partial match or a failed match, the last encountered  name  in
9275       the entire match process is returned. For example:
9276
9277           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9278         data> XP
9279         No match, mark = B
9280
9281       Note  that  in  this  unanchored  example the mark is retained from the
9282       match attempt that started at the letter "X" in the subject. Subsequent
9283       match attempts starting at "P" and then with an empty string do not get
9284       as far as the (*MARK) item, but nevertheless do not reset it.
9285
9286       If you are interested in  (*MARK)  values  after  failed  matches,  you
9287       should  probably  set the PCRE2_NO_START_OPTIMIZE option (see above) to
9288       ensure that the match is always attempted.
9289
9290   Verbs that act after backtracking
9291
9292       The following verbs do nothing when they are encountered. Matching con-
9293       tinues  with  what follows, but if there is a subsequent match failure,
9294       causing a backtrack to the verb, a failure is forced.  That  is,  back-
9295       tracking  cannot  pass  to  the  left of the verb. However, when one of
9296       these verbs appears inside an atomic group or in a lookaround assertion
9297       that  is  true,  its effect is confined to that group, because once the
9298       group has been matched, there is never any backtracking into it.  Back-
9299       tracking from beyond an assertion or an atomic group ignores the entire
9300       group, and seeks a preceding backtracking point.
9301
9302       These verbs differ in exactly what kind of failure  occurs  when  back-
9303       tracking  reaches  them.  The behaviour described below is what happens
9304       when the verb is not in a subroutine or an assertion.  Subsequent  sec-
9305       tions cover these special cases.
9306
9307         (*COMMIT) or (*COMMIT:NAME)
9308
9309       This  verb  causes the whole match to fail outright if there is a later
9310       matching failure that causes backtracking to reach it. Even if the pat-
9311       tern  is  unanchored,  no further attempts to find a match by advancing
9312       the starting point take place. If (*COMMIT) is  the  only  backtracking
9313       verb that is encountered, once it has been passed pcre2_match() is com-
9314       mitted to finding a match at the current starting point, or not at all.
9315       For example:
9316
9317         a+(*COMMIT)b
9318
9319       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
9320       of dynamic anchor, or "I've started, so I must finish."
9321
9322       The behaviour of (*COMMIT:NAME) is not the same  as  (*MARK:NAME)(*COM-
9323       MIT).  It is like (*MARK:NAME) in that the name is remembered for pass-
9324       ing back to the caller. However, (*SKIP:NAME) searches only  for  names
9325       that are set with (*MARK), ignoring those set by any of the other back-
9326       tracking verbs.
9327
9328       If there is more than one backtracking verb in a pattern,  a  different
9329       one  that  follows  (*COMMIT) may be triggered first, so merely passing
9330       (*COMMIT) during a match does not always guarantee that a match must be
9331       at this starting point.
9332
9333       Note that (*COMMIT) at the start of a pattern is not the same as an an-
9334       chor, unless PCRE2's start-of-match optimizations are  turned  off,  as
9335       shown in this output from pcre2test:
9336
9337           re> /(*COMMIT)abc/
9338         data> xyzabc
9339          0: abc
9340         data>
9341         re> /(*COMMIT)abc/no_start_optimize
9342         data> xyzabc
9343         No match
9344
9345       For  the first pattern, PCRE2 knows that any match must start with "a",
9346       so the optimization skips along the subject to "a" before applying  the
9347       pattern  to the first set of data. The match attempt then succeeds. The
9348       second pattern disables the optimization that skips along to the  first
9349       character.  The  pattern  is  now  applied  starting at "x", and so the
9350       (*COMMIT) causes the match to fail without trying  any  other  starting
9351       points.
9352
9353         (*PRUNE) or (*PRUNE:NAME)
9354
9355       This  verb causes the match to fail at the current starting position in
9356       the subject if there is a later matching failure that causes backtrack-
9357       ing  to  reach it. If the pattern is unanchored, the normal "bumpalong"
9358       advance to the next starting character then happens.  Backtracking  can
9359       occur  as  usual to the left of (*PRUNE), before it is reached, or when
9360       matching to the right of (*PRUNE), but if there  is  no  match  to  the
9361       right,  backtracking cannot cross (*PRUNE). In simple cases, the use of
9362       (*PRUNE) is just an alternative to an atomic group or possessive  quan-
9363       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
9364       any other way. In an anchored pattern (*PRUNE) has the same  effect  as
9365       (*COMMIT).
9366
9367       The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
9368       It is like (*MARK:NAME) in that the name is remembered for passing back
9369       to  the  caller. However, (*SKIP:NAME) searches only for names set with
9370       (*MARK), ignoring those set by other backtracking verbs.
9371
9372         (*SKIP)
9373
9374       This verb, when given without a name, is like (*PRUNE), except that  if
9375       the  pattern  is unanchored, the "bumpalong" advance is not to the next
9376       character, but to the position in the subject where (*SKIP) was encoun-
9377       tered.  (*SKIP)  signifies that whatever text was matched leading up to
9378       it cannot be part of a successful match if there is a  later  mismatch.
9379       Consider:
9380
9381         a+(*SKIP)b
9382
9383       If  the  subject  is  "aaaac...",  after  the first match attempt fails
9384       (starting at the first character in the  string),  the  starting  point
9385       skips on to start the next attempt at "c". Note that a possessive quan-
9386       tifer does not have the same effect as this example; although it  would
9387       suppress  backtracking  during  the first match attempt, the second at-
9388       tempt would start at the second character instead  of  skipping  on  to
9389       "c".
9390
9391       If  (*SKIP) is used to specify a new starting position that is the same
9392       as the starting position of the current match, or (by  being  inside  a
9393       lookbehind)  earlier, the position specified by (*SKIP) is ignored, and
9394       instead the normal "bumpalong" occurs.
9395
9396         (*SKIP:NAME)
9397
9398       When (*SKIP) has an associated name, its behaviour  is  modified.  When
9399       such  a  (*SKIP) is triggered, the previous path through the pattern is
9400       searched for the most recent (*MARK) that has the same name. If one  is
9401       found,  the  "bumpalong" advance is to the subject position that corre-
9402       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
9403       no (*MARK) with a matching name is found, the (*SKIP) is ignored.
9404
9405       The  search  for a (*MARK) name uses the normal backtracking mechanism,
9406       which means that it does not  see  (*MARK)  settings  that  are  inside
9407       atomic groups or assertions, because they are never re-entered by back-
9408       tracking. Compare the following pcre2test examples:
9409
9410           re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
9411         data: abc
9412          0: a
9413          1: a
9414         data:
9415           re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
9416         data: abc
9417          0: b
9418          1: b
9419
9420       In the first example, the (*MARK) setting is in an atomic group, so  it
9421       is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
9422       This allows the second branch of the pattern to be tried at  the  first
9423       character  position.  In the second example, the (*MARK) setting is not
9424       in an atomic group. This allows (*SKIP:X) to find the (*MARK)  when  it
9425       backtracks, and this causes a new matching attempt to start at the sec-
9426       ond character. This time, the (*MARK) is never seen  because  "a"  does
9427       not match "b", so the matcher immediately jumps to the second branch of
9428       the pattern.
9429
9430       Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME).  It
9431       ignores names that are set by other backtracking verbs.
9432
9433         (*THEN) or (*THEN:NAME)
9434
9435       This  verb  causes  a skip to the next innermost alternative when back-
9436       tracking reaches it. That  is,  it  cancels  any  further  backtracking
9437       within  the  current  alternative.  Its name comes from the observation
9438       that it can be used for a pattern-based if-then-else block:
9439
9440         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
9441
9442       If the COND1 pattern matches, FOO is tried (and possibly further  items
9443       after  the  end  of the group if FOO succeeds); on failure, the matcher
9444       skips to the second alternative and tries COND2,  without  backtracking
9445       into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse-
9446       quently BAZ fails, there are no more alternatives, so there is a  back-
9447       track  to  whatever came before the entire group. If (*THEN) is not in-
9448       side an alternation, it acts like (*PRUNE).
9449
9450       The behaviour of (*THEN:NAME) is not the same  as  (*MARK:NAME)(*THEN).
9451       It is like (*MARK:NAME) in that the name is remembered for passing back
9452       to the caller. However, (*SKIP:NAME) searches only for names  set  with
9453       (*MARK), ignoring those set by other backtracking verbs.
9454
9455       A  group  that does not contain a | character is just a part of the en-
9456       closing alternative; it is not a nested alternation with only  one  al-
9457       ternative. The effect of (*THEN) extends beyond such a group to the en-
9458       closing alternative.  Consider this pattern, where A, B, etc. are  com-
9459       plex  pattern  fragments  that  do not contain any | characters at this
9460       level:
9461
9462         A (B(*THEN)C) | D
9463
9464       If A and B are matched, but there is a failure in C, matching does  not
9465       backtrack into A; instead it moves to the next alternative, that is, D.
9466       However, if the group containing (*THEN) is given  an  alternative,  it
9467       behaves differently:
9468
9469         A (B(*THEN)C | (*FAIL)) | D
9470
9471       The effect of (*THEN) is now confined to the inner group. After a fail-
9472       ure in C, matching moves to (*FAIL), which causes the  whole  group  to
9473       fail  because  there  are  no  more  alternatives to try. In this case,
9474       matching does backtrack into A.
9475
9476       Note that a conditional group is not considered as having two  alterna-
9477       tives,  because  only one is ever used. In other words, the | character
9478       in a conditional group has a different meaning. Ignoring  white  space,
9479       consider:
9480
9481         ^.*? (?(?=a) a | b(*THEN)c )
9482
9483       If the subject is "ba", this pattern does not match. Because .*? is un-
9484       greedy, it initially matches zero characters. The condition (?=a)  then
9485       fails,  the  character  "b"  is matched, but "c" is not. At this point,
9486       matching does not backtrack to .*? as might perhaps  be  expected  from
9487       the  presence  of the | character. The conditional group is part of the
9488       single alternative that comprises the whole pattern, and so  the  match
9489       fails.  (If  there  was a backtrack into .*?, allowing it to match "b",
9490       the match would succeed.)
9491
9492       The verbs just described provide four different "strengths" of  control
9493       when subsequent matching fails. (*THEN) is the weakest, carrying on the
9494       match at the next alternative. (*PRUNE) comes next, failing  the  match
9495       at  the  current starting position, but allowing an advance to the next
9496       character (for an unanchored pattern). (*SKIP) is similar, except  that
9497       the advance may be more than one character. (*COMMIT) is the strongest,
9498       causing the entire match to fail.
9499
9500   More than one backtracking verb
9501
9502       If more than one backtracking verb is present in  a  pattern,  the  one
9503       that  is  backtracked  onto first acts. For example, consider this pat-
9504       tern, where A, B, etc. are complex pattern fragments:
9505
9506         (A(*COMMIT)B(*THEN)C|ABD)
9507
9508       If A matches but B fails, the backtrack to (*COMMIT) causes the  entire
9509       match to fail. However, if A and B match, but C fails, the backtrack to
9510       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
9511       is  consistent,  but is not always the same as Perl's. It means that if
9512       two or more backtracking verbs appear in succession, all the  the  last
9513       of them has no effect. Consider this example:
9514
9515         ...(*COMMIT)(*PRUNE)...
9516
9517       If there is a matching failure to the right, backtracking onto (*PRUNE)
9518       causes it to be triggered, and its action is taken. There can never  be
9519       a backtrack onto (*COMMIT).
9520
9521   Backtracking verbs in repeated groups
9522
9523       PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9524       in repeated groups. For example, consider:
9525
9526         /(a(*COMMIT)b)+ac/
9527
9528       If the subject is "abac", Perl matches  unless  its  optimizations  are
9529       disabled,  but  PCRE2  always fails because the (*COMMIT) in the second
9530       repeat of the group acts.
9531
9532   Backtracking verbs in assertions
9533
9534       (*FAIL) in any assertion has its normal effect: it forces an  immediate
9535       backtrack.  The  behaviour  of  the other backtracking verbs depends on
9536       whether or not the assertion is standalone or acting as  the  condition
9537       in a conditional group.
9538
9539       (*ACCEPT)  in  a  standalone positive assertion causes the assertion to
9540       succeed without any further processing; captured  strings  and  a  mark
9541       name  (if  set) are retained. In a standalone negative assertion, (*AC-
9542       CEPT) causes the assertion to fail without any further processing; cap-
9543       tured substrings and any mark name are discarded.
9544
9545       If  the  assertion is a condition, (*ACCEPT) causes the condition to be
9546       true for a positive assertion and false for a  negative  one;  captured
9547       substrings are retained in both cases.
9548
9549       The remaining verbs act only when a later failure causes a backtrack to
9550       reach them. This means that, for the Perl-compatible assertions,  their
9551       effect is confined to the assertion, because Perl lookaround assertions
9552       are atomic. A backtrack that occurs after such an assertion is complete
9553       does  not  jump  back  into  the  assertion.  Note in particular that a
9554       (*MARK) name that is set in an assertion is not "seen" by  an  instance
9555       of (*SKIP:NAME) later in the pattern.
9556
9557       PCRE2  now supports non-atomic positive assertions, as described in the
9558       section entitled "Non-atomic assertions" above. These  assertions  must
9559       be  standalone  (not used as conditions). They are not Perl-compatible.
9560       For these assertions, a later backtrack does jump back into the  asser-
9561       tion,  and  therefore verbs such as (*COMMIT) can be triggered by back-
9562       tracks from later in the pattern.
9563
9564       The effect of (*THEN) is not allowed to escape beyond an assertion.  If
9565       there  are no more branches to try, (*THEN) causes a positive assertion
9566       to be false, and a negative assertion to be true.
9567
9568       The other backtracking verbs are not treated specially if  they  appear
9569       in  a  standalone  positive assertion. In a conditional positive asser-
9570       tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
9571       or  (*PRUNE) causes the condition to be false. However, for both stand-
9572       alone and conditional negative assertions, backtracking into (*COMMIT),
9573       (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9574       ing any further alternative branches.
9575
9576   Backtracking verbs in subroutines
9577
9578       These behaviours occur whether or not the group is called recursively.
9579
9580       (*ACCEPT) in a group called as a subroutine causes the subroutine match
9581       to  succeed without any further processing. Matching then continues af-
9582       ter the subroutine call. Perl documents this behaviour.  Perl's  treat-
9583       ment of the other verbs in subroutines is different in some cases.
9584
9585       (*FAIL)  in  a  group  called as a subroutine has its normal effect: it
9586       forces an immediate backtrack.
9587
9588       (*COMMIT), (*SKIP), and (*PRUNE) cause the  subroutine  match  to  fail
9589       when  triggered  by being backtracked to in a group called as a subrou-
9590       tine. There is then a backtrack at the outer level.
9591
9592       (*THEN), when triggered, skips to the next alternative in the innermost
9593       enclosing  group that has alternatives (its normal behaviour). However,
9594       if there is no such group within the subroutine's group, the subroutine
9595       match fails and there is a backtrack at the outer level.
9596
9597
9598SEE ALSO
9599
9600       pcre2api(3),    pcre2callout(3),    pcre2matching(3),   pcre2syntax(3),
9601       pcre2(3).
9602
9603
9604AUTHOR
9605
9606       Philip Hazel
9607       University Computing Service
9608       Cambridge, England.
9609
9610
9611REVISION
9612
9613       Last updated: 06 October 2020
9614       Copyright (c) 1997-2020 University of Cambridge.
9615------------------------------------------------------------------------------
9616
9617
9618PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
9619
9620
9621
9622NAME
9623       PCRE2 - Perl-compatible regular expressions (revised API)
9624
9625PCRE2 PERFORMANCE
9626
9627       Two  aspects  of performance are discussed below: memory usage and pro-
9628       cessing time. The way you express your pattern as a regular  expression
9629       can affect both of them.
9630
9631
9632COMPILED PATTERN MEMORY USAGE
9633
9634       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
9635       code, so that most simple patterns do not use much memory  for  storing
9636       the compiled version. However, there is one case where the memory usage
9637       of a compiled pattern can be unexpectedly  large.  If  a  parenthesized
9638       group  has  a quantifier with a minimum greater than 1 and/or a limited
9639       maximum, the whole group is repeated in the compiled code. For example,
9640       the pattern
9641
9642         (abc|def){2,4}
9643
9644       is compiled as if it were
9645
9646         (abc|def)(abc|def)((abc|def)(abc|def)?)?
9647
9648       (Technical  aside:  It is done this way so that backtrack points within
9649       each of the repetitions can be independently maintained.)
9650
9651       For regular expressions whose quantifiers use only small numbers,  this
9652       is  not  usually a problem. However, if the numbers are large, and par-
9653       ticularly if such repetitions are nested, the memory usage  can  become
9654       an embarrassment. For example, the very simple pattern
9655
9656         ((ab){1,1000}c){1,3}
9657
9658       uses  over  50KiB  when compiled using the 8-bit library. When PCRE2 is
9659       compiled with its default internal pointer size of two bytes, the  size
9660       limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9661       libraries, and this is reached with the above pattern if the outer rep-
9662       etition  is  increased from 3 to 4. PCRE2 can be compiled to use larger
9663       internal pointers and thus handle larger compiled patterns, but  it  is
9664       better to try to rewrite your pattern to use less memory if you can.
9665
9666       One  way  of reducing the memory usage for such patterns is to make use
9667       of PCRE2's "subroutine" facility. Re-writing the above pattern as
9668
9669         ((ab)(?2){0,999}c)(?1){0,2}
9670
9671       reduces the memory requirements to around 16KiB, and indeed it  remains
9672       under  20KiB  even with the outer repetition increased to 100. However,
9673       this kind of pattern is not always exactly equivalent, because any cap-
9674       tures  within  subroutine calls are lost when the subroutine completes.
9675       If this is not a problem, this kind of  rewriting  will  allow  you  to
9676       process  patterns that PCRE2 cannot otherwise handle. The matching per-
9677       formance of the two different versions of the pattern are  roughly  the
9678       same.  (This applies from release 10.30 - things were different in ear-
9679       lier releases.)
9680
9681
9682STACK AND HEAP USAGE AT RUN TIME
9683
9684       From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9685       uses  very  little system stack at run time. In earlier releases recur-
9686       sive function calls could use a great deal of  stack,  and  this  could
9687       cause  problems, but this usage has been eliminated. Backtracking posi-
9688       tions are now explicitly remembered in memory frames controlled by  the
9689       code.  An  initial  20KiB  vector  of frames is allocated on the system
9690       stack (enough for about 100 frames for small patterns), but if this  is
9691       insufficient,  heap  memory  is  used. The amount of heap memory can be
9692       limited; if the limit is set to zero, only the initial stack vector  is
9693       used.  Rewriting patterns to be time-efficient, as described below, may
9694       also reduce the memory requirements.
9695
9696       In contrast to  pcre2_match(),  pcre2_dfa_match()  does  use  recursive
9697       function  calls,  but only for processing atomic groups, lookaround as-
9698       sertions, and recursion within the pattern. The original version of the
9699       code  used  to  allocate  quite large internal workspace vectors on the
9700       stack, which caused some problems for  some  patterns  in  environments
9701       with  small  stacks.  From release 10.32 the code for pcre2_dfa_match()
9702       has been re-factored to use heap memory  when  necessary  for  internal
9703       workspace  when  recursing,  though  recursive function calls are still
9704       used.
9705
9706       The "match depth" parameter can be used to limit the depth of  function
9707       recursion,  and  the  "match  heap"  parameter  to limit heap memory in
9708       pcre2_dfa_match().
9709
9710
9711PROCESSING TIME
9712
9713       Certain items in regular expression patterns are processed  more  effi-
9714       ciently than others. It is more efficient to use a character class like
9715       [aeiou]  than  a  set  of   single-character   alternatives   such   as
9716       (a|e|i|o|u).  In  general,  the simplest construction that provides the
9717       required behaviour is usually the most efficient. Jeffrey Friedl's book
9718       contains  a  lot  of useful general discussion about optimizing regular
9719       expressions for efficient performance. This document contains a few ob-
9720       servations about PCRE2.
9721
9722       Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
9723       slow, because PCRE2 has to use a multi-stage table lookup  whenever  it
9724       needs  a  character's  property. If you can find an alternative pattern
9725       that does not use character properties, it will probably be faster.
9726
9727       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
9728       character  classes  such  as  [:alpha:]  do not use Unicode properties,
9729       partly for backwards compatibility, and partly for performance reasons.
9730       However,  you  can  set  the PCRE2_UCP option or start the pattern with
9731       (*UCP) if you want Unicode character properties to be  used.  This  can
9732       double  the  matching  time  for  items  such  as \d, when matched with
9733       pcre2_match(); the performance loss is less with a DFA  matching  func-
9734       tion, and in both cases there is not much difference for \b.
9735
9736       When  a pattern begins with .* not in atomic parentheses, nor in paren-
9737       theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
9738       option  is  set,  the pattern is implicitly anchored by PCRE2, since it
9739       can match only at the start of a subject string.  If  the  pattern  has
9740       multiple top-level branches, they must all be anchorable. The optimiza-
9741       tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is  au-
9742       tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
9743
9744       If  PCRE2_DOTALL  is  not set, PCRE2 cannot make this optimization, be-
9745       cause the dot metacharacter does not then match a newline, and  if  the
9746       subject  string contains newlines, the pattern may match from the char-
9747       acter immediately following one of them instead of from the very start.
9748       For example, the pattern
9749
9750         .*second
9751
9752       matches  the subject "first\nand second" (where \n stands for a newline
9753       character), with the match starting at the seventh character. In  order
9754       to  do  this, PCRE2 has to retry the match starting after every newline
9755       in the subject.
9756
9757       If you are using such a pattern with subject strings that do  not  con-
9758       tain   newlines,   the   best   performance   is  obtained  by  setting
9759       PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate  ex-
9760       plicit  anchoring.  That saves PCRE2 from having to scan along the sub-
9761       ject looking for a newline to restart at.
9762
9763       Beware of patterns that contain nested indefinite  repeats.  These  can
9764       take  a  long time to run when applied to a string that does not match.
9765       Consider the pattern fragment
9766
9767         ^(a+)*
9768
9769       This can match "aaaa" in 16 different ways, and this  number  increases
9770       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
9771       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
9772       repeats  can  match  different numbers of times.) When the remainder of
9773       the pattern is such that the entire match is going to fail,  PCRE2  has
9774       in  principle to try every possible variation, and this can take an ex-
9775       tremely long time, even for relatively short strings.
9776
9777       An optimization catches some of the more simple cases such as
9778
9779         (a+)*b
9780
9781       where a literal character follows. Before  embarking  on  the  standard
9782       matching  procedure, PCRE2 checks that there is a "b" later in the sub-
9783       ject string, and if there is not, it fails the match immediately.  How-
9784       ever,  when  there  is no following literal this optimization cannot be
9785       used. You can see the difference by comparing the behaviour of
9786
9787         (a+)*\d
9788
9789       with the pattern above. The former gives  a  failure  almost  instantly
9790       when  applied  to  a  whole  line of "a" characters, whereas the latter
9791       takes an appreciable time with strings longer than about 20 characters.
9792
9793       In many cases, the solution to this kind of performance issue is to use
9794       an  atomic group or a possessive quantifier. This can often reduce mem-
9795       ory requirements as well. As another example, consider this pattern:
9796
9797         ([^<]|<(?!inet))+
9798
9799       It matches from wherever it starts until it encounters "<inet"  or  the
9800       end  of  the  data,  and is the kind of pattern that might be used when
9801       processing an XML file. Each iteration of the outer parentheses matches
9802       either  one  character that is not "<" or a "<" that is not followed by
9803       "inet". However, each time a parenthesis is processed,  a  backtracking
9804       position  is  passed,  so this formulation uses a memory frame for each
9805       matched character. For a long string, a lot of memory is required. Con-
9806       sider  now  this  rewritten  pattern,  which  matches  exactly the same
9807       strings:
9808
9809         ([^<]++|<(?!inet))+
9810
9811       This runs much faster, because sequences of characters that do not con-
9812       tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9813       sessive quantifier is used to stop any backtracking into  the  runs  of
9814       non-"<"  characters.  This  version also uses a lot less memory because
9815       entry to a new set of parentheses happens only  when  a  "<"  character
9816       that  is  not  followed by "inet" is encountered (and we assume this is
9817       relatively rare).
9818
9819       This example shows that one way of optimizing performance when matching
9820       long  subject strings is to write repeated parenthesized subpatterns to
9821       match more than one character whenever possible.
9822
9823   SETTING RESOURCE LIMITS
9824
9825       You can set limits on the amount of processing that  takes  place  when
9826       matching,  and  on  the amount of heap memory that is used. The default
9827       values of the limits are very large, and unlikely ever to operate. They
9828       can  be  changed  when  PCRE2  is  built, and they can also be set when
9829       pcre2_match() or pcre2_dfa_match() is called. For details of these  in-
9830       terfaces,  see  the  pcre2build  documentation and the section entitled
9831       "The match context" in the pcre2api documentation.
9832
9833       The pcre2test test program has a modifier called  "find_limits"  which,
9834       if  applied  to  a  subject line, causes it to find the smallest limits
9835       that allow a pattern to match. This is done by repeatedly matching with
9836       different limits.
9837
9838
9839AUTHOR
9840
9841       Philip Hazel
9842       University Computing Service
9843       Cambridge, England.
9844
9845
9846REVISION
9847
9848       Last updated: 03 February 2019
9849       Copyright (c) 1997-2019 University of Cambridge.
9850------------------------------------------------------------------------------
9851
9852
9853PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
9854
9855
9856
9857NAME
9858       PCRE2 - Perl-compatible regular expressions (revised API)
9859
9860SYNOPSIS
9861
9862       #include <pcre2posix.h>
9863
9864       int pcre2_regcomp(regex_t *preg, const char *pattern,
9865            int cflags);
9866
9867       int pcre2_regexec(const regex_t *preg, const char *string,
9868            size_t nmatch, regmatch_t pmatch[], int eflags);
9869
9870       size_t pcre2_regerror(int errcode, const regex_t *preg,
9871            char *errbuf, size_t errbuf_size);
9872
9873       void pcre2_regfree(regex_t *preg);
9874
9875
9876DESCRIPTION
9877
9878       This  set of functions provides a POSIX-style API for the PCRE2 regular
9879       expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
9880       16-bit  and  32-bit libraries. See the pcre2api documentation for a de-
9881       scription of PCRE2's native API, which contains much  additional  func-
9882       tionality.
9883
9884       The functions described here are wrapper functions that ultimately call
9885       the PCRE2 native API. Their prototypes are defined in the  pcre2posix.h
9886       header  file, and they all have unique names starting with pcre2_. How-
9887       ever, the pcre2posix.h header also contains macro definitions that con-
9888       vert  the standard POSIX names such regcomp() into pcre2_regcomp() etc.
9889       This means that a program can use the usual POSIX names without running
9890       the  risk of accidentally linking with POSIX functions from a different
9891       library.
9892
9893       On Unix-like systems the PCRE2 POSIX library is called  libpcre2-posix,
9894       so  can  be accessed by adding -lpcre2-posix to the command for linking
9895       an application. Because the POSIX functions call the native ones, it is
9896       also necessary to add -lpcre2-8.
9897
9898       Although  they are not defined as protypes in pcre2posix.h, the library
9899       does contain functions with the POSIX names regcomp() etc. These simply
9900       pass  their  arguments to the PCRE2 functions. These functions are pro-
9901       vided for backwards compatibility with earlier versions  of  PCRE2,  so
9902       that existing programs do not have to be recompiled.
9903
9904       Calling  the  header  file  pcre2posix.h avoids any conflict with other
9905       POSIX libraries. It can, of course, be renamed or aliased  as  regex.h,
9906       which  is  the  "correct"  name,  if there is no clash. It provides two
9907       structure types, regex_t for compiled internal  forms,  and  regmatch_t
9908       for returning captured substrings. It also defines some constants whose
9909       names start with "REG_"; these are used for setting options and identi-
9910       fying error codes.
9911
9912
9913USING THE POSIX FUNCTIONS
9914
9915       Those  POSIX  option bits that can reasonably be mapped to PCRE2 native
9916       options have been implemented. In addition, the option REG_EXTENDED  is
9917       defined  with  the  value  zero. This has no effect, but since programs
9918       that are written to the POSIX interface often use  it,  this  makes  it
9919       easier  to  slot in PCRE2 as a replacement library. Other POSIX options
9920       are not even defined.
9921
9922       There are also some options that are not defined by POSIX.  These  have
9923       been  added  at  the  request  of users who want to make use of certain
9924       PCRE2-specific features via the POSIX calling interface or to  add  BSD
9925       or GNU functionality.
9926
9927       When  PCRE2  is  called via these functions, it is only the API that is
9928       POSIX-like in style. The syntax and semantics of  the  regular  expres-
9929       sions  themselves  are  still  those of Perl, subject to the setting of
9930       various PCRE2 options, as described below. "POSIX-like in style"  means
9931       that  the  API  approximates  to  the POSIX definition; it is not fully
9932       POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
9933       even less compatible.
9934
9935       The  descriptions  below use the actual names of the functions, but, as
9936       described above, the standard POSIX names (without the  pcre2_  prefix)
9937       may also be used.
9938
9939
9940COMPILING A PATTERN
9941
9942       The function pcre2_regcomp() is called to compile a pattern into an in-
9943       ternal form. By default, the pattern is a C string terminated by a  bi-
9944       nary zero (but see REG_PEND below). The preg argument is a pointer to a
9945       regex_t structure that is used as a base for storing information  about
9946       the  compiled  regular  expression.  (It  is  also  used for input when
9947       REG_PEND is set.)
9948
9949       The argument cflags is either zero, or contains one or more of the bits
9950       defined by the following macros:
9951
9952         REG_DOTALL
9953
9954       The  PCRE2_DOTALL  option  is set when the regular expression is passed
9955       for compilation to the native function. Note  that  REG_DOTALL  is  not
9956       part of the POSIX standard.
9957
9958         REG_ICASE
9959
9960       The  PCRE2_CASELESS option is set when the regular expression is passed
9961       for compilation to the native function.
9962
9963         REG_NEWLINE
9964
9965       The PCRE2_MULTILINE option is set when the regular expression is passed
9966       for  compilation  to the native function. Note that this does not mimic
9967       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
9968       tion).
9969
9970         REG_NOSPEC
9971
9972       The  PCRE2_LITERAL  option is set when the regular expression is passed
9973       for compilation to the native function. This disables all meta  charac-
9974       ters  in the pattern, causing it to be treated as a literal string. The
9975       only other options that are  allowed  with  REG_NOSPEC  are  REG_ICASE,
9976       REG_NOSUB,  REG_PEND,  and REG_UTF. Note that REG_NOSPEC is not part of
9977       the POSIX standard.
9978
9979         REG_NOSUB
9980
9981       When  a  pattern  that  is  compiled  with  this  flag  is  passed   to
9982       pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig-
9983       nored, and no captured strings are returned. Versions of the  PCRE  li-
9984       brary  prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
9985       tion, but this no longer happens because it disables the use  of  back-
9986       references.
9987
9988         REG_PEND
9989
9990       If  this option is set, the reg_endp field in the preg structure (which
9991       has the type const char *) must be set to point to the character beyond
9992       the  end of the pattern before calling pcre2_regcomp(). The pattern it-
9993       self may now contain binary zeros, which are treated  as  data  charac-
9994       ters.  Without  REG_PEND,  a binary zero terminates the pattern and the
9995       re_endp field is ignored. This is a GNU extension to the POSIX standard
9996       and  should be used with caution in software intended to be portable to
9997       other systems.
9998
9999         REG_UCP
10000
10001       The PCRE2_UCP option is set when the regular expression is  passed  for
10002       compilation  to  the  native function. This causes PCRE2 to use Unicode
10003       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
10004       ASCII values. Note that REG_UCP is not part of the POSIX standard.
10005
10006         REG_UNGREEDY
10007
10008       The  PCRE2_UNGREEDY option is set when the regular expression is passed
10009       for compilation to the native function. Note that REG_UNGREEDY  is  not
10010       part of the POSIX standard.
10011
10012         REG_UTF
10013
10014       The  PCRE2_UTF  option is set when the regular expression is passed for
10015       compilation to the native function. This causes the pattern itself  and
10016       all  data  strings used for matching it to be treated as UTF-8 strings.
10017       Note that REG_UTF is not part of the POSIX standard.
10018
10019       In the absence of these flags, no options  are  passed  to  the  native
10020       function.   This means the the regex is compiled with PCRE2 default se-
10021       mantics. In particular, the way it handles newline  characters  in  the
10022       subject  string  is  the Perl way, not the POSIX way. Note that setting
10023       PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
10024       It  does not affect the way newlines are matched by the dot metacharac-
10025       ter (they are not) or by a negative class such as [^a] (they are).
10026
10027       The yield of pcre2_regcomp() is zero on success,  and  non-zero  other-
10028       wise.  The preg structure is filled in on success, and one other member
10029       of the structure (as well as re_endp) is public: re_nsub  contains  the
10030       number  of capturing subpatterns in the regular expression. Various er-
10031       ror codes are defined in the header file.
10032
10033       NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10034       to use the contents of the preg structure. If, for example, you pass it
10035       to pcre2_regexec(), the result is undefined and your program is  likely
10036       to crash.
10037
10038
10039MATCHING NEWLINE CHARACTERS
10040
10041       This area is not simple, because POSIX and Perl take different views of
10042       things.  It is not possible to get PCRE2 to obey POSIX  semantics,  but
10043       then PCRE2 was never intended to be a POSIX engine. The following table
10044       lists the different possibilities for matching  newline  characters  in
10045       Perl and PCRE2:
10046
10047                                 Default   Change with
10048
10049         . matches newline          no     PCRE2_DOTALL
10050         newline matches [^a]       yes    not changeable
10051         $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
10052         $ matches \n in middle     no     PCRE2_MULTILINE
10053         ^ matches \n in middle     no     PCRE2_MULTILINE
10054
10055       This is the equivalent table for a POSIX-compatible pattern matcher:
10056
10057                                 Default   Change with
10058
10059         . matches newline          yes    REG_NEWLINE
10060         newline matches [^a]       yes    REG_NEWLINE
10061         $ matches \n at end        no     REG_NEWLINE
10062         $ matches \n in middle     no     REG_NEWLINE
10063         ^ matches \n in middle     no     REG_NEWLINE
10064
10065       This  behaviour  is not what happens when PCRE2 is called via its POSIX
10066       API. By default, PCRE2's behaviour is the same as Perl's,  except  that
10067       there  is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
10068       and Perl, there is no way to stop newline from matching [^a].
10069
10070       Default POSIX newline handling can be obtained by setting  PCRE2_DOTALL
10071       and  PCRE2_DOLLAR_ENDONLY  when  calling  pcre2_compile() directly, but
10072       there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10073       tion.  When  using  the  POSIX  API,  passing  REG_NEWLINE  to  PCRE2's
10074       pcre2_regcomp()  function  causes  PCRE2_MULTILINE  to  be  passed   to
10075       pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
10076       pass PCRE2_DOLLAR_ENDONLY.
10077
10078
10079MATCHING A PATTERN
10080
10081       The function pcre2_regexec() is called to match a compiled pattern preg
10082       against  a  given string, which is by default terminated by a zero byte
10083       (but see REG_STARTEND below), subject to the options in eflags.   These
10084       can be:
10085
10086         REG_NOTBOL
10087
10088       The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10089       ing function.
10090
10091         REG_NOTEMPTY
10092
10093       The PCRE2_NOTEMPTY option is set  when  calling  the  underlying  PCRE2
10094       matching  function.  Note  that  REG_NOTEMPTY  is not part of the POSIX
10095       standard. However, setting this option can give more POSIX-like  behav-
10096       iour in some situations.
10097
10098         REG_NOTEOL
10099
10100       The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10101       ing function.
10102
10103         REG_STARTEND
10104
10105       When this option  is  set,  the  subject  string  starts  at  string  +
10106       pmatch[0].rm_so  and  ends  at  string  + pmatch[0].rm_eo, which should
10107       point to the first character beyond the string. There may be binary ze-
10108       ros  within  the  subject string, and indeed, using REG_STARTEND is the
10109       only way to pass a subject string that contains a binary zero.
10110
10111       Whatever the value of  pmatch[0].rm_so,  the  offsets  of  the  matched
10112       string  and  any  captured  substrings  are still given relative to the
10113       start of string itself. (Before PCRE2 release 10.30  these  were  given
10114       relative  to  string + pmatch[0].rm_so, but this differs from other im-
10115       plementations.)
10116
10117       This is a BSD extension, compatible with  but  not  specified  by  IEEE
10118       Standard  1003.2 (POSIX.2), and should be used with caution in software
10119       intended to be portable to other systems. Note that  a  non-zero  rm_so
10120       does  not  imply REG_NOTBOL; REG_STARTEND affects only the location and
10121       length of the string, not how it is matched. Setting  REG_STARTEND  and
10122       passing  pmatch as NULL are mutually exclusive; the error REG_INVARG is
10123       returned.
10124
10125       If the pattern was compiled with the REG_NOSUB flag, no data about  any
10126       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
10127       pcre2_regexec() are ignored (except possibly  as  input  for  REG_STAR-
10128       TEND).
10129
10130       The  value of nmatch may be zero, and the value pmatch may be NULL (un-
10131       less REG_STARTEND is set); in  both  these  cases  no  data  about  any
10132       matched strings is returned.
10133
10134       Otherwise,  the  portion  of  the string that was matched, and also any
10135       captured substrings, are returned via the pmatch argument, which points
10136       to  an  array  of  nmatch structures of type regmatch_t, containing the
10137       members rm_so and rm_eo. These contain the byte  offset  to  the  first
10138       character of each substring and the offset to the first character after
10139       the end of each substring, respectively. The 0th element of the  vector
10140       relates  to  the  entire portion of string that was matched; subsequent
10141       elements relate to the capturing subpatterns of the regular expression.
10142       Unused entries in the array have both structure members set to -1.
10143
10144       A  successful  match  yields a zero return; various error codes are de-
10145       fined in the header file, of which REG_NOMATCH is the "expected"  fail-
10146       ure code.
10147
10148
10149ERROR MESSAGES
10150
10151       The  pcre2_regerror()  function  maps  a non-zero errorcode from either
10152       pcre2_regcomp() or pcre2_regexec() to a printable message. If  preg  is
10153       not  NULL, the error should have arisen from the use of that structure.
10154       A message terminated by a binary zero is placed in errbuf. If the  buf-
10155       fer  is too short, only the first errbuf_size - 1 characters of the er-
10156       ror message are used. The yield of the function is the size  of  buffer
10157       needed  to hold the whole message, including the terminating zero. This
10158       value is greater than errbuf_size if the message was truncated.
10159
10160
10161MEMORY USAGE
10162
10163       Compiling a regular expression causes memory to be allocated and  asso-
10164       ciated  with the preg structure. The function pcre2_regfree() frees all
10165       such memory, after which preg may no longer be used as a  compiled  ex-
10166       pression.
10167
10168
10169AUTHOR
10170
10171       Philip Hazel
10172       University Computing Service
10173       Cambridge, England.
10174
10175
10176REVISION
10177
10178       Last updated: 30 January 2019
10179       Copyright (c) 1997-2019 University of Cambridge.
10180------------------------------------------------------------------------------
10181
10182
10183PCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3)
10184
10185
10186
10187NAME
10188       PCRE2 - Perl-compatible regular expressions (revised API)
10189
10190PCRE2 SAMPLE PROGRAM
10191
10192       A  simple, complete demonstration program to get you started with using
10193       PCRE2 is supplied in the file pcre2demo.c in the src directory  in  the
10194       PCRE2 distribution. A listing of this program is given in the pcre2demo
10195       documentation. If you do not have a copy of the PCRE2 distribution, you
10196       can save this listing to re-create the contents of pcre2demo.c.
10197
10198       The  demonstration  program compiles the regular expression that is its
10199       first argument, and matches it against the subject string in its second
10200       argument.  No  PCRE2  options are set, and default character tables are
10201       used. If matching succeeds, the program outputs the portion of the sub-
10202       ject  that  matched,  together  with  the contents of any captured sub-
10203       strings.
10204
10205       If the -g option is given on the command line, the program then goes on
10206       to check for further matches of the same regular expression in the same
10207       subject string. The logic is a little bit tricky because of the  possi-
10208       bility  of  matching an empty string. Comments in the code explain what
10209       is going on.
10210
10211       The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
10212       library.  It  handles  strings  and characters that are stored in 8-bit
10213       code units.  By default, one character corresponds to  one  code  unit,
10214       but  if  the  pattern starts with "(*UTF)", both it and the subject are
10215       treated as UTF-8 strings, where characters  may  occupy  multiple  code
10216       units.
10217
10218       If  PCRE2  is installed in the standard include and library directories
10219       for your operating system, you should be able to compile the demonstra-
10220       tion program using a command like this:
10221
10222         cc -o pcre2demo pcre2demo.c -lpcre2-8
10223
10224       If PCRE2 is installed elsewhere, you may need to add additional options
10225       to the command line. For example, on a Unix-like system that has  PCRE2
10226       installed  in /usr/local, you can compile the demonstration program us-
10227       ing a command like this:
10228
10229         cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10230            -L/usr/local/lib -lpcre2-8
10231
10232       Once you have built the demonstration program, you can run simple tests
10233       like this:
10234
10235         ./pcre2demo 'cat|dog' 'the cat sat on the mat'
10236         ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10237
10238       Note  that  there  is  a  much  more comprehensive test program, called
10239       pcre2test, which supports many more facilities for testing regular  ex-
10240       pressions  using  all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
10241       though not all three need be installed). The pcre2demo program is  pro-
10242       vided as a relatively simple coding example.
10243
10244       If you try to run pcre2demo when PCRE2 is not installed in the standard
10245       library directory, you may get an error like  this  on  some  operating
10246       systems (e.g. Solaris):
10247
10248         ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10249       or directory
10250
10251       This is caused by the way shared library support works  on  those  sys-
10252       tems. You need to add
10253
10254         -R/usr/local/lib
10255
10256       (for example) to the compile command to get round this problem.
10257
10258
10259AUTHOR
10260
10261       Philip Hazel
10262       University Computing Service
10263       Cambridge, England.
10264
10265
10266REVISION
10267
10268       Last updated: 02 February 2016
10269       Copyright (c) 1997-2016 University of Cambridge.
10270------------------------------------------------------------------------------
10271PCRE2SERIALIZE(3)          Library Functions Manual          PCRE2SERIALIZE(3)
10272
10273
10274
10275NAME
10276       PCRE2 - Perl-compatible regular expressions (revised API)
10277
10278SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10279
10280       int32_t pcre2_serialize_decode(pcre2_code **codes,
10281         int32_t number_of_codes, const uint32_t *bytes,
10282         pcre2_general_context *gcontext);
10283
10284       int32_t pcre2_serialize_encode(pcre2_code **codes,
10285         int32_t number_of_codes, uint32_t **serialized_bytes,
10286         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
10287
10288       void pcre2_serialize_free(uint8_t *bytes);
10289
10290       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
10291
10292       If  you  are running an application that uses a large number of regular
10293       expression patterns, it may be useful to store them  in  a  precompiled
10294       form  instead  of  having to compile them every time the application is
10295       run. However, if you are using the just-in-time  optimization  feature,
10296       it is not possible to save and reload the JIT data, because it is posi-
10297       tion-dependent. The host on which the patterns  are  reloaded  must  be
10298       running  the  same version of PCRE2, with the same code unit width, and
10299       must also have the same endianness, pointer width and PCRE2_SIZE  type.
10300       For  example, patterns compiled on a 32-bit system using PCRE2's 16-bit
10301       library cannot be reloaded on a 64-bit system, nor can they be reloaded
10302       using the 8-bit library.
10303
10304       Note  that  "serialization" in PCRE2 does not convert compiled patterns
10305       to an abstract format like Java or .NET serialization.  The  serialized
10306       output  is  really  just  a  bytecode dump, which is why it can only be
10307       reloaded in the same environment as the one that created it. Hence  the
10308       restrictions  mentioned  above.   Applications  that are not statically
10309       linked with a fixed version of PCRE2 must be prepared to recompile pat-
10310       terns from their sources, in order to be immune to PCRE2 upgrades.
10311
10312
10313SECURITY CONCERNS
10314
10315       The facility for saving and restoring compiled patterns is intended for
10316       use within individual applications.  As  such,  the  data  supplied  to
10317       pcre2_serialize_decode()  is expected to be trusted data, not data from
10318       arbitrary external sources.  There  is  only  some  simple  consistency
10319       checking, not complete validation of what is being re-loaded. Corrupted
10320       data may cause undefined results. For example, if the length field of a
10321       pattern in the serialized data is corrupted, the deserializing code may
10322       read beyond the end of the byte stream that is passed to it.
10323
10324
10325SAVING COMPILED PATTERNS
10326
10327       Before compiled patterns can be saved they must be serialized, which in
10328       PCRE2  means converting the pattern to a stream of bytes. A single byte
10329       stream may contain any number of compiled patterns, but they  must  all
10330       use  the same character tables. A single copy of the tables is included
10331       in the byte stream (its size is 1088 bytes). For more details of  char-
10332       acter  tables,  see the section on locale support in the pcre2api docu-
10333       mentation.
10334
10335       The function pcre2_serialize_encode() creates a serialized byte  stream
10336       from  a  list of compiled patterns. Its first two arguments specify the
10337       list, being a pointer to a vector of pointers to compiled patterns, and
10338       the length of the vector. The third and fourth arguments point to vari-
10339       ables which are set to point to the created byte stream and its length,
10340       respectively.  The  final  argument  is a pointer to a general context,
10341       which can be used to specify custom memory  mangagement  functions.  If
10342       this  argument  is NULL, malloc() is used to obtain memory for the byte
10343       stream. The yield of the function is the number of serialized patterns,
10344       or one of the following negative error codes:
10345
10346         PCRE2_ERROR_BADDATA      the number of patterns is zero or less
10347         PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns
10348         PCRE2_ERROR_MEMORY       memory allocation failed
10349         PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables
10350         PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL
10351
10352       PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
10353       rupted, or that a slot in the vector does not point to a compiled  pat-
10354       tern.
10355
10356       Once a set of patterns has been serialized you can save the data in any
10357       appropriate manner. Here is sample code that compiles two patterns  and
10358       writes them to a file. It assumes that the variable fd refers to a file
10359       that is open for output. The error checking that should be present in a
10360       real application has been omitted for simplicity.
10361
10362         int errorcode;
10363         uint8_t *bytes;
10364         PCRE2_SIZE erroroffset;
10365         PCRE2_SIZE bytescount;
10366         pcre2_code *list_of_codes[2];
10367         list_of_codes[0] = pcre2_compile("first pattern",
10368           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10369         list_of_codes[1] = pcre2_compile("second pattern",
10370           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10371         errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
10372           &bytescount, NULL);
10373         errorcode = fwrite(bytes, 1, bytescount, fd);
10374
10375       Note  that  the  serialized data is binary data that may contain any of
10376       the 256 possible byte values. On systems that make  a  distinction  be-
10377       tween  binary  and non-binary data, be sure that the file is opened for
10378       binary output.
10379
10380       Serializing a set of patterns leaves the original  data  untouched,  so
10381       they  can  still  be used for matching. Their memory must eventually be
10382       freed in the usual way by calling pcre2_code_free(). When you have fin-
10383       ished with the byte stream, it too must be freed by calling pcre2_seri-
10384       alize_free(). If this function is called with a NULL argument,  it  re-
10385       turns immediately without doing anything.
10386
10387
10388RE-USING PRECOMPILED PATTERNS
10389
10390       In  order to re-use a set of saved patterns you must first make the se-
10391       rialized byte stream available in main memory (for example, by  reading
10392       from a file). The management of this memory block is up to the applica-
10393       tion. You can use the pcre2_serialize_get_number_of_codes() function to
10394       find  out how many compiled patterns are in the serialized data without
10395       actually decoding the patterns:
10396
10397         uint8_t *bytes = <serialized data>;
10398         int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
10399
10400       The pcre2_serialize_decode() function reads a byte stream and recreates
10401       the compiled patterns in new memory blocks, setting pointers to them in
10402       a vector. The first two arguments are a pointer to  a  suitable  vector
10403       and its length, and the third argument points to a byte stream. The fi-
10404       nal argument is a pointer to a general context, which can  be  used  to
10405       specify  custom  memory mangagement functions for the decoded patterns.
10406       If this argument is NULL, malloc() and free() are used. After deserial-
10407       ization, the byte stream is no longer needed and can be discarded.
10408
10409         int32_t number_of_codes;
10410         pcre2_code *list_of_codes[2];
10411         uint8_t *bytes = <serialized data>;
10412         int32_t number_of_codes =
10413           pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
10414
10415       If  the  vector  is  not  large enough for all the patterns in the byte
10416       stream, it is filled with those that fit, and  the  remainder  are  ig-
10417       nored.  The yield of the function is the number of decoded patterns, or
10418       one of the following negative error codes:
10419
10420         PCRE2_ERROR_BADDATA    second argument is zero or less
10421         PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data
10422         PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version
10423         PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure
10424         PCRE2_ERROR_MEMORY     memory allocation failed
10425         PCRE2_ERROR_NULL       first or third argument is NULL
10426
10427       PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it  was
10428       compiled on a system with different endianness.
10429
10430       Decoded patterns can be used for matching in the usual way, and must be
10431       freed by calling pcre2_code_free(). However, be aware that there  is  a
10432       potential  race  issue if you are using multiple patterns that were de-
10433       coded from a single byte stream in a multithreaded application. A  sin-
10434       gle  copy  of  the character tables is used by all the decoded patterns
10435       and a reference count is used to arrange for its memory to be automati-
10436       cally  freed when the last pattern is freed, but there is no locking on
10437       this reference count. Therefore, if you want to call  pcre2_code_free()
10438       for  these  patterns  in  different  threads, you must arrange your own
10439       locking, and ensure that pcre2_code_free()  cannot  be  called  by  two
10440       threads at the same time.
10441
10442       If  a pattern was processed by pcre2_jit_compile() before being serial-
10443       ized, the JIT data is discarded and so is no longer available  after  a
10444       save/restore  cycle.  You can, however, process a restored pattern with
10445       pcre2_jit_compile() if you wish.
10446
10447
10448AUTHOR
10449
10450       Philip Hazel
10451       University Computing Service
10452       Cambridge, England.
10453
10454
10455REVISION
10456
10457       Last updated: 27 June 2018
10458       Copyright (c) 1997-2018 University of Cambridge.
10459------------------------------------------------------------------------------
10460
10461
10462PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
10463
10464
10465
10466NAME
10467       PCRE2 - Perl-compatible regular expressions (revised API)
10468
10469PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
10470
10471       The  full syntax and semantics of the regular expressions that are sup-
10472       ported by PCRE2 are described in the pcre2pattern  documentation.  This
10473       document contains a quick-reference summary of the syntax.
10474
10475
10476QUOTING
10477
10478         \x         where x is non-alphanumeric is a literal x
10479         \Q...\E    treat enclosed characters as literal
10480
10481
10482ESCAPED CHARACTERS
10483
10484       This  table  applies to ASCII and Unicode environments. An unrecognized
10485       escape sequence causes an error.
10486
10487         \a         alarm, that is, the BEL character (hex 07)
10488         \cx        "control-x", where x is any ASCII printing character
10489         \e         escape (hex 1B)
10490         \f         form feed (hex 0C)
10491         \n         newline (hex 0A)
10492         \r         carriage return (hex 0D)
10493         \t         tab (hex 09)
10494         \0dd       character with octal code 0dd
10495         \ddd       character with octal code ddd, or backreference
10496         \o{ddd..}  character with octal code ddd..
10497         \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
10498         \xhh       character with hex code hh
10499         \x{hh..}   character with hex code hh..
10500
10501       If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
10502       following are also recognized:
10503
10504         \U         the character "U"
10505         \uhhhh     character with hex code hhhh
10506         \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
10507
10508       When  \x  is not followed by {, from zero to two hexadecimal digits are
10509       read, but in ALT_BSUX mode \x must be followed by two hexadecimal  dig-
10510       its  to  be  recognized as a hexadecimal escape; otherwise it matches a
10511       literal "x".  Likewise, if \u (in ALT_BSUX mode)  is  not  followed  by
10512       four  hexadecimal  digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
10513       digits in curly brackets, it matches a literal "u".
10514
10515       Note that \0dd is always an octal code. The treatment of backslash fol-
10516       lowed  by  a non-zero digit is complicated; for details see the section
10517       "Non-printing characters" in the pcre2pattern documentation, where  de-
10518       tails  of  escape  processing  in  EBCDIC  environments are also given.
10519       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
10520       EBCDIC  environments.  Note  that  \N  not followed by an opening curly
10521       bracket has a different meaning (see below).
10522
10523
10524CHARACTER TYPES
10525
10526         .          any character except newline;
10527                      in dotall mode, any character whatsoever
10528         \C         one code unit, even in UTF mode (best avoided)
10529         \d         a decimal digit
10530         \D         a character that is not a decimal digit
10531         \h         a horizontal white space character
10532         \H         a character that is not a horizontal white space character
10533         \N         a character that is not a newline
10534         \p{xx}     a character with the xx property
10535         \P{xx}     a character without the xx property
10536         \R         a newline sequence
10537         \s         a white space character
10538         \S         a character that is not a white space character
10539         \v         a vertical white space character
10540         \V         a character that is not a vertical white space character
10541         \w         a "word" character
10542         \W         a "non-word" character
10543         \X         a Unicode extended grapheme cluster
10544
10545       \C is dangerous because it may leave the current matching point in  the
10546       middle of a UTF-8 or UTF-16 character. The application can lock out the
10547       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
10548       possible to build PCRE2 with the use of \C permanently disabled.
10549
10550       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
10551       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10552       matching  is  happening,  \s and \w may also match characters with code
10553       points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10554       iour of these escape sequences is changed to use Unicode properties and
10555       they match many more characters.
10556
10557
10558GENERAL CATEGORY PROPERTIES FOR \p and \P
10559
10560         C          Other
10561         Cc         Control
10562         Cf         Format
10563         Cn         Unassigned
10564         Co         Private use
10565         Cs         Surrogate
10566
10567         L          Letter
10568         Ll         Lower case letter
10569         Lm         Modifier letter
10570         Lo         Other letter
10571         Lt         Title case letter
10572         Lu         Upper case letter
10573         L&         Ll, Lu, or Lt
10574
10575         M          Mark
10576         Mc         Spacing mark
10577         Me         Enclosing mark
10578         Mn         Non-spacing mark
10579
10580         N          Number
10581         Nd         Decimal number
10582         Nl         Letter number
10583         No         Other number
10584
10585         P          Punctuation
10586         Pc         Connector punctuation
10587         Pd         Dash punctuation
10588         Pe         Close punctuation
10589         Pf         Final punctuation
10590         Pi         Initial punctuation
10591         Po         Other punctuation
10592         Ps         Open punctuation
10593
10594         S          Symbol
10595         Sc         Currency symbol
10596         Sk         Modifier symbol
10597         Sm         Mathematical symbol
10598         So         Other symbol
10599
10600         Z          Separator
10601         Zl         Line separator
10602         Zp         Paragraph separator
10603         Zs         Space separator
10604
10605
10606PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
10607
10608         Xan        Alphanumeric: union of properties L and N
10609         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
10610         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
10611         Xuc        Univerally-named character: one that can be
10612                      represented by a Universal Character Name
10613         Xwd        Perl word: property Xan or underscore
10614
10615       Perl and POSIX space are now the same. Perl added VT to its space char-
10616       acter set at release 5.18.
10617
10618
10619SCRIPT NAMES FOR \p AND \P
10620
10621       Adlam,  Ahom,  Anatolian_Hieroglyphs,  Arabic, Armenian, Avestan, Bali-
10622       nese, Bamum, Bassa_Vah, Batak, Bengali,  Bhaiksuki,  Bopomofo,  Brahmi,
10623       Braille,  Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
10624       nian, Chakma, Cham, Cherokee, Chorasmian,  Common,  Coptic,  Cuneiform,
10625       Cypriot,  Cyrillic,  Deseret, Devanagari, Dives_Akuru, Dogra, Duployan,
10626       Egyptian_Hieroglyphs, Elbasan, Elymaic, Ethiopic, Georgian, Glagolitic,
10627       Gothic, Grantha, Greek, Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul,
10628       Hanifi_Rohingya, Hanunoo, Hatran, Hebrew,  Hiragana,  Imperial_Aramaic,
10629       Inherited,   Inscriptional_Pahlavi,  Inscriptional_Parthian,  Javanese,
10630       Kaithi, Kannada, Katakana, Kayah_Li,  Kharoshthi,  Khitan_Small_Script,
10631       Khmer,  Khojki,  Khudawadi,  Lao,  Latin, Lepcha, Limbu, Linear_A, Lin-
10632       ear_B, Lisu, Lycian, Lydian,  Mahajani,  Makasar,  Malayalam,  Mandaic,
10633       Manichaean,    Marchen,   Masaram_Gondi,   Medefaidrin,   Meetei_Mayek,
10634       Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mon-
10635       golian,  Mro,  Multani,  Myanmar,  Nabataean, Nandinagari, New_Tai_Lue,
10636       Newa, Nko, Nushu, Nyakeng_Puachue_Hmong, Ogham,  Ol_Chiki,  Old_Hungar-
10637       ian,  Old_Italic,  Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
10638       dian,  Old_South_Arabian,  Old_Turkic,  Oriya,  Osage,   Osmanya,   Pa-
10639       hawh_Hmong,     Palmyrene,     Pau_Cin_Hau,    Phags_Pa,    Phoenician,
10640       Psalter_Pahlavi, Rejang, Runic, Samaritan,  Saurashtra,  Sharada,  Sha-
10641       vian,  Siddham,  SignWriting,  Sinhala, Sogdian, Sora_Sompeng, Soyombo,
10642       Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,
10643       Tai_Viet,  Takri,  Tamil,  Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
10644       nagh, Tirhuta, Ugaritic, Vai, Wancho,  Warang_Citi,  Yezidi,  Yi,  Zan-
10645       abazar_Square.
10646
10647
10648CHARACTER CLASSES
10649
10650         [...]       positive character class
10651         [^...]      negative character class
10652         [x-y]       range (can be used for hex characters)
10653         [[:xxx:]]   positive POSIX named set
10654         [[:^xxx:]]  negative POSIX named set
10655
10656         alnum       alphanumeric
10657         alpha       alphabetic
10658         ascii       0-127
10659         blank       space or tab
10660         cntrl       control character
10661         digit       decimal digit
10662         graph       printing, excluding space
10663         lower       lower case letter
10664         print       printing, including space
10665         punct       printing, excluding alphanumeric
10666         space       white space
10667         upper       upper case letter
10668         word        same as \w
10669         xdigit      hexadecimal digit
10670
10671       In  PCRE2, POSIX character set names recognize only ASCII characters by
10672       default, but some of them use Unicode properties if PCRE2_UCP  is  set.
10673       You can use \Q...\E inside a character class.
10674
10675
10676QUANTIFIERS
10677
10678         ?           0 or 1, greedy
10679         ?+          0 or 1, possessive
10680         ??          0 or 1, lazy
10681         *           0 or more, greedy
10682         *+          0 or more, possessive
10683         *?          0 or more, lazy
10684         +           1 or more, greedy
10685         ++          1 or more, possessive
10686         +?          1 or more, lazy
10687         {n}         exactly n
10688         {n,m}       at least n, no more than m, greedy
10689         {n,m}+      at least n, no more than m, possessive
10690         {n,m}?      at least n, no more than m, lazy
10691         {n,}        n or more, greedy
10692         {n,}+       n or more, possessive
10693         {n,}?       n or more, lazy
10694
10695
10696ANCHORS AND SIMPLE ASSERTIONS
10697
10698         \b          word boundary
10699         \B          not a word boundary
10700         ^           start of subject
10701                       also after an internal newline in multiline mode
10702                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
10703         \A          start of subject
10704         $           end of subject
10705                       also before newline at end of subject
10706                       also before internal newline in multiline mode
10707         \Z          end of subject
10708                       also before newline at end of subject
10709         \z          end of subject
10710         \G          first matching position in subject
10711
10712
10713REPORTED MATCH POINT SETTING
10714
10715         \K          set reported start of match
10716
10717       \K is honoured in positive assertions, but ignored in negative ones.
10718
10719
10720ALTERNATION
10721
10722         expr|expr|expr...
10723
10724
10725CAPTURING
10726
10727         (...)           capture group
10728         (?<name>...)    named capture group (Perl)
10729         (?'name'...)    named capture group (Perl)
10730         (?P<name>...)   named capture group (Python)
10731         (?:...)         non-capture group
10732         (?|...)         non-capture group; reset group numbers for
10733                          capture groups in each alternative
10734
10735       In  non-UTF  modes, names may contain underscores and ASCII letters and
10736       digits; in UTF modes, any Unicode letters and  Unicode  decimal  digits
10737       are permitted. In both cases, a name must not start with a digit.
10738
10739
10740ATOMIC GROUPS
10741
10742         (?>...)         atomic non-capture group
10743         (*atomic:...)   atomic non-capture group
10744
10745
10746COMMENT
10747
10748         (?#....)        comment (not nestable)
10749
10750
10751OPTION SETTING
10752       Changes  of these options within a group are automatically cancelled at
10753       the end of the group.
10754
10755         (?i)            caseless
10756         (?J)            allow duplicate named groups
10757         (?m)            multiline
10758         (?n)            no auto capture
10759         (?s)            single line (dotall)
10760         (?U)            default ungreedy (lazy)
10761         (?x)            extended: ignore white space except in classes
10762         (?xx)           as (?x) but also ignore space and tab in classes
10763         (?-...)         unset option(s)
10764         (?^)            unset imnsx options
10765
10766       Unsetting x or xx unsets both. Several options may be set at once,  and
10767       a mixture of setting and unsetting such as (?i-x) is allowed, but there
10768       may be only one hyphen. Setting (but no unsetting) is allowed after (?^
10769       for example (?^in). An option setting may appear at the start of a non-
10770       capture group, for example (?i:...).
10771
10772       The following are recognized only at the very start of a pattern or af-
10773       ter one of the newline or \R options with similar syntax. More than one
10774       of them may appear. For the first three, d is a decimal number.
10775
10776         (*LIMIT_DEPTH=d) set the backtracking limit to d
10777         (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
10778         (*LIMIT_MATCH=d) set the match limit to d
10779         (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
10780         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
10781         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10782         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
10783         (*NO_JIT)       disable JIT optimization
10784         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10785         (*UTF)          set appropriate UTF mode for the library in use
10786         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
10787
10788       Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce  the
10789       value   of   the   limits   set  by  the  caller  of  pcre2_match()  or
10790       pcre2_dfa_match(), not increase them. LIMIT_RECURSION  is  an  obsolete
10791       synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
10792       and (*UCP) by setting the PCRE2_NEVER_UTF or  PCRE2_NEVER_UCP  options,
10793       respectively, at compile time.
10794
10795
10796NEWLINE CONVENTION
10797
10798       These are recognized only at the very start of the pattern or after op-
10799       tion settings with a similar syntax.
10800
10801         (*CR)           carriage return only
10802         (*LF)           linefeed only
10803         (*CRLF)         carriage return followed by linefeed
10804         (*ANYCRLF)      all three of the above
10805         (*ANY)          any Unicode newline sequence
10806         (*NUL)          the NUL character (binary zero)
10807
10808
10809WHAT \R MATCHES
10810
10811       These are recognized only at the very start of the pattern or after op-
10812       tion setting with a similar syntax.
10813
10814         (*BSR_ANYCRLF)  CR, LF, or CRLF
10815         (*BSR_UNICODE)  any Unicode newline sequence
10816
10817
10818LOOKAHEAD AND LOOKBEHIND ASSERTIONS
10819
10820         (?=...)                     )
10821         (*pla:...)                  ) positive lookahead
10822         (*positive_lookahead:...)   )
10823
10824         (?!...)                     )
10825         (*nla:...)                  ) negative lookahead
10826         (*negative_lookahead:...)   )
10827
10828         (?<=...)                    )
10829         (*plb:...)                  ) positive lookbehind
10830         (*positive_lookbehind:...)  )
10831
10832         (?<!...)                    )
10833         (*nlb:...)                  ) negative lookbehind
10834         (*negative_lookbehind:...)  )
10835
10836       Each top-level branch of a lookbehind must be of a fixed length.
10837
10838
10839NON-ATOMIC LOOKAROUND ASSERTIONS
10840
10841       These assertions are specific to PCRE2 and are not Perl-compatible.
10842
10843         (?*...)                                )
10844         (*napla:...)                           ) synonyms
10845         (*non_atomic_positive_lookahead:...)   )
10846
10847         (?<*...)                               )
10848         (*naplb:...)                           ) synonyms
10849         (*non_atomic_positive_lookbehind:...)  )
10850
10851
10852SCRIPT RUNS
10853
10854         (*script_run:...)           ) script run, can be backtracked into
10855         (*sr:...)                   )
10856
10857         (*atomic_script_run:...)    ) atomic script run
10858         (*asr:...)                  )
10859
10860
10861BACKREFERENCES
10862
10863         \n              reference by number (can be ambiguous)
10864         \gn             reference by number
10865         \g{n}           reference by number
10866         \g+n            relative reference by number (PCRE2 extension)
10867         \g-n            relative reference by number
10868         \g{+n}          relative reference by number (PCRE2 extension)
10869         \g{-n}          relative reference by number
10870         \k<name>        reference by name (Perl)
10871         \k'name'        reference by name (Perl)
10872         \g{name}        reference by name (Perl)
10873         \k{name}        reference by name (.NET)
10874         (?P=name)       reference by name (Python)
10875
10876
10877SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
10878
10879         (?R)            recurse whole pattern
10880         (?n)            call subroutine by absolute number
10881         (?+n)           call subroutine by relative number
10882         (?-n)           call subroutine by relative number
10883         (?&name)        call subroutine by name (Perl)
10884         (?P>name)       call subroutine by name (Python)
10885         \g<name>        call subroutine by name (Oniguruma)
10886         \g'name'        call subroutine by name (Oniguruma)
10887         \g<n>           call subroutine by absolute number (Oniguruma)
10888         \g'n'           call subroutine by absolute number (Oniguruma)
10889         \g<+n>          call subroutine by relative number (PCRE2 extension)
10890         \g'+n'          call subroutine by relative number (PCRE2 extension)
10891         \g<-n>          call subroutine by relative number (PCRE2 extension)
10892         \g'-n'          call subroutine by relative number (PCRE2 extension)
10893
10894
10895CONDITIONAL PATTERNS
10896
10897         (?(condition)yes-pattern)
10898         (?(condition)yes-pattern|no-pattern)
10899
10900         (?(n)               absolute reference condition
10901         (?(+n)              relative reference condition
10902         (?(-n)              relative reference condition
10903         (?(<name>)          named reference condition (Perl)
10904         (?('name')          named reference condition (Perl)
10905         (?(name)            named reference condition (PCRE2, deprecated)
10906         (?(R)               overall recursion condition
10907         (?(Rn)              specific numbered group recursion condition
10908         (?(R&name)          specific named group recursion condition
10909         (?(DEFINE)          define groups for reference
10910         (?(VERSION[>]=n.m)  test PCRE2 version
10911         (?(assert)          assertion condition
10912
10913       Note  the  ambiguity of (?(R) and (?(Rn) which might be named reference
10914       conditions or recursion tests. Such a condition  is  interpreted  as  a
10915       reference condition if the relevant named group exists.
10916
10917
10918BACKTRACKING CONTROL
10919
10920       All  backtracking  control  verbs  may be in the form (*VERB:NAME). For
10921       (*MARK) the name is mandatory, for the others it is  optional.  (*SKIP)
10922       changes  its  behaviour if :NAME is present. The others just set a name
10923       for passing back to the caller, but this is not a name that (*SKIP) can
10924       see. The following act immediately they are reached:
10925
10926         (*ACCEPT)       force successful match
10927         (*FAIL)         force backtrack; synonym (*F)
10928         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
10929
10930       The  following  act only when a subsequent match failure causes a back-
10931       track to reach them. They all force a match failure, but they differ in
10932       what happens afterwards. Those that advance the start-of-match point do
10933       so only if the pattern is not anchored.
10934
10935         (*COMMIT)       overall failure, no advance of starting point
10936         (*PRUNE)        advance to next starting character
10937         (*SKIP)         advance to current matching position
10938         (*SKIP:NAME)    advance to position corresponding to an earlier
10939                         (*MARK:NAME); if not found, the (*SKIP) is ignored
10940         (*THEN)         local failure, backtrack to next alternation
10941
10942       The effect of one of these verbs in a group called as a  subroutine  is
10943       confined to the subroutine call.
10944
10945
10946CALLOUTS
10947
10948         (?C)            callout (assumed number 0)
10949         (?Cn)           callout with numerical data n
10950         (?C"text")      callout with string data
10951
10952       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
10953       the start and the end), and the starting delimiter { matched  with  the
10954       ending  delimiter  }. To encode the ending delimiter within the string,
10955       double it.
10956
10957
10958SEE ALSO
10959
10960       pcre2pattern(3),   pcre2api(3),   pcre2callout(3),    pcre2matching(3),
10961       pcre2(3).
10962
10963
10964AUTHOR
10965
10966       Philip Hazel
10967       University Computing Service
10968       Cambridge, England.
10969
10970
10971REVISION
10972
10973       Last updated: 28 December 2019
10974       Copyright (c) 1997-2019 University of Cambridge.
10975------------------------------------------------------------------------------
10976
10977
10978PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
10979
10980
10981
10982NAME
10983       PCRE - Perl-compatible regular expressions (revised API)
10984
10985UNICODE AND UTF SUPPORT
10986
10987       PCRE2 is normally built with Unicode support, though if you do not need
10988       it, you can build it  without,  in  which  case  the  library  will  be
10989       smaller. With Unicode support, PCRE2 has knowledge of Unicode character
10990       properties and can process strings of text in UTF-8, UTF-16, and UTF-32
10991       format (depending on the code unit width), but this is not the default.
10992       Unless specifically requested, PCRE2 treats each code unit in a  string
10993       as one character.
10994
10995       There  are two ways of telling PCRE2 to switch to UTF mode, where char-
10996       acters may consist of more than one code unit and the range  of  values
10997       is constrained. The program can call pcre2_compile() with the PCRE2_UTF
10998       option, or the pattern may start with the  sequence  (*UTF).   However,
10999       the  latter  facility  can be locked out by the PCRE2_NEVER_UTF option.
11000       That is, the programmer can prevent the supplier of  the  pattern  from
11001       switching to UTF mode.
11002
11003       Note   that  the  PCRE2_MATCH_INVALID_UTF  option  (see  below)  forces
11004       PCRE2_UTF to be set.
11005
11006       In UTF mode, both the pattern and any subject strings that are  matched
11007       against  it are treated as UTF strings instead of strings of individual
11008       one-code-unit characters. There are also some other changes to the  way
11009       characters are handled, as documented below.
11010
11011
11012UNICODE PROPERTY SUPPORT
11013
11014       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
11015       \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11016       ting.   The  Unicode  properties  that can be tested are limited to the
11017       general category properties such as Lu for an upper case letter  or  Nd
11018       for  a  decimal number, the Unicode script names such as Arabic or Han,
11019       and the derived properties Any and L&. Full  lists  are  given  in  the
11020       pcre2pattern  and  pcre2syntax  documentation. Only the short names for
11021       properties are supported. For example, \p{L} matches a letter. Its Perl
11022       synonym,  \p{Letter},  is  not  supported.   Furthermore, in Perl, many
11023       properties may optionally be prefixed by "Is", for  compatibility  with
11024       Perl 5.6. PCRE2 does not support this.
11025
11026
11027WIDE CHARACTERS AND UTF MODES
11028
11029       Code points less than 256 can be specified in patterns by either braced
11030       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
11031       Larger  values have to use braced sequences. Unbraced octal code points
11032       up to \777 are also recognized; larger ones can be coded using \o{...}.
11033
11034       The escape sequence \N{U+<hex digits>} is recognized as another way  of
11035       specifying  a  Unicode character by code point in a UTF mode. It is not
11036       allowed in non-UTF mode.
11037
11038       In UTF mode, repeat quantifiers apply to complete UTF  characters,  not
11039       to individual code units.
11040
11041       In UTF mode, the dot metacharacter matches one UTF character instead of
11042       a single code unit.
11043
11044       In UTF mode, capture group names are not restricted to ASCII,  and  may
11045       contain any Unicode letters and decimal digits, as well as underscore.
11046
11047       The  escape  sequence \C can be used to match a single code unit in UTF
11048       mode, but its use can lead to some strange effects because it breaks up
11049       multi-unit  characters  (see  the description of \C in the pcre2pattern
11050       documentation). For this reason, there is a build-time option that dis-
11051       ables  support  for  \C completely. There is also a less draconian com-
11052       pile-time option for locking out the use of \C when a pattern  is  com-
11053       piled.
11054
11055       The  use  of  \C  is not supported by the alternative matching function
11056       pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11057       ter  may  consist  of  more  than one code unit. The use of \C in these
11058       modes provokes a match-time error. Also, the JIT optimization does  not
11059       support \C in these modes. If JIT optimization is requested for a UTF-8
11060       or UTF-16 pattern that contains \C, it will not succeed,  and  so  when
11061       pcre2_match() is called, the matching will be carried out by the inter-
11062       pretive function.
11063
11064       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
11065       characters  of  any  code  value,  but, by default, the characters that
11066       PCRE2 recognizes as digits, spaces, or word characters remain the  same
11067       set  as  in  non-UTF mode, all with code points less than 256. This re-
11068       mains true even when PCRE2 is built to include Unicode support, because
11069       to  do  otherwise  would  slow down matching in many common cases. Note
11070       that this also applies to \b and \B, because they are defined in  terms
11071       of  \w  and \W. If you want to test for a wider sense of, say, "digit",
11072       you can use explicit Unicode property tests such  as  \p{Nd}.  Alterna-
11073       tively, if you set the PCRE2_UCP option, the way that the character es-
11074       capes work is changed so that Unicode properties are used to  determine
11075       which  characters  match.  There  are  more  details  in the section on
11076       generic character types in the pcre2pattern documentation.
11077
11078       Similarly, characters that match the POSIX named character classes  are
11079       all low-valued characters, unless the PCRE2_UCP option is set.
11080
11081       However,  the  special horizontal and vertical white space matching es-
11082       capes (\h, \H, \v, and \V) do match all the appropriate Unicode charac-
11083       ters, whether or not PCRE2_UCP is set.
11084
11085
11086UNICODE CASE-EQUIVALENCE
11087
11088       If  either  PCRE2_UTF  or PCRE2_UCP is set, upper/lower case processing
11089       makes use of Unicode properties except for characters whose code points
11090       are less than 128 and that have at most two case-equivalent values. For
11091       these, a direct table lookup is used for speed. A few  Unicode  charac-
11092       ters  such as Greek sigma have more than two code points that are case-
11093       equivalent, and these are treated specially. Setting PCRE2_UCP  without
11094       PCRE2_UTF  allows  Unicode-style  case processing for non-UTF character
11095       encodings such as UCS-2.
11096
11097
11098SCRIPT RUNS
11099
11100       The pattern constructs (*script_run:...) and  (*atomic_script_run:...),
11101       with  synonyms (*sr:...) and (*asr:...), verify that the string matched
11102       within the parentheses is a script run. In concept, a script run  is  a
11103       sequence  of characters that are all from the same Unicode script. How-
11104       ever, because some scripts are commonly used together, and because some
11105       diacritical  and  other marks are used with multiple scripts, it is not
11106       that simple.
11107
11108       Every Unicode character has a Script property, mostly with a value cor-
11109       responding  to the name of a script, such as Latin, Greek, or Cyrillic.
11110       There are also three special values:
11111
11112       "Unknown" is used for code points that have not been assigned, and also
11113       for  the surrogate code points. In the PCRE2 32-bit library, characters
11114       whose code points are greater  than  the  Unicode  maximum  (U+10FFFF),
11115       which  are  accessible  only  in non-UTF mode, are assigned the Unknown
11116       script.
11117
11118       "Common" is used for characters that are used with many scripts.  These
11119       include  punctuation,  emoji,  mathematical, musical, and currency sym-
11120       bols, and the ASCII digits 0 to 9.
11121
11122       "Inherited" is used for characters such as diacritical marks that  mod-
11123       ify a previous character. These are considered to take on the script of
11124       the character that they modify.
11125
11126       Some Inherited characters are used with many scripts, but many of  them
11127       are  only  normally  used  with a small number of scripts. For example,
11128       U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11129       tic.  In  order  to  make it possible to check this, a Unicode property
11130       called Script Extension exists. Its value is a list of scripts that ap-
11131       ply to the character. For the majority of characters, the list contains
11132       just one script, the same one as  the  Script  property.  However,  for
11133       characters  such  as  U+102E0 more than one Script is listed. There are
11134       also some Common characters that have a single,  non-Common  script  in
11135       their Script Extension list.
11136
11137       The next section describes the basic rules for deciding whether a given
11138       string of characters is a script run. Note,  however,  that  there  are
11139       some  special cases involving the Chinese Han script, and an additional
11140       constraint for decimal digits. These are  covered  in  subsequent  sec-
11141       tions.
11142
11143   Basic script run rules
11144
11145       A string that is less than two characters long is a script run. This is
11146       the only case in which an Unknown character can be  part  of  a  script
11147       run.  Longer strings are checked using only the Script Extensions prop-
11148       erty, not the basic Script property.
11149
11150       If a character's Script Extension property is the single value  "Inher-
11151       ited", it is always accepted as part of a script run. This is also true
11152       for the property "Common", subject to the checking  of  decimal  digits
11153       described below. All the remaining characters in a script run must have
11154       at least one script in common in their Script Extension lists. In  set-
11155       theoretic terminology, the intersection of all the sets of scripts must
11156       not be empty.
11157
11158       A simple example is an Internet name such as "google.com". The  letters
11159       are all in the Latin script, and the dot is Common, so this string is a
11160       script run.  However, the Cyrillic letter "o" looks exactly the same as
11161       the  Latin "o"; a string that looks the same, but with Cyrillic "o"s is
11162       not a script run.
11163
11164       More interesting examples involve characters with more than one  script
11165       in their Script Extension. Consider the following characters:
11166
11167         U+060C  Arabic comma
11168         U+06D4  Arabic full stop
11169
11170       The  first  has the Script Extension list Arabic, Hanifi Rohingya, Syr-
11171       iac, and Thaana; the second has just Arabic and Hanifi  Rohingya.  Both
11172       of  them  could  appear  in  script runs of either Arabic or Hanifi Ro-
11173       hingya. The first could also appear in Syriac or  Thaana  script  runs,
11174       but the second could not.
11175
11176   The Chinese Han script
11177
11178       The  Chinese  Han  script  is  commonly  used in conjunction with other
11179       scripts for writing certain languages. Japanese uses the  Hiragana  and
11180       Katakana  scripts  together  with Han; Korean uses Hangul and Han; Tai-
11181       wanese Mandarin uses Bopomofo and Han.  These  three  combinations  are
11182       treated  as special cases when checking script runs and are, in effect,
11183       "virtual scripts". Thus, a script run may contain a  mixture  of  Hira-
11184       gana,  Katakana,  and Han, or a mixture of Hangul and Han, or a mixture
11185       of Bopomofo and Han, but not, for example,  a  mixture  of  Hangul  and
11186       Bopomofo  and  Han. PCRE2 (like Perl) follows Unicode's Technical Stan-
11187       dard  39   ("Unicode   Security   Mechanisms",   http://unicode.org/re-
11188       ports/tr39/) in allowing such mixtures.
11189
11190   Decimal digits
11191
11192       Unicode  contains  many sets of 10 decimal digits in different scripts,
11193       and some scripts (including the Common script) contain  more  than  one
11194       set.  Some  of these decimal digits them are visually indistinguishable
11195       from the common ASCII digits. In addition to the  script  checking  de-
11196       scribed  above,  if a script run contains any decimal digits, they must
11197       all come from the same set of 10 adjacent characters.
11198
11199
11200VALIDITY OF UTF STRINGS
11201
11202       When the PCRE2_UTF option is set, the strings passed  as  patterns  and
11203       subjects are (by default) checked for validity on entry to the relevant
11204       functions. If an invalid UTF string is passed, a negative error code is
11205       returned.  The  code  unit offset to the offending character can be ex-
11206       tracted from the match data  block  by  calling  pcre2_get_startchar(),
11207       which is used for this purpose after a UTF error.
11208
11209       In  some  situations, you may already know that your strings are valid,
11210       and therefore want to skip these checks in  order  to  improve  perfor-
11211       mance,  for  example in the case of a long subject string that is being
11212       scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option  at  com-
11213       pile  time  or at match time, PCRE2 assumes that the pattern or subject
11214       it is given (respectively) contains only valid UTF code unit sequences.
11215
11216       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
11217       result  is undefined and your program may crash or loop indefinitely or
11218       give incorrect results. There is, however, one mode  of  matching  that
11219       can  handle  invalid  UTF  subject  strings. This is enabled by passing
11220       PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is  discussed  below  in
11221       the  next  section.  The  rest  of  this  section  covers the case when
11222       PCRE2_MATCH_INVALID_UTF is not set.
11223
11224       Passing PCRE2_NO_UTF_CHECK to pcre2_compile()  just  disables  the  UTF
11225       check  for  the  pattern; it does not also apply to subject strings. If
11226       you want to disable the check for a subject string you must  pass  this
11227       same option to pcre2_match() or pcre2_dfa_match().
11228
11229       UTF-16 and UTF-32 strings can indicate their endianness by special code
11230       knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
11231       this, expecting strings to be in host byte order.
11232
11233       Unless  PCRE2_NO_UTF_CHECK  is  set, a UTF string is checked before any
11234       other  processing  takes  place.  In  the  case  of  pcre2_match()  and
11235       pcre2_dfa_match()  calls  with a non-zero starting offset, the check is
11236       applied only to that part of the subject that could be inspected during
11237       matching,  and  there is a check that the starting offset points to the
11238       first code unit of a character or to the end of the subject.  If  there
11239       are  no  lookbehind  assertions in the pattern, the check starts at the
11240       starting offset.  Otherwise, it starts at the  length  of  the  longest
11241       lookbehind  before  the starting offset, or at the start of the subject
11242       if there are not that many characters before the starting offset.  Note
11243       that the sequences \b and \B are one-character lookbehinds.
11244
11245       In  addition  to checking the format of the string, there is a check to
11246       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
11247       the  surrogate  area. The so-called "non-character" code points are not
11248       excluded because Unicode corrigendum #9 makes it clear that they should
11249       not be.
11250
11251       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
11252       UTF-16, where they are used in pairs to encode code points with  values
11253       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
11254       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
11255       other  words, the whole surrogate thing is a fudge for UTF-16 which un-
11256       fortunately messes up UTF-8 and UTF-32.)
11257
11258       Setting PCRE2_NO_UTF_CHECK at compile time does not disable  the  error
11259       that  is  given if an escape sequence for an invalid Unicode code point
11260       is encountered in the pattern. If you want to  allow  escape  sequences
11261       such  as  \x{d800}  (a  surrogate code point) you can set the PCRE2_EX-
11262       TRA_ALLOW_SURROGATE_ESCAPES extra option.  However,  this  is  possible
11263       only  in  UTF-8  and  UTF-32 modes, because these values are not repre-
11264       sentable in UTF-16.
11265
11266   Errors in UTF-8 strings
11267
11268       The following negative error codes are given for invalid UTF-8 strings:
11269
11270         PCRE2_ERROR_UTF8_ERR1
11271         PCRE2_ERROR_UTF8_ERR2
11272         PCRE2_ERROR_UTF8_ERR3
11273         PCRE2_ERROR_UTF8_ERR4
11274         PCRE2_ERROR_UTF8_ERR5
11275
11276       The string ends with a truncated UTF-8 character;  the  code  specifies
11277       how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
11278       characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
11279       nally  defined  by  RFC  2279)  allows  for  up to 6 bytes, and this is
11280       checked first; hence the possibility of 4 or 5 missing bytes.
11281
11282         PCRE2_ERROR_UTF8_ERR6
11283         PCRE2_ERROR_UTF8_ERR7
11284         PCRE2_ERROR_UTF8_ERR8
11285         PCRE2_ERROR_UTF8_ERR9
11286         PCRE2_ERROR_UTF8_ERR10
11287
11288       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
11289       the  character  do  not have the binary value 0b10 (that is, either the
11290       most significant bit is 0, or the next bit is 1).
11291
11292         PCRE2_ERROR_UTF8_ERR11
11293         PCRE2_ERROR_UTF8_ERR12
11294
11295       A character that is valid by the RFC 2279 rules is either 5 or 6  bytes
11296       long; these code points are excluded by RFC 3629.
11297
11298         PCRE2_ERROR_UTF8_ERR13
11299
11300       A 4-byte character has a value greater than 0x10ffff; these code points
11301       are excluded by RFC 3629.
11302
11303         PCRE2_ERROR_UTF8_ERR14
11304
11305       A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
11306       range  of code points are reserved by RFC 3629 for use with UTF-16, and
11307       so are excluded from UTF-8.
11308
11309         PCRE2_ERROR_UTF8_ERR15
11310         PCRE2_ERROR_UTF8_ERR16
11311         PCRE2_ERROR_UTF8_ERR17
11312         PCRE2_ERROR_UTF8_ERR18
11313         PCRE2_ERROR_UTF8_ERR19
11314
11315       A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
11316       for  a  value that can be represented by fewer bytes, which is invalid.
11317       For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
11318       rect coding uses just one byte.
11319
11320         PCRE2_ERROR_UTF8_ERR20
11321
11322       The two most significant bits of the first byte of a character have the
11323       binary value 0b10 (that is, the most significant bit is 1 and the  sec-
11324       ond  is  0). Such a byte can only validly occur as the second or subse-
11325       quent byte of a multi-byte character.
11326
11327         PCRE2_ERROR_UTF8_ERR21
11328
11329       The first byte of a character has the value 0xfe or 0xff. These  values
11330       can never occur in a valid UTF-8 string.
11331
11332   Errors in UTF-16 strings
11333
11334       The  following  negative  error  codes  are  given  for  invalid UTF-16
11335       strings:
11336
11337         PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
11338         PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
11339         PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate
11340
11341
11342   Errors in UTF-32 strings
11343
11344       The following  negative  error  codes  are  given  for  invalid  UTF-32
11345       strings:
11346
11347         PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
11348         PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
11349
11350
11351MATCHING IN INVALID UTF STRINGS
11352
11353       You can run pattern matches on subject strings that may contain invalid
11354       UTF sequences if you  call  pcre2_compile()  with  the  PCRE2_MATCH_IN-
11355       VALID_UTF  option.  This  is  supported by pcre2_match(), including JIT
11356       matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is
11357       set,  it  forces  PCRE2_UTF  to be set as well. Note, however, that the
11358       pattern itself must be a valid UTF string.
11359
11360       Setting PCRE2_MATCH_INVALID_UTF does not  affect  what  pcre2_compile()
11361       generates,  but  if pcre2_jit_compile() is subsequently called, it does
11362       generate different code. If JIT is not used, the option affects the be-
11363       haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11364       VALID_UTF is set at compile  time,  PCRE2_NO_UTF_CHECK  is  ignored  at
11365       match time.
11366
11367       In  this  mode,  an  invalid  code  unit  sequence in the subject never
11368       matches any pattern item. It does not match  dot,  it  does  not  match
11369       \p{Any},  it does not even match negative items such as [^X]. A lookbe-
11370       hind assertion fails if it encounters an invalid sequence while  moving
11371       the  current  point backwards. In other words, an invalid UTF code unit
11372       sequence acts as a barrier which no match can cross.
11373
11374       You can also think of this as the subject being split up into fragments
11375       of  valid UTF, delimited internally by invalid code unit sequences. The
11376       pattern is matched fragment by fragment. The  result  of  a  successful
11377       match,  however,  is  given  as code unit offsets in the entire subject
11378       string in the usual way. There are a few points to consider:
11379
11380       The internal boundaries are not interpreted as the beginnings  or  ends
11381       of  lines  and  so  do not match circumflex or dollar characters in the
11382       pattern.
11383
11384       If pcre2_match() is called with an offset that  points  to  an  invalid
11385       UTF-sequence,  that  sequence  is  skipped, and the match starts at the
11386       next valid UTF character, or the end of the subject.
11387
11388       At internal fragment boundaries, \b and \B behave in the same way as at
11389       the  beginning  and end of the subject. For example, a sequence such as
11390       \bWORD\b would match an instance of WORD that is surrounded by  invalid
11391       UTF code units.
11392
11393       Using  PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11394       trary data, knowing that any matched  strings  that  are  returned  are
11395       valid UTF. This can be useful when searching for UTF text in executable
11396       or other binary files.
11397
11398
11399AUTHOR
11400
11401       Philip Hazel
11402       University Computing Service
11403       Cambridge, England.
11404
11405
11406REVISION
11407
11408       Last updated: 23 February 2020
11409       Copyright (c) 1997-2020 University of Cambridge.
11410------------------------------------------------------------------------------
11411
11412
11413