1-----------------------------------------------------------------------------
2This file contains a concatenation of the PCRE2 man pages, converted to plain
3text format for ease of searching with a text editor, or for use on systems
4that do not have a man page processor. The small individual files that give
5synopses of each function in the library have not been included. Neither has
6the pcre2demo program. There are separate text files for the pcre2grep and
7pcre2test commands.
8-----------------------------------------------------------------------------
9
10
11PCRE2(3)                   Library Functions Manual                   PCRE2(3)
12
13
14
15NAME
16       PCRE2 - Perl-compatible regular expressions (revised API)
17
18INTRODUCTION
19
20       PCRE2 is the name used for a revised API for the PCRE library, which is
21       a set of functions, written in C,  that  implement  regular  expression
22       pattern matching using the same syntax and semantics as Perl, with just
23       a few differences. Some features that appeared in Python and the origi-
24       nal  PCRE  before  they  appeared  in Perl are also available using the
25       Python syntax. There is also some support for one or two .NET and Onig-
26       uruma  syntax  items,  and  there are options for requesting some minor
27       changes that give better ECMAScript (aka JavaScript) compatibility.
28
29       The source code for PCRE2 can be compiled to support 8-bit, 16-bit,  or
30       32-bit  code units, which means that up to three separate libraries may
31       be installed.  The original work to extend PCRE to  16-bit  and  32-bit
32       code  units  was  done  by Zoltan Herczeg and Christian Persch, respec-
33       tively. In all three cases, strings can be interpreted  either  as  one
34       character  per  code  unit, or as UTF-encoded Unicode, with support for
35       Unicode general category properties. Unicode  support  is  optional  at
36       build  time  (but  is  the default). However, processing strings as UTF
37       code units must be enabled explicitly at run time. The version of  Uni-
38       code in use can be discovered by running
39
40         pcre2test -C
41
42       The  three  libraries  contain  identical sets of functions, with names
43       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
44       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
45       32, a program that uses just one code unit width can be  written  using
46       generic names such as pcre2_compile(), and the documentation is written
47       assuming that this is the case.
48
49       In addition to the Perl-compatible matching function, PCRE2 contains an
50       alternative  function that matches the same compiled patterns in a dif-
51       ferent way. In certain circumstances, the alternative function has some
52       advantages.   For  a discussion of the two matching algorithms, see the
53       pcre2matching page.
54
55       Details of exactly which Perl regular expression features are  and  are
56       not  supported  by  PCRE2  are  given  in  separate  documents. See the
57       pcre2pattern and pcre2compat pages. There is a syntax  summary  in  the
58       pcre2syntax page.
59
60       Some  features  of PCRE2 can be included, excluded, or changed when the
61       library is built. The pcre2_config() function makes it possible  for  a
62       client  to  discover  which  features are available. The features them-
63       selves are described in the pcre2build page. Documentation about build-
64       ing  PCRE2 for various operating systems can be found in the README and
65       NON-AUTOTOOLS_BUILD files in the source distribution.
66
67       The libraries contains a number of undocumented internal functions  and
68       data  tables  that  are  used by more than one of the exported external
69       functions, but which are not intended  for  use  by  external  callers.
70       Their  names  all begin with "_pcre2", which hopefully will not provoke
71       any name clashes. In some environments, it is possible to control which
72       external  symbols  are  exported when a shared library is built, and in
73       these cases the undocumented symbols are not exported.
74
75
76SECURITY CONSIDERATIONS
77
78       If you are using PCRE2 in a non-UTF application that permits  users  to
79       supply  arbitrary  patterns  for  compilation, you should be aware of a
80       feature that allows users to turn on UTF support from within a pattern.
81       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
82       mode, which interprets patterns and subjects as strings of  UTF-8  code
83       units instead of individual 8-bit characters. This causes both the pat-
84       tern and any data against which it is matched to be checked  for  UTF-8
85       validity.  If the data string is very long, such a check might use suf-
86       ficiently many resources as to cause your application to  lose  perfor-
87       mance.
88
89       One  way  of guarding against this possibility is to use the pcre2_pat-
90       tern_info() function  to  check  the  compiled  pattern's  options  for
91       PCRE2_UTF.  Alternatively,  you can set the PCRE2_NEVER_UTF option when
92       calling pcre2_compile(). This causes an compile time error if a pattern
93       contains a UTF-setting sequence.
94
95       The  use  of Unicode properties for character types such as \d can also
96       be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
97       ture can be disallowed by setting the PCRE2_NEVER_UCP option.
98
99       If  your  application  is one that supports UTF, be aware that validity
100       checking can take time. If the same data string is to be  matched  many
101       times,  you  can  use  the PCRE2_NO_UTF_CHECK option for the second and
102       subsequent matches to avoid running redundant checks.
103
104       The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
105       to  problems,  because  it  may leave the current matching point in the
106       middle of  a  multi-code-unit  character.  The  PCRE2_NEVER_BACKSLASH_C
107       option can be used by an application to lock out the use of \C, causing
108       a compile-time error if it is encountered. It is also possible to build
109       PCRE2 with the use of \C permanently disabled.
110
111       Another  way  that  performance can be hit is by running a pattern that
112       has a very large search tree against a string that  will  never  match.
113       Nested  unlimited repeats in a pattern are a common example. PCRE2 pro-
114       vides some protection against  this:  see  the  pcre2_set_match_limit()
115       function in the pcre2api page.
116
117
118USER DOCUMENTATION
119
120       The  user  documentation for PCRE2 comprises a number of different sec-
121       tions. In the "man" format, each of these is a separate "man page".  In
122       the  HTML  format, each is a separate page, linked from the index page.
123       In the plain  text  format,  the  descriptions  of  the  pcre2grep  and
124       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
125       respectively. The remaining sections, except for the pcre2demo  section
126       (which  is a program listing), and the short pages for individual func-
127       tions, are concatenated in pcre2.txt, for ease of searching.  The  sec-
128       tions are as follows:
129
130         pcre2              this document
131         pcre2-config       show PCRE2 installation configuration information
132         pcre2api           details of PCRE2's native C API
133         pcre2build         building PCRE2
134         pcre2callout       details of the callout feature
135         pcre2compat        discussion of Perl compatibility
136         pcre2demo          a demonstration C program that uses PCRE2
137         pcre2grep          description of the pcre2grep command (8-bit only)
138         pcre2jit           discussion of just-in-time optimization support
139         pcre2limits        details of size and other limits
140         pcre2matching      discussion of the two matching algorithms
141         pcre2partial       details of the partial matching facility
142         pcre2pattern       syntax and semantics of supported regular
143                              expression patterns
144         pcre2perform       discussion of performance issues
145         pcre2posix         the POSIX-compatible C API for the 8-bit library
146         pcre2sample        discussion of the pcre2demo program
147         pcre2stack         discussion of stack usage
148         pcre2syntax        quick syntax reference
149         pcre2test          description of the pcre2test command
150         pcre2unicode       discussion of Unicode and UTF support
151
152       In  the  "man"  and HTML formats, there is also a short page for each C
153       library function, listing its arguments and results.
154
155
156AUTHOR
157
158       Philip Hazel
159       University Computing Service
160       Cambridge, England.
161
162       Putting an actual email address here is a spam magnet. If you  want  to
163       email  me,  use  my two initials, followed by the two digits 10, at the
164       domain cam.ac.uk.
165
166
167REVISION
168
169       Last updated: 16 October 2015
170       Copyright (c) 1997-2015 University of Cambridge.
171------------------------------------------------------------------------------
172
173
174PCRE2API(3)                Library Functions Manual                PCRE2API(3)
175
176
177
178NAME
179       PCRE2 - Perl-compatible regular expressions (revised API)
180
181       #include <pcre2.h>
182
183       PCRE2  is  a  new API for PCRE. This document contains a description of
184       all its functions. See the pcre2 document for an overview  of  all  the
185       PCRE2 documentation.
186
187
188PCRE2 NATIVE API BASIC FUNCTIONS
189
190       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
191         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
192         pcre2_compile_context *ccontext);
193
194       void pcre2_code_free(pcre2_code *code);
195
196       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
197         pcre2_general_context *gcontext);
198
199       pcre2_match_data *pcre2_match_data_create_from_pattern(
200         const pcre2_code *code, pcre2_general_context *gcontext);
201
202       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
203         PCRE2_SIZE length, PCRE2_SIZE startoffset,
204         uint32_t options, pcre2_match_data *match_data,
205         pcre2_match_context *mcontext);
206
207       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
208         PCRE2_SIZE length, PCRE2_SIZE startoffset,
209         uint32_t options, pcre2_match_data *match_data,
210         pcre2_match_context *mcontext,
211         int *workspace, PCRE2_SIZE wscount);
212
213       void pcre2_match_data_free(pcre2_match_data *match_data);
214
215
216PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
217
218       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
219
220       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
221
222       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
223
224       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
225
226
227PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
228
229       pcre2_general_context *pcre2_general_context_create(
230         void *(*private_malloc)(PCRE2_SIZE, void *),
231         void (*private_free)(void *, void *), void *memory_data);
232
233       pcre2_general_context *pcre2_general_context_copy(
234         pcre2_general_context *gcontext);
235
236       void pcre2_general_context_free(pcre2_general_context *gcontext);
237
238
239PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
240
241       pcre2_compile_context *pcre2_compile_context_create(
242         pcre2_general_context *gcontext);
243
244       pcre2_compile_context *pcre2_compile_context_copy(
245         pcre2_compile_context *ccontext);
246
247       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
248
249       int pcre2_set_bsr(pcre2_compile_context *ccontext,
250         uint32_t value);
251
252       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
253         const unsigned char *tables);
254
255       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
256         PCRE2_SIZE value);
257
258       int pcre2_set_newline(pcre2_compile_context *ccontext,
259         uint32_t value);
260
261       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
262         uint32_t value);
263
264       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
265         int (*guard_function)(uint32_t, void *), void *user_data);
266
267
268PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
269
270       pcre2_match_context *pcre2_match_context_create(
271         pcre2_general_context *gcontext);
272
273       pcre2_match_context *pcre2_match_context_copy(
274         pcre2_match_context *mcontext);
275
276       void pcre2_match_context_free(pcre2_match_context *mcontext);
277
278       int pcre2_set_callout(pcre2_match_context *mcontext,
279         int (*callout_function)(pcre2_callout_block *, void *),
280         void *callout_data);
281
282       int pcre2_set_match_limit(pcre2_match_context *mcontext,
283         uint32_t value);
284
285       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
286         PCRE2_SIZE value);
287
288       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
289         uint32_t value);
290
291       int pcre2_set_recursion_memory_management(
292         pcre2_match_context *mcontext,
293         void *(*private_malloc)(PCRE2_SIZE, void *),
294         void (*private_free)(void *, void *), void *memory_data);
295
296
297PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
298
299       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
300         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
301
302       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
303         uint32_t number, PCRE2_UCHAR *buffer,
304         PCRE2_SIZE *bufflen);
305
306       void pcre2_substring_free(PCRE2_UCHAR *buffer);
307
308       int pcre2_substring_get_byname(pcre2_match_data *match_data,
309         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
310
311       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
312         uint32_t number, PCRE2_UCHAR **bufferptr,
313         PCRE2_SIZE *bufflen);
314
315       int pcre2_substring_length_byname(pcre2_match_data *match_data,
316         PCRE2_SPTR name, PCRE2_SIZE *length);
317
318       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
319         uint32_t number, PCRE2_SIZE *length);
320
321       int pcre2_substring_nametable_scan(const pcre2_code *code,
322         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
323
324       int pcre2_substring_number_from_name(const pcre2_code *code,
325         PCRE2_SPTR name);
326
327       void pcre2_substring_list_free(PCRE2_SPTR *list);
328
329       int pcre2_substring_list_get(pcre2_match_data *match_data,
330         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
331
332
333PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
334
335       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
336         PCRE2_SIZE length, PCRE2_SIZE startoffset,
337         uint32_t options, pcre2_match_data *match_data,
338         pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
339         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
340         PCRE2_SIZE *outlengthptr);
341
342
343PCRE2 NATIVE API JIT FUNCTIONS
344
345       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
346
347       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
348         PCRE2_SIZE length, PCRE2_SIZE startoffset,
349         uint32_t options, pcre2_match_data *match_data,
350         pcre2_match_context *mcontext);
351
352       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
353
354       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
355         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
356
357       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
358         pcre2_jit_callback callback_function, void *callback_data);
359
360       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
361
362
363PCRE2 NATIVE API SERIALIZATION FUNCTIONS
364
365       int32_t pcre2_serialize_decode(pcre2_code **codes,
366         int32_t number_of_codes, const uint8_t *bytes,
367         pcre2_general_context *gcontext);
368
369       int32_t pcre2_serialize_encode(const pcre2_code **codes,
370         int32_t number_of_codes, uint8_t **serialized_bytes,
371         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
372
373       void pcre2_serialize_free(uint8_t *bytes);
374
375       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
376
377
378PCRE2 NATIVE API AUXILIARY FUNCTIONS
379
380       pcre2_code *pcre2_code_copy(const pcre2_code *code);
381
382       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
383         PCRE2_SIZE bufflen);
384
385       const unsigned char *pcre2_maketables(pcre2_general_context *gcontext);
386
387       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
388
389       int pcre2_callout_enumerate(const pcre2_code *code,
390         int (*callback)(pcre2_callout_enumerate_block *, void *),
391         void *user_data);
392
393       int pcre2_config(uint32_t what, void *where);
394
395
396PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
397
398       There  are  three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
399       code units, respectively. However,  there  is  just  one  header  file,
400       pcre2.h.   This  contains the function prototypes and other definitions
401       for all three libraries. One, two, or all three can be installed simul-
402       taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
403       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
404       inal PCRE libraries.
405
406       Character  strings are passed to and from a PCRE2 library as a sequence
407       of unsigned integers in code units  of  the  appropriate  width.  Every
408       PCRE2  function  comes  in three different forms, one for each library,
409       for example:
410
411         pcre2_compile_8()
412         pcre2_compile_16()
413         pcre2_compile_32()
414
415       There are also three different sets of data types:
416
417         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
418         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32
419
420       The UCHAR types define unsigned code units of the  appropriate  widths.
421       For  example,  PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
422       types are constant pointers to the equivalent  UCHAR  types,  that  is,
423       they are pointers to vectors of unsigned code units.
424
425       Many  applications use only one code unit width. For their convenience,
426       macros are defined whose names are the generic forms such as pcre2_com-
427       pile()  and  PCRE2_SPTR.  These  macros  use  the  value  of  the macro
428       PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific  func-
429       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
430       An application must define it to be  8,  16,  or  32  before  including
431       pcre2.h in order to make use of the generic names.
432
433       Applications  that use more than one code unit width can be linked with
434       more than one PCRE2 library, but must define  PCRE2_CODE_UNIT_WIDTH  to
435       be  0  before  including pcre2.h, and then use the real function names.
436       Any code that is to be included in an environment where  the  value  of
437       PCRE2_CODE_UNIT_WIDTH  is  unknown  should  also  use the real function
438       names. (Unfortunately, it is not possible in C code to save and restore
439       the value of a macro.)
440
441       If  PCRE2_CODE_UNIT_WIDTH  is  not  defined before including pcre2.h, a
442       compiler error occurs.
443
444       When using multiple libraries in an application,  you  must  take  care
445       when  processing  any  particular  pattern to use only functions from a
446       single library.  For example, if you want to run a match using  a  pat-
447       tern  that  was  compiled  with pcre2_compile_16(), you must do so with
448       pcre2_match_16(), not pcre2_match_8().
449
450       In the function summaries above, and in the rest of this  document  and
451       other  PCRE2  documents,  functions  and data types are described using
452       their generic names, without the 8, 16, or 32 suffix.
453
454
455PCRE2 API OVERVIEW
456
457       PCRE2 has its own native API, which  is  described  in  this  document.
458       There are also some wrapper functions for the 8-bit library that corre-
459       spond to the POSIX regular expression API, but they do not give  access
460       to all the functionality. They are described in the pcre2posix documen-
461       tation. Both these APIs define a set of C function calls.
462
463       The native API C data types, function prototypes,  option  values,  and
464       error codes are defined in the header file pcre2.h, which contains def-
465       initions of PCRE2_MAJOR and PCRE2_MINOR, the major  and  minor  release
466       numbers  for the library. Applications can use these to include support
467       for different releases of PCRE2.
468
469       In a Windows environment, if you want to statically link an application
470       program  against  a non-dll PCRE2 library, you must define PCRE2_STATIC
471       before including pcre2.h.
472
473       The functions pcre2_compile(), and pcre2_match() are used for compiling
474       and  matching regular expressions in a Perl-compatible manner. A sample
475       program that demonstrates the simplest way of using them is provided in
476       the file called pcre2demo.c in the PCRE2 source distribution. A listing
477       of this program is  given  in  the  pcre2demo  documentation,  and  the
478       pcre2sample documentation describes how to compile and run it.
479
480       Just-in-time  compiler support is an optional feature of PCRE2 that can
481       be built in appropriate hardware environments. It greatly speeds up the
482       matching  performance of many patterns. Programs can request that it be
483       used if available, by calling pcre2_jit_compile() after a  pattern  has
484       been successfully compiled by pcre2_compile(). This does nothing if JIT
485       support is not available.
486
487       More complicated programs might need to  make  use  of  the  specialist
488       functions    pcre2_jit_stack_create(),    pcre2_jit_stack_free(),   and
489       pcre2_jit_stack_assign() in order to  control  the  JIT  code's  memory
490       usage.
491
492       JIT matching is automatically used by pcre2_match() if it is available,
493       unless the PCRE2_NO_JIT option is set. There is also a direct interface
494       for  JIT  matching,  which gives improved performance. The JIT-specific
495       functions are discussed in the pcre2jit documentation.
496
497       A second matching function, pcre2_dfa_match(), which is  not  Perl-com-
498       patible,  is  also  provided.  This  uses a different algorithm for the
499       matching. The alternative algorithm finds all possible  matches  (at  a
500       given  point  in  the subject), and scans the subject just once (unless
501       there are lookbehind assertions).  However,  this  algorithm  does  not
502       return  captured  substrings.  A  description of the two matching algo-
503       rithms  and  their  advantages  and  disadvantages  is  given  in   the
504       pcre2matching    documentation.   There   is   no   JIT   support   for
505       pcre2_dfa_match().
506
507       In addition to the main compiling and  matching  functions,  there  are
508       convenience functions for extracting captured substrings from a subject
509       string that has been matched by pcre2_match(). They are:
510
511         pcre2_substring_copy_byname()
512         pcre2_substring_copy_bynumber()
513         pcre2_substring_get_byname()
514         pcre2_substring_get_bynumber()
515         pcre2_substring_list_get()
516         pcre2_substring_length_byname()
517         pcre2_substring_length_bynumber()
518         pcre2_substring_nametable_scan()
519         pcre2_substring_number_from_name()
520
521       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
522       vided, to free the memory used for extracted strings.
523
524       The  function  pcre2_substitute()  can be called to match a pattern and
525       return a copy of the subject string with substitutions for  parts  that
526       were matched.
527
528       Functions  whose  names begin with pcre2_serialize_ are used for saving
529       compiled patterns on disc or elsewhere, and reloading them later.
530
531       Finally, there are functions for finding out information about  a  com-
532       piled  pattern  (pcre2_pattern_info()) and about the configuration with
533       which PCRE2 was built (pcre2_config()).
534
535       Functions with names ending with _free() are used  for  freeing  memory
536       blocks  of  various  sorts.  In all cases, if one of these functions is
537       called with a NULL argument, it does nothing.
538
539
540STRING LENGTHS AND OFFSETS
541
542       The PCRE2 API uses string lengths and  offsets  into  strings  of  code
543       units  in  several  places. These values are always of type PCRE2_SIZE,
544       which is an unsigned integer type, currently always defined as  size_t.
545       The  largest  value  that  can  be  stored  in  such  a  type  (that is
546       ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
547       strings  and  unset offsets.  Therefore, the longest string that can be
548       handled is one less than this maximum.
549
550
551NEWLINES
552
553       PCRE2 supports five different conventions for indicating line breaks in
554       strings:  a  single  CR (carriage return) character, a single LF (line-
555       feed) character, the two-character sequence CRLF, any of the three pre-
556       ceding,  or any Unicode newline sequence. The Unicode newline sequences
557       are the three just mentioned, plus the single characters  VT  (vertical
558       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
559       separator, U+2028), and PS (paragraph separator, U+2029).
560
561       Each of the first three conventions is used by at least  one  operating
562       system as its standard newline sequence. When PCRE2 is built, a default
563       can be specified.  The default default is LF, which is the  Unix  stan-
564       dard.  However, the newline convention can be changed by an application
565       when calling pcre2_compile(), or it can be specified by special text at
566       the start of the pattern itself; this overrides any other settings. See
567       the pcre2pattern page for details of the special character sequences.
568
569       In the PCRE2 documentation the word "newline"  is  used  to  mean  "the
570       character or pair of characters that indicate a line break". The choice
571       of newline convention affects the handling of the dot, circumflex,  and
572       dollar metacharacters, the handling of #-comments in /x mode, and, when
573       CRLF is a recognized line ending sequence, the match position  advance-
574       ment for a non-anchored pattern. There is more detail about this in the
575       section on pcre2_match() options below.
576
577       The choice of newline convention does not affect the interpretation  of
578       the \n or \r escape sequences, nor does it affect what \R matches; this
579       has its own separate convention.
580
581
582MULTITHREADING
583
584       In a multithreaded application it is important to keep  thread-specific
585       data  separate  from data that can be shared between threads. The PCRE2
586       library code itself is thread-safe: it contains  no  static  or  global
587       variables.  The  API  is  designed to be fairly simple for non-threaded
588       applications while at the same time ensuring that multithreaded  appli-
589       cations can use it.
590
591       There are several different blocks of data that are used to pass infor-
592       mation between the application and the PCRE2 libraries.
593
594   The compiled pattern
595
596       A pointer to the compiled form of a pattern is  returned  to  the  user
597       when pcre2_compile() is successful. The data in the compiled pattern is
598       fixed, and does not change when the pattern is matched.  Therefore,  it
599       is  thread-safe, that is, the same compiled pattern can be used by more
600       than one thread simultaneously. For example, an application can compile
601       all its patterns at the start, before forking off multiple threads that
602       use them. However, if the just-in-time optimization  feature  is  being
603       used,  it  needs  separate  memory stack areas for each thread. See the
604       pcre2jit documentation for more details.
605
606       In a more complicated situation, where patterns are compiled only  when
607       they  are  first needed, but are still shared between threads, pointers
608       to compiled patterns must be protected  from  simultaneous  writing  by
609       multiple threads, at least until a pattern has been compiled. The logic
610       can be something like this:
611
612         Get a read-only (shared) lock (mutex) for pointer
613         if (pointer == NULL)
614           {
615           Get a write (unique) lock for pointer
616           pointer = pcre2_compile(...
617           }
618         Release the lock
619         Use pointer in pcre2_match()
620
621       Of course, testing for compilation errors should also  be  included  in
622       the code.
623
624       If JIT is being used, but the JIT compilation is not being done immedi-
625       ately, (perhaps waiting to see if the pattern  is  used  often  enough)
626       similar logic is required. JIT compilation updates a pointer within the
627       compiled code block, so a thread must gain unique write access  to  the
628       pointer     before    calling    pcre2_jit_compile().    Alternatively,
629       pcre2_code_copy() can be used to obtain a private copy of the  compiled
630       code.
631
632   Context blocks
633
634       The  next main section below introduces the idea of "contexts" in which
635       PCRE2 functions are called. A context is nothing more than a collection
636       of parameters that control the way PCRE2 operates. Grouping a number of
637       parameters together in a context is a convenient way of passing them to
638       a  PCRE2  function without using lots of arguments. The parameters that
639       are stored in contexts are in some sense  "advanced  features"  of  the
640       API. Many straightforward applications will not need to use contexts.
641
642       In a multithreaded application, if the parameters in a context are val-
643       ues that are never changed, the same context can be  used  by  all  the
644       threads. However, if any thread needs to change any value in a context,
645       it must make its own thread-specific copy.
646
647   Match blocks
648
649       The matching functions need a block of memory for working space and for
650       storing  the  results  of  a  match.  This includes details of what was
651       matched, as well as additional  information  such  as  the  name  of  a
652       (*MARK) setting. Each thread must provide its own copy of this memory.
653
654
655PCRE2 CONTEXTS
656
657       Some  PCRE2  functions have a lot of parameters, many of which are used
658       only by specialist applications, for example,  those  that  use  custom
659       memory  management  or  non-standard character tables. To keep function
660       argument lists at a reasonable size, and at the same time to  keep  the
661       API  extensible,  "uncommon" parameters are passed to certain functions
662       in a context instead of directly. A context is just a block  of  memory
663       that  holds  the  parameter  values.   Applications that do not need to
664       adjust any of the context parameters  can  pass  NULL  when  a  context
665       pointer is required.
666
667       There  are  three different types of context: a general context that is
668       relevant for several PCRE2 operations, a compile-time  context,  and  a
669       match-time context.
670
671   The general context
672
673       At  present,  this  context  just  contains  pointers to (and data for)
674       external memory management  functions  that  are  called  from  several
675       places in the PCRE2 library. The context is named `general' rather than
676       specifically `memory' because in future other fields may be  added.  If
677       you  do not want to supply your own custom memory management functions,
678       you do not need to bother with a general context. A general context  is
679       created by:
680
681       pcre2_general_context *pcre2_general_context_create(
682         void *(*private_malloc)(PCRE2_SIZE, void *),
683         void (*private_free)(void *, void *), void *memory_data);
684
685       The  two  function pointers specify custom memory management functions,
686       whose prototypes are:
687
688         void *private_malloc(PCRE2_SIZE, void *);
689         void  private_free(void *, void *);
690
691       Whenever code in PCRE2 calls these functions, the final argument is the
692       value of memory_data. Either of the first two arguments of the creation
693       function may be NULL, in which case the system memory management  func-
694       tions  malloc()  and free() are used. (This is not currently useful, as
695       there are no other fields in a general context,  but  in  future  there
696       might  be.)   The  private_malloc()  function  is used (if supplied) to
697       obtain memory for storing the context, and all three values  are  saved
698       as part of the context.
699
700       Whenever  PCRE2  creates a data block of any kind, the block contains a
701       pointer to the free() function that matches the malloc() function  that
702       was  used.  When  the  time  comes  to free the block, this function is
703       called.
704
705       A general context can be copied by calling:
706
707       pcre2_general_context *pcre2_general_context_copy(
708         pcre2_general_context *gcontext);
709
710       The memory used for a general context should be freed by calling:
711
712       void pcre2_general_context_free(pcre2_general_context *gcontext);
713
714
715   The compile context
716
717       A compile context is required if you want to change the default  values
718       of any of the following compile-time parameters:
719
720         What \R matches (Unicode newlines or CR, LF, CRLF only)
721         PCRE2's character tables
722         The newline character sequence
723         The compile time nested parentheses limit
724         The maximum length of the pattern string
725         An external function for stack checking
726
727       A  compile context is also required if you are using custom memory man-
728       agement.  If none of these apply, just pass NULL as the  context  argu-
729       ment of pcre2_compile().
730
731       A  compile context is created, copied, and freed by the following func-
732       tions:
733
734       pcre2_compile_context *pcre2_compile_context_create(
735         pcre2_general_context *gcontext);
736
737       pcre2_compile_context *pcre2_compile_context_copy(
738         pcre2_compile_context *ccontext);
739
740       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
741
742       A compile context is created with default values  for  its  parameters.
743       These can be changed by calling the following functions, which return 0
744       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
745
746       int pcre2_set_bsr(pcre2_compile_context *ccontext,
747         uint32_t value);
748
749       The value must be PCRE2_BSR_ANYCRLF, to specify that  \R  matches  only
750       CR,  LF,  or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
751       Unicode line ending sequence. The value is used by the JIT compiler and
752       by   the   two   interpreted   matching  functions,  pcre2_match()  and
753       pcre2_dfa_match().
754
755       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
756         const unsigned char *tables);
757
758       The value must be the result of a  call  to  pcre2_maketables(),  whose
759       only argument is a general context. This function builds a set of char-
760       acter tables in the current locale.
761
762       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
763         PCRE2_SIZE value);
764
765       This sets a maximum length, in code units, for the pattern string  that
766       is  to  be  compiled.  If the pattern is longer, an error is generated.
767       This facility is provided so that  applications  that  accept  patterns
768       from  external sources can limit their size. The default is the largest
769       number that a PCRE2_SIZE variable can hold, which is effectively unlim-
770       ited.
771
772       int pcre2_set_newline(pcre2_compile_context *ccontext,
773         uint32_t value);
774
775       This specifies which characters or character sequences are to be recog-
776       nized as newlines. The value must be one of PCRE2_NEWLINE_CR  (carriage
777       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
778       two-character sequence CR followed by LF),  PCRE2_NEWLINE_ANYCRLF  (any
779       of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence).
780
781       When a pattern is compiled with the PCRE2_EXTENDED option, the value of
782       this parameter affects the recognition of white space and  the  end  of
783       internal comments starting with #. The value is saved with the compiled
784       pattern for subsequent use by the JIT compiler and by  the  two  inter-
785       preted matching functions, pcre2_match() and pcre2_dfa_match().
786
787       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
788         uint32_t value);
789
790       This parameter ajusts the limit, set when PCRE2 is built (default 250),
791       on the depth of parenthesis nesting in  a  pattern.  This  limit  stops
792       rogue patterns using up too much system stack when being compiled.
793
794       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
795         int (*guard_function)(uint32_t, void *), void *user_data);
796
797       There  is at least one application that runs PCRE2 in threads with very
798       limited system stack, where running out of stack is to  be  avoided  at
799       all  costs. The parenthesis limit above cannot take account of how much
800       stack is actually available. For a finer  control,  you  can  supply  a
801       function  that  is  called whenever pcre2_compile() starts to compile a
802       parenthesized part of a pattern. This function  can  check  the  actual
803       stack size (or anything else that it wants to, of course).
804
805       The  first  argument to the callout function gives the current depth of
806       nesting, and the second is user data that is set up by the  last  argu-
807       ment   of  pcre2_set_compile_recursion_guard().  The  callout  function
808       should return zero if all is well, or non-zero to force an error.
809
810   The match context
811
812       A match context is required if you want to change the default values of
813       any of the following match-time parameters:
814
815         A callout function
816         The offset limit for matching an unanchored pattern
817         The limit for calling match() (see below)
818         The limit for calling match() recursively
819
820       A match context is also required if you are using custom memory manage-
821       ment.  If none of these apply, just pass NULL as the  context  argument
822       of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
823
824       A  match  context  is created, copied, and freed by the following func-
825       tions:
826
827       pcre2_match_context *pcre2_match_context_create(
828         pcre2_general_context *gcontext);
829
830       pcre2_match_context *pcre2_match_context_copy(
831         pcre2_match_context *mcontext);
832
833       void pcre2_match_context_free(pcre2_match_context *mcontext);
834
835       A match context is created with  default  values  for  its  parameters.
836       These can be changed by calling the following functions, which return 0
837       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
838
839       int pcre2_set_callout(pcre2_match_context *mcontext,
840         int (*callout_function)(pcre2_callout_block *, void *),
841         void *callout_data);
842
843       This sets up a "callout" function, which PCRE2 will call  at  specified
844       points during a matching operation. Details are given in the pcre2call-
845       out documentation.
846
847       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
848         PCRE2_SIZE value);
849
850       The offset_limit parameter limits how  far  an  unanchored  search  can
851       advance  in  the  subject string. The default value is PCRE2_UNSET. The
852       pcre2_match()     and      pcre2_dfa_match()      functions      return
853       PCRE2_ERROR_NOMATCH  if  a match with a starting point before or at the
854       given offset is not found. For example, if the pattern /abc/ is matched
855       against  "123abc"  with  an  offset  limit  less  than 3, the result is
856       PCRE2_ERROR_NO_MATCH.  A match can never be found  if  the  startoffset
857       argument of pcre2_match() or pcre2_dfa_match() is greater than the off-
858       set limit.
859
860       When using this facility,  you  must  set  PCRE2_USE_OFFSET_LIMIT  when
861       calling  pcre2_compile() so that when JIT is in use, different code can
862       be compiled. If a match is started with a non-default match limit  when
863       PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
864
865       The  offset limit facility can be used to track progress when searching
866       large subject strings.  See  also  the  PCRE2_FIRSTLINE  option,  which
867       requires a match to start within the first line of the subject. If this
868       is set with an offset limit, a match must occur in the first  line  and
869       also  within  the  offset limit.  In other words, whichever limit comes
870       first is used.
871
872       int pcre2_set_match_limit(pcre2_match_context *mcontext,
873         uint32_t value);
874
875       The match_limit parameter provides a means  of  preventing  PCRE2  from
876       using up too many resources when processing patterns that are not going
877       to match, but which have a very large number of possibilities in  their
878       search  trees. The classic example is a pattern that uses nested unlim-
879       ited repeats.
880
881       Internally, pcre2_match() uses a  function  called  match(),  which  it
882       calls  repeatedly (sometimes recursively). The limit set by match_limit
883       is imposed on the number of times this  function  is  called  during  a
884       match, which has the effect of limiting the amount of backtracking that
885       can take place. For patterns that are not anchored, the count  restarts
886       from  zero  for  each position in the subject string. This limit is not
887       relevant to pcre2_dfa_match(), which ignores it.
888
889       When pcre2_match() is called with a pattern that was successfully  pro-
890       cessed by pcre2_jit_compile(), the way in which matching is executed is
891       entirely different. However, there is still the possibility of  runaway
892       matching  that  goes  on  for  a very long time, and so the match_limit
893       value is also used in this case (but in a different way) to  limit  how
894       long the matching can continue.
895
896       The  default  value  for  the limit can be set when PCRE2 is built; the
897       default default is 10 million, which handles all but the  most  extreme
898       cases.    If    the    limit   is   exceeded,   pcre2_match()   returns
899       PCRE2_ERROR_MATCHLIMIT. A value for the match limit may  also  be  sup-
900       plied by an item at the start of a pattern of the form
901
902         (*LIMIT_MATCH=ddd)
903
904       where  ddd  is  a  decimal  number.  However, such a setting is ignored
905       unless ddd is less than the limit set by the  caller  of  pcre2_match()
906       or, if no such limit is set, less than the default.
907
908       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
909         uint32_t value);
910
911       The recursion_limit parameter is similar to match_limit, but instead of
912       limiting the total number of times that match() is  called,  it  limits
913       the  depth  of  recursion. The recursion depth is a smaller number than
914       the total number of calls, because not all calls to match() are  recur-
915       sive.  This limit is of use only if it is set smaller than match_limit.
916
917       Limiting the recursion depth limits the amount of system stack that can
918       be used, or, when PCRE2 has been compiled to use  memory  on  the  heap
919       instead  of the stack, the amount of heap memory that can be used. This
920       limit is not relevant, and is ignored, when matching is done using  JIT
921       compiled code or by the pcre2_dfa_match() function.
922
923       The  default  value for recursion_limit can be set when PCRE2 is built;
924       the default default is the same value as the default  for  match_limit.
925       If  the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION-
926       LIMIT. A value for the recursion limit may also be supplied by an  item
927       at the start of a pattern of the form
928
929         (*LIMIT_RECURSION=ddd)
930
931       where  ddd  is  a  decimal  number.  However, such a setting is ignored
932       unless ddd is less than the limit set by the  caller  of  pcre2_match()
933       or, if no such limit is set, less than the default.
934
935       int pcre2_set_recursion_memory_management(
936         pcre2_match_context *mcontext,
937         void *(*private_malloc)(PCRE2_SIZE, void *),
938         void (*private_free)(void *, void *), void *memory_data);
939
940       This function sets up two additional custom memory management functions
941       for use by pcre2_match() when PCRE2 is compiled to  use  the  heap  for
942       remembering backtracking data, instead of recursive function calls that
943       use the system stack. There is a discussion about PCRE2's  stack  usage
944       in  the  pcre2stack documentation. See the pcre2build documentation for
945       details of how to build PCRE2.
946
947       Using the heap for recursion is a non-standard way of  building  PCRE2,
948       for  use  in  environments  that  have  limited  stacks. Because of the
949       greater use of memory management, pcre2_match() runs more slowly. Func-
950       tions  that  are  different  to the general custom memory functions are
951       provided so that special-purpose external code can  be  used  for  this
952       case,  because  the memory blocks are all the same size. The blocks are
953       retained by pcre2_match() until it is about to exit so that they can be
954       re-used  when  possible during the match. In the absence of these func-
955       tions, the normal custom memory management functions are used, if  sup-
956       plied, otherwise the system functions.
957
958
959CHECKING BUILD-TIME OPTIONS
960
961       int pcre2_config(uint32_t what, void *where);
962
963       The  function  pcre2_config()  makes  it possible for a PCRE2 client to
964       discover which optional features have  been  compiled  into  the  PCRE2
965       library.  The  pcre2build  documentation  has  more details about these
966       optional features.
967
968       The first argument for pcre2_config() specifies  which  information  is
969       required.  The  second  argument  is a pointer to memory into which the
970       information is placed. If NULL is  passed,  the  function  returns  the
971       amount  of  memory  that  is  needed for the requested information. For
972       calls that return  numerical  values,  the  value  is  in  bytes;  when
973       requesting  these  values,  where should point to appropriately aligned
974       memory. For calls that return strings, the required length is given  in
975       code units, not counting the terminating zero.
976
977       When  requesting information, the returned value from pcre2_config() is
978       non-negative on success, or the negative error code  PCRE2_ERROR_BADOP-
979       TION  if the value in the first argument is not recognized. The follow-
980       ing information is available:
981
982         PCRE2_CONFIG_BSR
983
984       The output is a uint32_t integer whose value indicates  what  character
985       sequences  the  \R  escape  sequence  matches  by  default.  A value of
986       PCRE2_BSR_UNICODE  means  that  \R  matches  any  Unicode  line  ending
987       sequence;  a  value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
988       LF, or CRLF. The default can be overridden when a pattern is compiled.
989
990         PCRE2_CONFIG_JIT
991
992       The output is a uint32_t integer that is set  to  one  if  support  for
993       just-in-time compiling is available; otherwise it is set to zero.
994
995         PCRE2_CONFIG_JITTARGET
996
997       The  where  argument  should point to a buffer that is at least 48 code
998       units long.  (The  exact  length  required  can  be  found  by  calling
999       pcre2_config()  with  where  set  to NULL.) The buffer is filled with a
1000       string that contains the name of the architecture  for  which  the  JIT
1001       compiler  is  configured,  for  example  "x86  32bit  (little  endian +
1002       unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION  is
1003       returned,  otherwise the number of code units used is returned. This is
1004       the length of the string, plus one unit for the terminating zero.
1005
1006         PCRE2_CONFIG_LINKSIZE
1007
1008       The output is a uint32_t integer that contains the number of bytes used
1009       for  internal  linkage  in  compiled regular expressions. When PCRE2 is
1010       configured, the value can be set to 2, 3, or 4, with the default  being
1011       2.  This is the value that is returned by pcre2_config(). However, when
1012       the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1013       when  the  32-bit  library  is compiled, internal linkages always use 4
1014       bytes, so the configured value is not relevant.
1015
1016       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1017       for  all but the most massive patterns, since it allows the size of the
1018       compiled pattern to be up to 64K code units. Larger values allow larger
1019       regular  expressions  to be compiled by those two libraries, but at the
1020       expense of slower matching.
1021
1022         PCRE2_CONFIG_MATCHLIMIT
1023
1024       The output is a uint32_t integer that gives the default limit  for  the
1025       number  of  internal  matching function calls in a pcre2_match() execu-
1026       tion. Further details are given with pcre2_match() below.
1027
1028         PCRE2_CONFIG_NEWLINE
1029
1030       The output is a uint32_t integer  whose  value  specifies  the  default
1031       character  sequence that is recognized as meaning "newline". The values
1032       are:
1033
1034         PCRE2_NEWLINE_CR       Carriage return (CR)
1035         PCRE2_NEWLINE_LF       Linefeed (LF)
1036         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
1037         PCRE2_NEWLINE_ANY      Any Unicode line ending
1038         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
1039
1040       The default should normally correspond to  the  standard  sequence  for
1041       your operating system.
1042
1043         PCRE2_CONFIG_PARENSLIMIT
1044
1045       The  output is a uint32_t integer that gives the maximum depth of nest-
1046       ing of parentheses (of any kind) in a pattern. This limit is imposed to
1047       cap  the  amount of system stack used when a pattern is compiled. It is
1048       specified when PCRE2 is built; the default is 250. This limit does  not
1049       take  into  account  the  stack that may already be used by the calling
1050       application. For  finer  control  over  compilation  stack  usage,  see
1051       pcre2_set_compile_recursion_guard().
1052
1053         PCRE2_CONFIG_RECURSIONLIMIT
1054
1055       The  output  is a uint32_t integer that gives the default limit for the
1056       depth of recursion when calling the internal  matching  function  in  a
1057       pcre2_match()  execution.  Further details are given with pcre2_match()
1058       below.
1059
1060         PCRE2_CONFIG_STACKRECURSE
1061
1062       The output is a uint32_t integer that is set to one if internal  recur-
1063       sion  when  running  pcre2_match() is implemented by recursive function
1064       calls that use the system stack to remember their state.  This  is  the
1065       usual  way that PCRE2 is compiled. The output is zero if PCRE2 was com-
1066       piled to use blocks of data on the heap instead of  recursive  function
1067       calls.
1068
1069         PCRE2_CONFIG_UNICODE_VERSION
1070
1071       The  where  argument  should point to a buffer that is at least 24 code
1072       units long.  (The  exact  length  required  can  be  found  by  calling
1073       pcre2_config()  with  where  set  to  NULL.) If PCRE2 has been compiled
1074       without Unicode support, the buffer is filled with  the  text  "Unicode
1075       not  supported".  Otherwise,  the  Unicode version string (for example,
1076       "8.0.0") is inserted. The number of code units used is  returned.  This
1077       is the length of the string plus one unit for the terminating zero.
1078
1079         PCRE2_CONFIG_UNICODE
1080
1081       The  output is a uint32_t integer that is set to one if Unicode support
1082       is available; otherwise it is set to zero. Unicode support implies  UTF
1083       support.
1084
1085         PCRE2_CONFIG_VERSION
1086
1087       The  where  argument  should point to a buffer that is at least 12 code
1088       units long.  (The  exact  length  required  can  be  found  by  calling
1089       pcre2_config()  with  where set to NULL.) The buffer is filled with the
1090       PCRE2 version string, zero-terminated. The number of code units used is
1091       returned. This is the length of the string plus one unit for the termi-
1092       nating zero.
1093
1094
1095COMPILING A PATTERN
1096
1097       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
1098         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
1099         pcre2_compile_context *ccontext);
1100
1101       void pcre2_code_free(pcre2_code *code);
1102
1103       pcre2_code *pcre2_code_copy(const pcre2_code *code);
1104
1105       The pcre2_compile() function compiles a pattern into an internal  form.
1106       The  pattern  is  defined  by a pointer to a string of code units and a
1107       length. If the pattern is zero-terminated, the length can be  specified
1108       as  PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of
1109       memory that contains the compiled pattern and related data, or NULL  if
1110       an error occurred.
1111
1112       If  the  compile context argument ccontext is NULL, memory for the com-
1113       piled pattern  is  obtained  by  calling  malloc().  Otherwise,  it  is
1114       obtained  from  the  same memory function that was used for the compile
1115       context. The caller must free the memory by  calling  pcre2_code_free()
1116       when it is no longer needed.
1117
1118       The function pcre2_code_copy() makes a copy of the compiled code in new
1119       memory, using the same memory allocator as was used for  the  original.
1120       However,  if  the  code  has  been  processed  by the JIT compiler (see
1121       below), the JIT information cannot be copied (because it  is  position-
1122       dependent).  The new copy can initially be used only for non-JIT match-
1123       ing, though it can be passed to pcre2_jit_compile()  if  required.  The
1124       pcre2_code_copy()  function  provides a way for individual threads in a
1125       multithreaded application to acquire a private copy of shared  compiled
1126       code.
1127
1128       NOTE:  When  one  of  the matching functions is called, pointers to the
1129       compiled pattern and the subject string are set in the match data block
1130       so  that  they can be referenced by the substring extraction functions.
1131       After running a match, you must not free a compiled pattern (or a  sub-
1132       ject  string)  until  after all operations on the match data block have
1133       taken place.
1134
1135       The options argument for pcre2_compile() contains various bit  settings
1136       that  affect  the  compilation.  It  should  be  zero if no options are
1137       required. The available options are described below. Some of  them  (in
1138       particular,  those  that  are  compatible with Perl, but some others as
1139       well) can also be set and  unset  from  within  the  pattern  (see  the
1140       detailed description in the pcre2pattern documentation).
1141
1142       For  those options that can be different in different parts of the pat-
1143       tern, the contents of the options argument specifies their settings  at
1144       the  start  of  compilation.  The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK
1145       options can be set at the time of matching as well as at compile time.
1146
1147       Other, less frequently required compile-time parameters  (for  example,
1148       the newline setting) can be provided in a compile context (as described
1149       above).
1150
1151       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1152       diately.  Otherwise,  the  variables to which these point are set to an
1153       error code and an offset (number of code  units)  within  the  pattern,
1154       respectively,  when  pcre2_compile() returns NULL because a compilation
1155       error has occurred. The values are not defined when compilation is suc-
1156       cessful and pcre2_compile() returns a non-NULL value.
1157
1158       The  pcre2_get_error_message() function (see "Obtaining a textual error
1159       message" below) provides a textual message for each error code.  Compi-
1160       lation errors have positive error codes; UTF formatting error codes are
1161       negative. For an invalid UTF-8 or UTF-16 string, the offset is that  of
1162       the first code unit of the failing character.
1163
1164       Some  errors are not detected until the whole pattern has been scanned;
1165       in these cases, the offset passed back is the length  of  the  pattern.
1166       Note  that  the  offset is in code units, not characters, even in a UTF
1167       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1168       acter.
1169
1170       This  code  fragment shows a typical straightforward call to pcre2_com-
1171       pile():
1172
1173         pcre2_code *re;
1174         PCRE2_SIZE erroffset;
1175         int errorcode;
1176         re = pcre2_compile(
1177           "^A.*Z",                /* the pattern */
1178           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1179           0,                      /* default options */
1180           &errorcode,             /* for error code */
1181           &erroffset,             /* for error offset */
1182           NULL);                  /* no compile context */
1183
1184       The following names for option bits are defined in the  pcre2.h  header
1185       file:
1186
1187         PCRE2_ANCHORED
1188
1189       If this bit is set, the pattern is forced to be "anchored", that is, it
1190       is constrained to match only at the first matching point in the  string
1191       that  is being searched (the "subject string"). This effect can also be
1192       achieved by appropriate constructs in the pattern itself, which is  the
1193       only way to do it in Perl.
1194
1195         PCRE2_ALLOW_EMPTY_CLASS
1196
1197       By  default, for compatibility with Perl, a closing square bracket that
1198       immediately follows an opening one is treated as a data  character  for
1199       the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the
1200       class, which therefore contains no characters and so can never match.
1201
1202         PCRE2_ALT_BSUX
1203
1204       This option request alternative handling  of  three  escape  sequences,
1205       which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript).
1206       When it is set:
1207
1208       (1) \U matches an upper case "U" character; by default \U causes a com-
1209       pile time error (Perl uses \U to upper case subsequent characters).
1210
1211       (2) \u matches a lower case "u" character unless it is followed by four
1212       hexadecimal digits, in which case the hexadecimal  number  defines  the
1213       code  point  to match. By default, \u causes a compile time error (Perl
1214       uses it to upper case the following character).
1215
1216       (3) \x matches a lower case "x" character unless it is followed by  two
1217       hexadecimal  digits,  in  which case the hexadecimal number defines the
1218       code point to match. By default, as in Perl, a  hexadecimal  number  is
1219       always expected after \x, but it may have zero, one, or two digits (so,
1220       for example, \xz matches a binary zero character followed by z).
1221
1222         PCRE2_ALT_CIRCUMFLEX
1223
1224       In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex
1225       metacharacter  matches at the start of the subject (unless PCRE2_NOTBOL
1226       is set), and also after any internal  newline.  However,  it  does  not
1227       match after a newline at the end of the subject, for compatibility with
1228       Perl. If you want a multiline circumflex also to match after  a  termi-
1229       nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
1230
1231         PCRE2_ALT_VERBNAMES
1232
1233       By  default, for compatibility with Perl, the name in any verb sequence
1234       such as (*MARK:NAME) is  any  sequence  of  characters  that  does  not
1235       include  a  closing  parenthesis. The name is not processed in any way,
1236       and it is not possible to include a closing parenthesis  in  the  name.
1237       However,  if  the  PCRE2_ALT_VERBNAMES  option is set, normal backslash
1238       processing is applied to verb  names  and  only  an  unescaped  closing
1239       parenthesis  terminates the name. A closing parenthesis can be included
1240       in a name either as \) or between \Q  and  \E.  If  the  PCRE2_EXTENDED
1241       option is set, unescaped whitespace in verb names is skipped and #-com-
1242       ments are recognized, exactly as in the rest of the pattern.
1243
1244         PCRE2_AUTO_CALLOUT
1245
1246       If this bit  is  set,  pcre2_compile()  automatically  inserts  callout
1247       items, all with number 255, before each pattern item. For discussion of
1248       the callout facility, see the pcre2callout documentation.
1249
1250         PCRE2_CASELESS
1251
1252       If this bit is set, letters in the pattern match both upper  and  lower
1253       case  letters in the subject. It is equivalent to Perl's /i option, and
1254       it can be changed within a pattern by a (?i) option setting.
1255
1256         PCRE2_DOLLAR_ENDONLY
1257
1258       If this bit is set, a dollar metacharacter in the pattern matches  only
1259       at  the  end  of the subject string. Without this option, a dollar also
1260       matches immediately before a newline at the end of the string (but  not
1261       before  any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
1262       if PCRE2_MULTILINE is set. There is no equivalent  to  this  option  in
1263       Perl, and no way to set it within a pattern.
1264
1265         PCRE2_DOTALL
1266
1267       If  this  bit  is  set,  a dot metacharacter in the pattern matches any
1268       character, including one that indicates a  newline.  However,  it  only
1269       ever matches one character, even if newlines are coded as CRLF. Without
1270       this option, a dot does not match when the current position in the sub-
1271       ject  is  at  a newline. This option is equivalent to Perl's /s option,
1272       and it can be changed within a pattern by a (?s) option setting. A neg-
1273       ative class such as [^a] always matches newline characters, independent
1274       of the setting of this option.
1275
1276         PCRE2_DUPNAMES
1277
1278       If this bit is set, names used to identify capturing  subpatterns  need
1279       not be unique. This can be helpful for certain types of pattern when it
1280       is known that only one instance of the named  subpattern  can  ever  be
1281       matched.  There  are  more details of named subpatterns below; see also
1282       the pcre2pattern documentation.
1283
1284         PCRE2_EXTENDED
1285
1286       If this bit is set, most white space  characters  in  the  pattern  are
1287       totally  ignored  except when escaped or inside a character class. How-
1288       ever, white space is not allowed within  sequences  such  as  (?>  that
1289       introduce various parenthesized subpatterns, nor within numerical quan-
1290       tifiers such as {1,3}.  Ignorable white space is permitted  between  an
1291       item  and a following quantifier and between a quantifier and a follow-
1292       ing + that indicates possessiveness.
1293
1294       PCRE2_EXTENDED also causes characters between an unescaped # outside  a
1295       character  class  and the next newline, inclusive, to be ignored, which
1296       makes it possible to include comments inside complicated patterns. Note
1297       that  the  end of this type of comment is a literal newline sequence in
1298       the pattern; escape sequences that happen to represent a newline do not
1299       count.  PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
1300       changed within a pattern by a (?x) option setting.
1301
1302       Which characters are interpreted as newlines can be specified by a set-
1303       ting  in  the compile context that is passed to pcre2_compile() or by a
1304       special sequence at the start of the pattern, as described in the  sec-
1305       tion  entitled "Newline conventions" in the pcre2pattern documentation.
1306       A default is defined when PCRE2 is built.
1307
1308         PCRE2_FIRSTLINE
1309
1310       If this option is set, an  unanchored  pattern  is  required  to  match
1311       before  or  at  the  first  newline  in  the subject string, though the
1312       matched text may continue over the  newline.  See  also  PCRE2_USE_OFF-
1313       SET_LIMIT,   which  provides  a  more  general  limiting  facility.  If
1314       PCRE2_FIRSTLINE is set with an offset limit, a match must occur in  the
1315       first  line and also within the offset limit. In other words, whichever
1316       limit comes first is used.
1317
1318         PCRE2_MATCH_UNSET_BACKREF
1319
1320       If this option is set, a back reference to an  unset  subpattern  group
1321       matches  an  empty  string (by default this causes the current matching
1322       alternative to fail).  A pattern such as  (\1)(a)  succeeds  when  this
1323       option  is set (assuming it can find an "a" in the subject), whereas it
1324       fails by default, for Perl compatibility.  Setting  this  option  makes
1325       PCRE2 behave more like ECMAscript (aka JavaScript).
1326
1327         PCRE2_MULTILINE
1328
1329       By  default,  for  the purposes of matching "start of line" and "end of
1330       line", PCRE2 treats the subject string as consisting of a  single  line
1331       of  characters,  even  if  it actually contains newlines. The "start of
1332       line" metacharacter (^) matches only at the start of  the  string,  and
1333       the  "end  of  line"  metacharacter  ($) matches only at the end of the
1334       string,  or  before  a  terminating  newline  (except  when  PCRE2_DOL-
1335       LAR_ENDONLY  is  set).  Note, however, that unless PCRE2_DOTALL is set,
1336       the "any character" metacharacter (.) does not match at a newline. This
1337       behaviour (for ^, $, and dot) is the same as Perl.
1338
1339       When  PCRE2_MULTILINE  it is set, the "start of line" and "end of line"
1340       constructs match immediately following or immediately  before  internal
1341       newlines  in  the  subject string, respectively, as well as at the very
1342       start and end. This is equivalent to Perl's /m option, and  it  can  be
1343       changed within a pattern by a (?m) option setting. Note that the "start
1344       of line" metacharacter does not match after a newline at the end of the
1345       subject,  for compatibility with Perl.  However, you can change this by
1346       setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in  a
1347       subject  string,  or  no  occurrences  of  ^ or $ in a pattern, setting
1348       PCRE2_MULTILINE has no effect.
1349
1350         PCRE2_NEVER_BACKSLASH_C
1351
1352       This option locks out the use of \C in the pattern that is  being  com-
1353       piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1354       UTF-16 modes, because it may leave the current matching  point  in  the
1355       middle  of  a  multi-code-unit  character. This option may be useful in
1356       applications that process patterns from  external  sources.  Note  that
1357       there is also a build-time option that permanently locks out the use of
1358       \C.
1359
1360         PCRE2_NEVER_UCP
1361
1362       This option locks out the use of Unicode properties  for  handling  \B,
1363       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
1364       described for the PCRE2_UCP option below. In  particular,  it  prevents
1365       the  creator of the pattern from enabling this facility by starting the
1366       pattern with (*UCP). This option may be  useful  in  applications  that
1367       process patterns from external sources. The option combination PCRE_UCP
1368       and PCRE_NEVER_UCP causes an error.
1369
1370         PCRE2_NEVER_UTF
1371
1372       This option locks out interpretation of the pattern as  UTF-8,  UTF-16,
1373       or UTF-32, depending on which library is in use. In particular, it pre-
1374       vents the creator of the pattern from switching to  UTF  interpretation
1375       by  starting  the  pattern  with  (*UTF).  This option may be useful in
1376       applications that process patterns from external sources. The  combina-
1377       tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
1378
1379         PCRE2_NO_AUTO_CAPTURE
1380
1381       If this option is set, it disables the use of numbered capturing paren-
1382       theses in the pattern. Any opening parenthesis that is not followed  by
1383       ?  behaves as if it were followed by ?: but named parentheses can still
1384       be used for capturing (and they acquire  numbers  in  the  usual  way).
1385       There  is  no  equivalent  of  this  option in Perl. Note that, if this
1386       option is set, references  to  capturing  groups  (back  references  or
1387       recursion/subroutine  calls) may only refer to named groups, though the
1388       reference can be by name or by number.
1389
1390         PCRE2_NO_AUTO_POSSESS
1391
1392       If this option is set, it disables "auto-possessification", which is an
1393       optimization  that,  for example, turns a+b into a++b in order to avoid
1394       backtracks into a+ that can never be successful. However,  if  callouts
1395       are  in  use,  auto-possessification means that some callouts are never
1396       taken. You can set this option if you want the matching functions to do
1397       a  full  unoptimized  search and run all the callouts, but it is mainly
1398       provided for testing purposes.
1399
1400         PCRE2_NO_DOTSTAR_ANCHOR
1401
1402       If this option is set, it disables an optimization that is applied when
1403       .*  is  the  first significant item in a top-level branch of a pattern,
1404       and all the other branches also start with .* or with \A or  \G  or  ^.
1405       The  optimization  is  automatically disabled for .* if it is inside an
1406       atomic group or a capturing group that is the subject of a back  refer-
1407       ence,  or  if  the pattern contains (*PRUNE) or (*SKIP). When the opti-
1408       mization is not disabled, such a pattern is automatically  anchored  if
1409       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
1410       for any ^ items. Otherwise, the fact that any match must  start  either
1411       at  the start of the subject or following a newline is remembered. Like
1412       other optimizations, this can cause callouts to be skipped.
1413
1414         PCRE2_NO_START_OPTIMIZE
1415
1416       This is an option whose main effect is at matching time.  It  does  not
1417       change what pcre2_compile() generates, but it does affect the output of
1418       the JIT compiler.
1419
1420       There are a number of optimizations that may occur at the  start  of  a
1421       match,  in  order  to speed up the process. For example, if it is known
1422       that an unanchored match must start  with  a  specific  character,  the
1423       matching  code searches the subject for that character, and fails imme-
1424       diately if it cannot find it, without actually running the main  match-
1425       ing  function.  This means that a special item such as (*COMMIT) at the
1426       start of a pattern is not considered until after  a  suitable  starting
1427       point  for  the  match  has  been found. Also, when callouts or (*MARK)
1428       items are in use, these "start-up" optimizations can cause them  to  be
1429       skipped  if  the pattern is never actually used. The start-up optimiza-
1430       tions are in effect a pre-scan of the subject that takes  place  before
1431       the pattern is run.
1432
1433       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1434       possibly causing performance to suffer,  but  ensuring  that  in  cases
1435       where  the  result is "no match", the callouts do occur, and that items
1436       such as (*COMMIT) and (*MARK) are considered at every possible starting
1437       position in the subject string.
1438
1439       Setting  PCRE2_NO_START_OPTIMIZE  may  change the outcome of a matching
1440       operation.  Consider the pattern
1441
1442         (*COMMIT)ABC
1443
1444       When this is compiled, PCRE2 records the fact that a match  must  start
1445       with  the  character  "A".  Suppose the subject string is "DEFABC". The
1446       start-up optimization scans along the subject, finds "A" and  runs  the
1447       first  match attempt from there. The (*COMMIT) item means that the pat-
1448       tern must match the current starting position, which in this  case,  it
1449       does.  However,  if  the same match is run with PCRE2_NO_START_OPTIMIZE
1450       set, the initial scan along the subject string  does  not  happen.  The
1451       first  match  attempt  is  run  starting  from "D" and when this fails,
1452       (*COMMIT) prevents any further matches  being  tried,  so  the  overall
1453       result is "no match". There are also other start-up optimizations.  For
1454       example, a minimum length for the subject may be recorded. Consider the
1455       pattern
1456
1457         (*MARK:A)(X|Y)
1458
1459       The  minimum  length  for  a  match is one character. If the subject is
1460       "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
1461       to match an empty string at the end of the subject does not take place,
1462       because PCRE2 knows that the subject is  now  too  short,  and  so  the
1463       (*MARK)  is  never encountered. In this case, the optimization does not
1464       affect the overall match result, which is still "no match", but it does
1465       affect the auxiliary information that is returned.
1466
1467         PCRE2_NO_UTF_CHECK
1468
1469       When  PCRE2_UTF  is set, the validity of the pattern as a UTF string is
1470       automatically checked. There are  discussions  about  the  validity  of
1471       UTF-8  strings,  UTF-16 strings, and UTF-32 strings in the pcre2unicode
1472       document.  If an invalid UTF sequence is found, pcre2_compile() returns
1473       a negative error code.
1474
1475       If you know that your pattern is valid, and you want to skip this check
1476       for performance reasons, you can  set  the  PCRE2_NO_UTF_CHECK  option.
1477       When  it  is set, the effect of passing an invalid UTF string as a pat-
1478       tern is undefined. It may cause your program to  crash  or  loop.  Note
1479       that   this   option   can   also   be   passed  to  pcre2_match()  and
1480       pcre_dfa_match(), to suppress validity checking of the subject string.
1481
1482         PCRE2_UCP
1483
1484       This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
1485       \w,  and  some  of  the POSIX character classes. By default, only ASCII
1486       characters are recognized, but if PCRE2_UCP is set, Unicode  properties
1487       are  used instead to classify characters. More details are given in the
1488       section on generic character types in the pcre2pattern page. If you set
1489       PCRE2_UCP,  matching one of the items it affects takes much longer. The
1490       option is available only if PCRE2 has been compiled with  Unicode  sup-
1491       port.
1492
1493         PCRE2_UNGREEDY
1494
1495       This  option  inverts  the "greediness" of the quantifiers so that they
1496       are not greedy by default, but become greedy if followed by "?". It  is
1497       not  compatible  with Perl. It can also be set by a (?U) option setting
1498       within the pattern.
1499
1500         PCRE2_USE_OFFSET_LIMIT
1501
1502       This option must be set for pcre2_compile() if pcre2_set_offset_limit()
1503       is  going  to be used to set a non-default offset limit in a match con-
1504       text for matches that use this pattern. An error  is  generated  if  an
1505       offset  limit  is  set  without  this option. For more details, see the
1506       description of pcre2_set_offset_limit() in the section  that  describes
1507       match contexts. See also the PCRE2_FIRSTLINE option above.
1508
1509         PCRE2_UTF
1510
1511       This  option  causes  PCRE2  to regard both the pattern and the subject
1512       strings that are subsequently processed as strings  of  UTF  characters
1513       instead  of  single-code-unit  strings.  It  is available when PCRE2 is
1514       built to include Unicode support (which is  the  default).  If  Unicode
1515       support  is  not  available,  the use of this option provokes an error.
1516       Details of how this option changes the behaviour of PCRE2 are given  in
1517       the pcre2unicode page.
1518
1519
1520COMPILATION ERROR CODES
1521
1522       There  are over 80 positive error codes that pcre2_compile() may return
1523       (via errorcode) if it finds an error in the  pattern.  There  are  also
1524       some  negative error codes that are used for invalid UTF strings. These
1525       are the same as given by pcre2_match() and pcre2_dfa_match(),  and  are
1526       described in the pcre2unicode page. The pcre2_get_error_message() func-
1527       tion (see "Obtaining a textual error message" below) can be  called  to
1528       obtain a textual error message from any error code.
1529
1530
1531JUST-IN-TIME (JIT) COMPILATION
1532
1533       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
1534
1535       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
1536         PCRE2_SIZE length, PCRE2_SIZE startoffset,
1537         uint32_t options, pcre2_match_data *match_data,
1538         pcre2_match_context *mcontext);
1539
1540       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
1541
1542       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
1543         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
1544
1545       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
1546         pcre2_jit_callback callback_function, void *callback_data);
1547
1548       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
1549
1550       These  functions  provide  support  for  JIT compilation, which, if the
1551       just-in-time compiler is available, further processes a  compiled  pat-
1552       tern into machine code that executes much faster than the pcre2_match()
1553       interpretive matching function. Full details are given in the  pcre2jit
1554       documentation.
1555
1556       JIT  compilation  is  a heavyweight optimization. It can take some time
1557       for patterns to be analyzed, and for one-off matches  and  simple  pat-
1558       terns  the benefit of faster execution might be offset by a much slower
1559       compilation time.  Most, but not all patterns can be optimized  by  the
1560       JIT compiler.
1561
1562
1563LOCALE SUPPORT
1564
1565       PCRE2  handles caseless matching, and determines whether characters are
1566       letters, digits, or whatever, by reference to a set of tables,  indexed
1567       by  character  code  point.  This applies only to characters whose code
1568       points are less than 256. By default, higher-valued code  points  never
1569       match  escapes  such  as \w or \d.  However, if PCRE2 is built with UTF
1570       support, all characters can be tested with  \p  and  \P,  or,  alterna-
1571       tively,  the  PCRE2_UCP  option  can be set when a pattern is compiled;
1572       this causes \w and friends to use Unicode property support  instead  of
1573       the built-in tables.
1574
1575       The  use  of  locales  with Unicode is discouraged. If you are handling
1576       characters with code points greater than 128,  you  should  either  use
1577       Unicode support, or use locales, but not try to mix the two.
1578
1579       PCRE2  contains  an  internal  set of character tables that are used by
1580       default.  These are sufficient for  many  applications.  Normally,  the
1581       internal tables recognize only ASCII characters. However, when PCRE2 is
1582       built, it is possible to cause the internal tables to be rebuilt in the
1583       default "C" locale of the local system, which may cause them to be dif-
1584       ferent.
1585
1586       The internal tables can be overridden by tables supplied by the  appli-
1587       cation  that  calls  PCRE2.  These may be created in a different locale
1588       from the default.  As more and more applications change to  using  Uni-
1589       code, the need for this locale support is expected to die away.
1590
1591       External  tables  are built by calling the pcre2_maketables() function,
1592       in the relevant locale. The result can be passed to pcre2_compile()  as
1593       often   as  necessary,  by  creating  a  compile  context  and  calling
1594       pcre2_set_character_tables() to set the  tables  pointer  therein.  For
1595       example,  to  build  and use tables that are appropriate for the French
1596       locale (where accented characters with  values  greater  than  128  are
1597       treated as letters), the following code could be used:
1598
1599         setlocale(LC_CTYPE, "fr_FR");
1600         tables = pcre2_maketables(NULL);
1601         ccontext = pcre2_compile_context_create(NULL);
1602         pcre2_set_character_tables(ccontext, tables);
1603         re = pcre2_compile(..., ccontext);
1604
1605       The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1606       if you are using Windows, the name for the French locale  is  "french".
1607       It  is the caller's responsibility to ensure that the memory containing
1608       the tables remains available for as long as it is needed.
1609
1610       The pointer that is passed (via the compile context) to pcre2_compile()
1611       is  saved  with  the  compiled pattern, and the same tables are used by
1612       pcre2_match() and pcre_dfa_match(). Thus, for any single pattern,  com-
1613       pilation,  and  matching  all  happen in the same locale, but different
1614       patterns can be processed in different locales.
1615
1616
1617INFORMATION ABOUT A COMPILED PATTERN
1618
1619       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
1620
1621       The pcre2_pattern_info() function returns general information  about  a
1622       compiled pattern. For information about callouts, see the next section.
1623       The first argument for pcre2_pattern_info() is a pointer  to  the  com-
1624       piled pattern. The second argument specifies which piece of information
1625       is required, and the third argument is  a  pointer  to  a  variable  to
1626       receive  the data. If the third argument is NULL, the first argument is
1627       ignored, and the function returns the size in  bytes  of  the  variable
1628       that is required for the information requested. Otherwise, The yield of
1629       the function is zero for success, or one of the following negative num-
1630       bers:
1631
1632         PCRE2_ERROR_NULL           the argument code was NULL
1633         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
1634         PCRE2_ERROR_BADOPTION      the value of what was invalid
1635         PCRE2_ERROR_UNSET          the requested field is not set
1636
1637       The  "magic  number" is placed at the start of each compiled pattern as
1638       an simple check against passing an arbitrary memory pointer. Here is  a
1639       typical  call of pcre2_pattern_info(), to obtain the length of the com-
1640       piled pattern:
1641
1642         int rc;
1643         size_t length;
1644         rc = pcre2_pattern_info(
1645           re,               /* result of pcre2_compile() */
1646           PCRE2_INFO_SIZE,  /* what is required */
1647           &length);         /* where to put the data */
1648
1649       The possible values for the second argument are defined in pcre2.h, and
1650       are as follows:
1651
1652         PCRE2_INFO_ALLOPTIONS
1653         PCRE2_INFO_ARGOPTIONS
1654
1655       Return a copy of the pattern's options. The third argument should point
1656       to a  uint32_t  variable.  PCRE2_INFO_ARGOPTIONS  returns  exactly  the
1657       options  that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
1658       TIONS returns the compile options as modified by any  top-level  (*XXX)
1659       option settings such as (*UTF) at the start of the pattern itself.
1660
1661       For   example,   if  the  pattern  /(*UTF)abc/  is  compiled  with  the
1662       PCRE2_EXTENDED  option,  the  result   for   PCRE2_INFO_ALLOPTIONS   is
1663       PCRE2_EXTENDED  and  PCRE2_UTF.   Option settings such as (?i) that can
1664       change within a pattern do not affect the result  of  PCRE2_INFO_ALLOP-
1665       TIONS, even if they appear right at the start of the pattern. (This was
1666       different in some earlier releases.)
1667
1668       A pattern compiled without PCRE2_ANCHORED is automatically anchored  by
1669       PCRE2 if the first significant item in every top-level branch is one of
1670       the following:
1671
1672         ^     unless PCRE2_MULTILINE is set
1673         \A    always
1674         \G    always
1675         .*    sometimes - see below
1676
1677       When .* is the first significant item, anchoring is possible only  when
1678       all the following are true:
1679
1680         .* is not in an atomic group
1681         .* is not in a capturing group that is the subject
1682              of a back reference
1683         PCRE2_DOTALL is in force for .*
1684         Neither (*PRUNE) nor (*SKIP) appears in the pattern.
1685         PCRE2_NO_DOTSTAR_ANCHOR is not set.
1686
1687       For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
1688       the options returned for PCRE2_INFO_ALLOPTIONS.
1689
1690         PCRE2_INFO_BACKREFMAX
1691
1692       Return the number of the highest back reference  in  the  pattern.  The
1693       third  argument should point to an uint32_t variable. Named subpatterns
1694       acquire numbers as well as names, and these count towards  the  highest
1695       back  reference.   Back  references such as \4 or \g{12} match the cap-
1696       tured characters of the given group, but in addition, the check that  a
1697       capturing group is set in a conditional subpattern such as (?(3)a|b) is
1698       also a back reference. Zero is returned if there  are  no  back  refer-
1699       ences.
1700
1701         PCRE2_INFO_BSR
1702
1703       The output is a uint32_t whose value indicates what character sequences
1704       the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that
1705       \R  matches any Unicode line ending sequence; a value of PCRE2_BSR_ANY-
1706       CRLF means that \R matches only CR, LF, or CRLF.
1707
1708         PCRE2_INFO_CAPTURECOUNT
1709
1710       Return the highest capturing subpattern number in the pattern. In  pat-
1711       terns where (?| is not used, this is also the total number of capturing
1712       subpatterns.  The third argument should point to an uint32_t variable.
1713
1714         PCRE2_INFO_FIRSTBITMAP
1715
1716       In the absence of a single first code unit for a non-anchored  pattern,
1717       pcre2_compile()  may construct a 256-bit table that defines a fixed set
1718       of values for the first code unit in any match. For example, a  pattern
1719       that  starts  with  [abc]  results in a table with three bits set. When
1720       code unit values greater than 255 are supported, the flag bit  for  255
1721       means  "any  code unit of value 255 or above". If such a table was con-
1722       structed, a pointer to it is returned. Otherwise NULL is returned.  The
1723       third argument should point to an const uint8_t * variable.
1724
1725         PCRE2_INFO_FIRSTCODETYPE
1726
1727       Return information about the first code unit of any matched string, for
1728       a non-anchored pattern. The third argument should point to an  uint32_t
1729       variable.  If there is a fixed first value, for example, the letter "c"
1730       from a pattern such as (cat|cow|coyote), 1 is returned, and the charac-
1731       ter  value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is
1732       no fixed first value, but it is known that a match can  occur  only  at
1733       the  start  of  the subject or following a newline in the subject, 2 is
1734       returned. Otherwise, and for anchored patterns, 0 is returned.
1735
1736         PCRE2_INFO_FIRSTCODEUNIT
1737
1738       Return the value of the first code unit of any matched  string  in  the
1739       situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
1740       The third argument should point to an uint32_t variable. In  the  8-bit
1741       library,  the  value is always less than 256. In the 16-bit library the
1742       value can be up to 0xffff. In the 32-bit library  in  UTF-32  mode  the
1743       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
1744       mode.
1745
1746         PCRE2_INFO_HASBACKSLASHC
1747
1748       Return 1 if the pattern contains any instances of \C, otherwise 0.  The
1749       third argument should point to an uint32_t variable.
1750
1751         PCRE2_INFO_HASCRORLF
1752
1753       Return  1  if  the  pattern  contains any explicit matches for CR or LF
1754       characters, otherwise 0. The third argument should point to an uint32_t
1755       variable.  An explicit match is either a literal CR or LF character, or
1756       \r or \n.
1757
1758         PCRE2_INFO_JCHANGED
1759
1760       Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
1761       otherwise  0.  The third argument should point to an uint32_t variable.
1762       (?J) and (?-J) set and unset the local PCRE2_DUPNAMES  option,  respec-
1763       tively.
1764
1765         PCRE2_INFO_JITSIZE
1766
1767       If  the  compiled  pattern was successfully processed by pcre2_jit_com-
1768       pile(), return the size of the  JIT  compiled  code,  otherwise  return
1769       zero. The third argument should point to a size_t variable.
1770
1771         PCRE2_INFO_LASTCODETYPE
1772
1773       Returns  1 if there is a rightmost literal code unit that must exist in
1774       any matched string, other than at its start. The third argument  should
1775       point  to  an  uint32_t  variable.  If  there  is  no  such value, 0 is
1776       returned. When 1 is  returned,  the  code  unit  value  itself  can  be
1777       retrieved  using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last
1778       literal value is recorded only if  it  follows  something  of  variable
1779       length.  For example, for the pattern /^a\d+z\d+/ the returned value is
1780       1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but  for  /^a\dz\d/
1781       the returned value is 0.
1782
1783         PCRE2_INFO_LASTCODEUNIT
1784
1785       Return  the value of the rightmost literal data unit that must exist in
1786       any matched string, other than at its start, if such a value  has  been
1787       recorded.  The  third argument should point to an uint32_t variable. If
1788       there is no such value, 0 is returned.
1789
1790         PCRE2_INFO_MATCHEMPTY
1791
1792       Return 1 if the pattern might match an empty string, otherwise  0.  The
1793       third  argument  should  point  to an uint32_t variable. When a pattern
1794       contains recursive subroutine calls it is not always possible to deter-
1795       mine  whether  or  not it can match an empty string. PCRE2 takes a cau-
1796       tious approach and returns 1 in such cases.
1797
1798         PCRE2_INFO_MATCHLIMIT
1799
1800       If the pattern set a match limit by  including  an  item  of  the  form
1801       (*LIMIT_MATCH=nnnn)  at  the  start,  the  value is returned. The third
1802       argument should point to an unsigned 32-bit integer. If no  such  value
1803       has  been  set,  the  call  to  pcre2_pattern_info()  returns the error
1804       PCRE2_ERROR_UNSET.
1805
1806         PCRE2_INFO_MAXLOOKBEHIND
1807
1808       Return the number of characters (not code units) in the longest lookbe-
1809       hind  assertion  in  the pattern. The third argument should point to an
1810       unsigned 32-bit integer. This information is useful when  doing  multi-
1811       segment  matching  using the partial matching facilities. Note that the
1812       simple assertions \b and \B require a one-character lookbehind. \A also
1813       registers  a  one-character  lookbehind,  though  it  does not actually
1814       inspect the previous character. This is to ensure  that  at  least  one
1815       character  from  the old segment is retained when a new segment is pro-
1816       cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
1817       match incorrectly at the start of a new segment.
1818
1819         PCRE2_INFO_MINLENGTH
1820
1821       If  a  minimum  length  for  matching subject strings was computed, its
1822       value is returned. Otherwise the returned value is 0. The  value  is  a
1823       number  of characters, which in UTF mode may be different from the num-
1824       ber of code units.  The third argument  should  point  to  an  uint32_t
1825       variable.  The  value  is  a  lower bound to the length of any matching
1826       string. There may not be any strings of that length  that  do  actually
1827       match, but every string that does match is at least that long.
1828
1829         PCRE2_INFO_NAMECOUNT
1830         PCRE2_INFO_NAMEENTRYSIZE
1831         PCRE2_INFO_NAMETABLE
1832
1833       PCRE2 supports the use of named as well as numbered capturing parenthe-
1834       ses. The names are just an additional way of identifying the  parenthe-
1835       ses, which still acquire numbers. Several convenience functions such as
1836       pcre2_substring_get_byname() are provided for extracting captured  sub-
1837       strings  by  name. It is also possible to extract the data directly, by
1838       first converting the name to a number in order to  access  the  correct
1839       pointers  in the output vector (described with pcre2_match() below). To
1840       do the conversion, you need to use the  name-to-number  map,  which  is
1841       described by these three values.
1842
1843       The  map  consists  of a number of fixed-size entries. PCRE2_INFO_NAME-
1844       COUNT gives the number of entries, and  PCRE2_INFO_NAMEENTRYSIZE  gives
1845       the  size  of each entry in code units; both of these return a uint32_t
1846       value. The entry size depends on the length of the longest name.
1847
1848       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
1849       This  is  a  PCRE2_SPTR  pointer to a block of code units. In the 8-bit
1850       library, the first two bytes of each entry are the number of  the  cap-
1851       turing parenthesis, most significant byte first. In the 16-bit library,
1852       the pointer points to 16-bit code units, the first  of  which  contains
1853       the  parenthesis  number.  In the 32-bit library, the pointer points to
1854       32-bit code units, the first of which contains the parenthesis  number.
1855       The rest of the entry is the corresponding name, zero terminated.
1856
1857       The  names are in alphabetical order. If (?| is used to create multiple
1858       groups with the same number, as described in the section  on  duplicate
1859       subpattern  numbers  in  the pcre2pattern page, the groups may be given
1860       the same name, but there is only one  entry  in  the  table.  Different
1861       names for groups of the same number are not permitted.
1862
1863       Duplicate  names  for subpatterns with different numbers are permitted,
1864       but only if PCRE2_DUPNAMES is set. They appear  in  the  table  in  the
1865       order  in  which  they were found in the pattern. In the absence of (?|
1866       this is the order of increasing number; when (?| is used  this  is  not
1867       necessarily the case because later subpatterns may have lower numbers.
1868
1869       As  a  simple  example of the name/number table, consider the following
1870       pattern after compilation by the 8-bit library  (assume  PCRE2_EXTENDED
1871       is set, so white space - including newlines - is ignored):
1872
1873         (?<date> (?<year>(\d\d)?\d\d) -
1874         (?<month>\d\d) - (?<day>\d\d) )
1875
1876       There  are  four  named subpatterns, so the table has four entries, and
1877       each entry in the table is eight bytes long. The table is  as  follows,
1878       with non-printing bytes shows in hexadecimal, and undefined bytes shown
1879       as ??:
1880
1881         00 01 d  a  t  e  00 ??
1882         00 05 d  a  y  00 ?? ??
1883         00 04 m  o  n  t  h  00
1884         00 02 y  e  a  r  00 ??
1885
1886       When writing code to extract data  from  named  subpatterns  using  the
1887       name-to-number  map,  remember that the length of the entries is likely
1888       to be different for each compiled pattern.
1889
1890         PCRE2_INFO_NEWLINE
1891
1892       The output is a uint32_t with one of the following values:
1893
1894         PCRE2_NEWLINE_CR       Carriage return (CR)
1895         PCRE2_NEWLINE_LF       Linefeed (LF)
1896         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
1897         PCRE2_NEWLINE_ANY      Any Unicode line ending
1898         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
1899
1900       This specifies the default character sequence that will  be  recognized
1901       as meaning "newline" while matching.
1902
1903         PCRE2_INFO_RECURSIONLIMIT
1904
1905       If  the  pattern set a recursion limit by including an item of the form
1906       (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The  third
1907       argument  should  point to an unsigned 32-bit integer. If no such value
1908       has been set,  the  call  to  pcre2_pattern_info()  returns  the  error
1909       PCRE2_ERROR_UNSET.
1910
1911         PCRE2_INFO_SIZE
1912
1913       Return  the  size  of  the  compiled  pattern  in  bytes (for all three
1914       libraries). The third argument should point to a size_t variable.  This
1915       value  includes  the  size  of the general data block that precedes the
1916       code units of the compiled pattern itself. The value that is used  when
1917       pcre2_compile()  is  getting memory in which to place the compiled pat-
1918       tern may be slightly larger than the value  returned  by  this  option,
1919       because  there are cases where the code that calculates the size has to
1920       over-estimate. Processing a pattern with  the  JIT  compiler  does  not
1921       alter the value returned by this option.
1922
1923
1924INFORMATION ABOUT A PATTERN'S CALLOUTS
1925
1926       int pcre2_callout_enumerate(const pcre2_code *code,
1927         int (*callback)(pcre2_callout_enumerate_block *, void *),
1928         void *user_data);
1929
1930       A script language that supports the use of string arguments in callouts
1931       might like to scan all the callouts in a  pattern  before  running  the
1932       match. This can be done by calling pcre2_callout_enumerate(). The first
1933       argument is a pointer to a compiled pattern, the  second  points  to  a
1934       callback  function,  and the third is arbitrary user data. The callback
1935       function is called for every callout in the pattern  in  the  order  in
1936       which they appear. Its first argument is a pointer to a callout enumer-
1937       ation block, and its second argument is the user_data  value  that  was
1938       passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
1939       meration block are described in the pcre2callout  documentation,  which
1940       also gives further details about callouts.
1941
1942
1943SERIALIZATION AND PRECOMPILING
1944
1945       It  is  possible  to  save  compiled patterns on disc or elsewhere, and
1946       reload them later, subject to a number of restrictions.  The  functions
1947       whose names begin with pcre2_serialize_ are used for this purpose. They
1948       are described in the pcre2serialize documentation.
1949
1950
1951THE MATCH DATA BLOCK
1952
1953       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
1954         pcre2_general_context *gcontext);
1955
1956       pcre2_match_data *pcre2_match_data_create_from_pattern(
1957         const pcre2_code *code, pcre2_general_context *gcontext);
1958
1959       void pcre2_match_data_free(pcre2_match_data *match_data);
1960
1961       Information about a successful or unsuccessful match  is  placed  in  a
1962       match  data  block,  which  is  an opaque structure that is accessed by
1963       function calls. In particular, the match data block contains  a  vector
1964       of  offsets into the subject string that define the matched part of the
1965       subject and any substrings that were captured.  This  is  know  as  the
1966       ovector.
1967
1968       Before  calling  pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
1969       you must create a match data block by calling one of the creation func-
1970       tions  above.  For pcre2_match_data_create(), the first argument is the
1971       number of pairs of offsets in the  ovector.  One  pair  of  offsets  is
1972       required  to  identify  the string that matched the whole pattern, with
1973       another pair for each captured substring. For example,  a  value  of  4
1974       creates  enough space to record the matched portion of the subject plus
1975       three captured substrings. A minimum of at least 1 pair is  imposed  by
1976       pcre2_match_data_create(), so it is always possible to return the over-
1977       all matched string.
1978
1979       The second argument of pcre2_match_data_create() is a pointer to a gen-
1980       eral  context, which can specify custom memory management for obtaining
1981       the memory for the match data block. If you are not using custom memory
1982       management, pass NULL, which causes malloc() to be used.
1983
1984       For  pcre2_match_data_create_from_pattern(),  the  first  argument is a
1985       pointer to a compiled pattern. The ovector is created to be exactly the
1986       right size to hold all the substrings a pattern might capture. The sec-
1987       ond argument is again a pointer to a general context, but in this  case
1988       if NULL is passed, the memory is obtained using the same allocator that
1989       was used for the compiled pattern (custom or default).
1990
1991       A match data block can be used many times, with the same  or  different
1992       compiled  patterns. You can extract information from a match data block
1993       after  a  match  operation  has  finished,  using  functions  that  are
1994       described  in  the  sections  on  matched  strings and other match data
1995       below.
1996
1997       When a call of pcre2_match() fails, valid  data  is  available  in  the
1998       match    block    only   when   the   error   is   PCRE2_ERROR_NOMATCH,
1999       PCRE2_ERROR_PARTIAL, or one of the  error  codes  for  an  invalid  UTF
2000       string. Exactly what is available depends on the error, and is detailed
2001       below.
2002
2003       When one of the matching functions is called, pointers to the  compiled
2004       pattern  and the subject string are set in the match data block so that
2005       they can be referenced by the extraction  functions.  After  running  a
2006       match,  you  must not free a compiled pattern or a subject string until
2007       after all operations on the match data  block  (for  that  match)  have
2008       taken place.
2009
2010       When  a match data block itself is no longer needed, it should be freed
2011       by calling pcre2_match_data_free().
2012
2013
2014MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2015
2016       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
2017         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2018         uint32_t options, pcre2_match_data *match_data,
2019         pcre2_match_context *mcontext);
2020
2021       The function pcre2_match() is called to match a subject string  against
2022       a  compiled pattern, which is passed in the code argument. You can call
2023       pcre2_match() with the same code argument as many times as you like, in
2024       order  to  find multiple matches in the subject string or to match dif-
2025       ferent subject strings with the same pattern.
2026
2027       This function is the main matching facility  of  the  library,  and  it
2028       operates  in  a  Perl-like  manner. For specialist use there is also an
2029       alternative matching function, which is described below in the  section
2030       about the pcre2_dfa_match() function.
2031
2032       Here is an example of a simple call to pcre2_match():
2033
2034         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2035         int rc = pcre2_match(
2036           re,             /* result of pcre2_compile() */
2037           "some string",  /* the subject string */
2038           11,             /* the length of the subject string */
2039           0,              /* start at offset 0 in the subject */
2040           0,              /* default options */
2041           match_data,     /* the match data block */
2042           NULL);          /* a match context; NULL means use defaults */
2043
2044       If  the  subject  string is zero-terminated, the length can be given as
2045       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
2046       common matching parameters are to be changed. For details, see the sec-
2047       tion on the match context above.
2048
2049   The string to be matched by pcre2_match()
2050
2051       The subject string is passed to pcre2_match() as a pointer in  subject,
2052       a  length  in  length, and a starting offset in startoffset. The length
2053       and offset are in code units, not characters.  That  is,  they  are  in
2054       bytes  for the 8-bit library, 16-bit code units for the 16-bit library,
2055       and 32-bit code units for the 32-bit library, whether or not  UTF  pro-
2056       cessing is enabled.
2057
2058       If startoffset is greater than the length of the subject, pcre2_match()
2059       returns PCRE2_ERROR_BADOFFSET. When the starting offset  is  zero,  the
2060       search  for a match starts at the beginning of the subject, and this is
2061       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2062       set  must  point to the start of a character, or to the end of the sub-
2063       ject (in UTF-32 mode, one code unit equals one character, so  all  off-
2064       sets  are  valid).  Like  the  pattern  string, the subject may contain
2065       binary zeroes.
2066
2067       A non-zero starting offset is useful when searching for  another  match
2068       in  the  same  subject  by calling pcre2_match() again after a previous
2069       success.  Setting startoffset differs from  passing  over  a  shortened
2070       string  and  setting  PCRE2_NOTBOL in the case of a pattern that begins
2071       with any kind of lookbehind. For example, consider the pattern
2072
2073         \Biss\B
2074
2075       which finds occurrences of "iss" in the middle of  words.  (\B  matches
2076       only  if  the  current position in the subject is not a word boundary.)
2077       When applied to the string "Mississipi" the first call to pcre2_match()
2078       finds  the first occurrence. If pcre2_match() is called again with just
2079       the remainder of the subject,  namely  "issipi",  it  does  not  match,
2080       because \B is always false at the start of the subject, which is deemed
2081       to be a word boundary. However, if pcre2_match() is passed  the  entire
2082       string again, but with startoffset set to 4, it finds the second occur-
2083       rence of "iss" because it is able to look behind the starting point  to
2084       discover that it is preceded by a letter.
2085
2086       Finding  all  the  matches  in a subject is tricky when the pattern can
2087       match an empty string. It is possible to emulate Perl's /g behaviour by
2088       first   trying   the   match   again  at  the  same  offset,  with  the
2089       PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options,  and  then  if  that
2090       fails,  advancing  the  starting  offset  and  trying an ordinary match
2091       again. There is some code that demonstrates  how  to  do  this  in  the
2092       pcre2demo  sample  program. In the most general case, you have to check
2093       to see if the newline convention recognizes CRLF as a newline,  and  if
2094       so,  and the current character is CR followed by LF, advance the start-
2095       ing offset by two characters instead of one.
2096
2097       If a non-zero starting offset is passed when the pattern  is  anchored,
2098       one attempt to match at the given offset is made. This can only succeed
2099       if the pattern does not require the match to be at  the  start  of  the
2100       subject.
2101
2102   Option bits for pcre2_match()
2103
2104       The unused bits of the options argument for pcre2_match() must be zero.
2105       The only  bits  that  may  be  set  are  PCRE2_ANCHORED,  PCRE2_NOTBOL,
2106       PCRE2_NOTEOL,   PCRE2_NOTEMPTY,  PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_JIT,
2107       PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and  PCRE2_PARTIAL_SOFT.  Their
2108       action is described below.
2109
2110       Setting  PCRE2_ANCHORED  at match time is not supported by the just-in-
2111       time (JIT) compiler. If it is set, JIT matching  is  disabled  and  the
2112       normal   interpretive   code   in  pcre2_match()  is  run.  Apart  from
2113       PCRE2_NO_JIT (obviously), the remaining options are supported  for  JIT
2114       matching.
2115
2116         PCRE2_ANCHORED
2117
2118       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
2119       matching position. If a pattern was compiled  with  PCRE2_ANCHORED,  or
2120       turned  out to be anchored by virtue of its contents, it cannot be made
2121       unachored at matching time. Note that setting the option at match  time
2122       disables JIT matching.
2123
2124         PCRE2_NOTBOL
2125
2126       This option specifies that first character of the subject string is not
2127       the beginning of a line, so the  circumflex  metacharacter  should  not
2128       match  before  it.  Setting  this without having set PCRE2_MULTILINE at
2129       compile time causes circumflex never to match. This option affects only
2130       the behaviour of the circumflex metacharacter. It does not affect \A.
2131
2132         PCRE2_NOTEOL
2133
2134       This option specifies that the end of the subject string is not the end
2135       of a line, so the dollar metacharacter should not match it nor  (except
2136       in  multiline mode) a newline immediately before it. Setting this with-
2137       out having set PCRE2_MULTILINE at compile time causes dollar  never  to
2138       match. This option affects only the behaviour of the dollar metacharac-
2139       ter. It does not affect \Z or \z.
2140
2141         PCRE2_NOTEMPTY
2142
2143       An empty string is not considered to be a valid match if this option is
2144       set.  If  there are alternatives in the pattern, they are tried. If all
2145       the alternatives match the empty string, the entire  match  fails.  For
2146       example, if the pattern
2147
2148         a?b?
2149
2150       is  applied  to  a  string not beginning with "a" or "b", it matches an
2151       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
2152       match  is  not valid, so pcre2_match() searches further into the string
2153       for occurrences of "a" or "b".
2154
2155         PCRE2_NOTEMPTY_ATSTART
2156
2157       This is like PCRE2_NOTEMPTY, except that it locks out an  empty  string
2158       match only at the first matching position, that is, at the start of the
2159       subject plus the starting offset. An empty string match  later  in  the
2160       subject  is  permitted.   If  the pattern is anchored, such a match can
2161       occur only if the pattern contains \K.
2162
2163         PCRE2_NO_JIT
2164
2165       By  default,  if  a  pattern  has  been   successfully   processed   by
2166       pcre2_jit_compile(),  JIT  is  automatically used when pcre2_match() is
2167       called with options that JIT supports.  Setting  PCRE2_NO_JIT  disables
2168       the use of JIT; it forces matching to be done by the interpreter.
2169
2170         PCRE2_NO_UTF_CHECK
2171
2172       When PCRE2_UTF is set at compile time, the validity of the subject as a
2173       UTF string is checked by default  when  pcre2_match()  is  subsequently
2174       called.   If  a non-zero starting offset is given, the check is applied
2175       only to that part of the subject that could be inspected during  match-
2176       ing,  and there is a check that the starting offset points to the first
2177       code unit of a character or to the end of the subject. If there are  no
2178       lookbehind  assertions in the pattern, the check starts at the starting
2179       offset. Otherwise, it starts at the length of  the  longest  lookbehind
2180       before the starting offset, or at the start of the subject if there are
2181       not that many characters before the  starting  offset.  Note  that  the
2182       sequences \b and \B are one-character lookbehinds.
2183
2184       The check is carried out before any other processing takes place, and a
2185       negative error code is returned if the check fails. There  are  several
2186       UTF  error  codes  for each code unit width, corresponding to different
2187       problems with the code unit sequence. There are discussions  about  the
2188       validity  of  UTF-8  strings, UTF-16 strings, and UTF-32 strings in the
2189       pcre2unicode page.
2190
2191       If you know that your subject is valid, and  you  want  to  skip  these
2192       checks  for  performance  reasons,  you  can set the PCRE2_NO_UTF_CHECK
2193       option when calling pcre2_match(). You might want to do  this  for  the
2194       second and subsequent calls to pcre2_match() if you are making repeated
2195       calls to find all the matches in a single subject string.
2196
2197       NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an  invalid
2198       string  as a subject, or an invalid value of startoffset, is undefined.
2199       Your program may crash or loop indefinitely.
2200
2201         PCRE2_PARTIAL_HARD
2202         PCRE2_PARTIAL_SOFT
2203
2204       These options turn on the partial matching  feature.  A  partial  match
2205       occurs  if  the  end of the subject string is reached successfully, but
2206       there are not enough subject characters to complete the match. If  this
2207       happens  when  PCRE2_PARTIAL_SOFT  (but not PCRE2_PARTIAL_HARD) is set,
2208       matching continues by testing any remaining alternatives.  Only  if  no
2209       complete  match can be found is PCRE2_ERROR_PARTIAL returned instead of
2210       PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies  that
2211       the  caller  is prepared to handle a partial match, but only if no com-
2212       plete match can be found.
2213
2214       If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In  this
2215       case,  if  a  partial match is found, pcre2_match() immediately returns
2216       PCRE2_ERROR_PARTIAL, without considering  any  other  alternatives.  In
2217       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2218       ered to be more important that an alternative complete match.
2219
2220       There is a more detailed discussion of partial and multi-segment match-
2221       ing, with examples, in the pcre2partial documentation.
2222
2223
2224NEWLINE HANDLING WHEN MATCHING
2225
2226       When  PCRE2 is built, a default newline convention is set; this is usu-
2227       ally the standard convention for the operating system. The default  can
2228       be  overridden  in a compile context by calling pcre2_set_newline(). It
2229       can also be overridden by starting a pattern string with, for  example,
2230       (*CRLF),  as  described  in  the  section on newline conventions in the
2231       pcre2pattern page. During matching, the newline choice affects the  be-
2232       haviour  of the dot, circumflex, and dollar metacharacters. It may also
2233       alter the way the match starting position is  advanced  after  a  match
2234       failure for an unanchored pattern.
2235
2236       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
2237       set as the newline convention, and a match attempt  for  an  unanchored
2238       pattern fails when the current starting position is at a CRLF sequence,
2239       and the pattern contains no explicit matches for CR or  LF  characters,
2240       the  match  position  is  advanced by two characters instead of one, in
2241       other words, to after the CRLF.
2242
2243       The above rule is a compromise that makes the most common cases work as
2244       expected.  For  example,  if  the  pattern is .+A (and the PCRE2_DOTALL
2245       option is not set), it does not match the string "\r\nA" because, after
2246       failing  at the start, it skips both the CR and the LF before retrying.
2247       However, the pattern [\r\n]A does match that string,  because  it  con-
2248       tains an explicit CR or LF reference, and so advances only by one char-
2249       acter after the first failure.
2250
2251       An explicit match for CR of LF is either a literal appearance of one of
2252       those  characters  in  the  pattern,  or  one  of  the  \r or \n escape
2253       sequences. Implicit matches such as [^X] do not  count,  nor  does  \s,
2254       even though it includes CR and LF in the characters that it matches.
2255
2256       Notwithstanding  the above, anomalous effects may still occur when CRLF
2257       is a valid newline sequence and explicit \r or \n escapes appear in the
2258       pattern.
2259
2260
2261HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
2262
2263       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
2264
2265       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
2266
2267       In  general, a pattern matches a certain portion of the subject, and in
2268       addition, further substrings from the subject  may  be  picked  out  by
2269       parenthesized  parts  of  the  pattern.  Following the usage in Jeffrey
2270       Friedl's book, this is called "capturing"  in  what  follows,  and  the
2271       phrase  "capturing subpattern" or "capturing group" is used for a frag-
2272       ment of a pattern that picks out a substring.  PCRE2  supports  several
2273       other kinds of parenthesized subpattern that do not cause substrings to
2274       be captured. The pcre2_pattern_info() function can be used to find  out
2275       how many capturing subpatterns there are in a compiled pattern.
2276
2277       You  can  use  auxiliary functions for accessing captured substrings by
2278       number or by name, as described in sections below.
2279
2280       Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2281       ues,  called  the  ovector,  which  contains  the  offsets  of captured
2282       strings.  It  is  part  of  the  match  data   block.    The   function
2283       pcre2_get_ovector_pointer()  returns  the  address  of the ovector, and
2284       pcre2_get_ovector_count() returns the number of pairs of values it con-
2285       tains.
2286
2287       Within the ovector, the first in each pair of values is set to the off-
2288       set of the first code unit of a substring, and the second is set to the
2289       offset  of the first code unit after the end of a substring. These val-
2290       ues are always code unit offsets, not character offsets. That is,  they
2291       are  byte  offsets  in  the 8-bit library, 16-bit offsets in the 16-bit
2292       library, and 32-bit offsets in the 32-bit library.
2293
2294       After a partial match  (error  return  PCRE2_ERROR_PARTIAL),  only  the
2295       first  pair  of  offsets  (that is, ovector[0] and ovector[1]) are set.
2296       They identify the part of the subject that was partially  matched.  See
2297       the pcre2partial documentation for details of partial matching.
2298
2299       After a successful match, the first pair of offsets identifies the por-
2300       tion of the subject string that was matched by the entire pattern.  The
2301       next  pair  is  used for the first capturing subpattern, and so on. The
2302       value returned by pcre2_match() is one more than the  highest  numbered
2303       pair  that  has been set. For example, if two substrings have been cap-
2304       tured, the returned value is 3. If there are no capturing  subpatterns,
2305       the return value from a successful match is 1, indicating that just the
2306       first pair of offsets has been set.
2307
2308       If a pattern uses the \K escape sequence within a  positive  assertion,
2309       the reported start of a successful match can be greater than the end of
2310       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
2311       "ab", the start and end offset values for the match are 2 and 0.
2312
2313       If  a  capturing subpattern group is matched repeatedly within a single
2314       match operation, it is the last portion of the subject that it  matched
2315       that is returned.
2316
2317       If the ovector is too small to hold all the captured substring offsets,
2318       as much as possible is filled in, and the function returns a  value  of
2319       zero.  If captured substrings are not of interest, pcre2_match() may be
2320       called with a match data block whose ovector is of minimum length (that
2321       is, one pair). However, if the pattern contains back references and the
2322       ovector is not big enough to remember the related substrings, PCRE2 has
2323       to  get  additional  memory for use during matching. Thus it is usually
2324       advisable to set up a match data block containing an ovector of reason-
2325       able size.
2326
2327       It  is  possible for capturing subpattern number n+1 to match some part
2328       of the subject when subpattern n has not been used at all. For example,
2329       if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
2330       return from the function is 4, and subpatterns 1 and 3 are matched, but
2331       2  is  not.  When  this happens, both values in the offset pairs corre-
2332       sponding to unused subpatterns are set to PCRE2_UNSET.
2333
2334       Offset values that correspond to unused subpatterns at the end  of  the
2335       expression  are  also  set  to  PCRE2_UNSET. For example, if the string
2336       "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
2337       are  not matched.  The return from the function is 2, because the high-
2338       est used capturing subpattern number is 1. The offsets for for the sec-
2339       ond  and  third  capturing  subpatterns  (assuming  the vector is large
2340       enough, of course) are set to PCRE2_UNSET.
2341
2342       Elements in the ovector that do not correspond to capturing parentheses
2343       in the pattern are never changed. That is, if a pattern contains n cap-
2344       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
2345       pcre2_match().  The  other  elements retain whatever values they previ-
2346       ously had.
2347
2348
2349OTHER INFORMATION ABOUT A MATCH
2350
2351       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
2352
2353       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
2354
2355       As well as the offsets in the ovector, other information about a  match
2356       is  retained  in the match data block and can be retrieved by the above
2357       functions in appropriate circumstances. If they  are  called  at  other
2358       times, the result is undefined.
2359
2360       After  a  successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
2361       failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name  may  be  avail-
2362       able,  and  pcre2_get_mark() can be called. It returns a pointer to the
2363       zero-terminated name, which is within the compiled  pattern.  Otherwise
2364       NULL  is returned. The length of the (*MARK) name (excluding the termi-
2365       nating zero) is stored in the code unit that  preceeds  the  name.  You
2366       should  use  this  instead  of  relying  on the terminating zero if the
2367       (*MARK) name might contain a binary zero.
2368
2369       After a successful match, the (*MARK) name that is returned is the last
2370       one  encountered  on the matching path through the pattern. After a "no
2371       match" or a  partial  match,  the  last  encountered  (*MARK)  name  is
2372       returned. For example, consider this pattern:
2373
2374         ^(*MARK:A)((*MARK:B)a|b)c
2375
2376       When  it  matches "bc", the returned mark is A. The B mark is "seen" in
2377       the first branch of the group, but it is not on the matching  path.  On
2378       the  other  hand,  when  this pattern fails to match "bx", the returned
2379       mark is B.
2380
2381       After a successful match, a partial match, or one of  the  invalid  UTF
2382       errors  (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
2383       be called. After a successful or partial match it returns the code unit
2384       offset  of  the character at which the match started. For a non-partial
2385       match, this can be different to the value of ovector[0] if the  pattern
2386       contains  the  \K escape sequence. After a partial match, however, this
2387       value is always the same as ovector[0] because \K does not  affect  the
2388       result of a partial match.
2389
2390       After  a UTF check failure, pcre2_get_startchar() can be used to obtain
2391       the code unit offset of the invalid UTF character. Details are given in
2392       the pcre2unicode page.
2393
2394
2395ERROR RETURNS FROM pcre2_match()
2396
2397       If  pcre2_match() fails, it returns a negative number. This can be con-
2398       verted to a text string by calling the pcre2_get_error_message()  func-
2399       tion  (see  "Obtaining a textual error message" below).  Negative error
2400       codes are also returned by other functions,  and  are  documented  with
2401       them.  The codes are given names in the header file. If UTF checking is
2402       in force and an invalid UTF subject string is detected, one of a number
2403       of  UTF-specific negative error codes is returned. Details are given in
2404       the pcre2unicode page. The following are the other errors that  may  be
2405       returned by pcre2_match():
2406
2407         PCRE2_ERROR_NOMATCH
2408
2409       The subject string did not match the pattern.
2410
2411         PCRE2_ERROR_PARTIAL
2412
2413       The  subject  string did not match, but it did match partially. See the
2414       pcre2partial documentation for details of partial matching.
2415
2416         PCRE2_ERROR_BADMAGIC
2417
2418       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2419       to  catch  the case when it is passed a junk pointer. This is the error
2420       that is returned when the magic number is not present.
2421
2422         PCRE2_ERROR_BADMODE
2423
2424       This error is given when a pattern  that  was  compiled  by  the  8-bit
2425       library  is  passed  to  a  16-bit  or 32-bit library function, or vice
2426       versa.
2427
2428         PCRE2_ERROR_BADOFFSET
2429
2430       The value of startoffset was greater than the length of the subject.
2431
2432         PCRE2_ERROR_BADOPTION
2433
2434       An unrecognized bit was set in the options argument.
2435
2436         PCRE2_ERROR_BADUTFOFFSET
2437
2438       The UTF code unit sequence that was passed as a subject was checked and
2439       found  to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the
2440       value of startoffset did not point to the beginning of a UTF  character
2441       or the end of the subject.
2442
2443         PCRE2_ERROR_CALLOUT
2444
2445       This  error  is never generated by pcre2_match() itself. It is provided
2446       for use by callout  functions  that  want  to  cause  pcre2_match()  or
2447       pcre2_callout_enumerate()  to  return a distinctive error code. See the
2448       pcre2callout documentation for details.
2449
2450         PCRE2_ERROR_INTERNAL
2451
2452       An unexpected internal error has occurred. This error could  be  caused
2453       by a bug in PCRE2 or by overwriting of the compiled pattern.
2454
2455         PCRE2_ERROR_JIT_BADOPTION
2456
2457       This  error  is  returned  when a pattern that was successfully studied
2458       using JIT is being matched, but the matching mode (partial or  complete
2459       match)  does  not  correspond to any JIT compilation mode. When the JIT
2460       fast path function is used, this error may be also  given  for  invalid
2461       options. See the pcre2jit documentation for more details.
2462
2463         PCRE2_ERROR_JIT_STACKLIMIT
2464
2465       This  error  is  returned  when a pattern that was successfully studied
2466       using JIT is being matched, but the memory available for  the  just-in-
2467       time  processing stack is not large enough. See the pcre2jit documenta-
2468       tion for more details.
2469
2470         PCRE2_ERROR_MATCHLIMIT
2471
2472       The backtracking limit was reached.
2473
2474         PCRE2_ERROR_NOMEMORY
2475
2476       If a pattern contains back references,  but  the  ovector  is  not  big
2477       enough  to  remember  the  referenced substrings, PCRE2 gets a block of
2478       memory at the start of matching to use for this purpose. There are some
2479       other  special cases where extra memory is needed during matching. This
2480       error is given when memory cannot be obtained.
2481
2482         PCRE2_ERROR_NULL
2483
2484       Either the code, subject, or match_data argument was passed as NULL.
2485
2486         PCRE2_ERROR_RECURSELOOP
2487
2488       This error is returned when  pcre2_match()  detects  a  recursion  loop
2489       within  the  pattern. Specifically, it means that either the whole pat-
2490       tern or a subpattern has been called recursively for the second time at
2491       the  same  position  in  the  subject string. Some simple patterns that
2492       might do this are detected and faulted at compile time, but  more  com-
2493       plicated  cases,  in particular mutual recursions between two different
2494       subpatterns, cannot be detected until matching is attempted.
2495
2496         PCRE2_ERROR_RECURSIONLIMIT
2497
2498       The internal recursion limit was reached.
2499
2500
2501OBTAINING A TEXTUAL ERROR MESSAGE
2502
2503       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
2504         PCRE2_SIZE bufflen);
2505
2506       A text message for an error code  from  any  PCRE2  function  (compile,
2507       match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
2508       sage(). The code is passed as the first argument,  with  the  remaining
2509       two  arguments specifying a code unit buffer and its length, into which
2510       the text message is placed. Note that the message is returned  in  code
2511       units of the appropriate width for the library that is being used.
2512
2513       The  returned message is terminated with a trailing zero, and the func-
2514       tion returns the number of code  units  used,  excluding  the  trailing
2515       zero.  If  the  error  number  is  unknown,  the  negative  error  code
2516       PCRE2_ERROR_BADDATA is returned. If the buffer is too small,  the  mes-
2517       sage  is  truncated  (but still with a trailing zero), and the negative
2518       error code PCRE2_ERROR_NOMEMORY is returned.  None of the messages  are
2519       very long; a buffer size of 120 code units is ample.
2520
2521
2522EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2523
2524       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
2525         uint32_t number, PCRE2_SIZE *length);
2526
2527       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
2528         uint32_t number, PCRE2_UCHAR *buffer,
2529         PCRE2_SIZE *bufflen);
2530
2531       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
2532         uint32_t number, PCRE2_UCHAR **bufferptr,
2533         PCRE2_SIZE *bufflen);
2534
2535       void pcre2_substring_free(PCRE2_UCHAR *buffer);
2536
2537       Captured  substrings  can  be accessed directly by using the ovector as
2538       described above.  For convenience, auxiliary functions are provided for
2539       extracting   captured  substrings  as  new,  separate,  zero-terminated
2540       strings. A substring that contains a binary zero is correctly extracted
2541       and  has  a  further  zero  added on the end, but the result is not, of
2542       course, a C string.
2543
2544       The functions in this section identify substrings by number. The number
2545       zero refers to the entire matched substring, with higher numbers refer-
2546       ring to substrings captured by parenthesized groups.  After  a  partial
2547       match,  only  substring  zero  is  available. An attempt to extract any
2548       other substring gives the error PCRE2_ERROR_PARTIAL. The  next  section
2549       describes similar functions for extracting captured substrings by name.
2550
2551       If  a  pattern uses the \K escape sequence within a positive assertion,
2552       the reported start of a successful match can be greater than the end of
2553       the  match.   For  example,  if the pattern (?=ab\K) is matched against
2554       "ab", the start and end offset values for the match are  2  and  0.  In
2555       this  situation,  calling  these functions with a zero substring number
2556       extracts a zero-length empty string.
2557
2558       You can find the length in code units of a captured  substring  without
2559       extracting  it  by calling pcre2_substring_length_bynumber(). The first
2560       argument is a pointer to the match data block, the second is the  group
2561       number,  and the third is a pointer to a variable into which the length
2562       is placed. If you just want to know whether or not  the  substring  has
2563       been captured, you can pass the third argument as NULL.
2564
2565       The  pcre2_substring_copy_bynumber()  function  copies  a captured sub-
2566       string into a supplied buffer,  whereas  pcre2_substring_get_bynumber()
2567       copies  it  into  new memory, obtained using the same memory allocation
2568       function that was used for the match data block. The  first  two  argu-
2569       ments  of  these  functions are a pointer to the match data block and a
2570       capturing group number.
2571
2572       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
2573       the buffer and a pointer to a variable that contains its length in code
2574       units.  This is updated to contain the actual number of code units used
2575       for the extracted substring, excluding the terminating zero.
2576
2577       For pcre2_substring_get_bynumber() the third and fourth arguments point
2578       to variables that are updated with a pointer to the new memory and  the
2579       number  of  code units that comprise the substring, again excluding the
2580       terminating zero. When the substring is no longer  needed,  the  memory
2581       should be freed by calling pcre2_substring_free().
2582
2583       The  return  value  from  all these functions is zero for success, or a
2584       negative error code. If the pattern match  failed,  the  match  failure
2585       code  is  returned.   If  a  substring number greater than zero is used
2586       after a partial match, PCRE2_ERROR_PARTIAL is returned. Other  possible
2587       error codes are:
2588
2589         PCRE2_ERROR_NOMEMORY
2590
2591       The  buffer  was  too small for pcre2_substring_copy_bynumber(), or the
2592       attempt to get memory failed for pcre2_substring_get_bynumber().
2593
2594         PCRE2_ERROR_NOSUBSTRING
2595
2596       There is no substring with that number in the  pattern,  that  is,  the
2597       number is greater than the number of capturing parentheses.
2598
2599         PCRE2_ERROR_UNAVAILABLE
2600
2601       The substring number, though not greater than the number of captures in
2602       the pattern, is greater than the number of slots in the ovector, so the
2603       substring could not be captured.
2604
2605         PCRE2_ERROR_UNSET
2606
2607       The  substring  did  not  participate in the match. For example, if the
2608       pattern is (abc)|(def) and the subject is "def", and the  ovector  con-
2609       tains at least two capturing slots, substring number 1 is unset.
2610
2611
2612EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
2613
2614       int pcre2_substring_list_get(pcre2_match_data *match_data,
2615         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
2616
2617       void pcre2_substring_list_free(PCRE2_SPTR *list);
2618
2619       The  pcre2_substring_list_get()  function  extracts  all available sub-
2620       strings and builds a list of pointers to  them.  It  also  (optionally)
2621       builds  a  second  list  that  contains  their lengths (in code units),
2622       excluding a terminating zero that is added to each of them. All this is
2623       done in a single block of memory that is obtained using the same memory
2624       allocation function that was used to get the match data block.
2625
2626       This function must be called only after a successful match.  If  called
2627       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
2628
2629       The  address of the memory block is returned via listptr, which is also
2630       the start of the list of string pointers. The end of the list is marked
2631       by  a  NULL pointer. The address of the list of lengths is returned via
2632       lengthsptr. If your strings do not contain binary zeros and you do  not
2633       therefore need the lengths, you may supply NULL as the lengthsptr argu-
2634       ment to disable the creation of a list of lengths.  The  yield  of  the
2635       function  is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
2636       ory block could not be obtained. When the list is no longer needed,  it
2637       should be freed by calling pcre2_substring_list_free().
2638
2639       If this function encounters a substring that is unset, which can happen
2640       when capturing subpattern number n+1 matches some part of the  subject,
2641       but  subpattern n has not been used at all, it returns an empty string.
2642       This can be distinguished  from  a  genuine  zero-length  substring  by
2643       inspecting  the  appropriate  offset  in  the  ovector,  which  contain
2644       PCRE2_UNSET  for   unset   substrings,   or   by   calling   pcre2_sub-
2645       string_length_bynumber().
2646
2647
2648EXTRACTING CAPTURED SUBSTRINGS BY NAME
2649
2650       int pcre2_substring_number_from_name(const pcre2_code *code,
2651         PCRE2_SPTR name);
2652
2653       int pcre2_substring_length_byname(pcre2_match_data *match_data,
2654         PCRE2_SPTR name, PCRE2_SIZE *length);
2655
2656       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
2657         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
2658
2659       int pcre2_substring_get_byname(pcre2_match_data *match_data,
2660         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
2661
2662       void pcre2_substring_free(PCRE2_UCHAR *buffer);
2663
2664       To  extract a substring by name, you first have to find associated num-
2665       ber.  For example, for this pattern:
2666
2667         (a+)b(?<xxx>\d+)...
2668
2669       the number of the subpattern called "xxx" is 2. If the name is known to
2670       be  unique  (PCRE2_DUPNAMES  was not set), you can find the number from
2671       the name by calling pcre2_substring_number_from_name(). The first argu-
2672       ment  is the compiled pattern, and the second is the name. The yield of
2673       the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
2674       is  no  subpattern  of  that  name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
2675       there is more than one subpattern of that name. Given the  number,  you
2676       can  extract  the  substring  directly,  or  use  one  of the functions
2677       described above.
2678
2679       For convenience, there are also "byname" functions that  correspond  to
2680       the  "bynumber"  functions,  the  only difference being that the second
2681       argument is a name instead of a number. If PCRE2_DUPNAMES  is  set  and
2682       there are duplicate names, these functions scan all the groups with the
2683       given name, and return the first named string that is set.
2684
2685       If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING  is
2686       returned.  If  all  groups  with the name have numbers that are greater
2687       than the number of slots in  the  ovector,  PCRE2_ERROR_UNAVAILABLE  is
2688       returned.  If  there  is at least one group with a slot in the ovector,
2689       but no group is found to be set, PCRE2_ERROR_UNSET is returned.
2690
2691       Warning: If the pattern uses the (?| feature to set up multiple subpat-
2692       terns  with  the  same number, as described in the section on duplicate
2693       subpattern numbers in the pcre2pattern page, you cannot  use  names  to
2694       distinguish  the  different subpatterns, because names are not included
2695       in the compiled code. The matching process uses only numbers. For  this
2696       reason,  the  use of different names for subpatterns of the same number
2697       causes an error at compile time.
2698
2699
2700CREATING A NEW STRING WITH SUBSTITUTIONS
2701
2702       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
2703         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2704         uint32_t options, pcre2_match_data *match_data,
2705         pcre2_match_context *mcontext, PCRE2_SPTR replacement,
2706         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
2707         PCRE2_SIZE *outlengthptr);
2708
2709       This function calls pcre2_match() and then makes a copy of the  subject
2710       string  in  outputbuffer,  replacing the part that was matched with the
2711       replacement string, whose length is supplied in rlength.  This  can  be
2712       given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
2713       which a \K item in a lookahead in the pattern causes the match  to  end
2714       before it starts are not supported, and give rise to an error return.
2715
2716       The  first  seven  arguments  of pcre2_substitute() are the same as for
2717       pcre2_match(), except that the partial matching options are not permit-
2718       ted,  and  match_data may be passed as NULL, in which case a match data
2719       block is obtained and freed within this function, using memory  manage-
2720       ment  functions from the match context, if provided, or else those that
2721       were used to allocate memory for the compiled code.
2722
2723       The outlengthptr argument must point to a variable  that  contains  the
2724       length,  in  code  units, of the output buffer. If the function is suc-
2725       cessful, the value is updated to contain the length of the new  string,
2726       excluding the trailing zero that is automatically added.
2727
2728       If  the  function  is  not  successful,  the value set via outlengthptr
2729       depends on the type of error. For  syntax  errors  in  the  replacement
2730       string,  the  value  is  the offset in the replacement string where the
2731       error was detected. For other  errors,  the  value  is  PCRE2_UNSET  by
2732       default.  This  includes the case of the output buffer being too small,
2733       unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see  below),  in  which
2734       case  the  value  is the minimum length needed, including space for the
2735       trailing zero. Note that in  order  to  compute  the  required  length,
2736       pcre2_substitute()  has  to  simulate  all  the  matching  and copying,
2737       instead of giving an error return as soon as the buffer overflows. Note
2738       also that the length is in code units, not bytes.
2739
2740       In  the replacement string, which is interpreted as a UTF string in UTF
2741       mode, and is checked for UTF  validity  unless  the  PCRE2_NO_UTF_CHECK
2742       option is set, a dollar character is an escape character that can spec-
2743       ify the insertion of characters from capturing groups or (*MARK)  items
2744       in the pattern. The following forms are always recognized:
2745
2746         $$                  insert a dollar character
2747         $<n> or ${<n>}      insert the contents of group <n>
2748         $*MARK or ${*MARK}  insert the name of the last (*MARK) encountered
2749
2750       Either  a  group  number  or  a  group name can be given for <n>. Curly
2751       brackets are required only if the following character would  be  inter-
2752       preted as part of the number or name. The number may be zero to include
2753       the entire matched string.   For  example,  if  the  pattern  a(b)c  is
2754       matched  with "=abc=" and the replacement string "+$1$0$1+", the result
2755       is "=+babcb+=".
2756
2757       The facility for inserting a (*MARK) name can be used to perform simple
2758       simultaneous substitutions, as this pcre2test example shows:
2759
2760         /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK}
2761             apple lemon
2762          2: pear orange
2763
2764       As  well as the usual options for pcre2_match(), a number of additional
2765       options can be set in the options argument.
2766
2767       PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
2768       string,  replacing  every  matching substring. If this is not set, only
2769       the first matching substring is replaced. If any matched substring  has
2770       zero  length, after the substitution has happened, an attempt to find a
2771       non-empty match at the same position is performed. If this is not  suc-
2772       cessful,  the current position is advanced by one character except when
2773       CRLF is a valid newline sequence and the next two  characters  are  CR,
2774       LF. In this case, the current position is advanced by two characters.
2775
2776       PCRE2_SUBSTITUTE_OVERFLOW_LENGTH  changes  what happens when the output
2777       buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
2778       ORY  immediately.  If  this  option is set, however, pcre2_substitute()
2779       continues to go through the motions of matching and substituting (with-
2780       out,  of course, writing anything) in order to compute the size of buf-
2781       fer that is needed. This value is  passed  back  via  the  outlengthptr
2782       variable,    with    the   result   of   the   function   still   being
2783       PCRE2_ERROR_NOMEMORY.
2784
2785       Passing a buffer size of zero is a permitted way  of  finding  out  how
2786       much  memory  is needed for given substitution. However, this does mean
2787       that the entire operation is carried out twice. Depending on the appli-
2788       cation,  it  may  be more efficient to allocate a large buffer and free
2789       the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
2790       FLOW_LENGTH.
2791
2792       PCRE2_SUBSTITUTE_UNKNOWN_UNSET  causes  references  to capturing groups
2793       that do not appear in the pattern to be treated as unset  groups.  This
2794       option  should  be  used  with  care, because it means that a typo in a
2795       group name or  number  no  longer  causes  the  PCRE2_ERROR_NOSUBSTRING
2796       error.
2797
2798       PCRE2_SUBSTITUTE_UNSET_EMPTY  causes  unset capturing groups (including
2799       unknown  groups  when  PCRE2_SUBSTITUTE_UNKNOWN_UNSET  is  set)  to  be
2800       treated  as  empty  strings  when  inserted as described above. If this
2801       option is not set, an attempt to  insert  an  unset  group  causes  the
2802       PCRE2_ERROR_UNSET  error.  This  option does not influence the extended
2803       substitution syntax described below.
2804
2805       PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to  the
2806       replacement  string.  Without this option, only the dollar character is
2807       special, and only the group insertion forms  listed  above  are  valid.
2808       When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
2809
2810       Firstly,  backslash in a replacement string is interpreted as an escape
2811       character. The usual forms such as \n or \x{ddd} can be used to specify
2812       particular  character codes, and backslash followed by any non-alphanu-
2813       meric character quotes that character. Extended quoting  can  be  coded
2814       using \Q...\E, exactly as in pattern strings.
2815
2816       There  are  also four escape sequences for forcing the case of inserted
2817       letters.  The insertion mechanism has three states:  no  case  forcing,
2818       force upper case, and force lower case. The escape sequences change the
2819       current state: \U and \L change to upper or lower case forcing, respec-
2820       tively,  and  \E (when not terminating a \Q quoted sequence) reverts to
2821       no case forcing. The sequences \u and \l force the next  character  (if
2822       it  is  a  letter)  to  upper or lower case, respectively, and then the
2823       state automatically reverts to no case forcing. Case forcing applies to
2824       all inserted  characters, including those from captured groups and let-
2825       ters within \Q...\E quoted sequences.
2826
2827       Note that case forcing sequences such as \U...\E do not nest. For exam-
2828       ple,  the  result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
2829       \E has no effect.
2830
2831       The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to  add  more
2832       flexibility  to  group substitution. The syntax is similar to that used
2833       by Bash:
2834
2835         ${<n>:-<string>}
2836         ${<n>:+<string1>:<string2>}
2837
2838       As before, <n> may be a group number or a name. The first  form  speci-
2839       fies  a  default  value. If group <n> is set, its value is inserted; if
2840       not, <string> is expanded and the  result  inserted.  The  second  form
2841       specifies  strings that are expanded and inserted when group <n> is set
2842       or unset, respectively. The first form is just a  convenient  shorthand
2843       for
2844
2845         ${<n>:+${<n>}:<string>}
2846
2847       Backslash  can  be  used to escape colons and closing curly brackets in
2848       the replacement strings. A change of the case forcing  state  within  a
2849       replacement  string  remains  in  force  afterwards,  as  shown in this
2850       pcre2test example:
2851
2852         /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
2853             body
2854          1: hello
2855             somebody
2856          1: HELLO
2857
2858       The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these  extended
2859       substitutions.   However,   PCRE2_SUBSTITUTE_UNKNOWN_UNSET  does  cause
2860       unknown groups in the extended syntax forms to be treated as unset.
2861
2862       If successful, pcre2_substitute() returns the  number  of  replacements
2863       that were made. This may be zero if no matches were found, and is never
2864       greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
2865
2866       In the event of an error, a negative error code is returned. Except for
2867       PCRE2_ERROR_NOMATCH    (which   is   never   returned),   errors   from
2868       pcre2_match() are passed straight back.
2869
2870       PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
2871       tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
2872
2873       PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
2874       ing an unknown substring when  PCRE2_SUBSTITUTE_UNKNOWN_UNSET  is  set)
2875       when  the  simple  (non-extended)  syntax  is  used  and  PCRE2_SUBSTI-
2876       TUTE_UNSET_EMPTY is not set.
2877
2878       PCRE2_ERROR_NOMEMORY is returned  if  the  output  buffer  is  not  big
2879       enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
2880       of buffer that is needed is returned via outlengthptr. Note  that  this
2881       does not happen by default.
2882
2883       PCRE2_ERROR_BADREPLACEMENT  is  used for miscellaneous syntax errors in
2884       the   replacement   string,   with   more   particular   errors   being
2885       PCRE2_ERROR_BADREPESCAPE  (invalid  escape  sequence), PCRE2_ERROR_REP-
2886       MISSING_BRACE (closing curly bracket not found),  PCRE2_BADSUBSTITUTION
2887       (syntax  error in extended group substitution), and PCRE2_BADSUBPATTERN
2888       (the pattern match ended before it started, which can happen if  \K  is
2889       used in an assertion).
2890
2891       As for all PCRE2 errors, a text message that describes the error can be
2892       obtained  by  calling  the  pcre2_get_error_message()   function   (see
2893       "Obtaining a textual error message" above).
2894
2895
2896DUPLICATE SUBPATTERN NAMES
2897
2898       int pcre2_substring_nametable_scan(const pcre2_code *code,
2899         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
2900
2901       When  a  pattern  is compiled with the PCRE2_DUPNAMES option, names for
2902       subpatterns are not required to be unique. Duplicate names  are  always
2903       allowed  for subpatterns with the same number, created by using the (?|
2904       feature. Indeed, if such subpatterns are named, they  are  required  to
2905       use the same names.
2906
2907       Normally, patterns with duplicate names are such that in any one match,
2908       only one of the named subpatterns participates. An example is shown  in
2909       the pcre2pattern documentation.
2910
2911       When   duplicates   are   present,   pcre2_substring_copy_byname()  and
2912       pcre2_substring_get_byname() return the first  substring  corresponding
2913       to   the   given   name   that   is  set.  Only  if  none  are  set  is
2914       PCRE2_ERROR_UNSET is returned.  The  pcre2_substring_number_from_name()
2915       function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
2916       duplicate names.
2917
2918       If you want to get full details of all captured substrings for a  given
2919       name,  you  must use the pcre2_substring_nametable_scan() function. The
2920       first argument is the compiled pattern, and the second is the name.  If
2921       the  third  and fourth arguments are NULL, the function returns a group
2922       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
2923
2924       When the third and fourth arguments are not NULL, they must be pointers
2925       to  variables  that are updated by the function. After it has run, they
2926       point to the first and last entries in the name-to-number table for the
2927       given  name,  and the function returns the length of each entry in code
2928       units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there  are
2929       no entries for the given name.
2930
2931       The format of the name table is described above in the section entitled
2932       Information about a pattern. Given all the  relevant  entries  for  the
2933       name,  you  can  extract  each of their numbers, and hence the captured
2934       data.
2935
2936
2937FINDING ALL POSSIBLE MATCHES AT ONE POSITION
2938
2939       The traditional matching function uses a  similar  algorithm  to  Perl,
2940       which  stops when it finds the first match at a given point in the sub-
2941       ject. If you want to find all possible matches, or the longest possible
2942       match  at  a  given  position,  consider using the alternative matching
2943       function (see below) instead. If you cannot use the  alternative  func-
2944       tion, you can kludge it up by making use of the callout facility, which
2945       is described in the pcre2callout documentation.
2946
2947       What you have to do is to insert a callout right at the end of the pat-
2948       tern.   When your callout function is called, extract and save the cur-
2949       rent matched substring. Then return 1, which  forces  pcre2_match()  to
2950       backtrack  and  try other alternatives. Ultimately, when it runs out of
2951       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
2952
2953
2954MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
2955
2956       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
2957         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2958         uint32_t options, pcre2_match_data *match_data,
2959         pcre2_match_context *mcontext,
2960         int *workspace, PCRE2_SIZE wscount);
2961
2962       The function pcre2_dfa_match() is called  to  match  a  subject  string
2963       against  a  compiled pattern, using a matching algorithm that scans the
2964       subject string just once, and does not backtrack.  This  has  different
2965       characteristics  to  the  normal  algorithm, and is not compatible with
2966       Perl. Some of the features of PCRE2 patterns are not supported.  Never-
2967       theless,  there are times when this kind of matching can be useful. For
2968       a discussion of the two matching algorithms, and  a  list  of  features
2969       that pcre2_dfa_match() does not support, see the pcre2matching documen-
2970       tation.
2971
2972       The arguments for the pcre2_dfa_match() function are the  same  as  for
2973       pcre2_match(), plus two extras. The ovector within the match data block
2974       is used in a different way, and this is described below. The other com-
2975       mon  arguments  are used in the same way as for pcre2_match(), so their
2976       description is not repeated here.
2977
2978       The two additional arguments provide workspace for  the  function.  The
2979       workspace  vector  should  contain at least 20 elements. It is used for
2980       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2981       workspace  is needed for patterns and subjects where there are a lot of
2982       potential matches.
2983
2984       Here is an example of a simple call to pcre2_dfa_match():
2985
2986         int wspace[20];
2987         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2988         int rc = pcre2_dfa_match(
2989           re,             /* result of pcre2_compile() */
2990           "some string",  /* the subject string */
2991           11,             /* the length of the subject string */
2992           0,              /* start at offset 0 in the subject */
2993           0,              /* default options */
2994           match_data,     /* the match data block */
2995           NULL,           /* a match context; NULL means use defaults */
2996           wspace,         /* working space vector */
2997           20);            /* number of elements (NOT size in bytes) */
2998
2999   Option bits for pcre_dfa_match()
3000
3001       The unused bits of the options argument for pcre2_dfa_match()  must  be
3002       zero.  The  only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
3003       PCRE2_NOTEOL,          PCRE2_NOTEMPTY,          PCRE2_NOTEMPTY_ATSTART,
3004       PCRE2_NO_UTF_CHECK,       PCRE2_PARTIAL_HARD,       PCRE2_PARTIAL_SOFT,
3005       PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but  the  last  four  of
3006       these  are  exactly the same as for pcre2_match(), so their description
3007       is not repeated here.
3008
3009         PCRE2_PARTIAL_HARD
3010         PCRE2_PARTIAL_SOFT
3011
3012       These have the same general effect as they do  for  pcre2_match(),  but
3013       the  details are slightly different. When PCRE2_PARTIAL_HARD is set for
3014       pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if  the  end  of  the
3015       subject is reached and there is still at least one matching possibility
3016       that requires additional characters. This happens even if some complete
3017       matches  have  already  been found. When PCRE2_PARTIAL_SOFT is set, the
3018       return code PCRE2_ERROR_NOMATCH is converted  into  PCRE2_ERROR_PARTIAL
3019       if  the  end  of  the  subject  is reached, there have been no complete
3020       matches, but there is still at least one matching possibility. The por-
3021       tion  of  the  string that was inspected when the longest partial match
3022       was found is set as the first matching string in both cases. There is a
3023       more  detailed  discussion  of partial and multi-segment matching, with
3024       examples, in the pcre2partial documentation.
3025
3026         PCRE2_DFA_SHORTEST
3027
3028       Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm  to
3029       stop as soon as it has found one match. Because of the way the alterna-
3030       tive algorithm works, this is necessarily the shortest  possible  match
3031       at the first possible matching point in the subject string.
3032
3033         PCRE2_DFA_RESTART
3034
3035       When  pcre2_dfa_match() returns a partial match, it is possible to call
3036       it again, with additional subject characters, and have it continue with
3037       the same match. The PCRE2_DFA_RESTART option requests this action; when
3038       it is set, the workspace and wscount options must  reference  the  same
3039       vector  as  before  because data about the match so far is left in them
3040       after a partial match. There is more discussion of this facility in the
3041       pcre2partial documentation.
3042
3043   Successful returns from pcre2_dfa_match()
3044
3045       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3046       string in the subject. Note, however, that all the matches from one run
3047       of  the  function  start  at the same point in the subject. The shorter
3048       matches are all initial substrings of the longer matches. For  example,
3049       if the pattern
3050
3051         <.*>
3052
3053       is matched against the string
3054
3055         This is <something> <something else> <something further> no more
3056
3057       the three matched strings are
3058
3059         <something> <something else> <something further>
3060         <something> <something else>
3061         <something>
3062
3063       On  success,  the  yield of the function is a number greater than zero,
3064       which is the number of matched substrings.  The  offsets  of  the  sub-
3065       strings  are returned in the ovector, and can be extracted by number in
3066       the same way as for pcre2_match(), but the numbers bear no relation  to
3067       any  capturing groups that may exist in the pattern, because DFA match-
3068       ing does not support group capture.
3069
3070       Calls to the convenience functions  that  extract  substrings  by  name
3071       return  the  error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used
3072       after a DFA match. The convenience functions that extract substrings by
3073       number  never  return PCRE2_ERROR_NOSUBSTRING, and the meanings of some
3074       other errors are slightly different:
3075
3076         PCRE2_ERROR_UNAVAILABLE
3077
3078       The ovector is not big enough to include a slot for the given substring
3079       number.
3080
3081         PCRE2_ERROR_UNSET
3082
3083       There  is  a  slot  in  the  ovector for this substring, but there were
3084       insufficient matches to fill it.
3085
3086       The matched strings are stored in  the  ovector  in  reverse  order  of
3087       length;  that  is,  the longest matching string is first. If there were
3088       too many matches to fit into the ovector, the yield of the function  is
3089       zero, and the vector is filled with the longest matches.
3090
3091       NOTE:  PCRE2's  "auto-possessification" optimization usually applies to
3092       character repeats at the end of a pattern (as well as internally).  For
3093       example,  the pattern "a\d+" is compiled as if it were "a\d++". For DFA
3094       matching, this means that only one possible  match  is  found.  If  you
3095       really  do  want multiple matches in such cases, either use an ungreedy
3096       repeat auch as "a\d+?" or set  the  PCRE2_NO_AUTO_POSSESS  option  when
3097       compiling.
3098
3099   Error returns from pcre2_dfa_match()
3100
3101       The pcre2_dfa_match() function returns a negative number when it fails.
3102       Many of the errors are the same  as  for  pcre2_match(),  as  described
3103       above.  There are in addition the following errors that are specific to
3104       pcre2_dfa_match():
3105
3106         PCRE2_ERROR_DFA_UITEM
3107
3108       This return is given if pcre2_dfa_match() encounters  an  item  in  the
3109       pattern  that it does not support, for instance, the use of \C in a UTF
3110       mode or a back reference.
3111
3112         PCRE2_ERROR_DFA_UCOND
3113
3114       This return is given if pcre2_dfa_match() encounters a  condition  item
3115       that  uses  a back reference for the condition, or a test for recursion
3116       in a specific group. These are not supported.
3117
3118         PCRE2_ERROR_DFA_WSSIZE
3119
3120       This return is given if pcre2_dfa_match() runs  out  of  space  in  the
3121       workspace vector.
3122
3123         PCRE2_ERROR_DFA_RECURSE
3124
3125       When  a  recursive subpattern is processed, the matching function calls
3126       itself recursively, using private memory for the ovector and workspace.
3127       This  error  is given if the internal ovector is not large enough. This
3128       should be extremely rare, as a vector of size 1000 is used.
3129
3130         PCRE2_ERROR_DFA_BADRESTART
3131
3132       When pcre2_dfa_match() is called  with  the  PCRE2_DFA_RESTART  option,
3133       some  plausibility  checks  are  made on the contents of the workspace,
3134       which should contain data about the previous partial match. If  any  of
3135       these checks fail, this error is given.
3136
3137
3138SEE ALSO
3139
3140       pcre2build(3),    pcre2callout(3),    pcre2demo(3),   pcre2matching(3),
3141       pcre2partial(3),    pcre2posix(3),    pcre2sample(3),    pcre2stack(3),
3142       pcre2unicode(3).
3143
3144
3145AUTHOR
3146
3147       Philip Hazel
3148       University Computing Service
3149       Cambridge, England.
3150
3151
3152REVISION
3153
3154       Last updated: 17 June 2016
3155       Copyright (c) 1997-2016 University of Cambridge.
3156------------------------------------------------------------------------------
3157
3158
3159PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
3160
3161
3162
3163NAME
3164       PCRE2 - Perl-compatible regular expressions (revised API)
3165
3166BUILDING PCRE2
3167
3168       PCRE2  is distributed with a configure script that can be used to build
3169       the library in Unix-like environments using the applications  known  as
3170       Autotools. Also in the distribution are files to support building using
3171       CMake instead of configure.  The  text  file  README  contains  general
3172       information  about  building  with Autotools (some of which is repeated
3173       below), and also has some comments about building on various  operating
3174       systems.  There  is a lot more information about building PCRE2 without
3175       using Autotools (including information about using CMake  and  building
3176       "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
3177       consult this file as well as the README file if you are building  in  a
3178       non-Unix-like environment.
3179
3180
3181PCRE2 BUILD-TIME OPTIONS
3182
3183       The rest of this document describes the optional features of PCRE2 that
3184       can be selected when the library is compiled. It  assumes  use  of  the
3185       configure  script,  where  the  optional features are selected or dese-
3186       lected by providing options to configure before running the  make  com-
3187       mand.  However,  the same options can be selected in both Unix-like and
3188       non-Unix-like environments if you are using CMake instead of  configure
3189       to build PCRE2.
3190
3191       If  you  are not using Autotools or CMake, option selection can be done
3192       by editing the config.h file, or by passing parameter settings  to  the
3193       compiler, as described in NON-AUTOTOOLS-BUILD.
3194
3195       The complete list of options for configure (which includes the standard
3196       ones such as the  selection  of  the  installation  directory)  can  be
3197       obtained by running
3198
3199         ./configure --help
3200
3201       The  following  sections  include  descriptions  of options whose names
3202       begin with --enable or --disable. These settings specify changes to the
3203       defaults  for  the configure command. Because of the way that configure
3204       works, --enable and --disable always come in pairs, so  the  complemen-
3205       tary  option always exists as well, but as it specifies the default, it
3206       is not described.
3207
3208
3209BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3210
3211       By default, a library called libpcre2-8 is built, containing  functions
3212       that  take  string arguments contained in vectors of bytes, interpreted
3213       either as single-byte characters, or UTF-8 strings. You can also  build
3214       two  other libraries, called libpcre2-16 and libpcre2-32, which process
3215       strings that are contained in vectors of 16-bit and 32-bit code  units,
3216       respectively. These can be interpreted either as single-unit characters
3217       or UTF-16/UTF-32 strings. To build these additional libraries, add  one
3218       or both of the following to the configure command:
3219
3220         --enable-pcre2-16
3221         --enable-pcre2-32
3222
3223       If you do not want the 8-bit library, add
3224
3225         --disable-pcre2-8
3226
3227       as  well.  At least one of the three libraries must be built. Note that
3228       the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
3229       an  8-bit  program.  Neither  of these are built if you select only the
3230       16-bit or 32-bit libraries.
3231
3232
3233BUILDING SHARED AND STATIC LIBRARIES
3234
3235       The Autotools PCRE2 building process uses libtool to build both  shared
3236       and  static  libraries by default. You can suppress an unwanted library
3237       by adding one of
3238
3239         --disable-shared
3240         --disable-static
3241
3242       to the configure command.
3243
3244
3245UNICODE AND UTF SUPPORT
3246
3247       By default, PCRE2 is built with support for Unicode and  UTF  character
3248       strings.  To build it without Unicode support, add
3249
3250         --disable-unicode
3251
3252       to  the configure command. This setting applies to all three libraries.
3253       It is not possible to build  one  library  with  Unicode  support,  and
3254       another without, in the same configuration.
3255
3256       Of  itself, Unicode support does not make PCRE2 treat strings as UTF-8,
3257       UTF-16 or UTF-32. To do that, applications that use the library can set
3258       the  PCRE2_UTF  option when they call pcre2_compile() to compile a pat-
3259       tern.  Alternatively, patterns may be started with  (*UTF)  unless  the
3260       application has locked this out by setting PCRE2_NEVER_UTF.
3261
3262       UTF support allows the libraries to process character code points up to
3263       0x10ffff in the strings that they handle. It also provides support  for
3264       accessing  the  Unicode  properties  of  such characters, using pattern
3265       escapes such as \P, \p, and \X. Only the  general  category  properties
3266       such  as Lu and Nd are supported. Details are given in the pcre2pattern
3267       documentation.
3268
3269       Pattern escapes such as \d and \w do not by default make use of Unicode
3270       properties.  The  application  can  request that they do by setting the
3271       PCRE2_UCP option. Unless the application  has  set  PCRE2_NEVER_UCP,  a
3272       pattern may also request this by starting with (*UCP).
3273
3274
3275DISABLING THE USE OF \C
3276
3277       The \C escape sequence, which matches a single code unit, even in a UTF
3278       mode, can cause unpredictable behaviour because it may leave  the  cur-
3279       rent  matching  point in the middle of a multi-code-unit character. The
3280       application can lock it  out  by  setting  the  PCRE2_NEVER_BACKSLASH_C
3281       option when calling pcre2_compile(). There is also a build-time option
3282
3283         --enable-never-backslash-C
3284
3285       (note the upper case C) which locks out the use of \C entirely.
3286
3287
3288JUST-IN-TIME COMPILER SUPPORT
3289
3290       Just-in-time compiler support is included in the build by specifying
3291
3292         --enable-jit
3293
3294       This  support  is available only for certain hardware architectures. If
3295       this option is set for an unsupported architecture,  a  building  error
3296       occurs.   See the pcre2jit documentation for a discussion of JIT usage.
3297       When JIT support is enabled, pcre2grep automatically makes use  of  it,
3298       unless you add
3299
3300         --disable-pcre2grep-jit
3301
3302       to the "configure" command.
3303
3304
3305NEWLINE RECOGNITION
3306
3307       By  default, PCRE2 interprets the linefeed (LF) character as indicating
3308       the end of a line. This is the normal newline  character  on  Unix-like
3309       systems.  You can compile PCRE2 to use carriage return (CR) instead, by
3310       adding
3311
3312         --enable-newline-is-cr
3313
3314       to the configure  command.  There  is  also  an  --enable-newline-is-lf
3315       option, which explicitly specifies linefeed as the newline character.
3316
3317       Alternatively, you can specify that line endings are to be indicated by
3318       the two-character sequence CRLF (CR immediately followed by LF). If you
3319       want this, add
3320
3321         --enable-newline-is-crlf
3322
3323       to the configure command. There is a fourth option, specified by
3324
3325         --enable-newline-is-anycrlf
3326
3327       which  causes  PCRE2 to recognize any of the three sequences CR, LF, or
3328       CRLF as indicating a line ending. Finally, a fifth option, specified by
3329
3330         --enable-newline-is-any
3331
3332       causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode
3333       newline sequences are the three just mentioned, plus the single charac-
3334       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
3335       U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator,
3336       U+2029).
3337
3338       Whatever default line ending convention is selected when PCRE2 is built
3339       can  be  overridden by applications that use the library. At build time
3340       it is conventional to use the standard for your operating system.
3341
3342
3343WHAT \R MATCHES
3344
3345       By default, the sequence \R in a pattern matches  any  Unicode  newline
3346       sequence,  independently  of  what has been selected as the line ending
3347       sequence. If you specify
3348
3349         --enable-bsr-anycrlf
3350
3351       the default is changed so that \R matches only CR, LF, or  CRLF.  What-
3352       ever  is selected when PCRE2 is built can be overridden by applications
3353       that use the called.
3354
3355
3356HANDLING VERY LARGE PATTERNS
3357
3358       Within a compiled pattern, offset values are used  to  point  from  one
3359       part  to another (for example, from an opening parenthesis to an alter-
3360       nation metacharacter). By default, in the 8-bit and  16-bit  libraries,
3361       two-byte  values  are used for these offsets, leading to a maximum size
3362       for a compiled pattern of around 64K code units. This is sufficient  to
3363       handle all but the most gigantic patterns. Nevertheless, some people do
3364       want to process truly enormous patterns, so it is possible  to  compile
3365       PCRE2  to  use three-byte or four-byte offsets by adding a setting such
3366       as
3367
3368         --with-link-size=3
3369
3370       to the configure command. The value given must be 2, 3, or 4.  For  the
3371       16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
3372       using longer offsets slows down the operation of PCRE2 because  it  has
3373       to  load additional data when handling them. For the 32-bit library the
3374       value is always 4 and cannot be overridden; the value  of  --with-link-
3375       size is ignored.
3376
3377
3378AVOIDING EXCESSIVE STACK USAGE
3379
3380       When  matching  with the pcre2_match() function, PCRE2 implements back-
3381       tracking by making recursive  calls  to  an  internal  function  called
3382       match().  In  environments where the size of the stack is limited, this
3383       can severely limit PCRE2's operation. (The Unix  environment  does  not
3384       usually  suffer from this problem, but it may sometimes be necessary to
3385       increase  the  maximum  stack  size.  There  is  a  discussion  in  the
3386       pcre2stack  documentation.)  An  alternative approach to recursion that
3387       uses memory from the heap to remember data, instead of using  recursive
3388       function  calls, has been implemented to work round the problem of lim-
3389       ited stack size. If you want to build a version  of  PCRE2  that  works
3390       this way, add
3391
3392         --disable-stack-for-recursion
3393
3394       to the configure command. By default, the system functions malloc() and
3395       free() are called to manage the heap memory that is required, but  cus-
3396       tom  memory  management  functions  can  be  called instead. PCRE2 runs
3397       noticeably more slowly when built in this way. This option affects only
3398       the pcre2_match() function; it is not relevant for pcre2_dfa_match().
3399
3400
3401LIMITING PCRE2 RESOURCE USAGE
3402
3403       Internally, PCRE2 has a function called match(), which it calls repeat-
3404       edly  (sometimes  recursively)  when  matching  a  pattern   with   the
3405       pcre2_match() function. By controlling the maximum number of times this
3406       function may be called during a single matching operation, a limit  can
3407       be  placed on the resources used by a single call to pcre2_match(). The
3408       limit can be changed at run time, as described in the pcre2api documen-
3409       tation.  The default is 10 million, but this can be changed by adding a
3410       setting such as
3411
3412         --with-match-limit=500000
3413
3414       to  the  configure  command.  This  setting  has  no  effect   on   the
3415       pcre2_dfa_match() matching function.
3416
3417       In  some  environments  it is desirable to limit the depth of recursive
3418       calls of match() more strictly than the total number of calls, in order
3419       to  restrict  the maximum amount of stack (or heap, if --disable-stack-
3420       for-recursion is specified) that is used. A second limit controls this;
3421       it  defaults  to  the  value  that is set for --with-match-limit, which
3422       imposes no additional constraints. However, you can set a  lower  limit
3423       by adding, for example,
3424
3425         --with-match-limit-recursion=10000
3426
3427       to  the  configure  command.  This  value can also be overridden at run
3428       time.
3429
3430
3431CREATING CHARACTER TABLES AT BUILD TIME
3432
3433       PCRE2 uses fixed tables for processing characters whose code points are
3434       less than 256. By default, PCRE2 is built with a set of tables that are
3435       distributed in the file src/pcre2_chartables.c.dist. These  tables  are
3436       for ASCII codes only. If you add
3437
3438         --enable-rebuild-chartables
3439
3440       to  the  configure  command, the distributed tables are no longer used.
3441       Instead, a program called dftables is compiled and  run.  This  outputs
3442       the source for new set of tables, created in the default locale of your
3443       C run-time system. (This method of replacing the tables does  not  work
3444       if  you are cross compiling, because dftables is run on the local host.
3445       If you need to create alternative tables when cross compiling, you will
3446       have to do so "by hand".)
3447
3448
3449USING EBCDIC CODE
3450
3451       PCRE2  assumes  by default that it will run in an environment where the
3452       character code is ASCII or Unicode, which is a superset of ASCII.  This
3453       is the case for most computer operating systems. PCRE2 can, however, be
3454       compiled to run in an 8-bit EBCDIC environment by adding
3455
3456         --enable-ebcdic --disable-unicode
3457
3458       to the configure command. This setting implies --enable-rebuild-charta-
3459       bles.  You  should  only  use  it if you know that you are in an EBCDIC
3460       environment (for example, an IBM mainframe operating system).
3461
3462       It is not possible to support both EBCDIC and UTF-8 codes in  the  same
3463       version  of  the  library. Consequently, --enable-unicode and --enable-
3464       ebcdic are mutually exclusive.
3465
3466       The EBCDIC character that corresponds to an ASCII LF is assumed to have
3467       the  value  0x15 by default. However, in some EBCDIC environments, 0x25
3468       is used. In such an environment you should use
3469
3470         --enable-ebcdic-nl25
3471
3472       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
3473       has  the  same  value  as in ASCII, namely, 0x0d. Whichever of 0x15 and
3474       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
3475       acter (which, in Unicode, is 0x85).
3476
3477       The options that select newline behaviour, such as --enable-newline-is-
3478       cr, and equivalent run-time options, refer to these character values in
3479       an EBCDIC environment.
3480
3481
3482PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
3483
3484       By default, on non-Windows systems, pcre2grep supports the use of call-
3485       outs with string arguments within the patterns it is matching, in order
3486       to  run external scripts. For details, see the pcre2grep documentation.
3487       This support can be disabled by adding  --disable-pcre2grep-callout  to
3488       the configure command.
3489
3490
3491PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
3492
3493       By  default,  pcre2grep reads all files as plain text. You can build it
3494       so that it recognizes files whose names end in .gz or .bz2,  and  reads
3495       them with libz or libbz2, respectively, by adding one or both of
3496
3497         --enable-pcre2grep-libz
3498         --enable-pcre2grep-libbz2
3499
3500       to the configure command. These options naturally require that the rel-
3501       evant libraries are installed on your system. Configuration  will  fail
3502       if they are not.
3503
3504
3505PCRE2GREP BUFFER SIZE
3506
3507       pcre2grep  uses an internal buffer to hold a "window" on the file it is
3508       scanning, in order to be able to output "before" and "after" lines when
3509       it  finds  a match. The size of the buffer is controlled by a parameter
3510       whose default value is 20K. The buffer itself is three times this size,
3511       but because of the way it is used for holding "before" lines, the long-
3512       est line that is guaranteed to be processable is  the  parameter  size.
3513       You can change the default parameter value by adding, for example,
3514
3515         --with-pcre2grep-bufsize=50K
3516
3517       to  the  configure  command.  The caller of pcre2grep can override this
3518       value by using --buffer-size on the command line.
3519
3520
3521PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
3522
3523       If you add one of
3524
3525         --enable-pcre2test-libreadline
3526         --enable-pcre2test-libedit
3527
3528       to the configure command, pcre2test  is  linked  with  the  libreadline
3529       orlibedit library, respectively, and when its input is from a terminal,
3530       it reads it using the readline() function. This  provides  line-editing
3531       and  history  facilities.  Note that libreadline is GPL-licensed, so if
3532       you distribute a binary of pcre2test linked in this way, there  may  be
3533       licensing issues. These can be avoided by linking instead with libedit,
3534       which has a BSD licence.
3535
3536       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
3537       be  added to the pcre2test build. In many operating environments with a
3538       sytem-installed readline library this is sufficient. However,  in  some
3539       environments (e.g. if an unmodified distribution version of readline is
3540       in use), some extra configuration may be necessary.  The  INSTALL  file
3541       for libreadline says this:
3542
3543         "Readline uses the termcap functions, but does not link with
3544         the termcap or curses library itself, allowing applications
3545         which link with readline the to choose an appropriate library."
3546
3547       If  your environment has not been set up so that an appropriate library
3548       is automatically included, you may need to add something like
3549
3550         LIBS="-ncurses"
3551
3552       immediately before the configure command.
3553
3554
3555INCLUDING DEBUGGING CODE
3556
3557       If you add
3558
3559         --enable-debug
3560
3561       to the configure command, additional debugging code is included in  the
3562       build. This feature is intended for use by the PCRE2 maintainers.
3563
3564
3565DEBUGGING WITH VALGRIND SUPPORT
3566
3567       If you add
3568
3569         --enable-valgrind
3570
3571       to  the  configure command, PCRE2 will use valgrind annotations to mark
3572       certain memory regions as  unaddressable.  This  allows  it  to  detect
3573       invalid  memory  accesses,  and  is  mostly  useful for debugging PCRE2
3574       itself.
3575
3576
3577CODE COVERAGE REPORTING
3578
3579       If your C compiler is gcc, you can build a version of  PCRE2  that  can
3580       generate a code coverage report for its test suite. To enable this, you
3581       must install lcov version 1.6 or above. Then specify
3582
3583         --enable-coverage
3584
3585       to the configure command and build PCRE2 in the usual way.
3586
3587       Note that using ccache (a caching C compiler) is incompatible with code
3588       coverage  reporting. If you have configured ccache to run automatically
3589       on your system, you must set the environment variable
3590
3591         CCACHE_DISABLE=1
3592
3593       before running make to build PCRE2, so that ccache is not used.
3594
3595       When --enable-coverage is used,  the  following  addition  targets  are
3596       added to the Makefile:
3597
3598         make coverage
3599
3600       This  creates  a  fresh coverage report for the PCRE2 test suite. It is
3601       equivalent to running "make coverage-reset", "make  coverage-baseline",
3602       "make check", and then "make coverage-report".
3603
3604         make coverage-reset
3605
3606       This zeroes the coverage counters, but does nothing else.
3607
3608         make coverage-baseline
3609
3610       This captures baseline coverage information.
3611
3612         make coverage-report
3613
3614       This creates the coverage report.
3615
3616         make coverage-clean-report
3617
3618       This  removes the generated coverage report without cleaning the cover-
3619       age data itself.
3620
3621         make coverage-clean-data
3622
3623       This removes the captured coverage data without removing  the  coverage
3624       files created at compile time (*.gcno).
3625
3626         make coverage-clean
3627
3628       This  cleans all coverage data including the generated coverage report.
3629       For more information about code coverage, see the gcov and  lcov  docu-
3630       mentation.
3631
3632
3633SEE ALSO
3634
3635       pcre2api(3), pcre2-config(3).
3636
3637
3638AUTHOR
3639
3640       Philip Hazel
3641       University Computing Service
3642       Cambridge, England.
3643
3644
3645REVISION
3646
3647       Last updated: 01 April 2016
3648       Copyright (c) 1997-2016 University of Cambridge.
3649------------------------------------------------------------------------------
3650
3651
3652PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
3653
3654
3655
3656NAME
3657       PCRE2 - Perl-compatible regular expressions (revised API)
3658
3659SYNOPSIS
3660
3661       #include <pcre2.h>
3662
3663       int (*pcre2_callout)(pcre2_callout_block *, void *);
3664
3665       int pcre2_callout_enumerate(const pcre2_code *code,
3666         int (*callback)(pcre2_callout_enumerate_block *, void *),
3667         void *user_data);
3668
3669
3670DESCRIPTION
3671
3672       PCRE2  provides  a feature called "callout", which is a means of tempo-
3673       rarily passing control to the caller of PCRE2 in the middle of  pattern
3674       matching.  The caller of PCRE2 provides an external function by putting
3675       its entry point in a match  context  (see  pcre2_set_callout()  in  the
3676       pcre2api documentation).
3677
3678       Within  a  regular expression, (?C<arg>) indicates a point at which the
3679       external function is to be called.  Different  callout  points  can  be
3680       identified  by  putting  a number less than 256 after the letter C. The
3681       default value is zero.  Alternatively, the argument may be a  delimited
3682       string.  The  starting delimiter must be one of ` ' " ^ % # $ { and the
3683       ending delimiter is the same as the start, except for {, where the end-
3684       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
3685       string, it must be doubled. For example, this pattern has  two  callout
3686       points:
3687
3688         (?C1)abc(?C"some ""arbitrary"" text")def
3689
3690       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
3691       PCRE2 automatically inserts callouts, all with number 255, before  each
3692       item  in  the  pattern. For example, if PCRE2_AUTO_CALLOUT is used with
3693       the pattern
3694
3695         A(\d{2}|--)
3696
3697       it is processed as if it were
3698
3699       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
3700
3701       Notice that there is a callout before and after  each  parenthesis  and
3702       alternation bar. If the pattern contains a conditional group whose con-
3703       dition is an assertion, an automatic callout  is  inserted  immediately
3704       before  the  condition. Such a callout may also be inserted explicitly,
3705       for example:
3706
3707         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
3708
3709       This applies only to assertion conditions (because they are  themselves
3710       independent groups).
3711
3712       Callouts  can  be useful for tracking the progress of pattern matching.
3713       The pcre2test program has a pattern qualifier (/auto_callout) that sets
3714       automatic  callouts.   When  any  callouts are present, the output from
3715       pcre2test indicates how the pattern is being matched.  This  is  useful
3716       information  when  you are trying to optimize the performance of a par-
3717       ticular pattern.
3718
3719
3720MISSING CALLOUTS
3721
3722       You should be aware that, because of optimizations  in  the  way  PCRE2
3723       compiles and matches patterns, callouts sometimes do not happen exactly
3724       as you might expect.
3725
3726   Auto-possessification
3727
3728       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
3729       that  what follows cannot be part of the repeat. For example, a+[bc] is
3730       compiled as if it were a++[bc]. The pcre2test output when this  pattern
3731       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
3732       to the string "aaaa" is:
3733
3734         --->aaaa
3735          +0 ^        a+
3736          +2 ^   ^    [bc]
3737         No match
3738
3739       This indicates that when matching [bc] fails, there is no  backtracking
3740       into  a+  and  therefore the callouts that would be taken for the back-
3741       tracks do not occur.  You can disable the  auto-possessify  feature  by
3742       passing  PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
3743       tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
3744
3745         --->aaaa
3746          +0 ^        a+
3747          +2 ^   ^    [bc]
3748          +2 ^  ^     [bc]
3749          +2 ^ ^      [bc]
3750          +2 ^^       [bc]
3751         No match
3752
3753       This time, when matching [bc] fails, the matcher backtracks into a+ and
3754       tries again, repeatedly, until a+ itself fails.
3755
3756   Automatic .* anchoring
3757
3758       By default, an optimization is applied when .* is the first significant
3759       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
3760       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
3761       is not set, a match can start only after an internal newline or at  the
3762       beginning  of  the  subject,  and  pcre2_compile() remembers this. This
3763       optimization is disabled, however, if .* is in an atomic  group  or  if
3764       there  is  a back reference to the capturing group in which it appears.
3765       It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How-
3766       ever, the presence of callouts does not affect it.
3767
3768       For  example,  if  the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
3769       and applied to the string "aa", the pcre2test output is:
3770
3771         --->aa
3772          +0 ^      .*
3773          +2 ^ ^    \d
3774          +2 ^^     \d
3775          +2 ^      \d
3776         No match
3777
3778       This shows that all match attempts start at the beginning of  the  sub-
3779       ject.  In  other  words,  the pattern is anchored. You can disable this
3780       optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(),  or
3781       starting  the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
3782       put changes to:
3783
3784         --->aa
3785          +0 ^      .*
3786          +2 ^ ^    \d
3787          +2 ^^     \d
3788          +2 ^      \d
3789          +0  ^     .*
3790          +2  ^^    \d
3791          +2  ^     \d
3792         No match
3793
3794       This shows more match attempts, starting at the second subject  charac-
3795       ter.   Another  optimization, described in the next section, means that
3796       there is no subsequent attempt to match with an empty subject.
3797
3798       If a pattern has more than one top-level  branch,  automatic  anchoring
3799       occurs if all branches are anchorable.
3800
3801   Other optimizations
3802
3803       Other  optimizations  that  provide fast "no match" results also affect
3804       callouts.  For example, if the pattern is
3805
3806         ab(?C4)cd
3807
3808       PCRE2 knows that any matching string must contain the  letter  "d".  If
3809       the  subject  string  is  "abyz",  the  lack of "d" means that matching
3810       doesn't ever start, and the callout is  never  reached.  However,  with
3811       "abyd", though the result is still no match, the callout is obeyed.
3812
3813       PCRE2  also  knows  the  minimum  length of a matching string, and will
3814       immediately give a "no match" return without actually running  a  match
3815       if  the  subject is not long enough, or, for unanchored patterns, if it
3816       has been scanned far enough.
3817
3818       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
3819       MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
3820       (*NO_START_OPT). This slows down the matching process, but does  ensure
3821       that callouts such as the example above are obeyed.
3822
3823
3824THE CALLOUT INTERFACE
3825
3826       During  matching,  when  PCRE2  reaches a callout point, if an external
3827       function is set in the match context, it is  called.  This  applies  to
3828       both  normal  and DFA matching. The first argument to the callout func-
3829       tion is a pointer to a pcre2_callout block. The second argument is  the
3830       void  *  callout  data that was supplied when the callout was set up by
3831       calling pcre2_set_callout() (see the pcre2api documentation). The call-
3832       out block structure contains the following fields:
3833
3834         uint32_t      version;
3835         uint32_t      callout_number;
3836         uint32_t      capture_top;
3837         uint32_t      capture_last;
3838         PCRE2_SIZE   *offset_vector;
3839         PCRE2_SPTR    mark;
3840         PCRE2_SPTR    subject;
3841         PCRE2_SIZE    subject_length;
3842         PCRE2_SIZE    start_match;
3843         PCRE2_SIZE    current_position;
3844         PCRE2_SIZE    pattern_position;
3845         PCRE2_SIZE    next_item_length;
3846         PCRE2_SIZE    callout_string_offset;
3847         PCRE2_SIZE    callout_string_length;
3848         PCRE2_SPTR    callout_string;
3849
3850       The  version field contains the version number of the block format. The
3851       current version is 1; the three callout string fields  were  added  for
3852       this  version. If you are writing an application that might use an ear-
3853       lier release of PCRE2, you  should  check  the  version  number  before
3854       accessing  any  of  these  fields.  The version number will increase in
3855       future if more fields are added, but the intention is never  to  remove
3856       any of the existing fields.
3857
3858   Fields for numerical callouts
3859
3860       For  a  numerical  callout,  callout_string is NULL, and callout_number
3861       contains the number of the callout, in the range  0-255.  This  is  the
3862       number  that  follows  (?C for manual callouts; it is 255 for automati-
3863       cally generated callouts.
3864
3865   Fields for string callouts
3866
3867       For callouts with string arguments, callout_number is always zero,  and
3868       callout_string  points  to the string that is contained within the com-
3869       piled pattern. Its length is given by callout_string_length. Duplicated
3870       ending delimiters that were present in the original pattern string have
3871       been turned into single characters, but there is no other processing of
3872       the  callout string argument. An additional code unit containing binary
3873       zero is present after the string, but is not included  in  the  length.
3874       The  delimiter  that was used to start the string is also stored within
3875       the pattern, immediately before the string itself. You can access  this
3876       delimiter as callout_string[-1] if you need it.
3877
3878       The callout_string_offset field is the code unit offset to the start of
3879       the callout argument string within the original pattern string. This is
3880       provided  for the benefit of applications such as script languages that
3881       might need to report errors in the callout string within the pattern.
3882
3883   Fields for all callouts
3884
3885       The remaining fields in the callout block are the same for  both  kinds
3886       of callout.
3887
3888       The offset_vector field is a pointer to the vector of capturing offsets
3889       (the "ovector") that was passed to the matching function in  the  match
3890       data  block.  When pcre2_match() is used, the contents can be inspected
3891       in order to extract substrings that have been matched so  far,  in  the
3892       same  way as for extracting substrings after a match has completed. For
3893       the DFA matching function, this field is not useful.
3894
3895       The subject and subject_length fields contain copies of the values that
3896       were passed to the matching function.
3897
3898       The  start_match  field normally contains the offset within the subject
3899       at which the current match attempt  started.  However,  if  the  escape
3900       sequence  \K has been encountered, this value is changed to reflect the
3901       modified starting point. If the pattern is not  anchored,  the  callout
3902       function may be called several times from the same point in the pattern
3903       for different starting points in the subject.
3904
3905       The current_position field contains the offset within  the  subject  of
3906       the current match pointer.
3907
3908       When the pcre2_match() is used, the capture_top field contains one more
3909       than the number of the highest numbered captured substring so  far.  If
3910       no substrings have been captured, the value of capture_top is one. This
3911       is always the case when the DFA functions are used, because they do not
3912       support captured substrings.
3913
3914       The  capture_last  field  contains the number of the most recently cap-
3915       tured substring. However, when a recursion exits, the value reverts  to
3916       what  it  was  outside  the recursion, as do the values of all captured
3917       substrings. If no substrings have been  captured,  the  value  of  cap-
3918       ture_last is 0. This is always the case for the DFA matching functions.
3919
3920       The pattern_position field contains the offset in the pattern string to
3921       the next item to be matched.
3922
3923       The next_item_length field contains the length of the next item  to  be
3924       matched in the pattern string. When the callout immediately precedes an
3925       alternation bar, a closing parenthesis, or the end of the pattern,  the
3926       length  is  zero. When the callout precedes an opening parenthesis, the
3927       length is that of the entire subpattern.
3928
3929       The pattern_position and next_item_length fields are intended  to  help
3930       in  distinguishing between different automatic callouts, which all have
3931       the same callout number. However, they are set for  all  callouts,  and
3932       are used by pcre2test to show the next item to be matched when display-
3933       ing callout information.
3934
3935       In callouts from pcre2_match() the mark field contains a pointer to the
3936       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
3937       (*THEN) item in the match, or NULL if no such items have  been  passed.
3938       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
3939       previous (*MARK). In callouts from the DFA matching function this field
3940       always contains NULL.
3941
3942
3943RETURN VALUES FROM CALLOUTS
3944
3945       The external callout function returns an integer to PCRE2. If the value
3946       is zero, matching proceeds as normal. If  the  value  is  greater  than
3947       zero,  matching  fails  at  the current point, but the testing of other
3948       matching possibilities goes ahead, just as if a lookahead assertion had
3949       failed. If the value is less than zero, the match is abandoned, and the
3950       matching function returns the negative value.
3951
3952       Negative  values  should  normally  be   chosen   from   the   set   of
3953       PCRE2_ERROR_xxx  values.  In  particular,  PCRE2_ERROR_NOMATCH forces a
3954       standard "no match" failure. The error  number  PCRE2_ERROR_CALLOUT  is
3955       reserved  for  use by callout functions; it will never be used by PCRE2
3956       itself.
3957
3958
3959CALLOUT ENUMERATION
3960
3961       int pcre2_callout_enumerate(const pcre2_code *code,
3962         int (*callback)(pcre2_callout_enumerate_block *, void *),
3963         void *user_data);
3964
3965       A script language that supports the use of string arguments in callouts
3966       might  like  to  scan  all the callouts in a pattern before running the
3967       match. This can be done by calling pcre2_callout_enumerate(). The first
3968       argument  is  a  pointer  to a compiled pattern, the second points to a
3969       callback function, and the third is arbitrary user data.  The  callback
3970       function  is  called  for  every callout in the pattern in the order in
3971       which they appear. Its first argument is a pointer to a callout enumer-
3972       ation  block,  and  its second argument is the user_data value that was
3973       passed to pcre2_callout_enumerate(). The data block contains  the  fol-
3974       lowing fields:
3975
3976         version                Block version number
3977         pattern_position       Offset to next item in pattern
3978         next_item_length       Length of next item in pattern
3979         callout_number         Number for numbered callouts
3980         callout_string_offset  Offset to string within pattern
3981         callout_string_length  Length of callout string
3982         callout_string         Points to callout string or is NULL
3983
3984       The  version  number is currently 0. It will increase if new fields are
3985       ever added to the block. The remaining fields are  the  same  as  their
3986       namesakes  in  the pcre2_callout block that is used for callouts during
3987       matching, as described above.
3988
3989       Note that the value of pattern_position is  unique  for  each  callout.
3990       However,  if  a callout occurs inside a group that is quantified with a
3991       non-zero minimum or a fixed maximum, the group is replicated inside the
3992       compiled  pattern.  For example, a pattern such as /(a){2}/ is compiled
3993       as if it were /(a)(a)/. This means that the callout will be  enumerated
3994       more  than  once,  but with the same value for pattern_position in each
3995       case.
3996
3997       The callback function should normally return zero. If it returns a non-
3998       zero value, scanning the pattern stops, and that value is returned from
3999       pcre2_callout_enumerate().
4000
4001
4002AUTHOR
4003
4004       Philip Hazel
4005       University Computing Service
4006       Cambridge, England.
4007
4008
4009REVISION
4010
4011       Last updated: 23 March 2015
4012       Copyright (c) 1997-2015 University of Cambridge.
4013------------------------------------------------------------------------------
4014
4015
4016PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
4017
4018
4019
4020NAME
4021       PCRE2 - Perl-compatible regular expressions (revised API)
4022
4023DIFFERENCES BETWEEN PCRE2 AND PERL
4024
4025       This document describes the differences in the ways that PCRE2 and Perl
4026       handle regular expressions. The differences  described  here  are  with
4027       respect to Perl versions 5.10 and above.
4028
4029       1.  PCRE2  has only a subset of Perl's Unicode support. Details of what
4030       it does have are given in the pcre2unicode page.
4031
4032       2. PCRE2 allows repeat quantifiers only  on  parenthesized  assertions,
4033       but  they  do not mean what you might think. For example, (?!a){3} does
4034       not assert that the next three characters are not "a". It just  asserts
4035       that  the  next  character  is not "a" three times (in principle: PCRE2
4036       optimizes this to run the assertion  just  once).  Perl  allows  repeat
4037       quantifiers  on  other  assertions such as \b, but these do not seem to
4038       have any use.
4039
4040       3. Capturing subpatterns that occur inside  negative  lookahead  asser-
4041       tions  are  counted,  but their entries in the offsets vector are never
4042       set. Perl sometimes (but not always) sets its numerical variables  from
4043       inside negative assertions.
4044
4045       4.  The  following Perl escape sequences are not supported: \l, \u, \L,
4046       \U, and \N when followed by a character name or Unicode value.  (\N  on
4047       its own, matching a non-newline character, is supported.) In fact these
4048       are implemented by Perl's general string-handling and are not  part  of
4049       its  pattern matching engine. If any of these are encountered by PCRE2,
4050       an error is generated by default. However, if the PCRE2_ALT_BSUX option
4051       is set, \U and \u are interpreted as ECMAScript interprets them.
4052
4053       5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4054       is built with Unicode support. The properties that can be  tested  with
4055       \p and \P are limited to the general category properties such as Lu and
4056       Nd, script names such as Greek or Han, and the derived  properties  Any
4057       and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
4058       not; the Perl documentation says "Because Perl hides the need  for  the
4059       user  to  understand the internal representation of Unicode characters,
4060       there is no need to implement the  somewhat  messy  concept  of  surro-
4061       gates."
4062
4063       6.  PCRE2 does support the \Q...\E escape for quoting substrings. Char-
4064       acters in between are treated as literals. This is  slightly  different
4065       from  Perl  in  that  $  and  @ are also handled as literals inside the
4066       quotes. In Perl, they cause variable interpolation (but of course PCRE2
4067       does not have variables).  Note the following examples:
4068
4069           Pattern            PCRE2 matches      Perl matches
4070
4071           \Qabc$xyz\E        abc$xyz           abc followed by the
4072                                                  contents of $xyz
4073           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
4074           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
4075
4076       The  \Q...\E  sequence  is recognized both inside and outside character
4077       classes.
4078
4079       7.  Fairly  obviously,  PCRE2  does  not  support  the  (?{code})   and
4080       (??{code})  constructions. However, there is support for recursive pat-
4081       terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
4082       the  PCRE2  "callout"  feature allows an external function to be called
4083       during  pattern  matching.  See  the  pcre2callout  documentation   for
4084       details.
4085
4086       8.  Subroutine  calls  (whether recursive or not) are treated as atomic
4087       groups.  Atomic recursion is like Python,  but  unlike  Perl.  Captured
4088       values  that  are  set outside a subroutine call can be referenced from
4089       inside in PCRE2, but not in Perl. There is a discussion  that  explains
4090       these  differences  in  more detail in the section on recursion differ-
4091       ences from Perl in the pcre2pattern page.
4092
4093       9. If any of the backtracking control verbs are used  in  a  subpattern
4094       that  is  called  as  a  subroutine (whether or not recursively), their
4095       effect is confined to that subpattern; it does not extend to  the  sur-
4096       rounding  pattern.  This is not always the case in Perl. In particular,
4097       if (*THEN) is present in a group that is called as  a  subroutine,  its
4098       action is limited to that group, even if the group does not contain any
4099       | characters. Note that such subpatterns are processed as  anchored  at
4100       the point where they are tested.
4101
4102       10.  If a pattern contains more than one backtracking control verb, the
4103       first one that is backtracked onto acts. For example,  in  the  pattern
4104       A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure
4105       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4106       it is the same as PCRE2, but there are examples where it differs.
4107
4108       11.  Most  backtracking  verbs in assertions have their normal actions.
4109       They are not confined to the assertion.
4110
4111       12. There are some differences that are concerned with the settings  of
4112       captured  strings  when  part  of  a  pattern is repeated. For example,
4113       matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
4114       unset, but in PCRE2 it is set to "b".
4115
4116       13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
4117       pattern names is not as general as Perl's. This is a consequence of the
4118       fact  the  PCRE2  works internally just with numbers, using an external
4119       table to translate between numbers and names. In particular, a  pattern
4120       such  as  (?|(?<a>A)|(?<b)B),  where the two capturing parentheses have
4121       the same number but different names, is not supported,  and  causes  an
4122       error  at compile time. If it were allowed, it would not be possible to
4123       distinguish which parentheses matched, because both names map  to  cap-
4124       turing subpattern number 1. To avoid this confusing situation, an error
4125       is given at compile time.
4126
4127       14. Perl recognizes comments in some places that PCRE2  does  not,  for
4128       example,  between  the  ( and ? at the start of a subpattern. If the /x
4129       modifier is set, Perl allows white space between ( and ?  (though  cur-
4130       rent  Perls warn that this is deprecated) but PCRE2 never does, even if
4131       the PCRE2_EXTENDED option is set.
4132
4133       15. Perl, when in warning mode, gives warnings  for  character  classes
4134       such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
4135       als. PCRE2 has no warning features, so it gives an error in these cases
4136       because they are almost certainly user mistakes.
4137
4138       16.  In  PCRE2, the upper/lower case character properties Lu and Ll are
4139       not affected when case-independent matching is specified. For  example,
4140       \p{Lu} always matches an upper case letter. I think Perl has changed in
4141       this respect; in the release at the time of writing (5.16), \p{Lu}  and
4142       \p{Ll} match all letters, regardless of case, when case independence is
4143       specified.
4144
4145       17. PCRE2 provides some  extensions  to  the  Perl  regular  expression
4146       facilities.   Perl  5.10  includes new features that are not in earlier
4147       versions of Perl, some of which (such as named parentheses)  have  been
4148       in PCRE2 for some time. This list is with respect to Perl 5.10:
4149
4150       (a)  Although  lookbehind  assertions  in PCRE2 must match fixed length
4151       strings, each alternative branch of a lookbehind assertion can match  a
4152       different  length  of  string.  Perl requires them all to have the same
4153       length.
4154
4155       (b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set,  the
4156       $ meta-character matches only at the very end of the string.
4157
4158       (c)  A  backslash  followed  by  a  letter  with  no special meaning is
4159       faulted. (Perl can be made to issue a warning.)
4160
4161       (d) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti-
4162       fiers is inverted, that is, by default they are not greedy, but if fol-
4163       lowed by a question mark they are.
4164
4165       (e) PCRE2_ANCHORED can be used at matching time to force a  pattern  to
4166       be tried only at the first matching position in the subject string.
4167
4168       (f)      The      PCRE2_NOTBOL,      PCRE2_NOTEOL,      PCRE2_NOTEMPTY,
4169       PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no  Perl
4170       equivalents.
4171
4172       (g)  The  \R escape sequence can be restricted to match only CR, LF, or
4173       CRLF by the PCRE2_BSR_ANYCRLF option.
4174
4175       (h) The callout facility is PCRE2-specific.
4176
4177       (i) The partial matching facility is PCRE2-specific.
4178
4179       (j) The alternative matching function (pcre2_dfa_match() matches  in  a
4180       different way and is not Perl-compatible.
4181
4182       (k)  PCRE2 recognizes some special sequences such as (*CR) at the start
4183       of a pattern that set overall options that cannot be changed within the
4184       pattern.
4185
4186
4187AUTHOR
4188
4189       Philip Hazel
4190       University Computing Service
4191       Cambridge, England.
4192
4193
4194REVISION
4195
4196       Last updated: 15 March 2015
4197       Copyright (c) 1997-2015 University of Cambridge.
4198------------------------------------------------------------------------------
4199
4200
4201PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
4202
4203
4204
4205NAME
4206       PCRE2 - Perl-compatible regular expressions (revised API)
4207
4208PCRE2 JUST-IN-TIME COMPILER SUPPORT
4209
4210       Just-in-time  compiling  is a heavyweight optimization that can greatly
4211       speed up pattern matching. However, it comes at the cost of extra  pro-
4212       cessing  before  the  match is performed, so it is of most benefit when
4213       the same pattern is going to be matched many times. This does not  nec-
4214       essarily  mean many calls of a matching function; if the pattern is not
4215       anchored, matching attempts may take place many times at various  posi-
4216       tions in the subject, even for a single call. Therefore, if the subject
4217       string is very long, it may still pay  to  use  JIT  even  for  one-off
4218       matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
4219       32-bit PCRE2 libraries.
4220
4221       JIT support applies only to the  traditional  Perl-compatible  matching
4222       function.   It  does  not apply when the DFA matching function is being
4223       used. The code for this support was written by Zoltan Herczeg.
4224
4225
4226AVAILABILITY OF JIT SUPPORT
4227
4228       JIT support is an optional feature of  PCRE2.  The  "configure"  option
4229       --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
4230       built if you want to use JIT. The support is limited to  the  following
4231       hardware platforms:
4232
4233         ARM 32-bit (v5, v7, and Thumb2)
4234         ARM 64-bit
4235         Intel x86 32-bit and 64-bit
4236         MIPS 32-bit and 64-bit
4237         Power PC 32-bit and 64-bit
4238         SPARC 32-bit
4239
4240       If --enable-jit is set on an unsupported platform, compilation fails.
4241
4242       A  program  can  tell if JIT support is available by calling pcre2_con-
4243       fig() with the PCRE2_CONFIG_JIT option. The result is  1  when  JIT  is
4244       available,  and 0 otherwise. However, a simple program does not need to
4245       check this in order to use JIT. The API is implemented in  a  way  that
4246       falls  back  to the interpretive code if JIT is not available. For pro-
4247       grams that need the best possible performance, there is  also  a  "fast
4248       path" API that is JIT-specific.
4249
4250
4251SIMPLE USE OF JIT
4252
4253       To  make use of the JIT support in the simplest way, all you have to do
4254       is to call pcre2_jit_compile() after successfully compiling  a  pattern
4255       with pcre2_compile(). This function has two arguments: the first is the
4256       compiled pattern pointer that was returned by pcre2_compile(), and  the
4257       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
4258       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
4259
4260       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
4261       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
4262       pattern is passed to the JIT compiler, which turns it into machine code
4263       that executes much faster than the normal interpretive code, but yields
4264       exactly the same results. The returned value  from  pcre2_jit_compile()
4265       is zero on success, or a negative error code.
4266
4267       There  is  a limit to the size of pattern that JIT supports, imposed by
4268       the size of machine stack that it uses. The exact rules are  not  docu-
4269       mented  because  they  may  change at any time, in particular, when new
4270       optimizations are introduced.  If a pattern  is  too  big,  a  call  to
4271       pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
4272
4273       PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
4274       plete matches. If you want to run partial matches using the  PCRE2_PAR-
4275       TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should
4276       set one or both of  the  other  options  as  well  as,  or  instead  of
4277       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
4278       for each of the three modes (normal, soft partial, hard partial).  When
4279       pcre2_match()  is  called,  the appropriate code is run if it is avail-
4280       able. Otherwise, the pattern is matched using interpretive code.
4281
4282       You can call pcre2_jit_compile() multiple times for the  same  compiled
4283       pattern.  It does nothing if it has previously compiled code for any of
4284       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
4285       PLETE  and  (perhaps  later,  when  you find you need partial matching)
4286       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
4287       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
4288       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
4289       diately returns zero. This is an alternative way of testing whether JIT
4290       is available.
4291
4292       At present, it is not possible to free JIT compiled  code  except  when
4293       the entire compiled pattern is freed by calling pcre2_code_free().
4294
4295       In  some circumstances you may need to call additional functions. These
4296       are described in the  section  entitled  "Controlling  the  JIT  stack"
4297       below.
4298
4299       There are some pcre2_match() options that are not supported by JIT, and
4300       there are also some pattern items that JIT cannot handle.  Details  are
4301       given  below.  In  both cases, matching automatically falls back to the
4302       interpretive code. If you want to know whether JIT  was  actually  used
4303       for  a particular match, you should arrange for a JIT callback function
4304       to be set up as described in the section entitled "Controlling the  JIT
4305       stack"  below,  even  if  you  do  not need to supply a non-default JIT
4306       stack. Such a callback function is called whenever JIT code is about to
4307       be  obeyed.  If the match-time options are not right for JIT execution,
4308       the callback function is not obeyed.
4309
4310       If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
4311       ated.  You  can find out if JIT matching is available after compiling a
4312       pattern by calling  pcre2_pattern_info()  with  the  PCRE2_INFO_JITSIZE
4313       option.  A non-zero result means that JIT compilation was successful. A
4314       result of 0 means that JIT support is not available, or the pattern was
4315       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
4316       to handle the pattern.
4317
4318
4319UNSUPPORTED OPTIONS AND PATTERN ITEMS
4320
4321       The pcre2_match() options that  are  supported  for  JIT  matching  are
4322       PCRE2_NOTBOL,   PCRE2_NOTEOL,  PCRE2_NOTEMPTY,  PCRE2_NOTEMPTY_ATSTART,
4323       PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and  PCRE2_PARTIAL_SOFT.  The
4324       PCRE2_ANCHORED option is not supported at match time.
4325
4326       If  the  PCRE2_NO_JIT option is passed to pcre2_match() it disables the
4327       use of JIT, forcing matching by the interpreter code.
4328
4329       The only unsupported pattern items are \C (match a  single  data  unit)
4330       when  running in a UTF mode, and a callout immediately before an asser-
4331       tion condition in a conditional group.
4332
4333
4334RETURN VALUES FROM JIT MATCHING
4335
4336       When a pattern is matched using JIT matching, the return values are the
4337       same  as  those  given by the interpretive pcre2_match() code, with the
4338       addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This  means
4339       that  the memory used for the JIT stack was insufficient. See "Control-
4340       ling the JIT stack" below for a discussion of JIT stack usage.
4341
4342       The error code PCRE2_ERROR_MATCHLIMIT is returned by the  JIT  code  if
4343       searching  a  very large pattern tree goes on for too long, as it is in
4344       the same circumstance when JIT is not used, but the details of  exactly
4345       what  is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
4346       code is never returned when JIT matching is used.
4347
4348
4349CONTROLLING THE JIT STACK
4350
4351       When the compiled JIT code runs, it needs a block of memory to use as a
4352       stack.   By  default,  it  uses 32K on the machine stack. However, some
4353       large  or  complicated  patterns  need  more  than  this.   The   error
4354       PCRE2_ERROR_JIT_STACKLIMIT  is  given  when  there is not enough stack.
4355       Three functions are provided for managing blocks of memory for  use  as
4356       JIT  stacks. There is further discussion about the use of JIT stacks in
4357       the section entitled "JIT stack FAQ" below.
4358
4359       The pcre2_jit_stack_create() function creates a JIT  stack.  Its  argu-
4360       ments  are  a starting size, a maximum size, and a general context (for
4361       memory allocation functions, or NULL for standard  memory  allocation).
4362       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
4363       NULL if there is an error. The pcre2_jit_stack_free() function is  used
4364       to  free a stack that is no longer needed. (For the technically minded:
4365       the address space is allocated by mmap or VirtualAlloc.)
4366
4367       JIT uses far less memory for recursion than the interpretive code,  and
4368       a  maximum  stack size of 512K to 1M should be more than enough for any
4369       pattern.
4370
4371       The pcre2_jit_stack_assign() function specifies which  stack  JIT  code
4372       should use. Its arguments are as follows:
4373
4374         pcre2_match_context  *mcontext
4375         pcre2_jit_callback    callback
4376         void                 *data
4377
4378       The first argument is a pointer to a match context. When this is subse-
4379       quently passed to a matching function, its information determines which
4380       JIT  stack  is  used. There are three cases for the values of the other
4381       two options:
4382
4383         (1) If callback is NULL and data is NULL, an internal 32K block
4384             on the machine stack is used. This is the default when a match
4385             context is created.
4386
4387         (2) If callback is NULL and data is not NULL, data must be
4388             a pointer to a valid JIT stack, the result of calling
4389             pcre2_jit_stack_create().
4390
4391         (3) If callback is not NULL, it must point to a function that is
4392             called with data as an argument at the start of matching, in
4393             order to set up a JIT stack. If the return from the callback
4394             function is NULL, the internal 32K stack is used; otherwise the
4395             return value must be a valid JIT stack, the result of calling
4396             pcre2_jit_stack_create().
4397
4398       A callback function is obeyed whenever JIT code is about to be run;  it
4399       is not obeyed when pcre2_match() is called with options that are incom-
4400       patible for JIT matching. A callback function can therefore be used  to
4401       determine  whether  a  match  operation  was  executed by JIT or by the
4402       interpreter.
4403
4404       You may safely use the same JIT stack for more than one pattern (either
4405       by  assigning  directly  or  by  callback), as long as the patterns are
4406       matched sequentially in the same thread. Currently, the only way to set
4407       up  non-sequential matches in one thread is to use callouts: if a call-
4408       out function starts another match, that match must use a different  JIT
4409       stack to the one used for currently suspended match(es).
4410
4411       In  a multithread application, if you do not specify a JIT stack, or if
4412       you assign or pass back NULL from  a  callback,  that  is  thread-safe,
4413       because  each  thread has its own machine stack. However, if you assign
4414       or pass back a non-NULL JIT stack, this must be a different  stack  for
4415       each thread so that the application is thread-safe.
4416
4417       Strictly  speaking,  even more is allowed. You can assign the same non-
4418       NULL stack to a match context that is used by any number  of  patterns,
4419       as  long  as  they are not used for matching by multiple threads at the
4420       same time. For example, you could use the same stack  in  all  compiled
4421       patterns,  with  a global mutex in the callback to wait until the stack
4422       is available for use. However, this is an inefficient solution, and not
4423       recommended.
4424
4425       This  is a suggestion for how a multithreaded program that needs to set
4426       up non-default JIT stacks might operate:
4427
4428         During thread initalization
4429           thread_local_var = pcre2_jit_stack_create(...)
4430
4431         During thread exit
4432           pcre2_jit_stack_free(thread_local_var)
4433
4434         Use a one-line callback function
4435           return thread_local_var
4436
4437       All the functions described in this section do nothing if  JIT  is  not
4438       available.
4439
4440
4441JIT STACK FAQ
4442
4443       (1) Why do we need JIT stacks?
4444
4445       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
4446       where the local data of the current node is pushed before checking  its
4447       child nodes.  Allocating real machine stack on some platforms is diffi-
4448       cult. For example, the stack chain needs to be updated every time if we
4449       extend  the  stack  on  PowerPC.  Although it is possible, its updating
4450       time overhead decreases performance. So we do the recursion in memory.
4451
4452       (2) Why don't we simply allocate blocks of memory with malloc()?
4453
4454       Modern operating systems have a  nice  feature:  they  can  reserve  an
4455       address space instead of allocating memory. We can safely allocate mem-
4456       ory pages inside this address space, so the stack  could  grow  without
4457       moving memory data (this is important because of pointers). Thus we can
4458       allocate 1M address space, and use only a single memory  page  (usually
4459       4K)  if  that is enough. However, we can still grow up to 1M anytime if
4460       needed.
4461
4462       (3) Who "owns" a JIT stack?
4463
4464       The owner of the stack is the user program, not the JIT studied pattern
4465       or anything else. The user program must ensure that if a stack is being
4466       used by pcre2_match(), (that is, it is assigned to a match context that
4467       is  passed  to  the  pattern currently running), that stack must not be
4468       used by any other threads (to avoid overwriting the same memory  area).
4469       The best practice for multithreaded programs is to allocate a stack for
4470       each thread, and return this stack through the JIT callback function.
4471
4472       (4) When should a JIT stack be freed?
4473
4474       You can free a JIT stack at any time, as long as it will not be used by
4475       pcre2_match() again. When you assign the stack to a match context, only
4476       a pointer is set. There is no reference counting or  any  other  magic.
4477       You can free compiled patterns, contexts, and stacks in any order, any-
4478       time. Just do not call pcre2_match() with a match context  pointing  to
4479       an already freed stack, as that will cause SEGFAULT. (Also, do not free
4480       a stack currently used by pcre2_match() in  another  thread).  You  can
4481       also  replace the stack in a context at any time when it is not in use.
4482       You should free the previous stack before assigning a replacement.
4483
4484       (5) Should I allocate/free a  stack  every  time  before/after  calling
4485       pcre2_match()?
4486
4487       No,  because  this  is  too  costly in terms of resources. However, you
4488       could implement some clever idea which release the stack if it  is  not
4489       used  in  let's  say  two minutes. The JIT callback can help to achieve
4490       this without keeping a list of patterns.
4491
4492       (6) OK, the stack is for long term memory allocation. But what  happens
4493       if  a pattern causes stack overflow with a stack of 1M? Is that 1M kept
4494       until the stack is freed?
4495
4496       Especially on embedded sytems, it might be a good idea to release  mem-
4497       ory  sometimes  without  freeing the stack. There is no API for this at
4498       the moment.  Probably a function call which returns with the  currently
4499       allocated  memory for any stack and another which allows releasing mem-
4500       ory (shrinking the stack) would be a good idea if someone needs this.
4501
4502       (7) This is too much of a headache. Isn't there any better solution for
4503       JIT stack handling?
4504
4505       No,  thanks to Windows. If POSIX threads were used everywhere, we could
4506       throw out this complicated API.
4507
4508
4509FREEING JIT SPECULATIVE MEMORY
4510
4511       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
4512
4513       The JIT executable allocator does not free all memory when it is possi-
4514       ble.   It expects new allocations, and keeps some free memory around to
4515       improve allocation speed. However, in low memory conditions,  it  might
4516       be  better to free all possible memory. You can cause this to happen by
4517       calling pcre2_jit_free_unused_memory(). Its argument is a general  con-
4518       text, for custom memory management, or NULL for standard memory manage-
4519       ment.
4520
4521
4522EXAMPLE CODE
4523
4524       This is a single-threaded example that specifies a  JIT  stack  without
4525       using  a  callback.  A real program should include error checking after
4526       all the function calls.
4527
4528         int rc;
4529         pcre2_code *re;
4530         pcre2_match_data *match_data;
4531         pcre2_match_context *mcontext;
4532         pcre2_jit_stack *jit_stack;
4533
4534         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
4535           &errornumber, &erroffset, NULL);
4536         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
4537         mcontext = pcre2_match_context_create(NULL);
4538         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
4539         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
4540         match_data = pcre2_match_data_create(re, 10);
4541         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
4542         /* Process result */
4543
4544         pcre2_code_free(re);
4545         pcre2_match_data_free(match_data);
4546         pcre2_match_context_free(mcontext);
4547         pcre2_jit_stack_free(jit_stack);
4548
4549
4550JIT FAST PATH API
4551
4552       Because the API described above falls back to interpreted matching when
4553       JIT  is  not  available, it is convenient for programs that are written
4554       for  general  use  in  many  environments.  However,  calling  JIT  via
4555       pcre2_match() does have a performance impact. Programs that are written
4556       for use where JIT is known to be available, and  which  need  the  best
4557       possible  performance,  can  instead  use a "fast path" API to call JIT
4558       matching directly instead of calling pcre2_match() (obviously only  for
4559       patterns that have been successfully processed by pcre2_jit_compile()).
4560
4561       The  fast  path  function  is  called  pcre2_jit_match(),  and it takes
4562       exactly the same arguments as pcre2_match(). The return values are also
4563       the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
4564       complete) is requested that was not compiled. Unsupported  option  bits
4565       (for  example,  PCRE2_ANCHORED)  are  ignored,  as  is the PCRE2_NO_JIT
4566       option.
4567
4568       When you call pcre2_match(), as well as testing for invalid options,  a
4569       number of other sanity checks are performed on the arguments. For exam-
4570       ple, if the subject pointer is NULL, an immediate error is given. Also,
4571       unless  PCRE2_NO_UTF_CHECK  is  set, a UTF subject string is tested for
4572       validity. In the interests of speed, these checks do not happen on  the
4573       JIT fast path, and if invalid data is passed, the result is undefined.
4574
4575       Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give
4576       speedups of more than 10%.
4577
4578
4579SEE ALSO
4580
4581       pcre2api(3)
4582
4583
4584AUTHOR
4585
4586       Philip Hazel (FAQ by Zoltan Herczeg)
4587       University Computing Service
4588       Cambridge, England.
4589
4590
4591REVISION
4592
4593       Last updated: 05 June 2016
4594       Copyright (c) 1997-2016 University of Cambridge.
4595------------------------------------------------------------------------------
4596
4597
4598PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
4599
4600
4601
4602NAME
4603       PCRE2 - Perl-compatible regular expressions (revised API)
4604
4605SIZE AND OTHER LIMITATIONS
4606
4607       There are some size limitations in PCRE2 but it is hoped that they will
4608       never in practice be relevant.
4609
4610       The maximum size of a compiled pattern is approximately 64K code  units
4611       for  the  8-bit  and  16-bit  libraries  if  PCRE2 is compiled with the
4612       default internal linkage size, which is 2 bytes for these libraries. If
4613       you  want  to  process regular expressions that are truly enormous, you
4614       can compile PCRE2 with an internal linkage size of 3 or 4 (when  build-
4615       ing  the  16-bit library, 3 is rounded up to 4). See the README file in
4616       the source distribution and the pcre2build documentation  for  details.
4617       In  these  cases the limit is substantially larger.  However, the speed
4618       of execution is slower. In the 32-bit  library,  the  internal  linkage
4619       size is always 4.
4620
4621       The maximum length of a source pattern string is essentially unlimited;
4622       it is the largest number a PCRE2_SIZE variable can hold.  However,  the
4623       program that calls pcre2_compile() can specify a smaller limit.
4624
4625       The maximum length (in code units) of a subject string is one less than
4626       the largest number a PCRE2_SIZE variable can  hold.  PCRE2_SIZE  is  an
4627       unsigned  integer  type,  usually  defined as size_t. Its maximum value
4628       (that is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-
4629       terminated strings and unset offsets.
4630
4631       Note  that  when  using  the  traditional matching function, PCRE2 uses
4632       recursion to handle subpatterns and indefinite repetition.  This  means
4633       that  the  available stack space may limit the size of a subject string
4634       that can be processed by certain patterns. For a  discussion  of  stack
4635       issues, see the pcre2stack documentation.
4636
4637       All values in repeating quantifiers must be less than 65536.
4638
4639       The maximum length of a lookbehind assertion is 65535 characters.
4640
4641       There is no limit to the number of parenthesized subpatterns, but there
4642       can be no more than 65535 capturing subpatterns. There is,  however,  a
4643       limit  to  the  depth  of  nesting  of parenthesized subpatterns of all
4644       kinds. This is imposed in order to limit the  amount  of  system  stack
4645       used  at  compile time. The limit can be specified when PCRE2 is built;
4646       the default is 250.
4647
4648       There is a limit to the number of forward references to subsequent sub-
4649       patterns  of  around  200,000.  Repeated  forward references with fixed
4650       upper limits, for example, (?2){0,100} when subpattern number 2  is  to
4651       the  right,  are included in the count. There is no limit to the number
4652       of backward references.
4653
4654       The maximum length of name for a named subpattern is 32 code units, and
4655       the maximum number of named subpatterns is 10000.
4656
4657       The  maximum  length  of  a  name  in  a (*MARK), (*PRUNE), (*SKIP), or
4658       (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit  and
4659       32-bit libraries.
4660
4661
4662AUTHOR
4663
4664       Philip Hazel
4665       University Computing Service
4666       Cambridge, England.
4667
4668
4669REVISION
4670
4671       Last updated: 05 November 2015
4672       Copyright (c) 1997-2015 University of Cambridge.
4673------------------------------------------------------------------------------
4674
4675
4676PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
4677
4678
4679
4680NAME
4681       PCRE2 - Perl-compatible regular expressions (revised API)
4682
4683PCRE2 MATCHING ALGORITHMS
4684
4685       This document describes the two different algorithms that are available
4686       in PCRE2 for matching a compiled regular  expression  against  a  given
4687       subject  string.  The  "standard"  algorithm is the one provided by the
4688       pcre2_match() function. This works in the same as  as  Perl's  matching
4689       function,  and  provide a Perl-compatible matching operation. The just-
4690       in-time (JIT) optimization that is described in the pcre2jit documenta-
4691       tion is compatible with this function.
4692
4693       An alternative algorithm is provided by the pcre2_dfa_match() function;
4694       it operates in a different way, and is not Perl-compatible. This alter-
4695       native  has  advantages  and  disadvantages  compared with the standard
4696       algorithm, and these are described below.
4697
4698       When there is only one possible way in which a given subject string can
4699       match  a pattern, the two algorithms give the same answer. A difference
4700       arises, however, when there are multiple possibilities. For example, if
4701       the pattern
4702
4703         ^<.*>
4704
4705       is matched against the string
4706
4707         <something> <something else> <something further>
4708
4709       there are three possible answers. The standard algorithm finds only one
4710       of them, whereas the alternative algorithm finds all three.
4711
4712
4713REGULAR EXPRESSIONS AS TREES
4714
4715       The set of strings that are matched by a regular expression can be rep-
4716       resented  as  a  tree structure. An unlimited repetition in the pattern
4717       makes the tree of infinite size, but it is still a tree.  Matching  the
4718       pattern  to a given subject string (from a given starting point) can be
4719       thought of as a search of the tree.  There are two  ways  to  search  a
4720       tree:  depth-first  and  breadth-first, and these correspond to the two
4721       matching algorithms provided by PCRE2.
4722
4723
4724THE STANDARD MATCHING ALGORITHM
4725
4726       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
4727       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
4728       depth-first search of the pattern tree. That is, it  proceeds  along  a
4729       single path through the tree, checking that the subject matches what is
4730       required. When there is a mismatch, the algorithm  tries  any  alterna-
4731       tives  at  the  current point, and if they all fail, it backs up to the
4732       previous branch point in the  tree,  and  tries  the  next  alternative
4733       branch  at  that  level.  This often involves backing up (moving to the
4734       left) in the subject string as well.  The  order  in  which  repetition
4735       branches  are  tried  is controlled by the greedy or ungreedy nature of
4736       the quantifier.
4737
4738       If a leaf node is reached, a matching string has  been  found,  and  at
4739       that  point the algorithm stops. Thus, if there is more than one possi-
4740       ble match, this algorithm returns the first one that it finds.  Whether
4741       this  is the shortest, the longest, or some intermediate length depends
4742       on the way the greedy and ungreedy repetition quantifiers are specified
4743       in the pattern.
4744
4745       Because  it  ends  up  with a single path through the tree, it is rela-
4746       tively straightforward for this algorithm to keep  track  of  the  sub-
4747       strings  that  are  matched  by portions of the pattern in parentheses.
4748       This provides support for capturing parentheses and back references.
4749
4750
4751THE ALTERNATIVE MATCHING ALGORITHM
4752
4753       This algorithm conducts a breadth-first search of  the  tree.  Starting
4754       from  the  first  matching  point  in the subject, it scans the subject
4755       string from left to right, once, character by character, and as it does
4756       this,  it remembers all the paths through the tree that represent valid
4757       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
4758       though  it is not implemented as a traditional finite state machine (it
4759       keeps multiple states active simultaneously).
4760
4761       Although the general principle of this matching algorithm  is  that  it
4762       scans  the subject string only once, without backtracking, there is one
4763       exception: when a lookaround assertion is encountered,  the  characters
4764       following  or  preceding  the  current  point  have to be independently
4765       inspected.
4766
4767       The scan continues until either the end of the subject is  reached,  or
4768       there  are  no more unterminated paths. At this point, terminated paths
4769       represent the different matching possibilities (if there are none,  the
4770       match  has  failed).   Thus,  if there is more than one possible match,
4771       this algorithm finds all of them, and in particular, it finds the long-
4772       est.  The  matches are returned in decreasing order of length. There is
4773       an option to stop the algorithm after the first match (which is  neces-
4774       sarily the shortest) is found.
4775
4776       Note that all the matches that are found start at the same point in the
4777       subject. If the pattern
4778
4779         cat(er(pillar)?)?
4780
4781       is matched against the string "the caterpillar catchment",  the  result
4782       is  the  three  strings "caterpillar", "cater", and "cat" that start at
4783       the fifth character of the subject. The algorithm  does  not  automati-
4784       cally move on to find matches that start at later positions.
4785
4786       PCRE2's "auto-possessification" optimization usually applies to charac-
4787       ter repeats at the end of a pattern (as well as internally). For  exam-
4788       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
4789       is no point even considering the possibility of backtracking  into  the
4790       repeated  digits.  For  DFA matching, this means that only one possible
4791       match is found. If you really do want multiple matches in  such  cases,
4792       either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
4793       SESS option when compiling.
4794
4795       There are a number of features of PCRE2 regular  expressions  that  are
4796       not  supported  by the alternative matching algorithm. They are as fol-
4797       lows:
4798
4799       1. Because the algorithm finds all  possible  matches,  the  greedy  or
4800       ungreedy  nature  of  repetition quantifiers is not relevant (though it
4801       may affect auto-possessification, as just described). During  matching,
4802       greedy  and  ungreedy  quantifiers are treated in exactly the same way.
4803       However, possessive quantifiers can make a difference when what follows
4804       could  also  match  what  is  quantified, for example in a pattern like
4805       this:
4806
4807         ^a++\w!
4808
4809       This pattern matches "aaab!" but not "aaa!", which would be matched  by
4810       a  non-possessive quantifier. Similarly, if an atomic group is present,
4811       it is matched as if it were a standalone pattern at the current  point,
4812       and  the  longest match is then "locked in" for the rest of the overall
4813       pattern.
4814
4815       2. When dealing with multiple paths through the tree simultaneously, it
4816       is  not  straightforward  to  keep track of captured substrings for the
4817       different matching possibilities, and PCRE2's  implementation  of  this
4818       algorithm does not attempt to do this. This means that no captured sub-
4819       strings are available.
4820
4821       3. Because no substrings are captured, back references within the  pat-
4822       tern are not supported, and cause errors if encountered.
4823
4824       4.  For  the same reason, conditional expressions that use a backrefer-
4825       ence as the condition or test for a specific group  recursion  are  not
4826       supported.
4827
4828       5.  Because  many  paths  through the tree may be active, the \K escape
4829       sequence, which resets the start of the match when encountered (but may
4830       be  on  some  paths  and not on others), is not supported. It causes an
4831       error if encountered.
4832
4833       6. Callouts are supported, but the value of the  capture_top  field  is
4834       always 1, and the value of the capture_last field is always 0.
4835
4836       7.  The  \C  escape  sequence, which (in the standard algorithm) always
4837       matches a single code unit, even in a UTF mode,  is  not  supported  in
4838       these  modes,  because the alternative algorithm moves through the sub-
4839       ject string one character (not code unit) at a  time,  for  all  active
4840       paths through the tree.
4841
4842       8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
4843       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
4844       negative assertion.
4845
4846
4847ADVANTAGES OF THE ALTERNATIVE ALGORITHM
4848
4849       Using  the alternative matching algorithm provides the following advan-
4850       tages:
4851
4852       1. All possible matches (at a single point in the subject) are automat-
4853       ically  found,  and  in particular, the longest match is found. To find
4854       more than one match using the standard algorithm, you have to do kludgy
4855       things with callouts.
4856
4857       2.  Because  the  alternative  algorithm  scans the subject string just
4858       once, and never needs to backtrack (except for lookbehinds), it is pos-
4859       sible  to  pass  very  long subject strings to the matching function in
4860       several pieces, checking for partial matching each time. Although it is
4861       also  possible  to  do  multi-segment matching using the standard algo-
4862       rithm, by retaining partially matched substrings, it  is  more  compli-
4863       cated. The pcre2partial documentation gives details of partial matching
4864       and discusses multi-segment matching.
4865
4866
4867DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
4868
4869       The alternative algorithm suffers from a number of disadvantages:
4870
4871       1. It is substantially slower than  the  standard  algorithm.  This  is
4872       partly  because  it has to search for all possible matches, but is also
4873       because it is less susceptible to optimization.
4874
4875       2. Capturing parentheses and back references are not supported.
4876
4877       3. Although atomic groups are supported, their use does not provide the
4878       performance advantage that it does for the standard algorithm.
4879
4880
4881AUTHOR
4882
4883       Philip Hazel
4884       University Computing Service
4885       Cambridge, England.
4886
4887
4888REVISION
4889
4890       Last updated: 29 September 2014
4891       Copyright (c) 1997-2014 University of Cambridge.
4892------------------------------------------------------------------------------
4893
4894
4895PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
4896
4897
4898
4899NAME
4900       PCRE2 - Perl-compatible regular expressions
4901
4902PARTIAL MATCHING IN PCRE2
4903
4904       In  normal  use  of  PCRE2,  if  the subject string that is passed to a
4905       matching function matches as far as it goes, but is too short to  match
4906       the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
4907       stances where it might be helpful to distinguish this case  from  other
4908       cases in which there is no match.
4909
4910       Consider, for example, an application where a human is required to type
4911       in data for a field with specific formatting requirements.  An  example
4912       might be a date in the form ddmmmyy, defined by this pattern:
4913
4914         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
4915
4916       If the application sees the user's keystrokes one by one, and can check
4917       that what has been typed so far is potentially valid,  it  is  able  to
4918       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
4919       reflecting the character that has been typed, for example. This immedi-
4920       ate  feedback is likely to be a better user interface than a check that
4921       is delayed until the entire string has been entered.  Partial  matching
4922       can  also be useful when the subject string is very long and is not all
4923       available at once.
4924
4925       PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT  and
4926       PCRE2_PARTIAL_HARD  options,  which  can be set when calling a matching
4927       function.  The difference between the two options is whether or  not  a
4928       partial match is preferred to an alternative complete match, though the
4929       details differ between the two types  of  matching  function.  If  both
4930       options are set, PCRE2_PARTIAL_HARD takes precedence.
4931
4932       If  you  want to use partial matching with just-in-time optimized code,
4933       you must call pcre2_jit_compile() with one or both of these options:
4934
4935         PCRE2_JIT_PARTIAL_SOFT
4936         PCRE2_JIT_PARTIAL_HARD
4937
4938       PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
4939       tial  matches  on the same pattern. If the appropriate JIT mode has not
4940       been compiled, interpretive matching code is used.
4941
4942       Setting a partial matching option  disables  two  of  PCRE2's  standard
4943       optimizations. PCRE2 remembers the last literal code unit in a pattern,
4944       and abandons matching immediately if it is not present in  the  subject
4945       string.  This  optimization  cannot  be  used for a subject string that
4946       might match only partially. PCRE2 also knows the minimum  length  of  a
4947       matching  string,  and  does not bother to run the matching function on
4948       shorter strings. This optimization is also disabled for partial  match-
4949       ing.
4950
4951
4952PARTIAL MATCHING USING pcre2_match()
4953
4954       A  partial  match occurs during a call to pcre2_match() when the end of
4955       the subject string is reached successfully, but  matching  cannot  con-
4956       tinue because more characters are needed. However, at least one charac-
4957       ter in the subject must have been inspected. This  character  need  not
4958       form part of the final matched string; lookbehind assertions and the \K
4959       escape sequence provide ways of inspecting characters before the  start
4960       of  a matched string. The requirement for inspecting at least one char-
4961       acter exists because an empty string can  always  be  matched;  without
4962       such  a  restriction  there would always be a partial match of an empty
4963       string at the end of the subject.
4964
4965       When a partial match is returned, the first two elements in the ovector
4966       point to the portion of the subject that was matched, but the values in
4967       the rest of the ovector are undefined. The appearance of \K in the pat-
4968       tern has no effect for a partial match. Consider this pattern:
4969
4970         /abc\K123/
4971
4972       If it is matched against "456abc123xyz" the result is a complete match,
4973       and the ovector defines the matched string as "123", because \K  resets
4974       the  "start  of  match" point. However, if a partial match is requested
4975       and the subject string is "456abc12", a partial match is found for  the
4976       string  "abc12",  because  all these characters are needed for a subse-
4977       quent re-match with additional characters.
4978
4979       What happens when a partial match is identified depends on which of the
4980       two partial matching options are set.
4981
4982   PCRE2_PARTIAL_SOFT WITH pcre2_match()
4983
4984       If  PCRE2_PARTIAL_SOFT  is  set when pcre2_match() identifies a partial
4985       match, the partial match is remembered, but matching continues as  nor-
4986       mal,  and  other  alternatives in the pattern are tried. If no complete
4987       match  can  be  found,  PCRE2_ERROR_PARTIAL  is  returned  instead   of
4988       PCRE2_ERROR_NOMATCH.
4989
4990       This  option  is "soft" because it prefers a complete match over a par-
4991       tial match.  All the various matching items in a pattern behave  as  if
4992       the  subject string is potentially complete. For example, \z, \Z, and $
4993       match at the end of the subject, as normal, and for \b and \B  the  end
4994       of the subject is treated as a non-alphanumeric.
4995
4996       If  there  is more than one partial match, the first one that was found
4997       provides the data that is returned. Consider this pattern:
4998
4999         /123\w+X|dogY/
5000
5001       If this is matched against the subject string "abc123dog", both  alter-
5002       natives  fail  to  match,  but the end of the subject is reached during
5003       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
5004       and  9, identifying "123dog" as the first partial match that was found.
5005       (In this example, there are two partial matches, because "dog"  on  its
5006       own partially matches the second alternative.)
5007
5008   PCRE2_PARTIAL_HARD WITH pcre2_match()
5009
5010       If  PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
5011       returned as soon as a partial match is  found,  without  continuing  to
5012       search  for possible complete matches. This option is "hard" because it
5013       prefers an earlier partial match over a later complete match. For  this
5014       reason,  the  assumption  is  made that the end of the supplied subject
5015       string may not be the true end of the available data, and  so,  if  \z,
5016       \Z,  \b, \B, or $ are encountered at the end of the subject, the result
5017       is PCRE2_ERROR_PARTIAL, provided that at least  one  character  in  the
5018       subject has been inspected.
5019
5020   Comparing hard and soft partial matching
5021
5022       The  difference  between the two partial matching options can be illus-
5023       trated by a pattern such as:
5024
5025         /dog(sbody)?/
5026
5027       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
5028       the  longer  string  if  possible). If it is matched against the string
5029       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
5030       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
5031       TIAL. On the other hand, if the pattern is made ungreedy the result  is
5032       different:
5033
5034         /dog(sbody)??/
5035
5036       In  this  case  the  result  is always a complete match because that is
5037       found first, and matching never  continues  after  finding  a  complete
5038       match. It might be easier to follow this explanation by thinking of the
5039       two patterns like this:
5040
5041         /dog(sbody)?/    is the same as  /dogsbody|dog/
5042         /dog(sbody)??/   is the same as  /dog|dogsbody/
5043
5044       The second pattern will never match "dogsbody", because it will  always
5045       find the shorter match first.
5046
5047
5048PARTIAL MATCHING USING pcre2_dfa_match()
5049
5050       The DFA functions move along the subject string character by character,
5051       without backtracking, searching for  all  possible  matches  simultane-
5052       ously.  If the end of the subject is reached before the end of the pat-
5053       tern, there is the possibility of a partial match, again provided  that
5054       at least one character has been inspected.
5055
5056       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
5057       there have been no complete matches. Otherwise,  the  complete  matches
5058       are  returned.   However, if PCRE2_PARTIAL_HARD is set, a partial match
5059       takes precedence over any complete matches. The portion of  the  string
5060       that was matched when the longest partial match was found is set as the
5061       first matching string.
5062
5063       Because the DFA functions always search for all possible  matches,  and
5064       there  is  no  difference between greedy and ungreedy repetition, their
5065       behaviour is different from  the  standard  functions  when  PCRE2_PAR-
5066       TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
5067       ungreedy pattern shown above:
5068
5069         /dog(sbody)??/
5070
5071       Whereas the standard function stops as soon as it  finds  the  complete
5072       match  for  "dog",  the  DFA  function also finds the partial match for
5073       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
5074
5075
5076PARTIAL MATCHING AND WORD BOUNDARIES
5077
5078       If a pattern ends with one of sequences \b or \B, which test  for  word
5079       boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
5080       intuitive results. Consider this pattern:
5081
5082         /\bcat\b/
5083
5084       This matches "cat", provided there is a word boundary at either end. If
5085       the subject string is "the cat", the comparison of the final "t" with a
5086       following character cannot take place, so a  partial  match  is  found.
5087       However,  normal  matching carries on, and \b matches at the end of the
5088       subject when the last character is a letter, so  a  complete  match  is
5089       found.   The  result,  therefore,  is  not  PCRE2_ERROR_PARTIAL.  Using
5090       PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
5091       then the partial match takes precedence.
5092
5093
5094EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
5095
5096       If  the  partial_soft  (or  ps) modifier is present on a pcre2test data
5097       line, the PCRE2_PARTIAL_SOFT option is used for the match.  Here  is  a
5098       run of pcre2test that uses the date example quoted above:
5099
5100           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5101         data> 25jun04\=ps
5102          0: 25jun04
5103          1: jun
5104         data> 25dec3\=ps
5105         Partial match: 23dec3
5106         data> 3ju\=ps
5107         Partial match: 3ju
5108         data> 3juj\=ps
5109         No match
5110         data> j\=ps
5111         No match
5112
5113       The  first  data  string  is matched completely, so pcre2test shows the
5114       matched substrings. The remaining four strings do not  match  the  com-
5115       plete pattern, but the first two are partial matches. Similar output is
5116       obtained if DFA matching is used.
5117
5118       If the partial_hard (or ph) modifier is present  on  a  pcre2test  data
5119       line, the PCRE2_PARTIAL_HARD option is set for the match.
5120
5121
5122MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
5123
5124       When  a  partial match has been found using a DFA matching function, it
5125       is possible to continue the match by providing additional subject  data
5126       and  calling  the function again with the same compiled regular expres-
5127       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
5128       same working space as before, because this is where details of the pre-
5129       vious partial match are stored. Here is an example using pcre2test:
5130
5131           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5132         data> 23ja\=dfa,ps
5133         Partial match: 23ja
5134         data> n05\=dfa,dfa_restart
5135          0: n05
5136
5137       The first call has "23ja" as the subject, and requests  partial  match-
5138       ing;  the  second  call  has  "n05"  as  the  subject for the continued
5139       (restarted) match.  Notice that when the match is  complete,  only  the
5140       last  part  is  shown;  PCRE2 does not retain the previously partially-
5141       matched string. It is up to the calling program to do that if it  needs
5142       to.
5143
5144       That means that, for an unanchored pattern, if a continued match fails,
5145       it is not possible to try again at  a  new  starting  point.  All  this
5146       facility  is  capable  of  doing  is continuing with the previous match
5147       attempt. In the previous example, if the second set of data  is  "ug23"
5148       the  result is no match, even though there would be a match for "aug23"
5149       if the entire string were given at once. Depending on the  application,
5150       this may or may not be what you want.  The only way to allow for start-
5151       ing again at the next character is to retain the matched  part  of  the
5152       subject and try a new complete match.
5153
5154       You  can  set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
5155       PCRE2_DFA_RESTART to continue partial matching over multiple  segments.
5156       This  facility can be used to pass very long subject strings to the DFA
5157       matching functions.
5158
5159
5160MULTI-SEGMENT MATCHING WITH pcre2_match()
5161
5162       Unlike the DFA function, it is not possible  to  restart  the  previous
5163       match with a new segment of data when using pcre2_match(). Instead, new
5164       data must be added to the previous subject string, and the entire match
5165       re-run,  starting from the point where the partial match occurred. Ear-
5166       lier data can be discarded.
5167
5168       It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
5169       not  treat the end of a segment as the end of the subject when matching
5170       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
5171       dates:
5172
5173           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
5174         data> The date is 23ja\=ph
5175         Partial match: 23ja
5176
5177       At  this stage, an application could discard the text preceding "23ja",
5178       add on text from the next  segment,  and  call  the  matching  function
5179       again.  Unlike  the  DFA  matching function, the entire matching string
5180       must always be available, and the complete matching process occurs  for
5181       each call, so more memory and more processing time is needed.
5182
5183
5184ISSUES WITH MULTI-SEGMENT MATCHING
5185
5186       Certain types of pattern may give problems with multi-segment matching,
5187       whichever matching function is used.
5188
5189       1. If the pattern contains a test for the beginning of a line, you need
5190       to  pass  the  PCRE2_NOTBOL option when the subject string for any call
5191       does start at the beginning of a line. There  is  also  a  PCRE2_NOTEOL
5192       option, but in practice when doing multi-segment matching you should be
5193       using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
5194
5195       2. If a pattern contains a lookbehind assertion, characters  that  pre-
5196       cede  the start of the partial match may have been inspected during the
5197       matching process.  When using pcre2_match(), sufficient characters must
5198       be  retained  for  the  next  match attempt. You can ensure that enough
5199       characters are retained by doing the following:
5200
5201       Before doing any matching, find the length of the longest lookbehind in
5202       the     pattern    by    calling    pcre2_pattern_info()    with    the
5203       PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting  count  is  in
5204       characters, not code units. After a partial match, moving back from the
5205       ovector[0] offset in the subject by the number of characters given  for
5206       the  maximum lookbehind gets you to the earliest character that must be
5207       retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
5208       subtraction,  but in UTF-8 or UTF-16 you have to count characters while
5209       moving back through the code units.
5210
5211       Characters before the point you have now reached can be discarded,  and
5212       after  the  next segment has been added to what is retained, you should
5213       run the next match with the startoffset argument set so that the  match
5214       begins at the same point as before.
5215
5216       For  example, if the pattern "(?<=123)abc" is partially matched against
5217       the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
5218       mum  lookbehind  count  is  3, so all characters before offset 2 can be
5219       discarded. The value of startoffset for the next  match  should  be  3.
5220       When  pcre2test  displays  a partial match, it indicates the lookbehind
5221       characters with '<' characters:
5222
5223           re> "(?<=123)abc"
5224         data> xx123ab\=ph
5225         Partial match: 123ab
5226                        <<<
5227
5228       3. Because a partial match must always contain at least one  character,
5229       what  might  be  considered a partial match of an empty string actually
5230       gives a "no match" result. For example:
5231
5232           re> /c(?<=abc)x/
5233         data> ab\=ps
5234         No match
5235
5236       If the next segment begins "cx", a match should be found, but this will
5237       only  happen  if characters from the previous segment are retained. For
5238       this reason, a "no match" result  should  be  interpreted  as  "partial
5239       match of an empty string" when the pattern contains lookbehinds.
5240
5241       4.  Matching  a subject string that is split into multiple segments may
5242       not always produce exactly the same result as matching over one  single
5243       long  string,  especially  when PCRE2_PARTIAL_SOFT is used. The section
5244       "Partial Matching and Word Boundaries" above describes  an  issue  that
5245       arises  if  the  pattern ends with \b or \B. Another kind of difference
5246       may occur when there are multiple matching possibilities, because  (for
5247       PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
5248       no completed matches. This means that as soon as the shortest match has
5249       been  found,  continuation to a new subject segment is no longer possi-
5250       ble. Consider this pcre2test example:
5251
5252           re> /dog(sbody)?/
5253         data> dogsb\=ps
5254          0: dog
5255         data> do\=ps,dfa
5256         Partial match: do
5257         data> gsb\=ps,dfa,dfa_restart
5258          0: g
5259         data> dogsbody\=dfa
5260          0: dogsbody
5261          1: dog
5262
5263       The first data line passes the string "dogsb" to  a  standard  matching
5264       function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
5265       a partial match for "dogsbody", the result is not  PCRE2_ERROR_PARTIAL,
5266       because  the  shorter string "dog" is a complete match. Similarly, when
5267       the subject is presented to a DFA matching function  in  several  parts
5268       ("do"  and  "gsb"  being  the first two) the match stops when "dog" has
5269       been found, and it is not possible to continue.  On the other hand,  if
5270       "dogsbody"  is  presented  as  a single string, a DFA matching function
5271       finds both matches.
5272
5273       Because of these problems, it is best to  use  PCRE2_PARTIAL_HARD  when
5274       matching  multi-segment  data.  The  example above then behaves differ-
5275       ently:
5276
5277           re> /dog(sbody)?/
5278         data> dogsb\=ph
5279         Partial match: dogsb
5280         data> do\=ps,dfa
5281         Partial match: do
5282         data> gsb\=ph,dfa,dfa_restart
5283         Partial match: gsb
5284
5285       5. Patterns that contain alternatives at the top level which do not all
5286       start  with  the  same  pattern  item  may  not  work  as expected when
5287       PCRE2_DFA_RESTART is used. For example, consider this pattern:
5288
5289         1234|3789
5290
5291       If the first part of the subject is "ABC123", a partial  match  of  the
5292       first  alternative  is found at offset 3. There is no partial match for
5293       the second alternative, because such a match does not start at the same
5294       point  in  the  subject  string. Attempting to continue with the string
5295       "7890" does not yield a match  because  only  those  alternatives  that
5296       match  at  one  point in the subject are remembered. The problem arises
5297       because the start of the second alternative matches  within  the  first
5298       alternative.  There  is  no  problem with anchored patterns or patterns
5299       such as:
5300
5301         1234|ABCD
5302
5303       where no string can be a partial match for both alternatives.  This  is
5304       not  a  problem  if  a  standard matching function is used, because the
5305       entire match has to be rerun each time:
5306
5307           re> /1234|3789/
5308         data> ABC123\=ph
5309         Partial match: 123
5310         data> 1237890
5311          0: 3789
5312
5313       Of course, instead of using PCRE2_DFA_RESTART, the  same  technique  of
5314       re-running  the  entire  match  can  also be used with the DFA matching
5315       function. Another possibility is to work with two buffers. If a partial
5316       match  at  offset  n in the first buffer is followed by "no match" when
5317       PCRE2_DFA_RESTART is used on the second buffer, you can then try a  new
5318       match starting at offset n+1 in the first buffer.
5319
5320
5321AUTHOR
5322
5323       Philip Hazel
5324       University Computing Service
5325       Cambridge, England.
5326
5327
5328REVISION
5329
5330       Last updated: 22 December 2014
5331       Copyright (c) 1997-2014 University of Cambridge.
5332------------------------------------------------------------------------------
5333
5334
5335PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
5336
5337
5338
5339NAME
5340       PCRE2 - Perl-compatible regular expressions (revised API)
5341
5342PCRE2 REGULAR EXPRESSION DETAILS
5343
5344       The  syntax and semantics of the regular expressions that are supported
5345       by PCRE2 are described in detail below. There is a quick-reference syn-
5346       tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
5347       and semantics as closely as it can.  PCRE2 also supports some  alterna-
5348       tive  regular  expression syntax (which does not conflict with the Perl
5349       syntax) in order to provide some compatibility with regular expressions
5350       in Python, .NET, and Oniguruma.
5351
5352       Perl's  regular expressions are described in its own documentation, and
5353       regular expressions in general are covered in a number of  books,  some
5354       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
5355       Expressions", published by  O'Reilly,  covers  regular  expressions  in
5356       great  detail.  This  description  of  PCRE2's  regular  expressions is
5357       intended as reference material.
5358
5359       This document discusses the patterns that are supported by  PCRE2  when
5360       its  main  matching function, pcre2_match(), is used. PCRE2 also has an
5361       alternative matching function, pcre2_dfa_match(), which matches using a
5362       different  algorithm  that is not Perl-compatible. Some of the features
5363       discussed below are not available when DFA matching is used. The advan-
5364       tages and disadvantages of the alternative function, and how it differs
5365       from the normal function, are discussed in the pcre2matching page.
5366
5367
5368SPECIAL START-OF-PATTERN ITEMS
5369
5370       A number of options that can be passed to pcre2_compile() can  also  be
5371       set by special items at the start of a pattern. These are not Perl-com-
5372       patible, but are provided to make these options accessible  to  pattern
5373       writers  who are not able to change the program that processes the pat-
5374       tern. Any number of these items  may  appear,  but  they  must  all  be
5375       together right at the start of the pattern string, and the letters must
5376       be in upper case.
5377
5378   UTF support
5379
5380       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
5381       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
5382       can be specified for the 32-bit library, in which  case  it  constrains
5383       the  character  values  to  valid  Unicode  code points. To process UTF
5384       strings, PCRE2 must be built to include Unicode support (which  is  the
5385       default).  When  using  UTF  strings you must either call the compiling
5386       function with the PCRE2_UTF option, or the pattern must start with  the
5387       special  sequence  (*UTF),  which is equivalent to setting the relevant
5388       option. How setting a UTF mode affects pattern matching is mentioned in
5389       several  places  below.  There  is  also  a  summary of features in the
5390       pcre2unicode page.
5391
5392       Some applications that allow their users to supply patterns may wish to
5393       restrict   them   to   non-UTF   data  for  security  reasons.  If  the
5394       PCRE2_NEVER_UTF option is passed  to  pcre2_compile(),  (*UTF)  is  not
5395       allowed, and its appearance in a pattern causes an error.
5396
5397   Unicode property support
5398
5399       Another  special  sequence that may appear at the start of a pattern is
5400       (*UCP).  This has the same effect as setting the PCRE2_UCP  option:  it
5401       causes  sequences such as \d and \w to use Unicode properties to deter-
5402       mine character types, instead of recognizing only characters with codes
5403       less than 256 via a lookup table.
5404
5405       Some applications that allow their users to supply patterns may wish to
5406       restrict them for security reasons. If the  PCRE2_NEVER_UCP  option  is
5407       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
5408       a pattern causes an error.
5409
5410   Locking out empty string matching
5411
5412       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
5413       effect  as  passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
5414       to whichever matching function is subsequently called to match the pat-
5415       tern.  These  options  lock  out  the matching of empty strings, either
5416       entirely, or only at the start of the subject.
5417
5418   Disabling auto-possessification
5419
5420       If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect  as
5421       setting  the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
5422       quantifiers possessive when what  follows  cannot  match  the  repeated
5423       item. For example, by default a+b is treated as a++b. For more details,
5424       see the pcre2api documentation.
5425
5426   Disabling start-up optimizations
5427
5428       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
5429       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
5430       mizations for quickly reaching "no match" results.  For  more  details,
5431       see the pcre2api documentation.
5432
5433   Disabling automatic anchoring
5434
5435       If  a  pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
5436       as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables  optimiza-
5437       tions that apply to patterns whose top-level branches all start with .*
5438       (match any number of arbitrary characters). For more details,  see  the
5439       pcre2api documentation.
5440
5441   Disabling JIT compilation
5442
5443       If  a  pattern  that starts with (*NO_JIT) is successfully compiled, an
5444       attempt by the application to apply the  JIT  optimization  by  calling
5445       pcre2_jit_compile() is ignored.
5446
5447   Setting match and recursion limits
5448
5449       The  caller of pcre2_match() can set a limit on the number of times the
5450       internal match() function is called and on the maximum depth of  recur-
5451       sive calls. These facilities are provided to catch runaway matches that
5452       are provoked by patterns with huge matching trees (a typical example is
5453       a  pattern  with  nested unlimited repeats) and to avoid running out of
5454       system stack by too  much  recursion.  When  one  of  these  limits  is
5455       reached,  pcre2_match()  gives  an error return. The limits can also be
5456       set by items at the start of the pattern of the form
5457
5458         (*LIMIT_MATCH=d)
5459         (*LIMIT_RECURSION=d)
5460
5461       where d is any number of decimal digits. However, the value of the set-
5462       ting  must  be  less than the value set (or defaulted) by the caller of
5463       pcre2_match() for it to have any effect. In other  words,  the  pattern
5464       writer  can lower the limits set by the programmer, but not raise them.
5465       If there is more than one setting of one of  these  limits,  the  lower
5466       value is used.
5467
5468   Newline conventions
5469
5470       PCRE2 supports five different conventions for indicating line breaks in
5471       strings: a single CR (carriage return) character, a  single  LF  (line-
5472       feed) character, the two-character sequence CRLF, any of the three pre-
5473       ceding, or any Unicode newline sequence. The pcre2api page has  further
5474       discussion  about newlines, and shows how to set the newline convention
5475       when calling pcre2_compile().
5476
5477       It is also possible to specify a newline convention by starting a  pat-
5478       tern string with one of the following five sequences:
5479
5480         (*CR)        carriage return
5481         (*LF)        linefeed
5482         (*CRLF)      carriage return, followed by linefeed
5483         (*ANYCRLF)   any of the three above
5484         (*ANY)       all Unicode newline sequences
5485
5486       These override the default and the options given to the compiling func-
5487       tion. For example, on a Unix system where LF  is  the  default  newline
5488       sequence, the pattern
5489
5490         (*CR)a.b
5491
5492       changes the convention to CR. That pattern matches "a\nb" because LF is
5493       no longer a newline. If more than one of these settings is present, the
5494       last one is used.
5495
5496       The  newline  convention affects where the circumflex and dollar asser-
5497       tions are true. It also affects the interpretation of the dot metachar-
5498       acter  when  PCRE2_DOTALL is not set, and the behaviour of \N. However,
5499       it does not affect what the \R escape  sequence  matches.  By  default,
5500       this  is any Unicode newline sequence, for Perl compatibility. However,
5501       this can be changed; see the description of \R in the section  entitled
5502       "Newline  sequences" below. A change of \R setting can be combined with
5503       a change of newline convention.
5504
5505   Specifying what \R matches
5506
5507       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
5508       the  complete  set  of  Unicode  line  endings)  by  setting the option
5509       PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved  by
5510       starting  a  pattern  with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
5511       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
5512
5513
5514EBCDIC CHARACTER CODES
5515
5516       PCRE2 can be compiled to run in an environment that uses EBCDIC as  its
5517       character code rather than ASCII or Unicode (typically a mainframe sys-
5518       tem). In the sections below, character code values are  ASCII  or  Uni-
5519       code; in an EBCDIC environment these characters may have different code
5520       values, and there are no code points greater than 255.
5521
5522
5523CHARACTERS AND METACHARACTERS
5524
5525       A regular expression is a pattern that is  matched  against  a  subject
5526       string  from  left  to right. Most characters stand for themselves in a
5527       pattern, and match the corresponding characters in the  subject.  As  a
5528       trivial example, the pattern
5529
5530         The quick brown fox
5531
5532       matches a portion of a subject string that is identical to itself. When
5533       caseless matching is specified (the PCRE2_CASELESS option), letters are
5534       matched independently of case.
5535
5536       The  power  of  regular  expressions  comes from the ability to include
5537       alternatives and repetitions in the pattern. These are encoded  in  the
5538       pattern by the use of metacharacters, which do not stand for themselves
5539       but instead are interpreted in some special way.
5540
5541       There are two different sets of metacharacters: those that  are  recog-
5542       nized  anywhere in the pattern except within square brackets, and those
5543       that are recognized within square brackets.  Outside  square  brackets,
5544       the metacharacters are as follows:
5545
5546         \      general escape character with several uses
5547         ^      assert start of string (or line, in multiline mode)
5548         $      assert end of string (or line, in multiline mode)
5549         .      match any character except newline (by default)
5550         [      start character class definition
5551         |      start of alternative branch
5552         (      start subpattern
5553         )      end subpattern
5554         ?      extends the meaning of (
5555                also 0 or 1 quantifier
5556                also quantifier minimizer
5557         *      0 or more quantifier
5558         +      1 or more quantifier
5559                also "possessive quantifier"
5560         {      start min/max quantifier
5561
5562       Part  of  a  pattern  that is in square brackets is called a "character
5563       class". In a character class the only metacharacters are:
5564
5565         \      general escape character
5566         ^      negate the class, but only if the first character
5567         -      indicates character range
5568         [      POSIX character class (only if followed by POSIX
5569                  syntax)
5570         ]      terminates the character class
5571
5572       The following sections describe the use of each of the metacharacters.
5573
5574
5575BACKSLASH
5576
5577       The backslash character has several uses. Firstly, if it is followed by
5578       a character that is not a number or a letter, it takes away any special
5579       meaning that character may have. This use of  backslash  as  an  escape
5580       character applies both inside and outside character classes.
5581
5582       For  example,  if  you want to match a * character, you write \* in the
5583       pattern.  This escaping action applies whether  or  not  the  following
5584       character  would  otherwise be interpreted as a metacharacter, so it is
5585       always safe to precede a non-alphanumeric  with  backslash  to  specify
5586       that  it stands for itself. In particular, if you want to match a back-
5587       slash, you write \\.
5588
5589       In a UTF mode, only ASCII numbers and letters have any special  meaning
5590       after  a  backslash.  All  other characters (in particular, those whose
5591       codepoints are greater than 127) are treated as literals.
5592
5593       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
5594       space  in the pattern (other than in a character class), and characters
5595       between a # outside a character class and the next newline,  inclusive,
5596       are ignored. An escaping backslash can be used to include a white space
5597       or # character as part of the pattern.
5598
5599       If you want to remove the special meaning from a  sequence  of  charac-
5600       ters,  you can do so by putting them between \Q and \E. This is differ-
5601       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
5602       sequences  in PCRE2, whereas in Perl, $ and @ cause variable interpola-
5603       tion. Note the following examples:
5604
5605         Pattern            PCRE2 matches   Perl matches
5606
5607         \Qabc$xyz\E        abc$xyz        abc followed by the
5608                                             contents of $xyz
5609         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
5610         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
5611
5612       The \Q...\E sequence is recognized both inside  and  outside  character
5613       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
5614       is not followed by \E later in the pattern, the literal  interpretation
5615       continues  to  the  end  of  the pattern (that is, \E is assumed at the
5616       end). If the isolated \Q is inside a character class,  this  causes  an
5617       error, because the character class is not terminated.
5618
5619   Non-printing characters
5620
5621       A second use of backslash provides a way of encoding non-printing char-
5622       acters in patterns in a visible manner. There is no restriction on  the
5623       appearance  of non-printing characters in a pattern, but when a pattern
5624       is being prepared by text editing, it is often easier to use one of the
5625       following  escape sequences than the binary character it represents. In
5626       an ASCII or Unicode environment, these escapes are as follows:
5627
5628         \a        alarm, that is, the BEL character (hex 07)
5629         \cx       "control-x", where x is any printable ASCII character
5630         \e        escape (hex 1B)
5631         \f        form feed (hex 0C)
5632         \n        linefeed (hex 0A)
5633         \r        carriage return (hex 0D)
5634         \t        tab (hex 09)
5635         \0dd      character with octal code 0dd
5636         \ddd      character with octal code ddd, or back reference
5637         \o{ddd..} character with octal code ddd..
5638         \xhh      character with hex code hh
5639         \x{hhh..} character with hex code hhh.. (default mode)
5640         \uhhhh    character with hex code hhhh (when PCRE2_ALT_BSUX is set)
5641
5642       The precise effect of \cx on ASCII characters is as follows: if x is  a
5643       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
5644       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
5645       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
5646       hex 7B (; is 3B). If the code unit following \c has a value  less  than
5647       32  or  greater  than  126, a compile-time error occurs. This locks out
5648       non-printable ASCII characters in all modes.
5649
5650       When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t  gen-
5651       erate the appropriate EBCDIC code values. The \c escape is processed as
5652       specified for Perl in the perlebcdic document. The only characters that
5653       are  allowed  after  \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
5654       Any other character provokes a  compile-time  error.  The  sequence  \@
5655       encodes  character  code 0; the letters (in either case) encode charac-
5656       ters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
5657       (hex 1B to hex 1F), and \? becomes either 255 (hex FF) or 95 (hex 5F).
5658
5659       Thus,  apart  from  \?,  these escapes generate the same character code
5660       values as they do in an ASCII environment, though the meanings  of  the
5661       values  mostly  differ.  For example, \G always generates code value 7,
5662       which is BEL in ASCII but DEL in EBCDIC.
5663
5664       The sequence \? generates DEL (127, hex 7F) in  an  ASCII  environment,
5665       but  because  127  is  not a control character in EBCDIC, Perl makes it
5666       generate the APC character. Unfortunately, there are  several  variants
5667       of  EBCDIC.  In  most  of them the APC character has the value 255 (hex
5668       FF), but in the one Perl calls POSIX-BC its value is 95  (hex  5F).  If
5669       certain  other characters have POSIX-BC values, PCRE2 makes \? generate
5670       95; otherwise it generates 255.
5671
5672       After \0 up to two further octal digits are read. If  there  are  fewer
5673       than  two  digits,  just  those  that  are  present  are used. Thus the
5674       sequence \0\x\015 specifies two binary zeros followed by a CR character
5675       (code value 13). Make sure you supply two digits after the initial zero
5676       if the pattern character that follows is itself an octal digit.
5677
5678       The escape \o must be followed by a sequence of octal digits,  enclosed
5679       in  braces.  An  error occurs if this is not the case. This escape is a
5680       recent addition to Perl; it provides way of specifying  character  code
5681       points  as  octal  numbers  greater than 0777, and it also allows octal
5682       numbers and back references to be unambiguously specified.
5683
5684       For greater clarity and unambiguity, it is best to avoid following \ by
5685       a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
5686       ter numbers, and \g{} to specify back references. The  following  para-
5687       graphs describe the old, ambiguous syntax.
5688
5689       The handling of a backslash followed by a digit other than 0 is compli-
5690       cated, and Perl has changed over time, causing PCRE2 also to change.
5691
5692       Outside a character class, PCRE2 reads the digit and any following dig-
5693       its as a decimal number. If the number is less than 10, begins with the
5694       digit 8 or 9, or if there are at least  that  many  previous  capturing
5695       left  parentheses  in the expression, the entire sequence is taken as a
5696       back reference. A description of how this works is given later, follow-
5697       ing  the  discussion  of  parenthesized  subpatterns.  Otherwise, up to
5698       three octal digits are read to form a character code.
5699
5700       Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
5701       acters  "8"  and "9", and otherwise reads up to three octal digits fol-
5702       lowing the backslash, using them to generate a data character. Any sub-
5703       sequent  digits  stand for themselves. For example, outside a character
5704       class:
5705
5706         \040   is another way of writing an ASCII space
5707         \40    is the same, provided there are fewer than 40
5708                   previous capturing subpatterns
5709         \7     is always a back reference
5710         \11    might be a back reference, or another way of
5711                   writing a tab
5712         \011   is always a tab
5713         \0113  is a tab followed by the character "3"
5714         \113   might be a back reference, otherwise the
5715                   character with octal code 113
5716         \377   might be a back reference, otherwise
5717                   the value 255 (decimal)
5718         \81    is always a back reference
5719
5720       Note that octal values of 100 or greater that are specified using  this
5721       syntax  must  not be introduced by a leading zero, because no more than
5722       three octal digits are ever read.
5723
5724       By default, after \x that is not followed by {, from zero to two  hexa-
5725       decimal  digits  are  read (letters can be in upper or lower case). Any
5726       number of hexadecimal digits may appear between \x{ and }. If a charac-
5727       ter  other  than  a  hexadecimal digit appears between \x{ and }, or if
5728       there is no terminating }, an error occurs.
5729
5730       If the PCRE2_ALT_BSUX option is set, the interpretation  of  \x  is  as
5731       just described only when it is followed by two hexadecimal digits. Oth-
5732       erwise, it matches a literal "x" character. In this mode mode,  support
5733       for  code points greater than 256 is provided by \u, which must be fol-
5734       lowed by four hexadecimal digits; otherwise it matches  a  literal  "u"
5735       character.
5736
5737       Characters whose value is less than 256 can be defined by either of the
5738       two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
5739       ference  in  the way they are handled. For example, \xdc is exactly the
5740       same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
5741
5742   Constraints on character values
5743
5744       Characters that are specified using octal or  hexadecimal  numbers  are
5745       limited to certain values, as follows:
5746
5747         8-bit non-UTF mode    less than 0x100
5748         8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
5749         16-bit non-UTF mode   less than 0x10000
5750         16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
5751         32-bit non-UTF mode   less than 0x100000000
5752         32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
5753
5754       Invalid  Unicode  codepoints  are  the  range 0xd800 to 0xdfff (the so-
5755       called "surrogate" codepoints), and 0xffef.
5756
5757   Escape sequences in character classes
5758
5759       All the sequences that define a single character value can be used both
5760       inside  and  outside character classes. In addition, inside a character
5761       class, \b is interpreted as the backspace character (hex 08).
5762
5763       \N is not allowed in a character class. \B, \R, and \X are not  special
5764       inside  a  character  class.  Like other unrecognized alphabetic escape
5765       sequences, they cause  an  error.  Outside  a  character  class,  these
5766       sequences have different meanings.
5767
5768   Unsupported escape sequences
5769
5770       In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
5771       handler and used  to  modify  the  case  of  following  characters.  By
5772       default, PCRE2 does not support these escape sequences. However, if the
5773       PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
5774       used  to define a character by code point, as described in the previous
5775       section.
5776
5777   Absolute and relative back references
5778
5779       The sequence \g followed by an unsigned or a negative  number,  option-
5780       ally  enclosed  in braces, is an absolute or relative back reference. A
5781       named back reference can be coded as \g{name}. Back references are dis-
5782       cussed later, following the discussion of parenthesized subpatterns.
5783
5784   Absolute and relative subroutine calls
5785
5786       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
5787       name or a number enclosed either in angle brackets or single quotes, is
5788       an  alternative  syntax for referencing a subpattern as a "subroutine".
5789       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
5790       \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
5791       reference; the latter is a subroutine call.
5792
5793   Generic character types
5794
5795       Another use of backslash is for specifying generic character types:
5796
5797         \d     any decimal digit
5798         \D     any character that is not a decimal digit
5799         \h     any horizontal white space character
5800         \H     any character that is not a horizontal white space character
5801         \s     any white space character
5802         \S     any character that is not a white space character
5803         \v     any vertical white space character
5804         \V     any character that is not a vertical white space character
5805         \w     any "word" character
5806         \W     any "non-word" character
5807
5808       There is also the single sequence \N, which matches a non-newline char-
5809       acter.   This is the same as the "." metacharacter when PCRE2_DOTALL is
5810       not set. Perl also uses \N to match characters by name; PCRE2 does  not
5811       support this.
5812
5813       Each  pair of lower and upper case escape sequences partitions the com-
5814       plete set of characters into two disjoint  sets.  Any  given  character
5815       matches  one, and only one, of each pair. The sequences can appear both
5816       inside and outside character classes. They each match one character  of
5817       the  appropriate  type.  If the current matching point is at the end of
5818       the subject string, all of them fail, because there is no character  to
5819       match.
5820
5821       The  default  \s  characters  are HT (9), LF (10), VT (11), FF (12), CR
5822       (13), and space (32), which are defined  as  white  space  in  the  "C"
5823       locale. This list may vary if locale-specific matching is taking place.
5824       For example, in some locales the "non-breaking space" character  (\xA0)
5825       is recognized as white space, and in others the VT character is not.
5826
5827       A  "word"  character is an underscore or any character that is a letter
5828       or digit.  By default, the definition of letters  and  digits  is  con-
5829       trolled by PCRE2's low-valued character tables, and may vary if locale-
5830       specific matching is taking place (see "Locale support" in the pcre2api
5831       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
5832       systems, or "french" in Windows, some character codes greater than  127
5833       are  used  for  accented letters, and these are then matched by \w. The
5834       use of locales with Unicode is discouraged.
5835
5836       By default, characters whose code points are  greater  than  127  never
5837       match \d, \s, or \w, and always match \D, \S, and \W, although this may
5838       be different for characters in the range 128-255  when  locale-specific
5839       matching  is  happening.   These escape sequences retain their original
5840       meanings from before Unicode support was available,  mainly  for  effi-
5841       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
5842       changed so that Unicode properties  are  used  to  determine  character
5843       types, as follows:
5844
5845         \d  any character that matches \p{Nd} (decimal digit)
5846         \s  any character that matches \p{Z} or \h or \v
5847         \w  any character that matches \p{L} or \p{N}, plus underscore
5848
5849       The  upper case escapes match the inverse sets of characters. Note that
5850       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
5851       as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
5852       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
5853       Matching these sequences is noticeably slower when PCRE2_UCP is set.
5854
5855       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
5856       which match only ASCII characters by default, always match  a  specific
5857       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
5858       space characters are:
5859
5860         U+0009     Horizontal tab (HT)
5861         U+0020     Space
5862         U+00A0     Non-break space
5863         U+1680     Ogham space mark
5864         U+180E     Mongolian vowel separator
5865         U+2000     En quad
5866         U+2001     Em quad
5867         U+2002     En space
5868         U+2003     Em space
5869         U+2004     Three-per-em space
5870         U+2005     Four-per-em space
5871         U+2006     Six-per-em space
5872         U+2007     Figure space
5873         U+2008     Punctuation space
5874         U+2009     Thin space
5875         U+200A     Hair space
5876         U+202F     Narrow no-break space
5877         U+205F     Medium mathematical space
5878         U+3000     Ideographic space
5879
5880       The vertical space characters are:
5881
5882         U+000A     Linefeed (LF)
5883         U+000B     Vertical tab (VT)
5884         U+000C     Form feed (FF)
5885         U+000D     Carriage return (CR)
5886         U+0085     Next line (NEL)
5887         U+2028     Line separator
5888         U+2029     Paragraph separator
5889
5890       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
5891       than 256 are relevant.
5892
5893   Newline sequences
5894
5895       Outside  a  character class, by default, the escape sequence \R matches
5896       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
5897       to the following:
5898
5899         (?>\r\n|\n|\x0b|\f|\r|\x85)
5900
5901       This  is  an  example  of an "atomic group", details of which are given
5902       below.  This particular group matches either the two-character sequence
5903       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
5904       U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
5905       riage  return,  U+000D), or NEL (next line, U+0085). Because this is an
5906       atomic group, the two-character sequence is treated as  a  single  unit
5907       that cannot be split.
5908
5909       In  other modes, two additional characters whose codepoints are greater
5910       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
5911       rator,  U+2029).  Unicode support is not needed for these characters to
5912       be recognized.
5913
5914       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
5915       the  complete  set  of  Unicode  line  endings)  by  setting the option
5916       PCRE2_BSR_ANYCRLF at compile time. (BSR is an  abbrevation  for  "back-
5917       slash R".) This can be made the default when PCRE2 is built; if this is
5918       the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI-
5919       CODE  option. It is also possible to specify these settings by starting
5920       a pattern string with one of the following sequences:
5921
5922         (*BSR_ANYCRLF)   CR, LF, or CRLF only
5923         (*BSR_UNICODE)   any Unicode newline sequence
5924
5925       These override the default and the options given to the compiling func-
5926       tion.  Note that these special settings, which are not Perl-compatible,
5927       are recognized only at the very start of a pattern, and that they  must
5928       be  in upper case. If more than one of them is present, the last one is
5929       used. They can be combined with a change  of  newline  convention;  for
5930       example, a pattern can start with:
5931
5932         (*ANY)(*BSR_ANYCRLF)
5933
5934       They  can also be combined with the (*UTF) or (*UCP) special sequences.
5935       Inside a character class, \R  is  treated  as  an  unrecognized  escape
5936       sequence, and causes an error.
5937
5938   Unicode character properties
5939
5940       When  PCRE2  is  built  with Unicode support (the default), three addi-
5941       tional escape sequences that match characters with specific  properties
5942       are  available.  In 8-bit non-UTF-8 mode, these sequences are of course
5943       limited to testing characters whose codepoints are less than  256,  but
5944       they do work in this mode.  The extra escape sequences are:
5945
5946         \p{xx}   a character with the xx property
5947         \P{xx}   a character without the xx property
5948         \X       a Unicode extended grapheme cluster
5949
5950       The  property  names represented by xx above are limited to the Unicode
5951       script names, the general category properties, "Any", which matches any
5952       character  (including  newline),  and  some  special  PCRE2  properties
5953       (described in the next section).  Other Perl properties such as  "InMu-
5954       sicalSymbols"  are  not supported by PCRE2.  Note that \P{Any} does not
5955       match any characters, so always causes a match failure.
5956
5957       Sets of Unicode characters are defined as belonging to certain scripts.
5958       A  character from one of these sets can be matched using a script name.
5959       For example:
5960
5961         \p{Greek}
5962         \P{Han}
5963
5964       Those that are not part of an identified script are lumped together  as
5965       "Common". The current list of scripts is:
5966
5967       Ahom,   Anatolian_Hieroglyphs,  Arabic,  Armenian,  Avestan,  Balinese,
5968       Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille,  Buginese,
5969       Buhid,  Canadian_Aboriginal,  Carian, Caucasian_Albanian, Chakma, Cham,
5970       Cherokee,  Common,  Coptic,  Cuneiform,  Cypriot,  Cyrillic,   Deseret,
5971       Devanagari,  Duployan,  Egyptian_Hieroglyphs,  Elbasan, Ethiopic, Geor-
5972       gian, Glagolitic, Gothic,  Grantha,  Greek,  Gujarati,  Gurmukhi,  Han,
5973       Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
5974       Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan-
5975       nada,  Katakana,  Kayah_Li,  Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
5976       Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha-
5977       jani,  Malayalam,  Mandaic,  Manichaean,  Meetei_Mayek,  Mende_Kikakui,
5978       Meroitic_Cursive, Meroitic_Hieroglyphs,  Miao,  Modi,  Mongolian,  Mro,
5979       Multani,   Myanmar,   Nabataean,  New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,
5980       Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,
5981       Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
5982       Pau_Cin_Hau,  Phags_Pa,  Phoenician,  Psalter_Pahlavi,  Rejang,  Runic,
5983       Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
5984       Sora_Sompeng,  Sundanese,  Syloti_Nagri,  Syriac,  Tagalog,   Tagbanwa,
5985       Tai_Le,   Tai_Tham,  Tai_Viet,  Takri,  Tamil,  Telugu,  Thaana,  Thai,
5986       Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
5987
5988       Each character has exactly one Unicode general category property, spec-
5989       ified  by a two-letter abbreviation. For compatibility with Perl, nega-
5990       tion can be specified by including a  circumflex  between  the  opening
5991       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
5992       \P{Lu}.
5993
5994       If only one letter is specified with \p or \P, it includes all the gen-
5995       eral  category properties that start with that letter. In this case, in
5996       the absence of negation, the curly brackets in the escape sequence  are
5997       optional; these two examples have the same effect:
5998
5999         \p{L}
6000         \pL
6001
6002       The following general category property codes are supported:
6003
6004         C     Other
6005         Cc    Control
6006         Cf    Format
6007         Cn    Unassigned
6008         Co    Private use
6009         Cs    Surrogate
6010
6011         L     Letter
6012         Ll    Lower case letter
6013         Lm    Modifier letter
6014         Lo    Other letter
6015         Lt    Title case letter
6016         Lu    Upper case letter
6017
6018         M     Mark
6019         Mc    Spacing mark
6020         Me    Enclosing mark
6021         Mn    Non-spacing mark
6022
6023         N     Number
6024         Nd    Decimal number
6025         Nl    Letter number
6026         No    Other number
6027
6028         P     Punctuation
6029         Pc    Connector punctuation
6030         Pd    Dash punctuation
6031         Pe    Close punctuation
6032         Pf    Final punctuation
6033         Pi    Initial punctuation
6034         Po    Other punctuation
6035         Ps    Open punctuation
6036
6037         S     Symbol
6038         Sc    Currency symbol
6039         Sk    Modifier symbol
6040         Sm    Mathematical symbol
6041         So    Other symbol
6042
6043         Z     Separator
6044         Zl    Line separator
6045         Zp    Paragraph separator
6046         Zs    Space separator
6047
6048       The  special property L& is also supported: it matches a character that
6049       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
6050       classified as a modifier or "other".
6051
6052       The  Cs  (Surrogate)  property  applies only to characters in the range
6053       U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
6054       so  cannot  be  tested  by PCRE2, unless UTF validity checking has been
6055       turned off (see the discussion of PCRE2_NO_UTF_CHECK  in  the  pcre2api
6056       page). Perl does not support the Cs property.
6057
6058       The  long  synonyms  for  property  names  that  Perl supports (such as
6059       \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix
6060       any of these properties with "Is".
6061
6062       No character that is in the Unicode table has the Cn (unassigned) prop-
6063       erty.  Instead, this property is assumed for any code point that is not
6064       in the Unicode table.
6065
6066       Specifying  caseless  matching  does not affect these escape sequences.
6067       For example, \p{Lu} always matches only upper  case  letters.  This  is
6068       different from the behaviour of current versions of Perl.
6069
6070       Matching  characters by Unicode property is not fast, because PCRE2 has
6071       to do a multistage table lookup in order to find  a  character's  prop-
6072       erty. That is why the traditional escape sequences such as \d and \w do
6073       not use Unicode properties in PCRE2 by default,  though  you  can  make
6074       them  do  so by setting the PCRE2_UCP option or by starting the pattern
6075       with (*UCP).
6076
6077   Extended grapheme clusters
6078
6079       The \X escape matches any number of Unicode  characters  that  form  an
6080       "extended grapheme cluster", and treats the sequence as an atomic group
6081       (see below).  Unicode supports various kinds of composite character  by
6082       giving  each  character  a grapheme breaking property, and having rules
6083       that use these properties to define the boundaries of extended grapheme
6084       clusters.  \X  always  matches  at least one character. Then it decides
6085       whether to add additional characters according to the  following  rules
6086       for ending a cluster:
6087
6088       1. End at the end of the subject string.
6089
6090       2.  Do not end between CR and LF; otherwise end after any control char-
6091       acter.
6092
6093       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
6094       characters  are of five types: L, V, T, LV, and LVT. An L character may
6095       be followed by an L, V, LV, or LVT character; an LV or V character  may
6096       be followed by a V or T character; an LVT or T character may be follwed
6097       only by a T character.
6098
6099       4. Do not end before extending characters or spacing marks.  Characters
6100       with  the  "mark"  property  always have the "extend" grapheme breaking
6101       property.
6102
6103       5. Do not end after prepend characters.
6104
6105       6. Otherwise, end the cluster.
6106
6107   PCRE2's additional properties
6108
6109       As well as the standard Unicode properties described above, PCRE2  sup-
6110       ports  four  more  that  make it possible to convert traditional escape
6111       sequences such as \w and \s to use Unicode properties. PCRE2 uses these
6112       non-standard,  non-Perl  properties  internally  when PCRE2_UCP is set.
6113       However, they may also be used explicitly. These properties are:
6114
6115         Xan   Any alphanumeric character
6116         Xps   Any POSIX space character
6117         Xsp   Any Perl space character
6118         Xwd   Any Perl "word" character
6119
6120       Xan matches characters that have either the L (letter) or the  N  (num-
6121       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
6122       form feed, or carriage return, and any other character that has  the  Z
6123       (separator)  property.   Xsp  is  the  same as Xps; in PCRE1 it used to
6124       exclude vertical tab, for Perl compatibility,  but  Perl  changed.  Xwd
6125       matches the same characters as Xan, plus underscore.
6126
6127       There  is another non-standard property, Xuc, which matches any charac-
6128       ter that can be represented by a Universal Character Name  in  C++  and
6129       other  programming  languages.  These are the characters $, @, ` (grave
6130       accent), and all characters with Unicode code points  greater  than  or
6131       equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
6132       most base (ASCII) characters are excluded. (Universal  Character  Names
6133       are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
6134       Note that the Xuc property does not match these sequences but the char-
6135       acters that they represent.)
6136
6137   Resetting the match start
6138
6139       The  escape sequence \K causes any previously matched characters not to
6140       be included in the final matched sequence. For example, the pattern:
6141
6142         foo\Kbar
6143
6144       matches "foobar", but reports that it has matched "bar".  This  feature
6145       is  similar  to  a lookbehind assertion (described below).  However, in
6146       this case, the part of the subject before the real match does not  have
6147       to  be of fixed length, as lookbehind assertions do. The use of \K does
6148       not interfere with the setting of captured  substrings.   For  example,
6149       when the pattern
6150
6151         (foo)\Kbar
6152
6153       matches "foobar", the first substring is still set to "foo".
6154
6155       Perl  documents  that  the  use  of  \K  within assertions is "not well
6156       defined". In PCRE2, \K is acted upon when  it  occurs  inside  positive
6157       assertions,  but  is  ignored  in negative assertions. Note that when a
6158       pattern such as (?=ab\K) matches, the reported start of the  match  can
6159       be greater than the end of the match.
6160
6161   Simple assertions
6162
6163       The  final use of backslash is for certain simple assertions. An asser-
6164       tion specifies a condition that has to be met at a particular point  in
6165       a  match, without consuming any characters from the subject string. The
6166       use of subpatterns for more complicated assertions is described  below.
6167       The backslashed assertions are:
6168
6169         \b     matches at a word boundary
6170         \B     matches when not at a word boundary
6171         \A     matches at the start of the subject
6172         \Z     matches at the end of the subject
6173                 also matches before a newline at the end of the subject
6174         \z     matches only at the end of the subject
6175         \G     matches at the first matching position in the subject
6176
6177       Inside  a  character  class, \b has a different meaning; it matches the
6178       backspace character. If any other of  these  assertions  appears  in  a
6179       character class, an "invalid escape sequence" error is generated.
6180
6181       A  word  boundary is a position in the subject string where the current
6182       character and the previous character do not both match \w or  \W  (i.e.
6183       one  matches  \w  and the other matches \W), or the start or end of the
6184       string if the first or last character matches \w,  respectively.  In  a
6185       UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
6186       PCRE2_UCP option. When this is done, it also affects \b and \B. Neither
6187       PCRE2  nor Perl has a separate "start of word" or "end of word" metase-
6188       quence. However, whatever follows \b normally determines which  it  is.
6189       For example, the fragment \ba matches "a" at the start of a word.
6190
6191       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
6192       and dollar (described in the next section) in that they only ever match
6193       at  the  very start and end of the subject string, whatever options are
6194       set. Thus, they are independent of multiline mode. These  three  asser-
6195       tions  are  not  affected  by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
6196       which affect only the behaviour of the circumflex and dollar  metachar-
6197       acters.  However,  if the startoffset argument of pcre2_match() is non-
6198       zero, indicating that matching is to start at a point  other  than  the
6199       beginning  of  the subject, \A can never match.  The difference between
6200       \Z and \z is that \Z matches before a newline at the end of the  string
6201       as well as at the very end, whereas \z matches only at the end.
6202
6203       The  \G assertion is true only when the current matching position is at
6204       the start point of the match, as specified by the startoffset  argument
6205       of  pcre2_match().  It differs from \A when the value of startoffset is
6206       non-zero. By calling  pcre2_match()  multiple  times  with  appropriate
6207       arguments,  you  can  mimic Perl's /g option, and it is in this kind of
6208       implementation where \G can be useful.
6209
6210       Note, however, that PCRE2's interpretation of \G, as the start  of  the
6211       current match, is subtly different from Perl's, which defines it as the
6212       end of the previous match. In Perl, these can  be  different  when  the
6213       previously  matched string was empty. Because PCRE2 does just one match
6214       at a time, it cannot reproduce this behaviour.
6215
6216       If all the alternatives of a pattern begin with \G, the  expression  is
6217       anchored to the starting match position, and the "anchored" flag is set
6218       in the compiled regular expression.
6219
6220
6221CIRCUMFLEX AND DOLLAR
6222
6223       The circumflex and dollar  metacharacters  are  zero-width  assertions.
6224       That  is,  they test for a particular condition being true without con-
6225       suming any characters from the subject string. These two metacharacters
6226       are  concerned  with matching the starts and ends of lines. If the new-
6227       line convention is set so that only the two-character sequence CRLF  is
6228       recognized  as  a newline, isolated CR and LF characters are treated as
6229       ordinary data characters, and are not recognized as newlines.
6230
6231       Outside a character class, in the default matching mode, the circumflex
6232       character  is  an  assertion  that is true only if the current matching
6233       point is at the start of the subject string. If the  startoffset  argu-
6234       ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
6235       flex can never match if the PCRE2_MULTILINE option is unset.  Inside  a
6236       character  class,  circumflex  has  an  entirely different meaning (see
6237       below).
6238
6239       Circumflex need not be the first character of the pattern if  a  number
6240       of  alternatives are involved, but it should be the first thing in each
6241       alternative in which it appears if the pattern is ever  to  match  that
6242       branch.  If all possible alternatives start with a circumflex, that is,
6243       if the pattern is constrained to match only at the start  of  the  sub-
6244       ject,  it  is  said  to be an "anchored" pattern. (There are also other
6245       constructs that can cause a pattern to be anchored.)
6246
6247       The dollar character is an assertion that is true only if  the  current
6248       matching  point  is  at  the  end of the subject string, or immediately
6249       before a newline  at  the  end  of  the  string  (by  default),  unless
6250       PCRE2_NOTEOL is set. Note, however, that it does not actually match the
6251       newline. Dollar need not be the last character of the pattern if a num-
6252       ber of alternatives are involved, but it should be the last item in any
6253       branch in which it appears. Dollar has no special meaning in a  charac-
6254       ter class.
6255
6256       The  meaning  of  dollar  can be changed so that it matches only at the
6257       very end of the string, by setting the PCRE2_DOLLAR_ENDONLY  option  at
6258       compile time. This does not affect the \Z assertion.
6259
6260       The meanings of the circumflex and dollar metacharacters are changed if
6261       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
6262       character  matches before any newlines in the string, as well as at the
6263       very end, and a circumflex matches immediately after internal  newlines
6264       as  well as at the start of the subject string. It does not match after
6265       a newline that ends the string, for compatibility with  Perl.  However,
6266       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
6267
6268       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
6269       (where \n represents a newline) in multiline mode, but  not  otherwise.
6270       Consequently,  patterns  that  are anchored in single line mode because
6271       all branches start with ^ are not anchored in  multiline  mode,  and  a
6272       match  for  circumflex  is  possible  when  the startoffset argument of
6273       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
6274       if PCRE2_MULTILINE is set.
6275
6276       When  the  newline  convention (see "Newline conventions" below) recog-
6277       nizes the two-character sequence CRLF as a newline, this is  preferred,
6278       even  if  the  single  characters CR and LF are also recognized as new-
6279       lines. For example, if the newline convention  is  "any",  a  multiline
6280       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
6281       than after CR, even though CR on its own is a valid newline.  (It  also
6282       matches at the very start of the string, of course.)
6283
6284       Note  that  the sequences \A, \Z, and \z can be used to match the start
6285       and end of the subject in both modes, and if all branches of a  pattern
6286       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
6287       set.
6288
6289
6290FULL STOP (PERIOD, DOT) AND \N
6291
6292       Outside a character class, a dot in the pattern matches any one charac-
6293       ter  in  the subject string except (by default) a character that signi-
6294       fies the end of a line.
6295
6296       When a line ending is defined as a single character, dot never  matches
6297       that  character; when the two-character sequence CRLF is used, dot does
6298       not match CR if it is immediately followed  by  LF,  but  otherwise  it
6299       matches  all characters (including isolated CRs and LFs). When any Uni-
6300       code line endings are being recognized, dot does not match CR or LF  or
6301       any of the other line ending characters.
6302
6303       The  behaviour  of  dot  with regard to newlines can be changed. If the
6304       PCRE2_DOTALL option is set, a dot matches any  one  character,  without
6305       exception.   If  the two-character sequence CRLF is present in the sub-
6306       ject string, it takes two dots to match it.
6307
6308       The handling of dot is entirely independent of the handling of  circum-
6309       flex  and  dollar,  the  only relationship being that they both involve
6310       newlines. Dot has no special meaning in a character class.
6311
6312       The escape sequence \N behaves like  a  dot,  except  that  it  is  not
6313       affected  by  the  PCRE2_DOTALL  option. In other words, it matches any
6314       character except one that signifies the end of a line. Perl  also  uses
6315       \N to match characters by name; PCRE2 does not support this.
6316
6317
6318MATCHING A SINGLE CODE UNIT
6319
6320       Outside  a character class, the escape sequence \C matches any one code
6321       unit, whether or not a UTF mode is set. In the 8-bit library, one  code
6322       unit  is  one  byte;  in the 16-bit library it is a 16-bit unit; in the
6323       32-bit library it is a 32-bit unit. Unlike a  dot,  \C  always  matches
6324       line-ending  characters.  The  feature  is provided in Perl in order to
6325       match individual bytes in UTF-8 mode, but it is unclear how it can use-
6326       fully be used.
6327
6328       Because  \C  breaks  up characters into individual code units, matching
6329       one unit with \C in UTF-8 or UTF-16 mode means that  the  rest  of  the
6330       string  may  start  with  a malformed UTF character. This has undefined
6331       results, because PCRE2 assumes that it is matching character by charac-
6332       ter  in  a  valid UTF string (by default it checks the subject string's
6333       validity at the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK
6334       option is used).
6335
6336       An   application   can   lock   out  the  use  of  \C  by  setting  the
6337       PCRE2_NEVER_BACKSLASH_C option when compiling a  pattern.  It  is  also
6338       possible to build PCRE2 with the use of \C permanently disabled.
6339
6340       PCRE2  does  not allow \C to appear in lookbehind assertions (described
6341       below) in UTF-8 or UTF-16 modes, because this would make it  impossible
6342       to  calculate  the  length  of  the lookbehind. Neither the alternative
6343       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
6344       these UTF modes.  The former gives a match-time error; the latter fails
6345       to optimize and so the match is always run using the interpreter.
6346
6347       In the 32-bit library,  however,  \C  is  always  supported  (when  not
6348       explicitly  locked  out)  because it always matches a single code unit,
6349       whether or not UTF-32 is specified.
6350
6351       In general, the \C escape sequence is best avoided. However, one way of
6352       using  it  that avoids the problem of malformed UTF-8 or UTF-16 charac-
6353       ters is to use a lookahead to check the length of the  next  character,
6354       as  in  this  pattern,  which could be used with a UTF-8 string (ignore
6355       white space and line breaks):
6356
6357         (?| (?=[\x00-\x7f])(\C) |
6358             (?=[\x80-\x{7ff}])(\C)(\C) |
6359             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
6360             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
6361
6362       In this example, a group that starts  with  (?|  resets  the  capturing
6363       parentheses numbers in each alternative (see "Duplicate Subpattern Num-
6364       bers" below). The assertions at the start of each branch check the next
6365       UTF-8  character  for  values  whose encoding uses 1, 2, 3, or 4 bytes,
6366       respectively. The character's individual bytes are then captured by the
6367       appropriate number of \C groups.
6368
6369
6370SQUARE BRACKETS AND CHARACTER CLASSES
6371
6372       An opening square bracket introduces a character class, terminated by a
6373       closing square bracket. A closing square bracket on its own is not spe-
6374       cial  by  default.  If a closing square bracket is required as a member
6375       of the class, it should be the first data character in the class (after
6376       an  initial  circumflex,  if present) or escaped with a backslash. This
6377       means that, by default, an empty class cannot be defined.  However,  if
6378       the  PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
6379       the start does end the (empty) class.
6380
6381       A character class matches a single character in the subject. A  matched
6382       character must be in the set of characters defined by the class, unless
6383       the first character in the class definition is a circumflex,  in  which
6384       case the subject character must not be in the set defined by the class.
6385       If a circumflex is actually required as a member of the  class,  ensure
6386       it is not the first character, or escape it with a backslash.
6387
6388       For  example, the character class [aeiou] matches any lower case vowel,
6389       while [^aeiou] matches any character that is not a  lower  case  vowel.
6390       Note that a circumflex is just a convenient notation for specifying the
6391       characters that are in the class by enumerating those that are  not.  A
6392       class  that starts with a circumflex is not an assertion; it still con-
6393       sumes a character from the subject string, and therefore  it  fails  if
6394       the current pointer is at the end of the string.
6395
6396       When  caseless  matching  is set, any letters in a class represent both
6397       their upper case and lower case versions, so for  example,  a  caseless
6398       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
6399       match "A", whereas a caseful version would.
6400
6401       Characters that might indicate line breaks are  never  treated  in  any
6402       special  way  when  matching  character  classes,  whatever line-ending
6403       sequence is in use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
6404       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
6405       one of these characters.
6406
6407       The minus (hyphen) character can be used to specify a range of  charac-
6408       ters  in  a  character  class.  For  example,  [d-m] matches any letter
6409       between d and m, inclusive. If a  minus  character  is  required  in  a
6410       class,  it  must  be  escaped  with a backslash or appear in a position
6411       where it cannot be interpreted as indicating a range, typically as  the
6412       first or last character in the class, or immediately after a range. For
6413       example, [b-d-z] matches letters in the range b to d, a hyphen  charac-
6414       ter, or z.
6415
6416       It is not possible to have the literal character "]" as the end charac-
6417       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
6418       two  characters ("W" and "-") followed by a literal string "46]", so it
6419       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
6420       backslash  it is interpreted as the end of range, so [W-\]46] is inter-
6421       preted as a class containing a range followed by two other  characters.
6422       The  octal or hexadecimal representation of "]" can also be used to end
6423       a range.
6424
6425       An error is generated if a POSIX character  class  (see  below)  or  an
6426       escape  sequence other than one that defines a single character appears
6427       at a point where a range ending character  is  expected.  For  example,
6428       [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
6429
6430       Ranges normally include all code points between the start and end char-
6431       acters, inclusive. They can also be  used  for  code  points  specified
6432       numerically, for example [\000-\037]. Ranges can include any characters
6433       that are valid for the current mode.
6434
6435       There is a special case in EBCDIC environments  for  ranges  whose  end
6436       points are both specified as literal letters in the same case. For com-
6437       patibility with Perl, EBCDIC code points within the range that are  not
6438       letters  are  omitted. For example, [h-k] matches only four characters,
6439       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
6440       points.  However,  if  the range is specified numerically, for example,
6441       [\x88-\x92] or [h-\x92], all code points are included.
6442
6443       If a range that includes letters is used when caseless matching is set,
6444       it matches the letters in either case. For example, [W-c] is equivalent
6445       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
6446       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
6447       accented E characters in both cases.
6448
6449       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
6450       \w, and \W may appear in a character class, and add the characters that
6451       they match to the class. For example, [\dABCDEF] matches any  hexadeci-
6452       mal  digit.  In UTF modes, the PCRE2_UCP option affects the meanings of
6453       \d, \s, \w and their upper case partners, just as  it  does  when  they
6454       appear  outside a character class, as described in the section entitled
6455       "Generic character types" above. The escape sequence \b has a different
6456       meaning  inside  a character class; it matches the backspace character.
6457       The sequences \B, \N, \R, and \X are not  special  inside  a  character
6458       class.  Like  any  other  unrecognized  escape sequences, they cause an
6459       error.
6460
6461       A circumflex can conveniently be used with  the  upper  case  character
6462       types  to specify a more restricted set of characters than the matching
6463       lower case type.  For example, the class [^\W_] matches any  letter  or
6464       digit, but not underscore, whereas [\w] includes underscore. A positive
6465       character class should be read as "something OR something OR ..." and a
6466       negative class as "NOT something AND NOT something AND NOT ...".
6467
6468       The  only  metacharacters  that are recognized in character classes are
6469       backslash, hyphen (only where it can be  interpreted  as  specifying  a
6470       range),  circumflex  (only  at the start), opening square bracket (only
6471       when it can be interpreted as introducing a POSIX class name, or for  a
6472       special  compatibility  feature  -  see the next two sections), and the
6473       terminating  closing  square  bracket.  However,  escaping  other  non-
6474       alphanumeric characters does no harm.
6475
6476
6477POSIX CHARACTER CLASSES
6478
6479       Perl supports the POSIX notation for character classes. This uses names
6480       enclosed by [: and :] within the enclosing square brackets. PCRE2  also
6481       supports this notation. For example,
6482
6483         [01[:alpha:]%]
6484
6485       matches "0", "1", any alphabetic character, or "%". The supported class
6486       names are:
6487
6488         alnum    letters and digits
6489         alpha    letters
6490         ascii    character codes 0 - 127
6491         blank    space or tab only
6492         cntrl    control characters
6493         digit    decimal digits (same as \d)
6494         graph    printing characters, excluding space
6495         lower    lower case letters
6496         print    printing characters, including space
6497         punct    printing characters, excluding letters and digits and space
6498         space    white space (the same as \s from PCRE2 8.34)
6499         upper    upper case letters
6500         word     "word" characters (same as \w)
6501         xdigit   hexadecimal digits
6502
6503       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
6504       CR  (13),  and space (32). If locale-specific matching is taking place,
6505       the list of space characters may be different; there may  be  fewer  or
6506       more of them. "Space" and \s match the same set of characters.
6507
6508       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
6509       from Perl 5.8. Another Perl extension is negation, which  is  indicated
6510       by a ^ character after the colon. For example,
6511
6512         [12[:^digit:]]
6513
6514       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
6515       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
6516       these are not supported, and an error is given if they are encountered.
6517
6518       By default, characters with values greater than 127 do not match any of
6519       the POSIX character classes, although this may be different for charac-
6520       ters  in  the range 128-255 when locale-specific matching is happening.
6521       However, if the PCRE2_UCP option is passed to pcre2_compile(), some  of
6522       the  classes are changed so that Unicode character properties are used.
6523       This  is  achieved  by  replacing  certain  POSIX  classes  with  other
6524       sequences, as follows:
6525
6526         [:alnum:]  becomes  \p{Xan}
6527         [:alpha:]  becomes  \p{L}
6528         [:blank:]  becomes  \h
6529         [:cntrl:]  becomes  \p{Cc}
6530         [:digit:]  becomes  \p{Nd}
6531         [:lower:]  becomes  \p{Ll}
6532         [:space:]  becomes  \p{Xps}
6533         [:upper:]  becomes  \p{Lu}
6534         [:word:]   becomes  \p{Xwd}
6535
6536       Negated  versions, such as [:^alpha:] use \P instead of \p. Three other
6537       POSIX classes are handled specially in UCP mode:
6538
6539       [:graph:] This matches characters that have glyphs that mark  the  page
6540                 when printed. In Unicode property terms, it matches all char-
6541                 acters with the L, M, N, P, S, or Cf properties, except for:
6542
6543                   U+061C           Arabic Letter Mark
6544                   U+180E           Mongolian Vowel Separator
6545                   U+2066 - U+2069  Various "isolate"s
6546
6547
6548       [:print:] This matches the same  characters  as  [:graph:]  plus  space
6549                 characters  that  are  not controls, that is, characters with
6550                 the Zs property.
6551
6552       [:punct:] This matches all characters that have the Unicode P (punctua-
6553                 tion)  property,  plus those characters with code points less
6554                 than 256 that have the S (Symbol) property.
6555
6556       The other POSIX classes are unchanged, and match only  characters  with
6557       code points less than 256.
6558
6559
6560COMPATIBILITY FEATURE FOR WORD BOUNDARIES
6561
6562       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
6563       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of  word"
6564       and "end of word". PCRE2 treats these items as follows:
6565
6566         [[:<:]]  is converted to  \b(?=\w)
6567         [[:>:]]  is converted to  \b(?<=\w)
6568
6569       Only these exact character sequences are recognized. A sequence such as
6570       [a[:<:]b] provokes error for an unrecognized  POSIX  class  name.  This
6571       support  is not compatible with Perl. It is provided to help migrations
6572       from other environments, and is best not used in any new patterns. Note
6573       that  \b matches at the start and the end of a word (see "Simple asser-
6574       tions" above), and in a Perl-style pattern the preceding  or  following
6575       character  normally  shows  which  is  wanted, without the need for the
6576       assertions that are used above in order to give exactly the  POSIX  be-
6577       haviour.
6578
6579
6580VERTICAL BAR
6581
6582       Vertical  bar characters are used to separate alternative patterns. For
6583       example, the pattern
6584
6585         gilbert|sullivan
6586
6587       matches either "gilbert" or "sullivan". Any number of alternatives  may
6588       appear,  and  an  empty  alternative  is  permitted (matching the empty
6589       string). The matching process tries each alternative in turn, from left
6590       to  right, and the first one that succeeds is used. If the alternatives
6591       are within a subpattern (defined below), "succeeds" means matching  the
6592       rest of the main pattern as well as the alternative in the subpattern.
6593
6594
6595INTERNAL OPTION SETTING
6596
6597       The  settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
6598       PCRE2_EXTENDED options (which are Perl-compatible) can be changed  from
6599       within  the  pattern  by  a  sequence  of  Perl option letters enclosed
6600       between "(?" and ")".  The option letters are
6601
6602         i  for PCRE2_CASELESS
6603         m  for PCRE2_MULTILINE
6604         s  for PCRE2_DOTALL
6605         x  for PCRE2_EXTENDED
6606
6607       For example, (?im) sets caseless, multiline matching. It is also possi-
6608       ble to unset these options by preceding the letter with a hyphen, and a
6609       combined setting and unsetting such as (?im-sx), which sets PCRE2_CASE-
6610       LESS    and    PCRE2_MULTILINE   while   unsetting   PCRE2_DOTALL   and
6611       PCRE2_EXTENDED, is also permitted. If a letter appears both before  and
6612       after  the  hyphen, the option is unset. An empty options setting "(?)"
6613       is allowed. Needless to say, it has no effect.
6614
6615       The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
6616       changed  in  the  same  way as the Perl-compatible options by using the
6617       characters J and U respectively.
6618
6619       When one of these option changes occurs at  top  level  (that  is,  not
6620       inside  subpattern parentheses), the change applies to the remainder of
6621       the pattern that follows. If the change is placed right at the start of
6622       a  pattern,  PCRE2  extracts  it  into  the global options (and it will
6623       therefore show up in data extracted by the  pcre2_pattern_info()  func-
6624       tion).
6625
6626       An  option  change  within a subpattern (see below for a description of
6627       subpatterns) affects only that part of the subpattern that follows  it,
6628       so
6629
6630         (a(?i)b)c
6631
6632       matches  abc  and  aBc and no other strings (assuming PCRE2_CASELESS is
6633       not used).  By this means, options can be made to have  different  set-
6634       tings in different parts of the pattern. Any changes made in one alter-
6635       native do carry on into subsequent branches within the same subpattern.
6636       For example,
6637
6638         (a(?i)b|c)
6639
6640       matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
6641       first branch is abandoned before the option setting.  This  is  because
6642       the  effects  of option settings happen at compile time. There would be
6643       some very weird behaviour otherwise.
6644
6645       As a convenient shorthand, if any option settings are required  at  the
6646       start  of a non-capturing subpattern (see the next section), the option
6647       letters may appear between the "?" and the ":". Thus the two patterns
6648
6649         (?i:saturday|sunday)
6650         (?:(?i)saturday|sunday)
6651
6652       match exactly the same set of strings.
6653
6654       Note: There are other PCRE2-specific options that can  be  set  by  the
6655       application when the compiling function is called. The pattern can con-
6656       tain special leading sequences such as (*CRLF)  to  override  what  the
6657       application  has  set  or what has been defaulted. Details are given in
6658       the section entitled "Newline sequences"  above.  There  are  also  the
6659       (*UTF)  and  (*UCP)  leading  sequences that can be used to set UTF and
6660       Unicode property modes; they are equivalent to  setting  the  PCRE2_UTF
6661       and  PCRE2_UCP  options, respectively. However, the application can set
6662       the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use
6663       of the (*UTF) and (*UCP) sequences.
6664
6665
6666SUBPATTERNS
6667
6668       Subpatterns are delimited by parentheses (round brackets), which can be
6669       nested.  Turning part of a pattern into a subpattern does two things:
6670
6671       1. It localizes a set of alternatives. For example, the pattern
6672
6673         cat(aract|erpillar|)
6674
6675       matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
6676       it would match "cataract", "erpillar" or an empty string.
6677
6678       2.  It  sets  up  the  subpattern as a capturing subpattern. This means
6679       that, when the whole pattern matches, the portion of the subject string
6680       that  matched  the  subpattern is passed back to the caller, separately
6681       from the portion that matched the whole pattern. (This applies only  to
6682       the  traditional  matching function; the DFA matching function does not
6683       support capturing.)
6684
6685       Opening parentheses are counted from left to right (starting from 1) to
6686       obtain  numbers  for  the  capturing  subpatterns.  For example, if the
6687       string "the red king" is matched against the pattern
6688
6689         the ((red|white) (king|queen))
6690
6691       the captured substrings are "red king", "red", and "king", and are num-
6692       bered 1, 2, and 3, respectively.
6693
6694       The  fact  that  plain  parentheses  fulfil two functions is not always
6695       helpful.  There are often times when a grouping subpattern is  required
6696       without  a capturing requirement. If an opening parenthesis is followed
6697       by a question mark and a colon, the subpattern does not do any  captur-
6698       ing,  and  is  not  counted when computing the number of any subsequent
6699       capturing subpatterns. For example, if the string "the white queen"  is
6700       matched against the pattern
6701
6702         the ((?:red|white) (king|queen))
6703
6704       the captured substrings are "white queen" and "queen", and are numbered
6705       1 and 2. The maximum number of capturing subpatterns is 65535.
6706
6707       As a convenient shorthand, if any option settings are required  at  the
6708       start  of  a  non-capturing  subpattern,  the option letters may appear
6709       between the "?" and the ":". Thus the two patterns
6710
6711         (?i:saturday|sunday)
6712         (?:(?i)saturday|sunday)
6713
6714       match exactly the same set of strings. Because alternative branches are
6715       tried  from  left  to right, and options are not reset until the end of
6716       the subpattern is reached, an option setting in one branch does  affect
6717       subsequent  branches,  so  the above patterns match "SUNDAY" as well as
6718       "Saturday".
6719
6720
6721DUPLICATE SUBPATTERN NUMBERS
6722
6723       Perl 5.10 introduced a feature whereby each alternative in a subpattern
6724       uses  the same numbers for its capturing parentheses. Such a subpattern
6725       starts with (?| and is itself a non-capturing subpattern. For  example,
6726       consider this pattern:
6727
6728         (?|(Sat)ur|(Sun))day
6729
6730       Because  the two alternatives are inside a (?| group, both sets of cap-
6731       turing parentheses are numbered one. Thus, when  the  pattern  matches,
6732       you  can  look  at captured substring number one, whichever alternative
6733       matched. This construct is useful when you want to  capture  part,  but
6734       not all, of one of a number of alternatives. Inside a (?| group, paren-
6735       theses are numbered as usual, but the number is reset at the  start  of
6736       each  branch.  The numbers of any capturing parentheses that follow the
6737       subpattern start after the highest number used in any branch. The  fol-
6738       lowing example is taken from the Perl documentation. The numbers under-
6739       neath show in which buffer the captured content will be stored.
6740
6741         # before  ---------------branch-reset----------- after
6742         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
6743         # 1            2         2  3        2     3     4
6744
6745       A back reference to a numbered subpattern uses the  most  recent  value
6746       that  is  set  for that number by any subpattern. The following pattern
6747       matches "abcabc" or "defdef":
6748
6749         /(?|(abc)|(def))\1/
6750
6751       In contrast, a subroutine call to a numbered subpattern  always  refers
6752       to  the  first  one in the pattern with the given number. The following
6753       pattern matches "abcabc" or "defabc":
6754
6755         /(?|(abc)|(def))(?1)/
6756
6757       A relative reference such as (?-1) is no different: it is just a conve-
6758       nient way of computing an absolute group number.
6759
6760       If  a condition test for a subpattern's having matched refers to a non-
6761       unique number, the test is true if any of the subpatterns of that  num-
6762       ber have matched.
6763
6764       An  alternative approach to using this "branch reset" feature is to use
6765       duplicate named subpatterns, as described in the next section.
6766
6767
6768NAMED SUBPATTERNS
6769
6770       Identifying capturing parentheses by number is simple, but  it  can  be
6771       very  hard  to keep track of the numbers in complicated regular expres-
6772       sions. Furthermore, if an  expression  is  modified,  the  numbers  may
6773       change. To help with this difficulty, PCRE2 supports the naming of sub-
6774       patterns. This feature was not added to Perl until release 5.10. Python
6775       had  the feature earlier, and PCRE1 introduced it at release 4.0, using
6776       the Python syntax. PCRE2 supports both the Perl and the Python  syntax.
6777       Perl  allows  identically numbered subpatterns to have different names,
6778       but PCRE2 does not.
6779
6780       In PCRE2, a subpattern can be named in one of three ways:  (?<name>...)
6781       or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
6782       to capturing parentheses from other parts of the pattern, such as  back
6783       references,  recursion,  and conditions, can be made by name as well as
6784       by number.
6785
6786       Names consist of up to 32 alphanumeric characters and underscores,  but
6787       must  start  with  a  non-digit.  Named capturing parentheses are still
6788       allocated numbers as well as names, exactly as if the  names  were  not
6789       present. The PCRE2 API provides function calls for extracting the name-
6790       to-number translation table from a compiled  pattern.  There  are  also
6791       convenience functions for extracting a captured substring by name.
6792
6793       By  default, a name must be unique within a pattern, but it is possible
6794       to relax this constraint by setting the PCRE2_DUPNAMES option  at  com-
6795       pile  time.  (Duplicate names are also always permitted for subpatterns
6796       with the same number, set up as described  in  the  previous  section.)
6797       Duplicate  names  can be useful for patterns where only one instance of
6798       the named parentheses can match.  Suppose you want to match the name of
6799       a  weekday,  either as a 3-letter abbreviation or as the full name, and
6800       in both cases you  want  to  extract  the  abbreviation.  This  pattern
6801       (ignoring the line breaks) does the job:
6802
6803         (?<DN>Mon|Fri|Sun)(?:day)?|
6804         (?<DN>Tue)(?:sday)?|
6805         (?<DN>Wed)(?:nesday)?|
6806         (?<DN>Thu)(?:rsday)?|
6807         (?<DN>Sat)(?:urday)?
6808
6809       There  are  five capturing substrings, but only one is ever set after a
6810       match.  (An alternative way of solving this problem is to use a "branch
6811       reset" subpattern, as described in the previous section.)
6812
6813       The  convenience  functions for extracting the data by name returns the
6814       substring for the first (and in this example, the only)  subpattern  of
6815       that  name  that  matched.  This saves searching to find which numbered
6816       subpattern it was.
6817
6818       If you make a back reference to  a  non-unique  named  subpattern  from
6819       elsewhere  in the pattern, the subpatterns to which the name refers are
6820       checked in the order in which they appear in the overall  pattern.  The
6821       first one that is set is used for the reference. For example, this pat-
6822       tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
6823
6824         (?:(?<n>foo)|(?<n>bar))\k<n>
6825
6826
6827       If you make a subroutine call to a non-unique named subpattern, the one
6828       that  corresponds  to  the first occurrence of the name is used. In the
6829       absence of duplicate numbers (see the previous section) this is the one
6830       with the lowest number.
6831
6832       If you use a named reference in a condition test (see the section about
6833       conditions below), either to check whether a subpattern has matched, or
6834       to  check for recursion, all subpatterns with the same name are tested.
6835       If the condition is true for any one of them, the overall condition  is
6836       true.  This  is  the  same  behaviour as testing by number. For further
6837       details of the interfaces  for  handling  named  subpatterns,  see  the
6838       pcre2api documentation.
6839
6840       Warning: You cannot use different names to distinguish between two sub-
6841       patterns with the same number because PCRE2 uses only the numbers  when
6842       matching. For this reason, an error is given at compile time if differ-
6843       ent names are given to subpatterns with the same number.  However,  you
6844       can always give the same name to subpatterns with the same number, even
6845       when PCRE2_DUPNAMES is not set.
6846
6847
6848REPETITION
6849
6850       Repetition is specified by quantifiers, which can  follow  any  of  the
6851       following items:
6852
6853         a literal data character
6854         the dot metacharacter
6855         the \C escape sequence
6856         the \X escape sequence
6857         the \R escape sequence
6858         an escape such as \d or \pL that matches a single character
6859         a character class
6860         a back reference
6861         a parenthesized subpattern (including most assertions)
6862         a subroutine call to a subpattern (recursive or otherwise)
6863
6864       The  general repetition quantifier specifies a minimum and maximum num-
6865       ber of permitted matches, by giving the two numbers in  curly  brackets
6866       (braces),  separated  by  a comma. The numbers must be less than 65536,
6867       and the first must be less than or equal to the second. For example:
6868
6869         z{2,4}
6870
6871       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
6872       special  character.  If  the second number is omitted, but the comma is
6873       present, there is no upper limit; if the second number  and  the  comma
6874       are  both omitted, the quantifier specifies an exact number of required
6875       matches. Thus
6876
6877         [aeiou]{3,}
6878
6879       matches at least 3 successive vowels, but may match many more, whereas
6880
6881         \d{8}
6882
6883       matches exactly 8 digits. An opening curly bracket that  appears  in  a
6884       position  where a quantifier is not allowed, or one that does not match
6885       the syntax of a quantifier, is taken as a literal character. For  exam-
6886       ple, {,6} is not a quantifier, but a literal string of four characters.
6887
6888       In UTF modes, quantifiers apply to characters rather than to individual
6889       code units. Thus, for example, \x{100}{2} matches two characters,  each
6890       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
6891       larly, \X{3} matches three Unicode extended grapheme clusters, each  of
6892       which  may  be  several  code  units long (and they may be of different
6893       lengths).
6894
6895       The quantifier {0} is permitted, causing the expression to behave as if
6896       the previous item and the quantifier were not present. This may be use-
6897       ful for subpatterns that are referenced as subroutines  from  elsewhere
6898       in the pattern (but see also the section entitled "Defining subpatterns
6899       for use by reference only" below). Items other  than  subpatterns  that
6900       have a {0} quantifier are omitted from the compiled pattern.
6901
6902       For  convenience, the three most common quantifiers have single-charac-
6903       ter abbreviations:
6904
6905         *    is equivalent to {0,}
6906         +    is equivalent to {1,}
6907         ?    is equivalent to {0,1}
6908
6909       It is possible to construct infinite loops by  following  a  subpattern
6910       that can match no characters with a quantifier that has no upper limit,
6911       for example:
6912
6913         (a?)*
6914
6915       Earlier versions of Perl and PCRE1 used to give  an  error  at  compile
6916       time for such patterns. However, because there are cases where this can
6917       be useful, such patterns are now accepted, but if any repetition of the
6918       subpattern  does in fact match no characters, the loop is forcibly bro-
6919       ken.
6920
6921       By default, the quantifiers are "greedy", that is, they match  as  much
6922       as  possible  (up  to  the  maximum number of permitted times), without
6923       causing the rest of the pattern to fail. The classic example  of  where
6924       this gives problems is in trying to match comments in C programs. These
6925       appear between /* and */ and within the comment,  individual  *  and  /
6926       characters  may  appear. An attempt to match C comments by applying the
6927       pattern
6928
6929         /\*.*\*/
6930
6931       to the string
6932
6933         /* first comment */  not comment  /* second comment */
6934
6935       fails, because it matches the entire string owing to the greediness  of
6936       the .*  item.
6937
6938       If a quantifier is followed by a question mark, it ceases to be greedy,
6939       and instead matches the minimum number of times possible, so  the  pat-
6940       tern
6941
6942         /\*.*?\*/
6943
6944       does  the  right  thing with the C comments. The meaning of the various
6945       quantifiers is not otherwise changed,  just  the  preferred  number  of
6946       matches.   Do  not  confuse this use of question mark with its use as a
6947       quantifier in its own right. Because it has two uses, it can  sometimes
6948       appear doubled, as in
6949
6950         \d??\d
6951
6952       which matches one digit by preference, but can match two if that is the
6953       only way the rest of the pattern matches.
6954
6955       If the PCRE2_UNGREEDY option is set (an option that is not available in
6956       Perl),  the  quantifiers are not greedy by default, but individual ones
6957       can be made greedy by following them with a  question  mark.  In  other
6958       words, it inverts the default behaviour.
6959
6960       When  a  parenthesized  subpattern  is quantified with a minimum repeat
6961       count that is greater than 1 or with a limited maximum, more memory  is
6962       required  for  the  compiled  pattern, in proportion to the size of the
6963       minimum or maximum.
6964
6965       If a pattern starts with  .*  or  .{0,}  and  the  PCRE2_DOTALL  option
6966       (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
6967       lines, the pattern is implicitly  anchored,  because  whatever  follows
6968       will  be  tried against every character position in the subject string,
6969       so there is no point in retrying the  overall  match  at  any  position
6970       after the first. PCRE2 normally treats such a pattern as though it were
6971       preceded by \A.
6972
6973       In cases where it is known that the subject  string  contains  no  new-
6974       lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
6975       mization, or alternatively, using ^ to indicate anchoring explicitly.
6976
6977       However, there are some cases where the optimization  cannot  be  used.
6978       When .*  is inside capturing parentheses that are the subject of a back
6979       reference elsewhere in the pattern, a match at the start may fail where
6980       a later one succeeds. Consider, for example:
6981
6982         (.*)abc\1
6983
6984       If  the subject is "xyz123abc123" the match point is the fourth charac-
6985       ter. For this reason, such a pattern is not implicitly anchored.
6986
6987       Another case where implicit anchoring is not applied is when the  lead-
6988       ing  .* is inside an atomic group. Once again, a match at the start may
6989       fail where a later one succeeds. Consider this pattern:
6990
6991         (?>.*?a)b
6992
6993       It matches "ab" in the subject "aab". The use of the backtracking  con-
6994       trol  verbs  (*PRUNE)  and  (*SKIP) also disable this optimization, and
6995       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
6996
6997       When a capturing subpattern is repeated, the value captured is the sub-
6998       string that matched the final iteration. For example, after
6999
7000         (tweedle[dume]{3}\s*)+
7001
7002       has matched "tweedledum tweedledee" the value of the captured substring
7003       is "tweedledee". However, if there are  nested  capturing  subpatterns,
7004       the  corresponding captured values may have been set in previous itera-
7005       tions. For example, after
7006
7007         (a|(b))+
7008
7009       matches "aba" the value of the second captured substring is "b".
7010
7011
7012ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
7013
7014       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
7015       repetition,  failure  of what follows normally causes the repeated item
7016       to be re-evaluated to see if a different number of repeats  allows  the
7017       rest  of  the pattern to match. Sometimes it is useful to prevent this,
7018       either to change the nature of the match, or to cause it  fail  earlier
7019       than  it otherwise might, when the author of the pattern knows there is
7020       no point in carrying on.
7021
7022       Consider, for example, the pattern \d+foo when applied to  the  subject
7023       line
7024
7025         123456bar
7026
7027       After matching all 6 digits and then failing to match "foo", the normal
7028       action of the matcher is to try again with only 5 digits  matching  the
7029       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
7030       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
7031       the  means for specifying that once a subpattern has matched, it is not
7032       to be re-evaluated in this way.
7033
7034       If we use atomic grouping for the previous example, the  matcher  gives
7035       up  immediately  on failing to match "foo" the first time. The notation
7036       is a kind of special parenthesis, starting with (?> as in this example:
7037
7038         (?>\d+)foo
7039
7040       This kind of parenthesis "locks up" the  part of the  pattern  it  con-
7041       tains  once  it  has matched, and a failure further into the pattern is
7042       prevented from backtracking into it. Backtracking past it  to  previous
7043       items, however, works as normal.
7044
7045       An  alternative  description  is that a subpattern of this type matches
7046       exactly the string of characters that an identical  standalone  pattern
7047       would match, if anchored at the current point in the subject string.
7048
7049       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
7050       such as the above example can be thought of as a maximizing repeat that
7051       must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
7052       pared to adjust the number of digits they match in order  to  make  the
7053       rest of the pattern match, (?>\d+) can only match an entire sequence of
7054       digits.
7055
7056       Atomic groups in general can of course contain arbitrarily  complicated
7057       subpatterns,  and  can  be  nested. However, when the subpattern for an
7058       atomic group is just a single repeated item, as in the example above, a
7059       simpler  notation,  called  a "possessive quantifier" can be used. This
7060       consists of an additional + character  following  a  quantifier.  Using
7061       this notation, the previous example can be rewritten as
7062
7063         \d++foo
7064
7065       Note that a possessive quantifier can be used with an entire group, for
7066       example:
7067
7068         (abc|xyz){2,3}+
7069
7070       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
7071       PCRE2_UNGREEDY  option  is  ignored. They are a convenient notation for
7072       the simpler forms of atomic group. However, there is no  difference  in
7073       the meaning of a possessive quantifier and the equivalent atomic group,
7074       though there may be a performance  difference;  possessive  quantifiers
7075       should be slightly faster.
7076
7077       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
7078       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
7079       edition of his book. Mike McCloskey liked it, so implemented it when he
7080       built Sun's Java package, and PCRE1 copied it from there. It ultimately
7081       found its way into Perl at release 5.10.
7082
7083       PCRE2  has  an  optimization  that automatically "possessifies" certain
7084       simple pattern constructs. For example, the sequence A+B is treated  as
7085       A++B  because  there is no point in backtracking into a sequence of A's
7086       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
7087       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
7088
7089       When  a  pattern  contains an unlimited repeat inside a subpattern that
7090       can itself be repeated an unlimited number of  times,  the  use  of  an
7091       atomic  group  is  the  only way to avoid some failing matches taking a
7092       very long time indeed. The pattern
7093
7094         (\D+|<\d+>)*[!?]
7095
7096       matches an unlimited number of substrings that either consist  of  non-
7097       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
7098       matches, it runs quickly. However, if it is applied to
7099
7100         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
7101
7102       it takes a long time before reporting  failure.  This  is  because  the
7103       string  can be divided between the internal \D+ repeat and the external
7104       * repeat in a large number of ways, and all  have  to  be  tried.  (The
7105       example  uses  [!?]  rather than a single character at the end, because
7106       both PCRE2 and Perl have an optimization that allows for  fast  failure
7107       when  a single character is used. They remember the last single charac-
7108       ter that is required for a match, and fail early if it is  not  present
7109       in  the  string.)  If  the pattern is changed so that it uses an atomic
7110       group, like this:
7111
7112         ((?>\D+)|<\d+>)*[!?]
7113
7114       sequences of non-digits cannot be broken, and failure happens quickly.
7115
7116
7117BACK REFERENCES
7118
7119       Outside a character class, a backslash followed by a digit greater than
7120       0 (and possibly further digits) is a back reference to a capturing sub-
7121       pattern earlier (that is, to its left) in the pattern,  provided  there
7122       have been that many previous capturing left parentheses.
7123
7124       However,  if the decimal number following the backslash is less than 8,
7125       it is always taken as a back reference, and causes  an  error  only  if
7126       there  are  not that many capturing left parentheses in the entire pat-
7127       tern. In other words, the parentheses that are referenced need  not  be
7128       to  the  left of the reference for numbers less than 8. A "forward back
7129       reference" of this type can make sense when a  repetition  is  involved
7130       and  the  subpattern to the right has participated in an earlier itera-
7131       tion.
7132
7133       It is not possible to have a numerical "forward back  reference"  to  a
7134       subpattern  whose  number  is  8  or  more  using this syntax because a
7135       sequence such as \50 is interpreted as a character  defined  in  octal.
7136       See the subsection entitled "Non-printing characters" above for further
7137       details of the handling of digits following a backslash.  There  is  no
7138       such  problem  when named parentheses are used. A back reference to any
7139       subpattern is possible using named parentheses (see below).
7140
7141       Another way of avoiding the ambiguity inherent in  the  use  of  digits
7142       following  a  backslash  is  to use the \g escape sequence. This escape
7143       must be followed by an unsigned number or a negative number, optionally
7144       enclosed in braces. These examples are all identical:
7145
7146         (ring), \1
7147         (ring), \g1
7148         (ring), \g{1}
7149
7150       An  unsigned number specifies an absolute reference without the ambigu-
7151       ity that is present in the older syntax. It is also useful when literal
7152       digits follow the reference. A negative number is a relative reference.
7153       Consider this example:
7154
7155         (abc(def)ghi)\g{-1}
7156
7157       The sequence \g{-1} is a reference to the most recently started captur-
7158       ing subpattern before \g, that is, is it equivalent to \2 in this exam-
7159       ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative
7160       references  can  be helpful in long patterns, and also in patterns that
7161       are created by  joining  together  fragments  that  contain  references
7162       within themselves.
7163
7164       A  back  reference matches whatever actually matched the capturing sub-
7165       pattern in the current subject string, rather  than  anything  matching
7166       the subpattern itself (see "Subpatterns as subroutines" below for a way
7167       of doing that). So the pattern
7168
7169         (sens|respons)e and \1ibility
7170
7171       matches "sense and sensibility" and "response and responsibility",  but
7172       not  "sense and responsibility". If caseful matching is in force at the
7173       time of the back reference, the case of letters is relevant. For  exam-
7174       ple,
7175
7176         ((?i)rah)\s+\1
7177
7178       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
7179       original capturing subpattern is matched caselessly.
7180
7181       There are several different ways of writing back  references  to  named
7182       subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
7183       \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
7184       unified back reference syntax, in which \g can be used for both numeric
7185       and named references, is also supported. We  could  rewrite  the  above
7186       example in any of the following ways:
7187
7188         (?<p1>(?i)rah)\s+\k<p1>
7189         (?'p1'(?i)rah)\s+\k{p1}
7190         (?P<p1>(?i)rah)\s+(?P=p1)
7191         (?<p1>(?i)rah)\s+\g{p1}
7192
7193       A  subpattern  that  is  referenced  by  name may appear in the pattern
7194       before or after the reference.
7195
7196       There may be more than one back reference to the same subpattern. If  a
7197       subpattern  has  not actually been used in a particular match, any back
7198       references to it always fail by default. For example, the pattern
7199
7200         (a|(bc))\2
7201
7202       always fails if it starts to match "a" rather than  "bc".  However,  if
7203       the  PCRE2_MATCH_UNSET_BACKREF  option  is  set at compile time, a back
7204       reference to an unset value matches an empty string.
7205
7206       Because there may be many capturing parentheses in a pattern, all  dig-
7207       its  following a backslash are taken as part of a potential back refer-
7208       ence number.  If the pattern continues with  a  digit  character,  some
7209       delimiter  must  be  used  to  terminate  the  back  reference.  If the
7210       PCRE2_EXTENDED option is set, this can be white space.  Otherwise,  the
7211       \g{ syntax or an empty comment (see "Comments" below) can be used.
7212
7213   Recursive back references
7214
7215       A  back reference that occurs inside the parentheses to which it refers
7216       fails when the subpattern is first used, so, for example,  (a\1)  never
7217       matches.   However,  such references can be useful inside repeated sub-
7218       patterns. For example, the pattern
7219
7220         (a|b\1)+
7221
7222       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
7223       ation  of  the  subpattern,  the  back  reference matches the character
7224       string corresponding to the previous iteration. In order  for  this  to
7225       work,  the  pattern must be such that the first iteration does not need
7226       to match the back reference. This can be done using alternation, as  in
7227       the example above, or by a quantifier with a minimum of zero.
7228
7229       Back  references of this type cause the group that they reference to be
7230       treated as an atomic group.  Once the whole group has been  matched,  a
7231       subsequent  matching  failure cannot cause backtracking into the middle
7232       of the group.
7233
7234
7235ASSERTIONS
7236
7237       An assertion is a test on the characters  following  or  preceding  the
7238       current matching point that does not consume any characters. The simple
7239       assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are  described
7240       above.
7241
7242       More  complicated  assertions  are  coded as subpatterns. There are two
7243       kinds: those that look ahead of the current  position  in  the  subject
7244       string,  and  those  that  look  behind  it. An assertion subpattern is
7245       matched in the normal way, except that it does not  cause  the  current
7246       matching position to be changed.
7247
7248       Assertion  subpatterns are not capturing subpatterns. If such an asser-
7249       tion contains capturing subpatterns within it, these  are  counted  for
7250       the  purposes  of numbering the capturing subpatterns in the whole pat-
7251       tern. However, substring capturing is carried  out  only  for  positive
7252       assertions. (Perl sometimes, but not always, does do capturing in nega-
7253       tive assertions.)
7254
7255       For  compatibility  with  Perl,  most  assertion  subpatterns  may   be
7256       repeated;  though  it  makes  no sense to assert the same thing several
7257       times, the side effect of capturing  parentheses  may  occasionally  be
7258       useful.  However,  an  assertion  that forms the condition for a condi-
7259       tional subpattern may not be quantified. In practice, for other  asser-
7260       tions, there only three cases:
7261
7262       (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
7263       matching.  However, it may  contain  internal  capturing  parenthesized
7264       groups that are called from elsewhere via the subroutine mechanism.
7265
7266       (2)  If quantifier is {0,n} where n is greater than zero, it is treated
7267       as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
7268       tried with and without the assertion, the order depending on the greed-
7269       iness of the quantifier.
7270
7271       (3) If the minimum repetition is greater than zero, the  quantifier  is
7272       ignored.   The  assertion  is  obeyed just once when encountered during
7273       matching.
7274
7275   Lookahead assertions
7276
7277       Lookahead assertions start with (?= for positive assertions and (?! for
7278       negative assertions. For example,
7279
7280         \w+(?=;)
7281
7282       matches  a word followed by a semicolon, but does not include the semi-
7283       colon in the match, and
7284
7285         foo(?!bar)
7286
7287       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
7288       that the apparently similar pattern
7289
7290         (?!foo)bar
7291
7292       does  not  find  an  occurrence  of "bar" that is preceded by something
7293       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
7294       the assertion (?!foo) is always true when the next three characters are
7295       "bar". A lookbehind assertion is needed to achieve the other effect.
7296
7297       If you want to force a matching failure at some point in a pattern, the
7298       most  convenient  way  to  do  it  is with (?!) because an empty string
7299       always matches, so an assertion that requires there not to be an  empty
7300       string must always fail.  The backtracking control verb (*FAIL) or (*F)
7301       is a synonym for (?!).
7302
7303   Lookbehind assertions
7304
7305       Lookbehind assertions start with (?<= for positive assertions and  (?<!
7306       for negative assertions. For example,
7307
7308         (?<!foo)bar
7309
7310       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
7311       contents of a lookbehind assertion are restricted  such  that  all  the
7312       strings it matches must have a fixed length. However, if there are sev-
7313       eral top-level alternatives, they do not all  have  to  have  the  same
7314       fixed length. Thus
7315
7316         (?<=bullock|donkey)
7317
7318       is permitted, but
7319
7320         (?<!dogs?|cats?)
7321
7322       causes  an  error at compile time. Branches that match different length
7323       strings are permitted only at the top level of a lookbehind  assertion.
7324       This is an extension compared with Perl, which requires all branches to
7325       match the same length of string. An assertion such as
7326
7327         (?<=ab(c|de))
7328
7329       is not permitted, because its single top-level  branch  can  match  two
7330       different  lengths,  but  it is acceptable to PCRE2 if rewritten to use
7331       two top-level branches:
7332
7333         (?<=abc|abde)
7334
7335       In some cases, the escape sequence \K (see above) can be  used  instead
7336       of a lookbehind assertion to get round the fixed-length restriction.
7337
7338       The  implementation  of lookbehind assertions is, for each alternative,
7339       to temporarily move the current position back by the fixed  length  and
7340       then try to match. If there are insufficient characters before the cur-
7341       rent position, the assertion fails.
7342
7343       In a UTF mode, PCRE2 does not allow the \C escape (which matches a sin-
7344       gle  code  unit even in a UTF mode) to appear in lookbehind assertions,
7345       because it makes it impossible to calculate the length of  the  lookbe-
7346       hind.  The \X and \R escapes, which can match different numbers of code
7347       units, are also not permitted.
7348
7349       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
7350       lookbehinds,  as  long as the subpattern matches a fixed-length string.
7351       Recursion, however, is not supported.
7352
7353       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
7354       assertions to specify efficient matching of fixed-length strings at the
7355       end of subject strings. Consider a simple pattern such as
7356
7357         abcd$
7358
7359       when applied to a long string that does  not  match.  Because  matching
7360       proceeds  from  left to right, PCRE2 will look for each "a" in the sub-
7361       ject and then see if what follows matches the rest of the  pattern.  If
7362       the pattern is specified as
7363
7364         ^.*abcd$
7365
7366       the  initial .* matches the entire string at first, but when this fails
7367       (because there is no following "a"), it backtracks to match all but the
7368       last  character,  then all but the last two characters, and so on. Once
7369       again the search for "a" covers the entire string, from right to  left,
7370       so we are no better off. However, if the pattern is written as
7371
7372         ^.*+(?<=abcd)
7373
7374       there can be no backtracking for the .*+ item because of the possessive
7375       quantifier; it can match only the entire string. The subsequent lookbe-
7376       hind  assertion  does  a single test on the last four characters. If it
7377       fails, the match fails immediately. For  long  strings,  this  approach
7378       makes a significant difference to the processing time.
7379
7380   Using multiple assertions
7381
7382       Several assertions (of any sort) may occur in succession. For example,
7383
7384         (?<=\d{3})(?<!999)foo
7385
7386       matches  "foo" preceded by three digits that are not "999". Notice that
7387       each of the assertions is applied independently at the  same  point  in
7388       the  subject  string.  First  there  is a check that the previous three
7389       characters are all digits, and then there is  a  check  that  the  same
7390       three characters are not "999".  This pattern does not match "foo" pre-
7391       ceded by six characters, the first of which are  digits  and  the  last
7392       three  of  which  are not "999". For example, it doesn't match "123abc-
7393       foo". A pattern to do that is
7394
7395         (?<=\d{3}...)(?<!999)foo
7396
7397       This time the first assertion looks at the  preceding  six  characters,
7398       checking that the first three are digits, and then the second assertion
7399       checks that the preceding three characters are not "999".
7400
7401       Assertions can be nested in any combination. For example,
7402
7403         (?<=(?<!foo)bar)baz
7404
7405       matches an occurrence of "baz" that is preceded by "bar" which in  turn
7406       is not preceded by "foo", while
7407
7408         (?<=\d{3}(?!999)...)foo
7409
7410       is  another pattern that matches "foo" preceded by three digits and any
7411       three characters that are not "999".
7412
7413
7414CONDITIONAL SUBPATTERNS
7415
7416       It is possible to cause the matching process to obey a subpattern  con-
7417       ditionally  or to choose between two alternative subpatterns, depending
7418       on the result of an assertion, or whether a specific capturing  subpat-
7419       tern  has  already  been matched. The two possible forms of conditional
7420       subpattern are:
7421
7422         (?(condition)yes-pattern)
7423         (?(condition)yes-pattern|no-pattern)
7424
7425       If the condition is satisfied, the yes-pattern is used;  otherwise  the
7426       no-pattern  (if  present)  is used. If there are more than two alterna-
7427       tives in the subpattern, a compile-time error occurs. Each of  the  two
7428       alternatives may itself contain nested subpatterns of any form, includ-
7429       ing  conditional  subpatterns;  the  restriction  to  two  alternatives
7430       applies only at the level of the condition. This pattern fragment is an
7431       example where the alternatives are complex:
7432
7433         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
7434
7435
7436       There are five kinds of condition: references  to  subpatterns,  refer-
7437       ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
7438       and assertions.
7439
7440   Checking for a used subpattern by number
7441
7442       If the text between the parentheses consists of a sequence  of  digits,
7443       the condition is true if a capturing subpattern of that number has pre-
7444       viously matched. If there is more than one  capturing  subpattern  with
7445       the  same  number  (see  the earlier section about duplicate subpattern
7446       numbers), the condition is true if any of them have matched. An  alter-
7447       native  notation is to precede the digits with a plus or minus sign. In
7448       this case, the subpattern number is relative rather than absolute.  The
7449       most  recently opened parentheses can be referenced by (?(-1), the next
7450       most recent by (?(-2), and so on. Inside loops it can also  make  sense
7451       to refer to subsequent groups. The next parentheses to be opened can be
7452       referenced as (?(+1), and so on. (The value zero in any of these  forms
7453       is not used; it provokes a compile-time error.)
7454
7455       Consider  the  following  pattern, which contains non-significant white
7456       space to make it more readable (assume the PCRE2_EXTENDED  option)  and
7457       to divide it into three parts for ease of discussion:
7458
7459         ( \( )?    [^()]+    (?(1) \) )
7460
7461       The  first  part  matches  an optional opening parenthesis, and if that
7462       character is present, sets it as the first captured substring. The sec-
7463       ond  part  matches one or more characters that are not parentheses. The
7464       third part is a conditional subpattern that tests whether  or  not  the
7465       first  set  of  parentheses  matched.  If they did, that is, if subject
7466       started with an opening parenthesis, the condition is true, and so  the
7467       yes-pattern  is  executed and a closing parenthesis is required. Other-
7468       wise, since no-pattern is not present, the subpattern matches  nothing.
7469       In  other  words,  this  pattern matches a sequence of non-parentheses,
7470       optionally enclosed in parentheses.
7471
7472       If you were embedding this pattern in a larger one,  you  could  use  a
7473       relative reference:
7474
7475         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
7476
7477       This  makes  the  fragment independent of the parentheses in the larger
7478       pattern.
7479
7480   Checking for a used subpattern by name
7481
7482       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
7483       used  subpattern  by  name.  For compatibility with earlier versions of
7484       PCRE1, which had this facility before Perl, the syntax (?(name)...)  is
7485       also recognized.
7486
7487       Rewriting the above example to use a named subpattern gives this:
7488
7489         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
7490
7491       If  the  name used in a condition of this kind is a duplicate, the test
7492       is applied to all subpatterns of the same name, and is true if any  one
7493       of them has matched.
7494
7495   Checking for pattern recursion
7496
7497       If the condition is the string (R), and there is no subpattern with the
7498       name R, the condition is true if a recursive call to the whole  pattern
7499       or any subpattern has been made. If digits or a name preceded by amper-
7500       sand follow the letter R, for example:
7501
7502         (?(R3)...) or (?(R&name)...)
7503
7504       the condition is true if the most recent recursion is into a subpattern
7505       whose number or name is given. This condition does not check the entire
7506       recursion stack. If the name used in a condition  of  this  kind  is  a
7507       duplicate, the test is applied to all subpatterns of the same name, and
7508       is true if any one of them is the most recent recursion.
7509
7510       At "top level", all these recursion test  conditions  are  false.   The
7511       syntax for recursive patterns is described below.
7512
7513   Defining subpatterns for use by reference only
7514
7515       If  the  condition  is  the string (DEFINE), and there is no subpattern
7516       with the name DEFINE, the condition is  always  false.  In  this  case,
7517       there  may  be  only  one  alternative  in the subpattern. It is always
7518       skipped if control reaches this point  in  the  pattern;  the  idea  of
7519       DEFINE  is that it can be used to define subroutines that can be refer-
7520       enced from elsewhere. (The use of subroutines is described below.)  For
7521       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
7522       could be written like this (ignore white space and line breaks):
7523
7524         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
7525         \b (?&byte) (\.(?&byte)){3} \b
7526
7527       The first part of the pattern is a DEFINE group inside which a  another
7528       group  named "byte" is defined. This matches an individual component of
7529       an IPv4 address (a number less than 256). When  matching  takes  place,
7530       this  part  of  the pattern is skipped because DEFINE acts like a false
7531       condition. The rest of the pattern uses references to the  named  group
7532       to  match the four dot-separated components of an IPv4 address, insist-
7533       ing on a word boundary at each end.
7534
7535   Checking the PCRE2 version
7536
7537       Programs that link with a PCRE2 library can check the version by  call-
7538       ing  pcre2_config()  with  appropriate arguments. Users of applications
7539       that do not have access to the underlying code cannot do this.  A  spe-
7540       cial  "condition" called VERSION exists to allow such users to discover
7541       which version of PCRE2 they are dealing with by using this condition to
7542       match  a string such as "yesno". VERSION must be followed either by "="
7543       or ">=" and a version number.  For example:
7544
7545         (?(VERSION>=10.4)yes|no)
7546
7547       This pattern matches "yes" if the PCRE2 version is greater or equal  to
7548       10.4,  or "no" otherwise. The fractional part of the version number may
7549       not contain more than two digits.
7550
7551   Assertion conditions
7552
7553       If the condition is not in any of the above  formats,  it  must  be  an
7554       assertion.   This may be a positive or negative lookahead or lookbehind
7555       assertion. Consider  this  pattern,  again  containing  non-significant
7556       white space, and with the two alternatives on the second line:
7557
7558         (?(?=[^a-z]*[a-z])
7559         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
7560
7561       The  condition  is  a  positive  lookahead  assertion  that  matches an
7562       optional sequence of non-letters followed by a letter. In other  words,
7563       it  tests  for the presence of at least one letter in the subject. If a
7564       letter is found, the subject is matched against the first  alternative;
7565       otherwise  it  is  matched  against  the  second.  This pattern matches
7566       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
7567       letters and dd are digits.
7568
7569
7570COMMENTS
7571
7572       There are two ways of including comments in patterns that are processed
7573       by PCRE2. In both cases, the start of the comment  must  not  be  in  a
7574       character  class,  nor  in  the middle of any other sequence of related
7575       characters such as (?: or a subpattern name or number.  The  characters
7576       that make up a comment play no part in the pattern matching.
7577
7578       The  sequence (?# marks the start of a comment that continues up to the
7579       next closing parenthesis. Nested parentheses are not permitted. If  the
7580       PCRE2_EXTENDED  option is set, an unescaped # character also introduces
7581       a comment, which in this case continues to immediately after  the  next
7582       newline  character  or character sequence in the pattern. Which charac-
7583       ters are interpreted as newlines is controlled by an option  passed  to
7584       the  compiling  function  or  by a special sequence at the start of the
7585       pattern, as described in the  section  entitled  "Newline  conventions"
7586       above.  Note  that the end of this type of comment is a literal newline
7587       sequence in the pattern; escape sequences that happen  to  represent  a
7588       newline   do  not  count.  For  example,  consider  this  pattern  when
7589       PCRE2_EXTENDED is set, and the default  newline  convention  (a  single
7590       linefeed character) is in force:
7591
7592         abc #comment \n still comment
7593
7594       On  encountering  the # character, pcre2_compile() skips along, looking
7595       for a newline in the pattern. The sequence \n is still literal at  this
7596       stage,  so  it does not terminate the comment. Only an actual character
7597       with the code value 0x0a (the default newline) does so.
7598
7599
7600RECURSIVE PATTERNS
7601
7602       Consider the problem of matching a string in parentheses, allowing  for
7603       unlimited  nested  parentheses.  Without the use of recursion, the best
7604       that can be done is to use a pattern that  matches  up  to  some  fixed
7605       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
7606       depth.
7607
7608       For some time, Perl has provided a facility that allows regular expres-
7609       sions  to recurse (amongst other things). It does this by interpolating
7610       Perl code in the expression at run time, and the code can refer to  the
7611       expression itself. A Perl pattern using code interpolation to solve the
7612       parentheses problem can be created like this:
7613
7614         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
7615
7616       The (?p{...}) item interpolates Perl code at run time, and in this case
7617       refers recursively to the pattern in which it appears.
7618
7619       Obviously,  PCRE2  cannot  support  the  interpolation  of  Perl  code.
7620       Instead, it supports special syntax for recursion of  the  entire  pat-
7621       tern, and also for individual subpattern recursion. After its introduc-
7622       tion in PCRE1 and Python,  this  kind  of  recursion  was  subsequently
7623       introduced into Perl at release 5.10.
7624
7625       A  special  item  that consists of (? followed by a number greater than
7626       zero and a closing parenthesis is a recursive subroutine  call  of  the
7627       subpattern  of  the  given  number, provided that it occurs inside that
7628       subpattern. (If not, it is a non-recursive subroutine  call,  which  is
7629       described  in  the  next  section.)  The special item (?R) or (?0) is a
7630       recursive call of the entire regular expression.
7631
7632       This PCRE2 pattern solves the nested parentheses  problem  (assume  the
7633       PCRE2_EXTENDED option is set so that white space is ignored):
7634
7635         \( ( [^()]++ | (?R) )* \)
7636
7637       First  it matches an opening parenthesis. Then it matches any number of
7638       substrings which can either be a  sequence  of  non-parentheses,  or  a
7639       recursive  match  of the pattern itself (that is, a correctly parenthe-
7640       sized substring).  Finally there is a closing parenthesis. Note the use
7641       of a possessive quantifier to avoid backtracking into sequences of non-
7642       parentheses.
7643
7644       If this were part of a larger pattern, you would not  want  to  recurse
7645       the entire pattern, so instead you could use this:
7646
7647         ( \( ( [^()]++ | (?1) )* \) )
7648
7649       We  have  put the pattern into parentheses, and caused the recursion to
7650       refer to them instead of the whole pattern.
7651
7652       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
7653       tricky.  This is made easier by the use of relative references. Instead
7654       of (?1) in the pattern above you can write (?-2) to refer to the second
7655       most  recently  opened  parentheses  preceding  the recursion. In other
7656       words, a negative number counts capturing  parentheses  leftwards  from
7657       the point at which it is encountered.
7658
7659       Be aware however, that if duplicate subpattern numbers are in use, rel-
7660       ative references refer to the earliest subpattern with the  appropriate
7661       number. Consider, for example:
7662
7663         (?|(a)|(b)) (c) (?-2)
7664
7665       The  first  two  capturing  groups (a) and (b) are both numbered 1, and
7666       group (c) is number 2. When the reference  (?-2)  is  encountered,  the
7667       second most recently opened parentheses has the number 1, but it is the
7668       first such group (the (a) group) to which the  recursion  refers.  This
7669       would  be  the  same  if  an absolute reference (?1) was used. In other
7670       words, relative references are just a shorthand for computing  a  group
7671       number.
7672
7673       It  is  also  possible  to refer to subsequently opened parentheses, by
7674       writing references such as (?+2). However, these  cannot  be  recursive
7675       because  the  reference  is  not inside the parentheses that are refer-
7676       enced. They are always non-recursive subroutine calls, as described  in
7677       the next section.
7678
7679       An  alternative  approach  is to use named parentheses. The Perl syntax
7680       for this is (?&name); PCRE1's earlier syntax  (?P>name)  is  also  sup-
7681       ported. We could rewrite the above example as follows:
7682
7683         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
7684
7685       If  there  is more than one subpattern with the same name, the earliest
7686       one is used.
7687
7688       The example pattern that we have been looking at contains nested unlim-
7689       ited  repeats,  and  so the use of a possessive quantifier for matching
7690       strings of non-parentheses is important when applying  the  pattern  to
7691       strings that do not match. For example, when this pattern is applied to
7692
7693         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
7694
7695       it  yields  "no  match" quickly. However, if a possessive quantifier is
7696       not used, the match runs for a very long time indeed because there  are
7697       so  many  different  ways the + and * repeats can carve up the subject,
7698       and all have to be tested before failure can be reported.
7699
7700       At the end of a match, the values of capturing  parentheses  are  those
7701       from  the outermost level. If you want to obtain intermediate values, a
7702       callout function can be used (see below and the pcre2callout documenta-
7703       tion). If the pattern above is matched against
7704
7705         (ab(cd)ef)
7706
7707       the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
7708       which is the last value taken on at the top level. If a capturing  sub-
7709       pattern  is  not  matched at the top level, its final captured value is
7710       unset, even if it was (temporarily) set at a deeper  level  during  the
7711       matching process.
7712
7713       If there are more than 15 capturing parentheses in a pattern, PCRE2 has
7714       to obtain extra memory from the heap to store data during a  recursion.
7715       If   no   memory   can   be   obtained,   the   match  fails  with  the
7716       PCRE2_ERROR_NOMEMORY error.
7717
7718       Do not confuse the (?R) item with the condition (R),  which  tests  for
7719       recursion.   Consider  this pattern, which matches text in angle brack-
7720       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
7721       brackets  (that is, when recursing), whereas any characters are permit-
7722       ted at the outer level.
7723
7724         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
7725
7726       In this pattern, (?(R) is the start of a conditional  subpattern,  with
7727       two  different  alternatives for the recursive and non-recursive cases.
7728       The (?R) item is the actual recursive call.
7729
7730   Differences in recursion processing between PCRE2 and Perl
7731
7732       Recursion processing in PCRE2 differs from Perl in two important  ways.
7733       In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is
7734       always treated as an atomic group. That is, once it has matched some of
7735       the subject string, it is never re-entered, even if it contains untried
7736       alternatives and there is a subsequent matching failure.  This  can  be
7737       illustrated  by the following pattern, which purports to match a palin-
7738       dromic string that contains an odd number of characters  (for  example,
7739       "a", "aba", "abcba", "abcdcba"):
7740
7741         ^(.|(.)(?1)\2)$
7742
7743       The idea is that it either matches a single character, or two identical
7744       characters surrounding a sub-palindrome. In Perl, this  pattern  works;
7745       in  PCRE2  it  does not if the pattern is longer than three characters.
7746       Consider the subject string "abcba":
7747
7748       At the top level, the first character is matched, but as it is  not  at
7749       the end of the string, the first alternative fails; the second alterna-
7750       tive is taken and the recursion kicks in. The recursive call to subpat-
7751       tern  1  successfully  matches the next character ("b"). (Note that the
7752       beginning and end of line tests are not part of the recursion).
7753
7754       Back at the top level, the next character ("c") is compared  with  what
7755       subpattern  2 matched, which was "a". This fails. Because the recursion
7756       is treated as an atomic group, there are now  no  backtracking  points,
7757       and  so  the  entire  match fails. (Perl is able, at this point, to re-
7758       enter the recursion and try the second alternative.)  However,  if  the
7759       pattern is written with the alternatives in the other order, things are
7760       different:
7761
7762         ^((.)(?1)\2|.)$
7763
7764       This time, the recursing alternative is tried first, and  continues  to
7765       recurse  until  it runs out of characters, at which point the recursion
7766       fails. But this time we do have  another  alternative  to  try  at  the
7767       higher  level.  That  is  the  big difference: in the previous case the
7768       remaining alternative is at a deeper recursion level, which PCRE2  can-
7769       not use.
7770
7771       To  change  the pattern so that it matches all palindromic strings, not
7772       just those with an odd number of characters, it is tempting  to  change
7773       the pattern to this:
7774
7775         ^((.)(?1)\2|.?)$
7776
7777       Again,  this  works in Perl, but not in PCRE2, and for the same reason.
7778       When a deeper recursion has matched a single character,  it  cannot  be
7779       entered  again  in  order  to match an empty string. The solution is to
7780       separate the two cases, and write out the odd and even cases as  alter-
7781       natives at the higher level:
7782
7783         ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
7784
7785       If  you  want  to match typical palindromic phrases, the pattern has to
7786       ignore all non-word characters, which can be done like this:
7787
7788         ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
7789
7790       If run with the PCRE2_CASELESS option,  this  pattern  matches  phrases
7791       such  as  "A  man, a plan, a canal: Panama!" and it works in both PCRE2
7792       and Perl. Note the use of the possessive quantifier *+ to  avoid  back-
7793       tracking  into  sequences  of  non-word characters. Without this, PCRE2
7794       takes a great deal longer (ten times or more) to match typical phrases,
7795       and Perl takes so long that you think it has gone into a loop.
7796
7797       WARNING:  The  palindrome-matching patterns above work only if the sub-
7798       ject string does not start with a palindrome that is shorter  than  the
7799       entire  string.  For example, although "abcba" is correctly matched, if
7800       the subject is "ababa", PCRE2 finds the palindrome "aba" at the  start,
7801       then  fails at top level because the end of the string does not follow.
7802       Once again, it cannot jump back into the recursion to try other  alter-
7803       natives, so the entire match fails.
7804
7805       The  second  way in which PCRE2 and Perl differ in their recursion pro-
7806       cessing is in the handling of captured values. In Perl, when a  subpat-
7807       tern  is  called recursively or as a subpattern (see the next section),
7808       it has no access to any values that were captured  outside  the  recur-
7809       sion,  whereas  in  PCRE2 these values can be referenced. Consider this
7810       pattern:
7811
7812         ^(.)(\1|a(?2))
7813
7814       In PCRE2, this pattern matches "bab". The first  capturing  parentheses
7815       match  "b",  then in the second group, when the back reference \1 fails
7816       to match "b", the second alternative matches "a" and then recurses.  In
7817       the  recursion,  \1 does now match "b" and so the whole match succeeds.
7818       In Perl, the pattern fails to match because inside the  recursive  call
7819       \1 cannot access the externally set value.
7820
7821
7822SUBPATTERNS AS SUBROUTINES
7823
7824       If  the  syntax for a recursive subpattern call (either by number or by
7825       name) is used outside the parentheses to which it refers,  it  operates
7826       like  a subroutine in a programming language. The called subpattern may
7827       be defined before or after the reference. A numbered reference  can  be
7828       absolute or relative, as in these examples:
7829
7830         (...(absolute)...)...(?2)...
7831         (...(relative)...)...(?-1)...
7832         (...(?+1)...(relative)...
7833
7834       An earlier example pointed out that the pattern
7835
7836         (sens|respons)e and \1ibility
7837
7838       matches  "sense and sensibility" and "response and responsibility", but
7839       not "sense and responsibility". If instead the pattern
7840
7841         (sens|respons)e and (?1)ibility
7842
7843       is used, it does match "sense and responsibility" as well as the  other
7844       two  strings.  Another  example  is  given  in the discussion of DEFINE
7845       above.
7846
7847       All subroutine calls, whether recursive or not, are always  treated  as
7848       atomic  groups. That is, once a subroutine has matched some of the sub-
7849       ject string, it is never re-entered, even if it contains untried alter-
7850       natives  and  there  is  a  subsequent  matching failure. Any capturing
7851       parentheses that are set during the subroutine  call  revert  to  their
7852       previous values afterwards.
7853
7854       Processing  options  such as case-independence are fixed when a subpat-
7855       tern is defined, so if it is used as a subroutine, such options  cannot
7856       be changed for different calls. For example, consider this pattern:
7857
7858         (abc)(?i:(?-1))
7859
7860       It  matches  "abcabc". It does not match "abcABC" because the change of
7861       processing option does not affect the called subpattern.
7862
7863
7864ONIGURUMA SUBROUTINE SYNTAX
7865
7866       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
7867       name or a number enclosed either in angle brackets or single quotes, is
7868       an alternative syntax for referencing a  subpattern  as  a  subroutine,
7869       possibly  recursively. Here are two of the examples used above, rewrit-
7870       ten using this syntax:
7871
7872         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
7873         (sens|respons)e and \g'1'ibility
7874
7875       PCRE2 supports an extension to Oniguruma: if a number is preceded by  a
7876       plus or a minus sign it is taken as a relative reference. For example:
7877
7878         (abc)(?i:\g<-1>)
7879
7880       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
7881       synonymous. The former is a back reference; the latter is a  subroutine
7882       call.
7883
7884
7885CALLOUTS
7886
7887       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
7888       Perl code to be obeyed in the middle of matching a regular  expression.
7889       This makes it possible, amongst other things, to extract different sub-
7890       strings that match the same pair of parentheses when there is a repeti-
7891       tion.
7892
7893       PCRE2  provides  a  similar feature, but of course it cannot obey arbi-
7894       trary Perl code. The feature is called "callout". The caller  of  PCRE2
7895       provides  an  external  function  by putting its entry point in a match
7896       context using the function pcre2_set_callout(), and then  passing  that
7897       context  to  pcre2_match() or pcre2_dfa_match(). If no match context is
7898       passed, or if the callout entry point is set to NULL, callouts are dis-
7899       abled.
7900
7901       Within  a  regular expression, (?C<arg>) indicates a point at which the
7902       external function is to be called. There  are  two  kinds  of  callout:
7903       those  with a numerical argument and those with a string argument. (?C)
7904       on its own with no argument is treated as (?C0). A  numerical  argument
7905       allows  the  application  to  distinguish  between  different callouts.
7906       String arguments were added for release 10.20 to make it  possible  for
7907       script  languages that use PCRE2 to embed short scripts within patterns
7908       in a similar way to Perl.
7909
7910       During matching, when PCRE2 reaches a callout point, the external func-
7911       tion  is  called.  It is provided with the number or string argument of
7912       the callout, the position in the pattern, and one item of data that  is
7913       also set in the match block. The callout function may cause matching to
7914       proceed, to backtrack, or to fail.
7915
7916       By default, PCRE2 implements a  number  of  optimizations  at  matching
7917       time,  and  one  side-effect is that sometimes callouts are skipped. If
7918       you need all possible callouts to happen, you need to set options  that
7919       disable  the relevant optimizations. More details, including a complete
7920       description of the programming interface to the callout  function,  are
7921       given in the pcre2callout documentation.
7922
7923   Callouts with numerical arguments
7924
7925       If  you  just  want  to  have  a means of identifying different callout
7926       points, put a number less than 256 after the  letter  C.  For  example,
7927       this pattern has two callout points:
7928
7929         (?C1)abc(?C2)def
7930
7931       If  the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
7932       callouts are automatically installed before each item in  the  pattern.
7933       They  are all numbered 255. If there is a conditional group in the pat-
7934       tern whose condition is an assertion, an additional callout is inserted
7935       just  before the condition. An explicit callout may also be set at this
7936       position, as in this example:
7937
7938         (?(?C9)(?=a)abc|def)
7939
7940       Note that this applies only to assertion conditions, not to other types
7941       of condition.
7942
7943   Callouts with string arguments
7944
7945       A  delimited  string may be used instead of a number as a callout argu-
7946       ment. The starting delimiter must be one of ` ' " ^ % #  $  {  and  the
7947       ending delimiter is the same as the start, except for {, where the end-
7948       ing delimiter is }. If  the  ending  delimiter  is  needed  within  the
7949       string, it must be doubled. For example:
7950
7951         (?C'ab ''c'' d')xyz(?C{any text})pqr
7952
7953       The  doubling  is  removed  before  the string is passed to the callout
7954       function.
7955
7956
7957BACKTRACKING CONTROL
7958
7959       Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
7960       which  are  still  described in the Perl documentation as "experimental
7961       and subject to change or removal in a future version of Perl". It  goes
7962       on  to  say:  "Their  usage in production code should be noted to avoid
7963       problems during upgrades." The same remarks apply to the PCRE2 features
7964       described in this section.
7965
7966       The  new verbs make use of what was previously invalid syntax: an open-
7967       ing parenthesis followed by an asterisk. They are generally of the form
7968       (*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving
7969       differently depending on whether or not a name is present.
7970
7971       By default, for compatibility with Perl, a  name  is  any  sequence  of
7972       characters that does not include a closing parenthesis. The name is not
7973       processed in any way, and it is  not  possible  to  include  a  closing
7974       parenthesis in the name.  However, if the PCRE2_ALT_VERBNAMES option is
7975       set, normal backslash processing is applied to verb names and  only  an
7976       unescaped  closing parenthesis terminates the name. A closing parenthe-
7977       sis can be included in a name either as \) or between \Q and \E. If the
7978       PCRE2_EXTENDED  option  is  set,  unescaped whitespace in verb names is
7979       skipped and #-comments are recognized, exactly as in the  rest  of  the
7980       pattern.
7981
7982       The  maximum  length of a name is 255 in the 8-bit library and 65535 in
7983       the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
7984       closing  parenthesis immediately follows the colon, the effect is as if
7985       the colon were not there. Any number of these verbs may occur in a pat-
7986       tern.
7987
7988       Since  these  verbs  are  specifically related to backtracking, most of
7989       them can be used only when the pattern is to be matched using the  tra-
7990       ditional matching function, because these use a backtracking algorithm.
7991       With the exception of (*FAIL), which behaves like  a  failing  negative
7992       assertion, the backtracking control verbs cause an error if encountered
7993       by the DFA matching function.
7994
7995       The behaviour of these verbs in repeated  groups,  assertions,  and  in
7996       subpatterns called as subroutines (whether or not recursively) is docu-
7997       mented below.
7998
7999   Optimizations that affect backtracking verbs
8000
8001       PCRE2 contains some optimizations that are used to speed up matching by
8002       running some checks at the start of each match attempt. For example, it
8003       may know the minimum length of matching subject, or that  a  particular
8004       character must be present. When one of these optimizations bypasses the
8005       running of a match,  any  included  backtracking  verbs  will  not,  of
8006       course, be processed. You can suppress the start-of-match optimizations
8007       by setting the PCRE2_NO_START_OPTIMIZE option when  calling  pcre2_com-
8008       pile(),  or by starting the pattern with (*NO_START_OPT). There is more
8009       discussion of this option in the section entitled "Compiling a pattern"
8010       in the pcre2api documentation.
8011
8012       Experiments  with  Perl  suggest that it too has similar optimizations,
8013       sometimes leading to anomalous results.
8014
8015   Verbs that act immediately
8016
8017       The following verbs act as soon as they are encountered. They  may  not
8018       be followed by a name.
8019
8020          (*ACCEPT)
8021
8022       This  verb causes the match to end successfully, skipping the remainder
8023       of the pattern. However, when it is inside a subpattern that is  called
8024       as  a  subroutine, only that subpattern is ended successfully. Matching
8025       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
8026       tive  assertion,  the  assertion succeeds; in a negative assertion, the
8027       assertion fails.
8028
8029       If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-
8030       tured. For example:
8031
8032         A((?:A|B(*ACCEPT)|C)D)
8033
8034       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
8035       tured by the outer parentheses.
8036
8037         (*FAIL) or (*F)
8038
8039       This verb causes a matching failure, forcing backtracking to occur.  It
8040       is  equivalent to (?!) but easier to read. The Perl documentation notes
8041       that it is probably useful only when combined  with  (?{})  or  (??{}).
8042       Those  are, of course, Perl features that are not present in PCRE2. The
8043       nearest equivalent is the callout feature, as for example in this  pat-
8044       tern:
8045
8046         a+(?C)(*FAIL)
8047
8048       A  match  with the string "aaaa" always fails, but the callout is taken
8049       before each backtrack happens (in this example, 10 times).
8050
8051   Recording which path was taken
8052
8053       There is one verb whose main purpose  is  to  track  how  a  match  was
8054       arrived  at,  though  it  also  has a secondary use in conjunction with
8055       advancing the match starting point (see (*SKIP) below).
8056
8057         (*MARK:NAME) or (*:NAME)
8058
8059       A name is always  required  with  this  verb.  There  may  be  as  many
8060       instances  of  (*MARK) as you like in a pattern, and their names do not
8061       have to be unique.
8062
8063       When a match succeeds, the name of the  last-encountered  (*MARK:NAME),
8064       (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to
8065       the caller as described in  the  section  entitled  "Other  information
8066       about  the  match" in the pcre2api documentation. Here is an example of
8067       pcre2test output, where the "mark" modifier requests the retrieval  and
8068       outputting of (*MARK) data:
8069
8070           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
8071         data> XY
8072          0: XY
8073         MK: A
8074         XZ
8075          0: XZ
8076         MK: B
8077
8078       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
8079       ple it indicates which of the two alternatives matched. This is a  more
8080       efficient  way of obtaining this information than putting each alterna-
8081       tive in its own capturing parentheses.
8082
8083       If a verb with a name is encountered in a positive  assertion  that  is
8084       true,  the  name  is recorded and passed back if it is the last-encoun-
8085       tered. This does not happen for negative assertions or failing positive
8086       assertions.
8087
8088       After  a  partial match or a failed match, the last encountered name in
8089       the entire match process is returned. For example:
8090
8091           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
8092         data> XP
8093         No match, mark = B
8094
8095       Note that in this unanchored example the  mark  is  retained  from  the
8096       match attempt that started at the letter "X" in the subject. Subsequent
8097       match attempts starting at "P" and then with an empty string do not get
8098       as far as the (*MARK) item, but nevertheless do not reset it.
8099
8100       If  you  are  interested  in  (*MARK)  values after failed matches, you
8101       should probably set the PCRE2_NO_START_OPTIMIZE option (see  above)  to
8102       ensure that the match is always attempted.
8103
8104   Verbs that act after backtracking
8105
8106       The following verbs do nothing when they are encountered. Matching con-
8107       tinues with what follows, but if there is no subsequent match,  causing
8108       a  backtrack  to  the  verb, a failure is forced. That is, backtracking
8109       cannot pass to the left of the verb. However, when one of  these  verbs
8110       appears inside an atomic group (which includes any group that is called
8111       as a subroutine) or in an assertion that is true, its  effect  is  con-
8112       fined  to that group, because once the group has been matched, there is
8113       never any backtracking into it. In this situation, backtracking has  to
8114       jump to the left of the entire atomic group or assertion.
8115
8116       These  verbs  differ  in exactly what kind of failure occurs when back-
8117       tracking reaches them. The behaviour described below  is  what  happens
8118       when  the  verb is not in a subroutine or an assertion. Subsequent sec-
8119       tions cover these special cases.
8120
8121         (*COMMIT)
8122
8123       This verb, which may not be followed by a name, causes the whole  match
8124       to fail outright if there is a later matching failure that causes back-
8125       tracking to reach it. Even if the pattern  is  unanchored,  no  further
8126       attempts to find a match by advancing the starting point take place. If
8127       (*COMMIT) is the only backtracking verb that is  encountered,  once  it
8128       has  been  passed  pcre2_match() is committed to finding a match at the
8129       current starting point, or not at all. For example:
8130
8131         a+(*COMMIT)b
8132
8133       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
8134       of dynamic anchor, or "I've started, so I must finish." The name of the
8135       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
8136       forces a match failure.
8137
8138       If  there  is more than one backtracking verb in a pattern, a different
8139       one that follows (*COMMIT) may be triggered first,  so  merely  passing
8140       (*COMMIT) during a match does not always guarantee that a match must be
8141       at this starting point.
8142
8143       Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
8144       anchor,  unless PCRE2's start-of-match optimizations are turned off, as
8145       shown in this output from pcre2test:
8146
8147           re> /(*COMMIT)abc/
8148         data> xyzabc
8149          0: abc
8150         data>
8151         re> /(*COMMIT)abc/no_start_optimize
8152         data> xyzabc
8153         No match
8154
8155       For the first pattern, PCRE2 knows that any match must start with  "a",
8156       so  the optimization skips along the subject to "a" before applying the
8157       pattern to the first set of data. The match attempt then succeeds.  The
8158       second  pattern disables the optimization that skips along to the first
8159       character. The pattern is now applied  starting  at  "x",  and  so  the
8160       (*COMMIT)  causes  the  match to fail without trying any other starting
8161       points.
8162
8163         (*PRUNE) or (*PRUNE:NAME)
8164
8165       This verb causes the match to fail at the current starting position  in
8166       the subject if there is a later matching failure that causes backtrack-
8167       ing to reach it. If the pattern is unanchored, the  normal  "bumpalong"
8168       advance  to  the next starting character then happens. Backtracking can
8169       occur as usual to the left of (*PRUNE), before it is reached,  or  when
8170       matching  to  the  right  of  (*PRUNE), but if there is no match to the
8171       right, backtracking cannot cross (*PRUNE). In simple cases, the use  of
8172       (*PRUNE)  is just an alternative to an atomic group or possessive quan-
8173       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
8174       any  other  way. In an anchored pattern (*PRUNE) has the same effect as
8175       (*COMMIT).
8176
8177       The   behaviour   of   (*PRUNE:NAME)   is   the   not   the   same   as
8178       (*MARK:NAME)(*PRUNE).   It  is  like  (*MARK:NAME)  in that the name is
8179       remembered for  passing  back  to  the  caller.  However,  (*SKIP:NAME)
8180       searches  only  for  names  set  with  (*MARK),  ignoring  those set by
8181       (*PRUNE) or (*THEN).
8182
8183         (*SKIP)
8184
8185       This verb, when given without a name, is like (*PRUNE), except that  if
8186       the  pattern  is unanchored, the "bumpalong" advance is not to the next
8187       character, but to the position in the subject where (*SKIP) was encoun-
8188       tered.  (*SKIP)  signifies that whatever text was matched leading up to
8189       it cannot be part of a successful match. Consider:
8190
8191         a+(*SKIP)b
8192
8193       If the subject is "aaaac...",  after  the  first  match  attempt  fails
8194       (starting  at  the  first  character in the string), the starting point
8195       skips on to start the next attempt at "c". Note that a possessive quan-
8196       tifer  does not have the same effect as this example; although it would
8197       suppress backtracking  during  the  first  match  attempt,  the  second
8198       attempt  would  start at the second character instead of skipping on to
8199       "c".
8200
8201         (*SKIP:NAME)
8202
8203       When (*SKIP) has an associated name, its behaviour is modified. When it
8204       is triggered, the previous path through the pattern is searched for the
8205       most recent (*MARK) that has the  same  name.  If  one  is  found,  the
8206       "bumpalong" advance is to the subject position that corresponds to that
8207       (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
8208       a matching name is found, the (*SKIP) is ignored.
8209
8210       Note  that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
8211       ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
8212
8213         (*THEN) or (*THEN:NAME)
8214
8215       This verb causes a skip to the next innermost  alternative  when  back-
8216       tracking  reaches  it.  That  is,  it  cancels any further backtracking
8217       within the current alternative. Its name  comes  from  the  observation
8218       that it can be used for a pattern-based if-then-else block:
8219
8220         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
8221
8222       If  the COND1 pattern matches, FOO is tried (and possibly further items
8223       after the end of the group if FOO succeeds); on  failure,  the  matcher
8224       skips  to  the second alternative and tries COND2, without backtracking
8225       into COND1. If that succeeds and BAR fails, COND3 is tried.  If  subse-
8226       quently  BAZ fails, there are no more alternatives, so there is a back-
8227       track to whatever came before the  entire  group.  If  (*THEN)  is  not
8228       inside an alternation, it acts like (*PRUNE).
8229
8230       The    behaviour   of   (*THEN:NAME)   is   the   not   the   same   as
8231       (*MARK:NAME)(*THEN).  It is like  (*MARK:NAME)  in  that  the  name  is
8232       remembered  for  passing  back  to  the  caller.  However, (*SKIP:NAME)
8233       searches only for  names  set  with  (*MARK),  ignoring  those  set  by
8234       (*PRUNE) and (*THEN).
8235
8236       A  subpattern that does not contain a | character is just a part of the
8237       enclosing alternative; it is not a nested  alternation  with  only  one
8238       alternative.  The effect of (*THEN) extends beyond such a subpattern to
8239       the enclosing alternative. Consider this pattern, where A, B, etc.  are
8240       complex  pattern fragments that do not contain any | characters at this
8241       level:
8242
8243         A (B(*THEN)C) | D
8244
8245       If A and B are matched, but there is a failure in C, matching does  not
8246       backtrack into A; instead it moves to the next alternative, that is, D.
8247       However, if the subpattern containing (*THEN) is given an  alternative,
8248       it behaves differently:
8249
8250         A (B(*THEN)C | (*FAIL)) | D
8251
8252       The  effect of (*THEN) is now confined to the inner subpattern. After a
8253       failure in C, matching moves to (*FAIL), which causes the whole subpat-
8254       tern  to  fail  because  there are no more alternatives to try. In this
8255       case, matching does now backtrack into A.
8256
8257       Note that a conditional subpattern is  not  considered  as  having  two
8258       alternatives,  because  only  one  is  ever used. In other words, the |
8259       character in a conditional subpattern has a different meaning. Ignoring
8260       white space, consider:
8261
8262         ^.*? (?(?=a) a | b(*THEN)c )
8263
8264       If  the  subject  is  "ba", this pattern does not match. Because .*? is
8265       ungreedy, it initially matches zero  characters.  The  condition  (?=a)
8266       then  fails,  the  character  "b"  is  matched, but "c" is not. At this
8267       point, matching does not backtrack to .*? as might perhaps be  expected
8268       from  the  presence  of  the | character. The conditional subpattern is
8269       part of the single alternative that comprises the whole pattern, and so
8270       the  match  fails.  (If  there was a backtrack into .*?, allowing it to
8271       match "b", the match would succeed.)
8272
8273       The verbs just described provide four different "strengths" of  control
8274       when subsequent matching fails. (*THEN) is the weakest, carrying on the
8275       match at the next alternative. (*PRUNE) comes next, failing  the  match
8276       at  the  current starting position, but allowing an advance to the next
8277       character (for an unanchored pattern). (*SKIP) is similar, except  that
8278       the advance may be more than one character. (*COMMIT) is the strongest,
8279       causing the entire match to fail.
8280
8281   More than one backtracking verb
8282
8283       If more than one backtracking verb is present in  a  pattern,  the  one
8284       that  is  backtracked  onto first acts. For example, consider this pat-
8285       tern, where A, B, etc. are complex pattern fragments:
8286
8287         (A(*COMMIT)B(*THEN)C|ABD)
8288
8289       If A matches but B fails, the backtrack to (*COMMIT) causes the  entire
8290       match to fail. However, if A and B match, but C fails, the backtrack to
8291       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
8292       is  consistent,  but is not always the same as Perl's. It means that if
8293       two or more backtracking verbs appear in succession, all the  the  last
8294       of them has no effect. Consider this example:
8295
8296         ...(*COMMIT)(*PRUNE)...
8297
8298       If there is a matching failure to the right, backtracking onto (*PRUNE)
8299       causes it to be triggered, and its action is taken. There can never  be
8300       a backtrack onto (*COMMIT).
8301
8302   Backtracking verbs in repeated groups
8303
8304       PCRE2  differs  from  Perl  in  its  handling  of backtracking verbs in
8305       repeated groups. For example, consider:
8306
8307         /(a(*COMMIT)b)+ac/
8308
8309       If the subject is "abac", Perl matches, but  PCRE2  fails  because  the
8310       (*COMMIT) in the second repeat of the group acts.
8311
8312   Backtracking verbs in assertions
8313
8314       (*FAIL)  in  an assertion has its normal effect: it forces an immediate
8315       backtrack.
8316
8317       (*ACCEPT) in a positive assertion causes the assertion to succeed with-
8318       out  any  further processing. In a negative assertion, (*ACCEPT) causes
8319       the assertion to fail without any further processing.
8320
8321       The other backtracking verbs are not treated specially if  they  appear
8322       in  a  positive  assertion.  In  particular,  (*THEN) skips to the next
8323       alternative in the innermost enclosing  group  that  has  alternations,
8324       whether or not this is within the assertion.
8325
8326       Negative  assertions  are,  however, different, in order to ensure that
8327       changing a positive assertion into a  negative  assertion  changes  its
8328       result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg-
8329       ative assertion to be true, without considering any further alternative
8330       branches in the assertion.  Backtracking into (*THEN) causes it to skip
8331       to the next enclosing alternative within the assertion (the normal  be-
8332       haviour),  but  if  the  assertion  does  not have such an alternative,
8333       (*THEN) behaves like (*PRUNE).
8334
8335   Backtracking verbs in subroutines
8336
8337       These behaviours occur whether or not the subpattern is  called  recur-
8338       sively.  Perl's treatment of subroutines is different in some cases.
8339
8340       (*FAIL)  in  a subpattern called as a subroutine has its normal effect:
8341       it forces an immediate backtrack.
8342
8343       (*ACCEPT) in a subpattern called as a subroutine causes the  subroutine
8344       match  to succeed without any further processing. Matching then contin-
8345       ues after the subroutine call.
8346
8347       (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
8348       cause the subroutine match to fail.
8349
8350       (*THEN)  skips to the next alternative in the innermost enclosing group
8351       within the subpattern that has alternatives. If there is no such  group
8352       within the subpattern, (*THEN) causes the subroutine match to fail.
8353
8354
8355SEE ALSO
8356
8357       pcre2api(3),    pcre2callout(3),    pcre2matching(3),   pcre2syntax(3),
8358       pcre2(3).
8359
8360
8361AUTHOR
8362
8363       Philip Hazel
8364       University Computing Service
8365       Cambridge, England.
8366
8367
8368REVISION
8369
8370       Last updated: 20 June 2016
8371       Copyright (c) 1997-2016 University of Cambridge.
8372------------------------------------------------------------------------------
8373
8374
8375PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
8376
8377
8378
8379NAME
8380       PCRE2 - Perl-compatible regular expressions (revised API)
8381
8382PCRE2 PERFORMANCE
8383
8384       Two  aspects  of performance are discussed below: memory usage and pro-
8385       cessing time. The way you express your pattern as a regular  expression
8386       can affect both of them.
8387
8388
8389COMPILED PATTERN MEMORY USAGE
8390
8391       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
8392       code, so that most simple patterns do not  use  much  memory.  However,
8393       there  is  one case where the memory usage of a compiled pattern can be
8394       unexpectedly large. If a parenthesized subpattern has a quantifier with
8395       a minimum greater than 1 and/or a limited maximum, the whole subpattern
8396       is repeated in the compiled code. For example, the pattern
8397
8398         (abc|def){2,4}
8399
8400       is compiled as if it were
8401
8402         (abc|def)(abc|def)((abc|def)(abc|def)?)?
8403
8404       (Technical aside: It is done this way so that backtrack  points  within
8405       each of the repetitions can be independently maintained.)
8406
8407       For  regular expressions whose quantifiers use only small numbers, this
8408       is not usually a problem. However, if the numbers are large,  and  par-
8409       ticularly  if  such repetitions are nested, the memory usage can become
8410       an embarrassment. For example, the very simple pattern
8411
8412         ((ab){1,1000}c){1,3}
8413
8414       uses 51K bytes when compiled using the 8-bit  library.  When  PCRE2  is
8415       compiled  with its default internal pointer size of two bytes, the size
8416       limit on a compiled pattern is 64K code units in the 8-bit  and  16-bit
8417       libraries, and this is reached with the above pattern if the outer rep-
8418       etition is increased from 3 to 4. PCRE2 can be compiled to  use  larger
8419       internal  pointers  and thus handle larger compiled patterns, but it is
8420       better to try to rewrite your pattern to use less memory if you can.
8421
8422       One way of reducing the memory usage for such patterns is to  make  use
8423       of PCRE2's "subroutine" facility. Re-writing the above pattern as
8424
8425         ((ab)(?2){0,999}c)(?1){0,2}
8426
8427       reduces the memory requirements to 18K, and indeed it remains under 20K
8428       even with the outer repetition increased to 100. However, this  pattern
8429       is  not  exactly equivalent, because the "subroutine" calls are treated
8430       as atomic groups into which there can be no backtracking if there is  a
8431       subsequent  matching  failure.  Therefore, PCRE2 cannot do this kind of
8432       rewriting automatically.  Furthermore, there is a  noticeable  loss  of
8433       speed  when executing the modified pattern. Nevertheless, if the atomic
8434       grouping is not a problem and the loss of  speed  is  acceptable,  this
8435       kind  of rewriting will allow you to process patterns that PCRE2 cannot
8436       otherwise handle.
8437
8438
8439STACK USAGE AT RUN TIME
8440
8441       When pcre2_match() is used for matching, certain kinds of  pattern  can
8442       cause  it  to  use large amounts of the process stack. In some environ-
8443       ments the default process stack is quite small, and if it runs out  the
8444       result  is  often  SIGSEGV.  Rewriting your pattern can often help. The
8445       pcre2stack documentation discusses this issue in detail.
8446
8447
8448PROCESSING TIME
8449
8450       Certain items in regular expression patterns are processed  more  effi-
8451       ciently than others. It is more efficient to use a character class like
8452       [aeiou]  than  a  set  of   single-character   alternatives   such   as
8453       (a|e|i|o|u).  In  general,  the simplest construction that provides the
8454       required behaviour is usually the most efficient. Jeffrey Friedl's book
8455       contains  a  lot  of useful general discussion about optimizing regular
8456       expressions for efficient performance. This  document  contains  a  few
8457       observations about PCRE2.
8458
8459       Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
8460       slow, because PCRE2 has to use a multi-stage table lookup  whenever  it
8461       needs  a  character's  property. If you can find an alternative pattern
8462       that does not use character properties, it will probably be faster.
8463
8464       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
8465       character  classes  such  as  [:alpha:]  do not use Unicode properties,
8466       partly for backwards compatibility, and partly for performance reasons.
8467       However,  you  can  set  the PCRE2_UCP option or start the pattern with
8468       (*UCP) if you want Unicode character properties to be  used.  This  can
8469       double  the  matching  time  for  items  such  as \d, when matched with
8470       pcre2_match(); the performance loss is less with a DFA  matching  func-
8471       tion, and in both cases there is not much difference for \b.
8472
8473       When  a pattern begins with .* not in atomic parentheses, nor in paren-
8474       theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
8475       option  is  set,  the pattern is implicitly anchored by PCRE2, since it
8476       can match only at the start of a subject string.  If  the  pattern  has
8477       multiple top-level branches, they must all be anchorable. The optimiza-
8478       tion can be disabled by  the  PCRE2_NO_DOTSTAR_ANCHOR  option,  and  is
8479       automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
8480
8481       If  PCRE2_DOTALL  is  not  set,  PCRE2  cannot  make this optimization,
8482       because the dot metacharacter does not then match a newline, and if the
8483       subject  string contains newlines, the pattern may match from the char-
8484       acter immediately following one of them instead of from the very start.
8485       For example, the pattern
8486
8487         .*second
8488
8489       matches  the subject "first\nand second" (where \n stands for a newline
8490       character), with the match starting at the seventh character. In  order
8491       to  do  this, PCRE2 has to retry the match starting after every newline
8492       in the subject.
8493
8494       If you are using such a pattern with subject strings that do  not  con-
8495       tain   newlines,   the   best   performance   is  obtained  by  setting
8496       PCRE2_DOTALL, or starting the pattern with  ^.*  or  ^.*?  to  indicate
8497       explicit anchoring. That saves PCRE2 from having to scan along the sub-
8498       ject looking for a newline to restart at.
8499
8500       Beware of patterns that contain nested indefinite  repeats.  These  can
8501       take  a  long time to run when applied to a string that does not match.
8502       Consider the pattern fragment
8503
8504         ^(a+)*
8505
8506       This can match "aaaa" in 16 different ways, and this  number  increases
8507       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
8508       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
8509       repeats  can  match  different numbers of times.) When the remainder of
8510       the pattern is such that the entire match is going to fail,  PCRE2  has
8511       in  principle  to  try  every  possible variation, and this can take an
8512       extremely long time, even for relatively short strings.
8513
8514       An optimization catches some of the more simple cases such as
8515
8516         (a+)*b
8517
8518       where a literal character follows. Before  embarking  on  the  standard
8519       matching  procedure, PCRE2 checks that there is a "b" later in the sub-
8520       ject string, and if there is not, it fails the match immediately.  How-
8521       ever,  when  there  is no following literal this optimization cannot be
8522       used. You can see the difference by comparing the behaviour of
8523
8524         (a+)*\d
8525
8526       with the pattern above. The former gives  a  failure  almost  instantly
8527       when  applied  to  a  whole  line of "a" characters, whereas the latter
8528       takes an appreciable time with strings longer than about 20 characters.
8529
8530       In many cases, the solution to this kind of performance issue is to use
8531       an atomic group or a possessive quantifier.
8532
8533
8534AUTHOR
8535
8536       Philip Hazel
8537       University Computing Service
8538       Cambridge, England.
8539
8540
8541REVISION
8542
8543       Last updated: 02 January 2015
8544       Copyright (c) 1997-2015 University of Cambridge.
8545------------------------------------------------------------------------------
8546
8547
8548PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
8549
8550
8551
8552NAME
8553       PCRE2 - Perl-compatible regular expressions (revised API)
8554
8555SYNOPSIS
8556
8557       #include <pcre2posix.h>
8558
8559       int regcomp(regex_t *preg, const char *pattern,
8560            int cflags);
8561
8562       int regexec(const regex_t *preg, const char *string,
8563            size_t nmatch, regmatch_t pmatch[], int eflags);
8564
8565       size_t regerror(int errcode, const regex_t *preg,
8566            char *errbuf, size_t errbuf_size);
8567
8568       void regfree(regex_t *preg);
8569
8570
8571DESCRIPTION
8572
8573       This  set of functions provides a POSIX-style API for the PCRE2 regular
8574       expression 8-bit library. See the pcre2api documentation for a descrip-
8575       tion  of PCRE2's native API, which contains much additional functional-
8576       ity. There are no POSIX-style wrappers for PCRE2's  16-bit  and  32-bit
8577       libraries.
8578
8579       The functions described here are just wrapper functions that ultimately
8580       call the  PCRE2  native  API.  Their  prototypes  are  defined  in  the
8581       pcre2posix.h  header  file,  and  on Unix systems the library itself is
8582       called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix  to
8583       the  command  for  linking  an  application that uses them. Because the
8584       POSIX functions call the native ones,  it  is  also  necessary  to  add
8585       -lpcre2-8.
8586
8587       Those  POSIX  option bits that can reasonably be mapped to PCRE2 native
8588       options have been implemented. In addition, the option REG_EXTENDED  is
8589       defined  with  the  value  zero. This has no effect, but since programs
8590       that are written to the POSIX interface often use  it,  this  makes  it
8591       easier  to  slot in PCRE2 as a replacement library. Other POSIX options
8592       are not even defined.
8593
8594       There are also some options that are not defined by POSIX.  These  have
8595       been  added  at  the  request  of users who want to make use of certain
8596       PCRE2-specific features via the POSIX calling interface.
8597
8598       When PCRE2 is called via these functions, it is only the  API  that  is
8599       POSIX-like  in  style.  The syntax and semantics of the regular expres-
8600       sions themselves are still those of Perl, subject  to  the  setting  of
8601       various  PCRE2 options, as described below. "POSIX-like in style" means
8602       that the API approximates to the POSIX  definition;  it  is  not  fully
8603       POSIX-compatible,  and  in  multi-unit  encoding domains it is probably
8604       even less compatible.
8605
8606       The header for these functions is supplied as pcre2posix.h to avoid any
8607       potential  clash  with  other  POSIX  libraries.  It can, of course, be
8608       renamed or aliased as regex.h, which is the "correct" name. It provides
8609       two  structure  types,  regex_t  for  compiled internal forms, and reg-
8610       match_t for returning captured substrings. It also  defines  some  con-
8611       stants  whose  names  start  with  "REG_";  these  are used for setting
8612       options and identifying error codes.
8613
8614
8615COMPILING A PATTERN
8616
8617       The function regcomp() is called to compile a pattern into an  internal
8618       form.  The  pattern  is  a C string terminated by a binary zero, and is
8619       passed in the argument pattern. The preg argument is  a  pointer  to  a
8620       regex_t  structure that is used as a base for storing information about
8621       the compiled regular expression.
8622
8623       The argument cflags is either zero, or contains one or more of the bits
8624       defined by the following macros:
8625
8626         REG_DOTALL
8627
8628       The  PCRE2_DOTALL  option  is set when the regular expression is passed
8629       for compilation to the native function. Note  that  REG_DOTALL  is  not
8630       part of the POSIX standard.
8631
8632         REG_ICASE
8633
8634       The  PCRE2_CASELESS option is set when the regular expression is passed
8635       for compilation to the native function.
8636
8637         REG_NEWLINE
8638
8639       The PCRE2_MULTILINE option is set when the regular expression is passed
8640       for  compilation  to the native function. Note that this does not mimic
8641       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
8642       tion).
8643
8644         REG_NOSUB
8645
8646       When  a  pattern that is compiled with this flag is passed to regexec()
8647       for matching, the nmatch and pmatch arguments are ignored, and no  cap-
8648       tured strings are returned. Versions of the PCRE library prior to 10.22
8649       used to set the  PCRE2_NO_AUTO_CAPTURE  compile  option,  but  this  no
8650       longer happens because it disables the use of back references.
8651
8652         REG_UCP
8653
8654       The  PCRE2_UCP  option is set when the regular expression is passed for
8655       compilation to the native function. This causes PCRE2  to  use  Unicode
8656       properties  when  matchine  \d,  \w,  etc., instead of just recognizing
8657       ASCII values. Note that REG_UCP is not part of the POSIX standard.
8658
8659         REG_UNGREEDY
8660
8661       The PCRE2_UNGREEDY option is set when the regular expression is  passed
8662       for  compilation  to the native function. Note that REG_UNGREEDY is not
8663       part of the POSIX standard.
8664
8665         REG_UTF
8666
8667       The PCRE2_UTF option is set when the regular expression is  passed  for
8668       compilation  to the native function. This causes the pattern itself and
8669       all data strings used for matching it to be treated as  UTF-8  strings.
8670       Note that REG_UTF is not part of the POSIX standard.
8671
8672       In  the  absence  of  these  flags, no options are passed to the native
8673       function.  This means the the regex  is  compiled  with  PCRE2  default
8674       semantics.  In particular, the way it handles newline characters in the
8675       subject string is the Perl way, not the POSIX way.  Note  that  setting
8676       PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
8677       It does not affect the way newlines are matched by the dot  metacharac-
8678       ter (they are not) or by a negative class such as [^a] (they are).
8679
8680       The  yield of regcomp() is zero on success, and non-zero otherwise. The
8681       preg structure is filled in on success, and one member of the structure
8682       is  public: re_nsub contains the number of capturing subpatterns in the
8683       regular expression. Various error codes are defined in the header file.
8684
8685       NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
8686       use the contents of the preg structure. If, for example, you pass it to
8687       regexec(), the result is undefined and your program is likely to crash.
8688
8689
8690MATCHING NEWLINE CHARACTERS
8691
8692       This area is not simple, because POSIX and Perl take different views of
8693       things.   It  is not possible to get PCRE2 to obey POSIX semantics, but
8694       then PCRE2 was never intended to be a POSIX engine. The following table
8695       lists  the  different  possibilities for matching newline characters in
8696       Perl and PCRE2:
8697
8698                                 Default   Change with
8699
8700         . matches newline          no     PCRE2_DOTALL
8701         newline matches [^a]       yes    not changeable
8702         $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
8703         $ matches \n in middle     no     PCRE2_MULTILINE
8704         ^ matches \n in middle     no     PCRE2_MULTILINE
8705
8706       This is the equivalent table for a POSIX-compatible pattern matcher:
8707
8708                                 Default   Change with
8709
8710         . matches newline          yes    REG_NEWLINE
8711         newline matches [^a]       yes    REG_NEWLINE
8712         $ matches \n at end        no     REG_NEWLINE
8713         $ matches \n in middle     no     REG_NEWLINE
8714         ^ matches \n in middle     no     REG_NEWLINE
8715
8716       This behaviour is not what happens when PCRE2 is called via  its  POSIX
8717       API.  By  default, PCRE2's behaviour is the same as Perl's, except that
8718       there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both  PCRE2
8719       and Perl, there is no way to stop newline from matching [^a].
8720
8721       Default  POSIX newline handling can be obtained by setting PCRE2_DOTALL
8722       and PCRE2_DOLLAR_ENDONLY when  calling  pcre2_compile()  directly,  but
8723       there  is  no  way  to make PCRE2 behave exactly as for the REG_NEWLINE
8724       action. When using the POSIX API, passing REG_NEWLINE to  PCRE2's  reg-
8725       comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
8726       and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass  PCRE2_DOL-
8727       LAR_ENDONLY.
8728
8729
8730MATCHING A PATTERN
8731
8732       The  function  regexec()  is  called  to  match a compiled pattern preg
8733       against a given string, which is by default terminated by a  zero  byte
8734       (but  see  REG_STARTEND below), subject to the options in eflags. These
8735       can be:
8736
8737         REG_NOTBOL
8738
8739       The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
8740       ing function.
8741
8742         REG_NOTEMPTY
8743
8744       The  PCRE2_NOTEMPTY  option  is  set  when calling the underlying PCRE2
8745       matching function. Note that REG_NOTEMPTY is  not  part  of  the  POSIX
8746       standard.  However, setting this option can give more POSIX-like behav-
8747       iour in some situations.
8748
8749         REG_NOTEOL
8750
8751       The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
8752       ing function.
8753
8754         REG_STARTEND
8755
8756       The  string  is  considered to start at string + pmatch[0].rm_so and to
8757       have a terminating NUL located at string + pmatch[0].rm_eo (there  need
8758       not  actually  be  a  NUL at that location), regardless of the value of
8759       nmatch. This is a BSD extension, compatible with but not  specified  by
8760       IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
8761       software intended to be portable to other systems. Note that a non-zero
8762       rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
8763       of the string, not how it is matched. Setting REG_STARTEND and  passing
8764       pmatch  as  NULL  are  mutually  exclusive;  the  error  REG_INVARG  is
8765       returned.
8766
8767       If the pattern was compiled with the REG_NOSUB flag, no data about  any
8768       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
8769       regexec() are ignored (except possibly as input for REG_STARTEND).
8770
8771       The value of nmatch may be zero, and  the  value  pmatch  may  be  NULL
8772       (unless  REG_STARTEND  is  set);  in both these cases no data about any
8773       matched strings is returned.
8774
8775       Otherwise, the portion of the string that was  matched,  and  also  any
8776       captured substrings, are returned via the pmatch argument, which points
8777       to an array of nmatch structures of  type  regmatch_t,  containing  the
8778       members  rm_so  and  rm_eo.  These contain the byte offset to the first
8779       character of each substring and the offset to the first character after
8780       the  end of each substring, respectively. The 0th element of the vector
8781       relates to the entire portion of string that  was  matched;  subsequent
8782       elements relate to the capturing subpatterns of the regular expression.
8783       Unused entries in the array have both structure members set to -1.
8784
8785       A successful match yields  a  zero  return;  various  error  codes  are
8786       defined  in  the  header  file,  of which REG_NOMATCH is the "expected"
8787       failure code.
8788
8789
8790ERROR MESSAGES
8791
8792       The regerror() function maps a non-zero errorcode from either regcomp()
8793       or  regexec()  to  a  printable message. If preg is not NULL, the error
8794       should have arisen from the use of that structure. A message terminated
8795       by  a binary zero is placed in errbuf. If the buffer is too short, only
8796       the first errbuf_size - 1 characters of the error message are used. The
8797       yield  of  the  function is the size of buffer needed to hold the whole
8798       message, including the terminating zero. This  value  is  greater  than
8799       errbuf_size if the message was truncated.
8800
8801
8802MEMORY USAGE
8803
8804       Compiling  a regular expression causes memory to be allocated and asso-
8805       ciated with the preg structure. The function regfree() frees  all  such
8806       memory,  after  which  preg may no longer be used as a compiled expres-
8807       sion.
8808
8809
8810AUTHOR
8811
8812       Philip Hazel
8813       University Computing Service
8814       Cambridge, England.
8815
8816
8817REVISION
8818
8819       Last updated: 31 January 2016
8820       Copyright (c) 1997-2016 University of Cambridge.
8821------------------------------------------------------------------------------
8822
8823
8824PCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3)
8825
8826
8827
8828NAME
8829       PCRE2 - Perl-compatible regular expressions (revised API)
8830
8831PCRE2 SAMPLE PROGRAM
8832
8833       A  simple, complete demonstration program to get you started with using
8834       PCRE2 is supplied in the file pcre2demo.c in the src directory  in  the
8835       PCRE2 distribution. A listing of this program is given in the pcre2demo
8836       documentation. If you do not have a copy of the PCRE2 distribution, you
8837       can save this listing to re-create the contents of pcre2demo.c.
8838
8839       The  demonstration  program compiles the regular expression that is its
8840       first argument, and matches it against the subject string in its second
8841       argument.  No  PCRE2  options are set, and default character tables are
8842       used. If matching succeeds, the program outputs the portion of the sub-
8843       ject  that  matched,  together  with  the contents of any captured sub-
8844       strings.
8845
8846       If the -g option is given on the command line, the program then goes on
8847       to check for further matches of the same regular expression in the same
8848       subject string. The logic is a little bit tricky because of the  possi-
8849       bility  of  matching an empty string. Comments in the code explain what
8850       is going on.
8851
8852       The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
8853       library.  It  handles  strings  and characters that are stored in 8-bit
8854       code units.  By default, one character corresponds to  one  code  unit,
8855       but  if  the  pattern starts with "(*UTF)", both it and the subject are
8856       treated as UTF-8 strings, where characters  may  occupy  multiple  code
8857       units.
8858
8859       If  PCRE2  is installed in the standard include and library directories
8860       for your operating system, you should be able to compile the demonstra-
8861       tion program using a command like this:
8862
8863         cc -o pcre2demo pcre2demo.c -lpcre2-8
8864
8865       If PCRE2 is installed elsewhere, you may need to add additional options
8866       to the command line. For example, on a Unix-like system that has  PCRE2
8867       installed  in  /usr/local,  you  can  compile the demonstration program
8868       using a command like this:
8869
8870         cc -o pcre2demo -I/usr/local/include pcre2demo.c \
8871            -L/usr/local/lib -lpcre2-8
8872
8873       Once you have built the demonstration program, you can run simple tests
8874       like this:
8875
8876         ./pcre2demo 'cat|dog' 'the cat sat on the mat'
8877         ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
8878
8879       Note  that  there  is  a  much  more comprehensive test program, called
8880       pcre2test, which supports many  more  facilities  for  testing  regular
8881       expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
8882       though not all three need be installed). The pcre2demo program is  pro-
8883       vided as a relatively simple coding example.
8884
8885       If you try to run pcre2demo when PCRE2 is not installed in the standard
8886       library directory, you may get an error like  this  on  some  operating
8887       systems (e.g. Solaris):
8888
8889         ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
8890       or directory
8891
8892       This is caused by the way shared library support works  on  those  sys-
8893       tems. You need to add
8894
8895         -R/usr/local/lib
8896
8897       (for example) to the compile command to get round this problem.
8898
8899
8900AUTHOR
8901
8902       Philip Hazel
8903       University Computing Service
8904       Cambridge, England.
8905
8906
8907REVISION
8908
8909       Last updated: 02 February 2016
8910       Copyright (c) 1997-2016 University of Cambridge.
8911------------------------------------------------------------------------------
8912PCRE2SERIALIZE(3)          Library Functions Manual          PCRE2SERIALIZE(3)
8913
8914
8915
8916NAME
8917       PCRE2 - Perl-compatible regular expressions (revised API)
8918
8919SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
8920
8921       int32_t pcre2_serialize_decode(pcre2_code **codes,
8922         int32_t number_of_codes, const uint32_t *bytes,
8923         pcre2_general_context *gcontext);
8924
8925       int32_t pcre2_serialize_encode(pcre2_code **codes,
8926         int32_t number_of_codes, uint32_t **serialized_bytes,
8927         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
8928
8929       void pcre2_serialize_free(uint8_t *bytes);
8930
8931       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
8932
8933       If  you  are running an application that uses a large number of regular
8934       expression patterns, it may be useful to store them  in  a  precompiled
8935       form  instead  of  having to compile them every time the application is
8936       run. However, if you are using the just-in-time  optimization  feature,
8937       it is not possible to save and reload the JIT data, because it is posi-
8938       tion-dependent. The host on which the patterns  are  reloaded  must  be
8939       running  the  same version of PCRE2, with the same code unit width, and
8940       must also have the same endianness, pointer width and PCRE2_SIZE  type.
8941       For  example, patterns compiled on a 32-bit system using PCRE2's 16-bit
8942       library cannot be reloaded on a 64-bit system, nor can they be reloaded
8943       using the 8-bit library.
8944
8945
8946SECURITY CONCERNS
8947
8948       The facility for saving and restoring compiled patterns is intended for
8949       use within individual applications.  As  such,  the  data  supplied  to
8950       pcre2_serialize_decode()  is expected to be trusted data, not data from
8951       arbitrary external sources.  There  is  only  some  simple  consistency
8952       checking, not complete validation of what is being re-loaded.
8953
8954
8955SAVING COMPILED PATTERNS
8956
8957       Before compiled patterns can be saved they must be serialized, that is,
8958       converted to a stream of bytes. A single byte stream  may  contain  any
8959       number  of  compiled patterns, but they must all use the same character
8960       tables. A single copy of the tables is included in the byte stream (its
8961       size is 1088 bytes). For more details of character tables, see the sec-
8962       tion on locale support in the pcre2api documentation.
8963
8964       The function pcre2_serialize_encode() creates a serialized byte  stream
8965       from  a  list of compiled patterns. Its first two arguments specify the
8966       list, being a pointer to a vector of pointers to compiled patterns, and
8967       the length of the vector. The third and fourth arguments point to vari-
8968       ables which are set to point to the created byte stream and its length,
8969       respectively.  The  final  argument  is a pointer to a general context,
8970       which can be used to specify custom memory  mangagement  functions.  If
8971       this  argument  is NULL, malloc() is used to obtain memory for the byte
8972       stream. The yield of the function is the number of serialized patterns,
8973       or one of the following negative error codes:
8974
8975         PCRE2_ERROR_BADDATA      the number of patterns is zero or less
8976         PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns
8977         PCRE2_ERROR_MEMORY       memory allocation failed
8978         PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables
8979         PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL
8980
8981       PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
8982       rupted, or that a slot in the vector does not point to a compiled  pat-
8983       tern.
8984
8985       Once a set of patterns has been serialized you can save the data in any
8986       appropriate manner. Here is sample code that compiles two patterns  and
8987       writes them to a file. It assumes that the variable fd refers to a file
8988       that is open for output. The error checking that should be present in a
8989       real application has been omitted for simplicity.
8990
8991         int errorcode;
8992         uint8_t *bytes;
8993         PCRE2_SIZE erroroffset;
8994         PCRE2_SIZE bytescount;
8995         pcre2_code *list_of_codes[2];
8996         list_of_codes[0] = pcre2_compile("first pattern",
8997           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
8998         list_of_codes[1] = pcre2_compile("second pattern",
8999           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
9000         errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
9001           &bytescount, NULL);
9002         errorcode = fwrite(bytes, 1, bytescount, fd);
9003
9004       Note  that  the  serialized data is binary data that may contain any of
9005       the 256 possible byte  values.  On  systems  that  make  a  distinction
9006       between binary and non-binary data, be sure that the file is opened for
9007       binary output.
9008
9009       Serializing a set of patterns leaves the original  data  untouched,  so
9010       they  can  still  be used for matching. Their memory must eventually be
9011       freed in the usual way by calling pcre2_code_free(). When you have fin-
9012       ished with the byte stream, it too must be freed by calling pcre2_seri-
9013       alize_free().
9014
9015
9016RE-USING PRECOMPILED PATTERNS
9017
9018       In order to re-use a set of saved patterns  you  must  first  make  the
9019       serialized  byte stream available in main memory (for example, by read-
9020       ing from a file). The management of this memory  block  is  up  to  the
9021       application.  You  can  use  the  pcre2_serialize_get_number_of_codes()
9022       function to find out how many compiled patterns are in  the  serialized
9023       data without actually decoding the patterns:
9024
9025         uint8_t *bytes = <serialized data>;
9026         int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
9027
9028       The pcre2_serialize_decode() function reads a byte stream and recreates
9029       the compiled patterns in new memory blocks, setting pointers to them in
9030       a  vector.  The  first two arguments are a pointer to a suitable vector
9031       and its length, and the third argument points to  a  byte  stream.  The
9032       final  argument is a pointer to a general context, which can be used to
9033       specify custom memory mangagement functions for the  decoded  patterns.
9034       If this argument is NULL, malloc() and free() are used. After deserial-
9035       ization, the byte stream is no longer needed and can be discarded.
9036
9037         int32_t number_of_codes;
9038         pcre2_code *list_of_codes[2];
9039         uint8_t *bytes = <serialized data>;
9040         int32_t number_of_codes =
9041           pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
9042
9043       If the vector is not large enough for all  the  patterns  in  the  byte
9044       stream,  it  is  filled  with  those  that  fit,  and the remainder are
9045       ignored. The yield of the function is the number of  decoded  patterns,
9046       or one of the following negative error codes:
9047
9048         PCRE2_ERROR_BADDATA    second argument is zero or less
9049         PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data
9050         PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version
9051         PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure
9052         PCRE2_ERROR_MEMORY     memory allocation failed
9053         PCRE2_ERROR_NULL       first or third argument is NULL
9054
9055       PCRE2_ERROR_BADMAGIC  may mean that the data is corrupt, or that it was
9056       compiled on a system with different endianness.
9057
9058       Decoded patterns can be used for matching in the usual way, and must be
9059       freed  by  calling pcre2_code_free(). However, be aware that there is a
9060       potential race issue if you  are  using  multiple  patterns  that  were
9061       decoded  from  a  single  byte stream in a multithreaded application. A
9062       single copy of the character tables is used by all the decoded patterns
9063       and a reference count is used to arrange for its memory to be automati-
9064       cally freed when the last pattern is freed, but there is no locking  on
9065       this  reference count. Therefore, if you want to call pcre2_code_free()
9066       for these patterns in different threads,  you  must  arrange  your  own
9067       locking,  and  ensure  that  pcre2_code_free()  cannot be called by two
9068       threads at the same time.
9069
9070       If a pattern was processed by pcre2_jit_compile() before being  serial-
9071       ized,  the  JIT data is discarded and so is no longer available after a
9072       save/restore cycle. You can, however, process a restored  pattern  with
9073       pcre2_jit_compile() if you wish.
9074
9075
9076AUTHOR
9077
9078       Philip Hazel
9079       University Computing Service
9080       Cambridge, England.
9081
9082
9083REVISION
9084
9085       Last updated: 24 May 2016
9086       Copyright (c) 1997-2016 University of Cambridge.
9087------------------------------------------------------------------------------
9088
9089
9090PCRE2STACK(3)              Library Functions Manual              PCRE2STACK(3)
9091
9092
9093
9094NAME
9095       PCRE2 - Perl-compatible regular expressions (revised API)
9096
9097PCRE2 DISCUSSION OF STACK USAGE
9098
9099       When  you  call  pcre2_match(),  it  makes  use of an internal function
9100       called match(). This calls itself recursively at branch points  in  the
9101       pattern,  in  order  to  remember the state of the match so that it can
9102       back up and try a different alternative after a  failure.  As  matching
9103       proceeds  deeper  and deeper into the tree of possibilities, the recur-
9104       sion depth increases. The match() function is also called in other cir-
9105       cumstances,  for  example,  whenever  a  parenthesized  sub-pattern  is
9106       entered, and in certain cases of repetition.
9107
9108       Not all calls of match() increase the recursion depth; for an item such
9109       as  a* it may be called several times at the same level, after matching
9110       different numbers of a's. Furthermore, in a number of cases  where  the
9111       result  of  the  recursive call would immediately be passed back as the
9112       result of the current call (a "tail recursion"), the function  is  just
9113       restarted instead.
9114
9115       Each  time the internal match() function is called recursively, it uses
9116       memory from the process stack. For certain kinds of pattern  and  data,
9117       very  large  amounts of stack may be needed, despite the recognition of
9118       "tail recursion". Note that if  PCRE2  is  compiled  with  the  -fsani-
9119       tize=address  option  of  the  GCC compiler, the stack requirements are
9120       greatly increased.
9121
9122       The above comments apply when pcre2_match() is run in its normal inter-
9123       pretive manner. If the compiled pattern was processed by pcre2_jit_com-
9124       pile(), and just-in-time compiling  was  successful,  and  the  options
9125       passed  to  pcre2_match()  were  not incompatible, the matching process
9126       uses the JIT-compiled code instead of the  match()  function.  In  this
9127       case, the memory requirements are handled entirely differently. See the
9128       pcre2jit documentation for details.
9129
9130       The  pcre2_dfa_match()  function  operates  in  a  different   way   to
9131       pcre2_match(),  and uses recursion only when there is a regular expres-
9132       sion recursion or subroutine call in the  pattern.  This  includes  the
9133       processing  of assertion and "once-only" subpatterns, which are handled
9134       like subroutine calls.  Normally, these are never very  deep,  and  the
9135       limit  on  the  complexity  of  pcre2_dfa_match()  is controlled by the
9136       amount of workspace it is given.  However, it is possible to write pat-
9137       terns  with  runaway  infinite  recursions;  such  patterns  will cause
9138       pcre2_dfa_match() to run out of stack. At present, there is no  protec-
9139       tion against this.
9140
9141       The  comments  that  follow do NOT apply to pcre2_dfa_match(); they are
9142       relevant only for pcre2_match() without the JIT optimization.
9143
9144   Reducing pcre2_match()'s stack usage
9145
9146       You can often reduce the amount of recursion, and therefore the  amount
9147       of  stack  used,  by  modifying the pattern that is being matched. Con-
9148       sider, for example, this pattern:
9149
9150         ([^<]|<(?!inet))+
9151
9152       It matches from wherever it starts until it encounters "<inet"  or  the
9153       end  of  the  data,  and is the kind of pattern that might be used when
9154       processing an XML file. Each iteration of the outer parentheses matches
9155       either  one  character that is not "<" or a "<" that is not followed by
9156       "inet". However, each time a  parenthesis  is  processed,  a  recursion
9157       occurs, so this formulation uses a stack frame for each matched charac-
9158       ter. For a long string, a lot of stack is required. Consider  now  this
9159       rewritten pattern, which matches exactly the same strings:
9160
9161         ([^<]++|<(?!inet))+
9162
9163       This  uses very much less stack, because runs of characters that do not
9164       contain "<" are "swallowed" in one item inside the parentheses.  Recur-
9165       sion  happens  only when a "<" character that is not followed by "inet"
9166       is encountered (and we assume this is relatively  rare).  A  possessive
9167       quantifier  is  used  to stop any backtracking into the runs of non-"<"
9168       characters, but that is not related to stack usage.
9169
9170       This example shows that one way of avoiding stack problems when  match-
9171       ing long subject strings is to write repeated parenthesized subpatterns
9172       to match more than one character whenever possible.
9173
9174   Compiling PCRE2 to use heap instead of stack for pcre2_match()
9175
9176       In environments where stack memory is constrained, you  might  want  to
9177       compile PCRE2 to use heap memory instead of stack for remembering back-
9178       up points when pcre2_match() is running. This makes it run more slowly,
9179       however. Details of how to do this are given in the pcre2build documen-
9180       tation. When built in this way, instead of using the stack, PCRE2  gets
9181       memory  for  remembering  backup  points from the heap. By default, the
9182       memory is obtained by calling the system malloc() function, but you can
9183       arrange to supply your own memory management function. For details, see
9184       the section entitled "The match context" in the pcre2api documentation.
9185       Since the block sizes are always the same, it may be possible to imple-
9186       ment customized a memory handler that is more efficient than the  stan-
9187       dard function. The memory blocks obtained for this purpose are retained
9188       and re-used if possible while pcre2_match() is running.  They  are  all
9189       freed just before it exits.
9190
9191   Limiting pcre2_match()'s stack usage
9192
9193       You can set limits on the number of times the internal match() function
9194       is called, both in total and  recursively.  If  a  limit  is  exceeded,
9195       pcre2_match()  returns  an  error  code. Setting suitable limits should
9196       prevent it from running out of stack. The default values of the  limits
9197       are  very large, and unlikely ever to operate. They can be changed when
9198       PCRE2 is built, and they can also be set when pcre2_match() is  called.
9199       For  details  of these interfaces, see the pcre2build documentation and
9200       the section entitled "The match context" in the pcre2api documentation.
9201
9202       As a very rough rule of thumb, you should reckon on about 500 bytes per
9203       recursion.  Thus,  if  you  want  to limit your stack usage to 8Mb, you
9204       should set the limit at 16000 recursions. A 64Mb stack,  on  the  other
9205       hand, can support around 128000 recursions.
9206
9207       The  pcre2test  test program has a modifier called "find_limits" which,
9208       if applied to a subject line, causes it to  find  the  smallest  limits
9209       that  allow a a pattern to match. This is done by calling pcre2_match()
9210       repeatedly with different limits.
9211
9212   Changing stack size in Unix-like systems
9213
9214       In Unix-like environments, there is not often a problem with the  stack
9215       unless  very  long  strings  are  involved, though the default limit on
9216       stack size varies from system to system. Values from 8Mb  to  64Mb  are
9217       common. You can find your default limit by running the command:
9218
9219         ulimit -s
9220
9221       Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
9222       though sometimes a more explicit error message is given. You  can  nor-
9223       mally increase the limit on stack size by code such as this:
9224
9225         struct rlimit rlim;
9226         getrlimit(RLIMIT_STACK, &rlim);
9227         rlim.rlim_cur = 100*1024*1024;
9228         setrlimit(RLIMIT_STACK, &rlim);
9229
9230       This  reads  the current limits (soft and hard) using getrlimit(), then
9231       attempts to increase the soft limit to  100Mb  using  setrlimit().  You
9232       must do this before calling pcre2_match().
9233
9234   Changing stack size in Mac OS X
9235
9236       Using setrlimit(), as described above, should also work on Mac OS X. It
9237       is also possible to set a stack size when linking a program. There is a
9238       discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
9239       http://developer.apple.com/qa/qa2005/qa1419.html.
9240
9241
9242AUTHOR
9243
9244       Philip Hazel
9245       University Computing Service
9246       Cambridge, England.
9247
9248
9249REVISION
9250
9251       Last updated: 21 November 2014
9252       Copyright (c) 1997-2014 University of Cambridge.
9253------------------------------------------------------------------------------
9254
9255
9256PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
9257
9258
9259
9260NAME
9261       PCRE2 - Perl-compatible regular expressions (revised API)
9262
9263PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
9264
9265       The  full syntax and semantics of the regular expressions that are sup-
9266       ported by PCRE2 are described in the pcre2pattern  documentation.  This
9267       document contains a quick-reference summary of the syntax.
9268
9269
9270QUOTING
9271
9272         \x         where x is non-alphanumeric is a literal x
9273         \Q...\E    treat enclosed characters as literal
9274
9275
9276ESCAPED CHARACTERS
9277
9278       This table applies to ASCII and Unicode environments.
9279
9280         \a         alarm, that is, the BEL character (hex 07)
9281         \cx        "control-x", where x is any ASCII printing character
9282         \e         escape (hex 1B)
9283         \f         form feed (hex 0C)
9284         \n         newline (hex 0A)
9285         \r         carriage return (hex 0D)
9286         \t         tab (hex 09)
9287         \0dd       character with octal code 0dd
9288         \ddd       character with octal code ddd, or backreference
9289         \o{ddd..}  character with octal code ddd..
9290         \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
9291         \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
9292         \xhh       character with hex code hh
9293         \x{hhh..}  character with hex code hhh..
9294
9295       Note that \0dd is always an octal code. The treatment of backslash fol-
9296       lowed by a non-zero digit is complicated; for details see  the  section
9297       "Non-printing  characters"  in  the  pcre2pattern  documentation, where
9298       details of escape processing in EBCDIC environments are also given.
9299
9300       When \x is not followed by {, from zero to two hexadecimal  digits  are
9301       read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
9302       imal digits to be recognized as  a  hexadecimal  escape;  otherwise  it
9303       matches  a literal "x".  Likewise, if \u (in ALT_BSUX mode) is not fol-
9304       lowed by four hexadecimal digits, it matches a literal "u".
9305
9306
9307CHARACTER TYPES
9308
9309         .          any character except newline;
9310                      in dotall mode, any character whatsoever
9311         \C         one code unit, even in UTF mode (best avoided)
9312         \d         a decimal digit
9313         \D         a character that is not a decimal digit
9314         \h         a horizontal white space character
9315         \H         a character that is not a horizontal white space character
9316         \N         a character that is not a newline
9317         \p{xx}     a character with the xx property
9318         \P{xx}     a character without the xx property
9319         \R         a newline sequence
9320         \s         a white space character
9321         \S         a character that is not a white space character
9322         \v         a vertical white space character
9323         \V         a character that is not a vertical white space character
9324         \w         a "word" character
9325         \W         a "non-word" character
9326         \X         a Unicode extended grapheme cluster
9327
9328       \C is dangerous because it may leave the current matching point in  the
9329       middle of a UTF-8 or UTF-16 character. The application can lock out the
9330       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
9331       possible to build PCRE2 with the use of \C permanently disabled.
9332
9333       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
9334       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
9335       matching  is  happening,  \s and \w may also match characters with code
9336       points in the range 128-255. If the PCRE2_UCP option is set, the behav-
9337       iour of these escape sequences is changed to use Unicode properties and
9338       they match many more characters.
9339
9340
9341GENERAL CATEGORY PROPERTIES FOR \p and \P
9342
9343         C          Other
9344         Cc         Control
9345         Cf         Format
9346         Cn         Unassigned
9347         Co         Private use
9348         Cs         Surrogate
9349
9350         L          Letter
9351         Ll         Lower case letter
9352         Lm         Modifier letter
9353         Lo         Other letter
9354         Lt         Title case letter
9355         Lu         Upper case letter
9356         L&         Ll, Lu, or Lt
9357
9358         M          Mark
9359         Mc         Spacing mark
9360         Me         Enclosing mark
9361         Mn         Non-spacing mark
9362
9363         N          Number
9364         Nd         Decimal number
9365         Nl         Letter number
9366         No         Other number
9367
9368         P          Punctuation
9369         Pc         Connector punctuation
9370         Pd         Dash punctuation
9371         Pe         Close punctuation
9372         Pf         Final punctuation
9373         Pi         Initial punctuation
9374         Po         Other punctuation
9375         Ps         Open punctuation
9376
9377         S          Symbol
9378         Sc         Currency symbol
9379         Sk         Modifier symbol
9380         Sm         Mathematical symbol
9381         So         Other symbol
9382
9383         Z          Separator
9384         Zl         Line separator
9385         Zp         Paragraph separator
9386         Zs         Space separator
9387
9388
9389PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
9390
9391         Xan        Alphanumeric: union of properties L and N
9392         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
9393         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
9394         Xuc        Univerally-named character: one that can be
9395                      represented by a Universal Character Name
9396         Xwd        Perl word: property Xan or underscore
9397
9398       Perl and POSIX space are now the same. Perl added VT to its space char-
9399       acter set at release 5.18.
9400
9401
9402SCRIPT NAMES FOR \p AND \P
9403
9404       Ahom,   Anatolian_Hieroglyphs,  Arabic,  Armenian,  Avestan,  Balinese,
9405       Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille,  Buginese,
9406       Buhid,  Canadian_Aboriginal,  Carian, Caucasian_Albanian, Chakma, Cham,
9407       Cherokee,  Common,  Coptic,  Cuneiform,  Cypriot,  Cyrillic,   Deseret,
9408       Devanagari,  Duployan,  Egyptian_Hieroglyphs,  Elbasan, Ethiopic, Geor-
9409       gian, Glagolitic, Gothic,  Grantha,  Greek,  Gujarati,  Gurmukhi,  Han,
9410       Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
9411       Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan-
9412       nada,  Katakana,  Kayah_Li,  Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
9413       Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha-
9414       jani,  Malayalam,  Mandaic,  Manichaean,  Meetei_Mayek,  Mende_Kikakui,
9415       Meroitic_Cursive, Meroitic_Hieroglyphs,  Miao,  Modi,  Mongolian,  Mro,
9416       Multani,   Myanmar,   Nabataean,  New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,
9417       Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,
9418       Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
9419       Pau_Cin_Hau,  Phags_Pa,  Phoenician,  Psalter_Pahlavi,  Rejang,  Runic,
9420       Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
9421       Sora_Sompeng,  Sundanese,  Syloti_Nagri,  Syriac,  Tagalog,   Tagbanwa,
9422       Tai_Le,   Tai_Tham,  Tai_Viet,  Takri,  Tamil,  Telugu,  Thaana,  Thai,
9423       Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
9424
9425
9426CHARACTER CLASSES
9427
9428         [...]       positive character class
9429         [^...]      negative character class
9430         [x-y]       range (can be used for hex characters)
9431         [[:xxx:]]   positive POSIX named set
9432         [[:^xxx:]]  negative POSIX named set
9433
9434         alnum       alphanumeric
9435         alpha       alphabetic
9436         ascii       0-127
9437         blank       space or tab
9438         cntrl       control character
9439         digit       decimal digit
9440         graph       printing, excluding space
9441         lower       lower case letter
9442         print       printing, including space
9443         punct       printing, excluding alphanumeric
9444         space       white space
9445         upper       upper case letter
9446         word        same as \w
9447         xdigit      hexadecimal digit
9448
9449       In PCRE2, POSIX character set names recognize only ASCII characters  by
9450       default,  but  some of them use Unicode properties if PCRE2_UCP is set.
9451       You can use \Q...\E inside a character class.
9452
9453
9454QUANTIFIERS
9455
9456         ?           0 or 1, greedy
9457         ?+          0 or 1, possessive
9458         ??          0 or 1, lazy
9459         *           0 or more, greedy
9460         *+          0 or more, possessive
9461         *?          0 or more, lazy
9462         +           1 or more, greedy
9463         ++          1 or more, possessive
9464         +?          1 or more, lazy
9465         {n}         exactly n
9466         {n,m}       at least n, no more than m, greedy
9467         {n,m}+      at least n, no more than m, possessive
9468         {n,m}?      at least n, no more than m, lazy
9469         {n,}        n or more, greedy
9470         {n,}+       n or more, possessive
9471         {n,}?       n or more, lazy
9472
9473
9474ANCHORS AND SIMPLE ASSERTIONS
9475
9476         \b          word boundary
9477         \B          not a word boundary
9478         ^           start of subject
9479                       also after an internal newline in multiline mode
9480                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
9481         \A          start of subject
9482         $           end of subject
9483                       also before newline at end of subject
9484                       also before internal newline in multiline mode
9485         \Z          end of subject
9486                       also before newline at end of subject
9487         \z          end of subject
9488         \G          first matching position in subject
9489
9490
9491MATCH POINT RESET
9492
9493         \K          reset start of match
9494
9495       \K is honoured in positive assertions, but ignored in negative ones.
9496
9497
9498ALTERNATION
9499
9500         expr|expr|expr...
9501
9502
9503CAPTURING
9504
9505         (...)           capturing group
9506         (?<name>...)    named capturing group (Perl)
9507         (?'name'...)    named capturing group (Perl)
9508         (?P<name>...)   named capturing group (Python)
9509         (?:...)         non-capturing group
9510         (?|...)         non-capturing group; reset group numbers for
9511                          capturing groups in each alternative
9512
9513
9514ATOMIC GROUPS
9515
9516         (?>...)         atomic, non-capturing group
9517
9518
9519COMMENT
9520
9521         (?#....)        comment (not nestable)
9522
9523
9524OPTION SETTING
9525
9526         (?i)            caseless
9527         (?J)            allow duplicate names
9528         (?m)            multiline
9529         (?s)            single line (dotall)
9530         (?U)            default ungreedy (lazy)
9531         (?x)            extended (ignore white space)
9532         (?-...)         unset option(s)
9533
9534       The following are recognized only at the very start  of  a  pattern  or
9535       after  one  of the newline or \R options with similar syntax. More than
9536       one of them may appear.
9537
9538         (*LIMIT_MATCH=d) set the match limit to d (decimal number)
9539         (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
9540         (*NOTEMPTY)     set PCRE2_NOTEMPTY when matching
9541         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
9542         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
9543         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
9544         (*NO_JIT)       disable JIT optimization
9545         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
9546         (*UTF)          set appropriate UTF mode for the library in use
9547         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
9548
9549       Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value  of
9550       the  limits  set by the caller of pcre2_match(), not increase them. The
9551       application can lock out the use of (*UTF) and (*UCP)  by  setting  the
9552       PCRE2_NEVER_UTF  or  PCRE2_NEVER_UCP  options, respectively, at compile
9553       time.
9554
9555
9556NEWLINE CONVENTION
9557
9558       These are recognized only at the very start of  the  pattern  or  after
9559       option settings with a similar syntax.
9560
9561         (*CR)           carriage return only
9562         (*LF)           linefeed only
9563         (*CRLF)         carriage return followed by linefeed
9564         (*ANYCRLF)      all three of the above
9565         (*ANY)          any Unicode newline sequence
9566
9567
9568WHAT \R MATCHES
9569
9570       These  are  recognized  only  at the very start of the pattern or after
9571       option setting with a similar syntax.
9572
9573         (*BSR_ANYCRLF)  CR, LF, or CRLF
9574         (*BSR_UNICODE)  any Unicode newline sequence
9575
9576
9577LOOKAHEAD AND LOOKBEHIND ASSERTIONS
9578
9579         (?=...)         positive look ahead
9580         (?!...)         negative look ahead
9581         (?<=...)        positive look behind
9582         (?<!...)        negative look behind
9583
9584       Each top-level branch of a look behind must be of a fixed length.
9585
9586
9587BACKREFERENCES
9588
9589         \n              reference by number (can be ambiguous)
9590         \gn             reference by number
9591         \g{n}           reference by number
9592         \g{-n}          relative reference by number
9593         \k<name>        reference by name (Perl)
9594         \k'name'        reference by name (Perl)
9595         \g{name}        reference by name (Perl)
9596         \k{name}        reference by name (.NET)
9597         (?P=name)       reference by name (Python)
9598
9599
9600SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
9601
9602         (?R)            recurse whole pattern
9603         (?n)            call subpattern by absolute number
9604         (?+n)           call subpattern by relative number
9605         (?-n)           call subpattern by relative number
9606         (?&name)        call subpattern by name (Perl)
9607         (?P>name)       call subpattern by name (Python)
9608         \g<name>        call subpattern by name (Oniguruma)
9609         \g'name'        call subpattern by name (Oniguruma)
9610         \g<n>           call subpattern by absolute number (Oniguruma)
9611         \g'n'           call subpattern by absolute number (Oniguruma)
9612         \g<+n>          call subpattern by relative number (PCRE2 extension)
9613         \g'+n'          call subpattern by relative number (PCRE2 extension)
9614         \g<-n>          call subpattern by relative number (PCRE2 extension)
9615         \g'-n'          call subpattern by relative number (PCRE2 extension)
9616
9617
9618CONDITIONAL PATTERNS
9619
9620         (?(condition)yes-pattern)
9621         (?(condition)yes-pattern|no-pattern)
9622
9623         (?(n)               absolute reference condition
9624         (?(+n)              relative reference condition
9625         (?(-n)              relative reference condition
9626         (?(<name>)          named reference condition (Perl)
9627         (?('name')          named reference condition (Perl)
9628         (?(name)            named reference condition (PCRE2)
9629         (?(R)               overall recursion condition
9630         (?(Rn)              specific group recursion condition
9631         (?(R&name)          specific recursion condition
9632         (?(DEFINE)          define subpattern for reference
9633         (?(VERSION[>]=n.m)  test PCRE2 version
9634         (?(assert)          assertion condition
9635
9636
9637BACKTRACKING CONTROL
9638
9639       The following act immediately they are reached:
9640
9641         (*ACCEPT)       force successful match
9642         (*FAIL)         force backtrack; synonym (*F)
9643         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
9644
9645       The following act only when a subsequent match failure causes  a  back-
9646       track to reach them. They all force a match failure, but they differ in
9647       what happens afterwards. Those that advance the start-of-match point do
9648       so only if the pattern is not anchored.
9649
9650         (*COMMIT)       overall failure, no advance of starting point
9651         (*PRUNE)        advance to next starting character
9652         (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
9653         (*SKIP)         advance to current matching position
9654         (*SKIP:NAME)    advance to position corresponding to an earlier
9655                         (*MARK:NAME); if not found, the (*SKIP) is ignored
9656         (*THEN)         local failure, backtrack to next alternation
9657         (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
9658
9659
9660CALLOUTS
9661
9662         (?C)            callout (assumed number 0)
9663         (?Cn)           callout with numerical data n
9664         (?C"text")      callout with string data
9665
9666       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
9667       the start and the end), and the starting delimiter { matched  with  the
9668       ending  delimiter  }. To encode the ending delimiter within the string,
9669       double it.
9670
9671
9672SEE ALSO
9673
9674       pcre2pattern(3),   pcre2api(3),   pcre2callout(3),    pcre2matching(3),
9675       pcre2(3).
9676
9677
9678AUTHOR
9679
9680       Philip Hazel
9681       University Computing Service
9682       Cambridge, England.
9683
9684
9685REVISION
9686
9687       Last updated: 16 October 2015
9688       Copyright (c) 1997-2015 University of Cambridge.
9689------------------------------------------------------------------------------
9690
9691
9692PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
9693
9694
9695
9696NAME
9697       PCRE - Perl-compatible regular expressions (revised API)
9698
9699UNICODE AND UTF SUPPORT
9700
9701       When PCRE2 is built with Unicode support (which is the default), it has
9702       knowledge of Unicode character properties and can process text  strings
9703       in  UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
9704       However, by default, PCRE2 assumes that one code unit is one character.
9705       To  process  a  pattern  as a UTF string, where a character may require
9706       more than one  code  unit,  you  must  call  pcre2_compile()  with  the
9707       PCRE2_UTF  option  flag,  or  the  pattern must start with the sequence
9708       (*UTF). When either of these is the case, both the pattern and any sub-
9709       ject  strings  that  are  matched against it are treated as UTF strings
9710       instead of strings of individual one-code-unit characters.
9711
9712       If you do not need Unicode support you can build PCRE2 without  it,  in
9713       which case the library will be smaller.
9714
9715
9716UNICODE PROPERTY SUPPORT
9717
9718       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
9719       \P{..}, and \X can be used. The Unicode properties that can  be  tested
9720       are  limited to the general category properties such as Lu for an upper
9721       case letter or Nd for a decimal number, the Unicode script  names  such
9722       as Arabic or Han, and the derived properties Any and L&. Full lists are
9723       given in the pcre2pattern and pcre2syntax documentation. Only the short
9724       names  for  properties are supported. For example, \p{L} matches a let-
9725       ter. Its Perl synonym, \p{Letter}, is not supported.   Furthermore,  in
9726       Perl,  many properties may optionally be prefixed by "Is", for compati-
9727       bility with Perl 5.6. PCRE does not support this.
9728
9729
9730WIDE CHARACTERS AND UTF MODES
9731
9732       Codepoints less than 256 can be specified in patterns by either  braced
9733       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
9734       Larger values have to use braced sequences. Unbraced octal code  points
9735       up to \777 are also recognized; larger ones can be coded using \o{...}.
9736
9737       In  UTF modes, repeat quantifiers apply to complete UTF characters, not
9738       to individual code units.
9739
9740       In UTF modes, the dot metacharacter matches one UTF  character  instead
9741       of a single code unit.
9742
9743       The escape sequence \C can be used to match a single code unit in a UTF
9744       mode, but its use can lead to some strange effects because it breaks up
9745       multi-unit  characters  (see  the description of \C in the pcre2pattern
9746       documentation).
9747
9748       The use of \C is not supported by  the  alternative  matching  function
9749       pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
9750       ter may consist of more than one code unit. The  use  of  \C  in  these
9751       modes  provokes a match-time error. Also, the JIT optimization does not
9752       support \C in these modes. If JIT optimization is requested for a UTF-8
9753       or  UTF-16  pattern  that contains \C, it will not succeed, and so when
9754       pcre2_match() is called, the matching will be carried out by the normal
9755       interpretive function.
9756
9757       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
9758       characters of any code value, but,  by  default,  the  characters  that
9759       PCRE2  recognizes as digits, spaces, or word characters remain the same
9760       set as in non-UTF mode, all  with  code  points  less  than  256.  This
9761       remains  true  even  when  PCRE2  is  built to include Unicode support,
9762       because to do otherwise would slow down matching in many common  cases.
9763       Note  that  this also applies to \b and \B, because they are defined in
9764       terms of \w and \W. If you want to test for  a  wider  sense  of,  say,
9765       "digit",  you  can  use explicit Unicode property tests such as \p{Nd}.
9766       Alternatively, if you set the PCRE2_UCP option, the way that the  char-
9767       acter  escapes  work  is changed so that Unicode properties are used to
9768       determine which characters match. There are more details in the section
9769       on generic character types in the pcre2pattern documentation.
9770
9771       Similarly,  characters that match the POSIX named character classes are
9772       all low-valued characters, unless the PCRE2_UCP option is set.
9773
9774       However, the special  horizontal  and  vertical  white  space  matching
9775       escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
9776       acters, whether or not PCRE2_UCP is set.
9777
9778       Case-insensitive matching in UTF mode makes use of Unicode  properties.
9779       A  few  Unicode characters such as Greek sigma have more than two code-
9780       points that are case-equivalent, and these are treated as such.
9781
9782
9783VALIDITY OF UTF STRINGS
9784
9785       When the PCRE2_UTF option is set, the strings passed  as  patterns  and
9786       subjects are (by default) checked for validity on entry to the relevant
9787       functions.  If an invalid UTF string is passed, an negative error  code
9788       is  returned.  The  code  unit offset to the offending character can be
9789       extracted from the match data block by  calling  pcre2_get_startchar(),
9790       which is used for this purpose after a UTF error.
9791
9792       UTF-16 and UTF-32 strings can indicate their endianness by special code
9793       knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
9794       this, expecting strings to be in host byte order.
9795
9796       A UTF string is checked before any other processing takes place. In the
9797       case of pcre2_match()  and  pcre2_dfa_match()  calls  with  a  non-zero
9798       starting  offset, the check is applied only to that part of the subject
9799       that could be inspected during matching, and there is a check that  the
9800       starting  offset points to the first code unit of a character or to the
9801       end of the subject. If there are no lookbehind assertions in  the  pat-
9802       tern,  the check starts at the starting offset. Otherwise, it starts at
9803       the length of the longest lookbehind before the starting offset, or  at
9804       the  start  of the subject if there are not that many characters before
9805       the starting offset. Note that the sequences \b and \B are  one-charac-
9806       ter lookbehinds.
9807
9808       In  addition  to checking the format of the string, there is a check to
9809       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
9810       the  surrogate  area. The so-called "non-character" code points are not
9811       excluded because Unicode corrigendum #9 makes it clear that they should
9812       not be.
9813
9814       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
9815       UTF-16, where they are used in pairs to encode code points with  values
9816       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
9817       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
9818       other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
9819       unfortunately messes up UTF-8 and UTF-32.)
9820
9821       In some situations, you may already know that your strings  are  valid,
9822       and  therefore  want  to  skip these checks in order to improve perfor-
9823       mance, for example in the case of a long subject string that  is  being
9824       scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
9825       pile time or at match time, PCRE2 assumes that the pattern  or  subject
9826       it is given (respectively) contains only valid UTF code unit sequences.
9827
9828       Passing  PCRE2_NO_UTF_CHECK  to pcre2_compile() just disables the check
9829       for the pattern; it does not also apply to subject strings. If you want
9830       to  disable the check for a subject string you must pass this option to
9831       pcre2_match() or pcre2_dfa_match().
9832
9833       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
9834       result is undefined and your program may crash or loop indefinitely.
9835
9836   Errors in UTF-8 strings
9837
9838       The following negative error codes are given for invalid UTF-8 strings:
9839
9840         PCRE2_ERROR_UTF8_ERR1
9841         PCRE2_ERROR_UTF8_ERR2
9842         PCRE2_ERROR_UTF8_ERR3
9843         PCRE2_ERROR_UTF8_ERR4
9844         PCRE2_ERROR_UTF8_ERR5
9845
9846       The  string  ends  with a truncated UTF-8 character; the code specifies
9847       how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
9848       characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
9849       nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
9850       checked first; hence the possibility of 4 or 5 missing bytes.
9851
9852         PCRE2_ERROR_UTF8_ERR6
9853         PCRE2_ERROR_UTF8_ERR7
9854         PCRE2_ERROR_UTF8_ERR8
9855         PCRE2_ERROR_UTF8_ERR9
9856         PCRE2_ERROR_UTF8_ERR10
9857
9858       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
9859       the character do not have the binary value 0b10 (that  is,  either  the
9860       most significant bit is 0, or the next bit is 1).
9861
9862         PCRE2_ERROR_UTF8_ERR11
9863         PCRE2_ERROR_UTF8_ERR12
9864
9865       A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
9866       long; these code points are excluded by RFC 3629.
9867
9868         PCRE2_ERROR_UTF8_ERR13
9869
9870       A 4-byte character has a value greater than 0x10fff; these code  points
9871       are excluded by RFC 3629.
9872
9873         PCRE2_ERROR_UTF8_ERR14
9874
9875       A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
9876       range of code points are reserved by RFC 3629 for use with UTF-16,  and
9877       so are excluded from UTF-8.
9878
9879         PCRE2_ERROR_UTF8_ERR15
9880         PCRE2_ERROR_UTF8_ERR16
9881         PCRE2_ERROR_UTF8_ERR17
9882         PCRE2_ERROR_UTF8_ERR18
9883         PCRE2_ERROR_UTF8_ERR19
9884
9885       A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
9886       for a value that can be represented by fewer bytes, which  is  invalid.
9887       For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
9888       rect coding uses just one byte.
9889
9890         PCRE2_ERROR_UTF8_ERR20
9891
9892       The two most significant bits of the first byte of a character have the
9893       binary  value 0b10 (that is, the most significant bit is 1 and the sec-
9894       ond is 0). Such a byte can only validly occur as the second  or  subse-
9895       quent byte of a multi-byte character.
9896
9897         PCRE2_ERROR_UTF8_ERR21
9898
9899       The  first byte of a character has the value 0xfe or 0xff. These values
9900       can never occur in a valid UTF-8 string.
9901
9902   Errors in UTF-16 strings
9903
9904       The following  negative  error  codes  are  given  for  invalid  UTF-16
9905       strings:
9906
9907         PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
9908         PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
9909         PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate
9910
9911
9912   Errors in UTF-32 strings
9913
9914       The  following  negative  error  codes  are  given  for  invalid UTF-32
9915       strings:
9916
9917         PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
9918         PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
9919
9920
9921AUTHOR
9922
9923       Philip Hazel
9924       University Computing Service
9925       Cambridge, England.
9926
9927
9928REVISION
9929
9930       Last updated: 03 July 2016
9931       Copyright (c) 1997-2016 University of Cambridge.
9932------------------------------------------------------------------------------
9933
9934
9935