1----------------------------------------------------------------------------- 2This file contains a concatenation of the PCRE2 man pages, converted to plain 3text format for ease of searching with a text editor, or for use on systems 4that do not have a man page processor. The small individual files that give 5synopses of each function in the library have not been included. Neither has 6the pcre2demo program. There are separate text files for the pcre2grep and 7pcre2test commands. 8----------------------------------------------------------------------------- 9 10 11PCRE2(3) Library Functions Manual PCRE2(3) 12 13 14 15NAME 16 PCRE2 - Perl-compatible regular expressions (revised API) 17 18INTRODUCTION 19 20 PCRE2 is the name used for a revised API for the PCRE library, which is 21 a set of functions, written in C, that implement regular expression 22 pattern matching using the same syntax and semantics as Perl, with just 23 a few differences. Some features that appeared in Python and the origi- 24 nal PCRE before they appeared in Perl are also available using the 25 Python syntax. There is also some support for one or two .NET and Onig- 26 uruma syntax items, and there are options for requesting some minor 27 changes that give better ECMAScript (aka JavaScript) compatibility. 28 29 The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 30 32-bit code units, which means that up to three separate libraries may 31 be installed. The original work to extend PCRE to 16-bit and 32-bit 32 code units was done by Zoltan Herczeg and Christian Persch, respec- 33 tively. In all three cases, strings can be interpreted either as one 34 character per code unit, or as UTF-encoded Unicode, with support for 35 Unicode general category properties. Unicode support is optional at 36 build time (but is the default). However, processing strings as UTF 37 code units must be enabled explicitly at run time. The version of Uni- 38 code in use can be discovered by running 39 40 pcre2test -C 41 42 The three libraries contain identical sets of functions, with names 43 ending in _8, _16, or _32, respectively (for example, pcre2_com- 44 pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 45 32, a program that uses just one code unit width can be written using 46 generic names such as pcre2_compile(), and the documentation is written 47 assuming that this is the case. 48 49 In addition to the Perl-compatible matching function, PCRE2 contains an 50 alternative function that matches the same compiled patterns in a dif- 51 ferent way. In certain circumstances, the alternative function has some 52 advantages. For a discussion of the two matching algorithms, see the 53 pcre2matching page. 54 55 Details of exactly which Perl regular expression features are and are 56 not supported by PCRE2 are given in separate documents. See the 57 pcre2pattern and pcre2compat pages. There is a syntax summary in the 58 pcre2syntax page. 59 60 Some features of PCRE2 can be included, excluded, or changed when the 61 library is built. The pcre2_config() function makes it possible for a 62 client to discover which features are available. The features them- 63 selves are described in the pcre2build page. Documentation about build- 64 ing PCRE2 for various operating systems can be found in the README and 65 NON-AUTOTOOLS_BUILD files in the source distribution. 66 67 The libraries contains a number of undocumented internal functions and 68 data tables that are used by more than one of the exported external 69 functions, but which are not intended for use by external callers. 70 Their names all begin with "_pcre2", which hopefully will not provoke 71 any name clashes. In some environments, it is possible to control which 72 external symbols are exported when a shared library is built, and in 73 these cases the undocumented symbols are not exported. 74 75 76SECURITY CONSIDERATIONS 77 78 If you are using PCRE2 in a non-UTF application that permits users to 79 supply arbitrary patterns for compilation, you should be aware of a 80 feature that allows users to turn on UTF support from within a pattern. 81 For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8 82 mode, which interprets patterns and subjects as strings of UTF-8 code 83 units instead of individual 8-bit characters. This causes both the pat- 84 tern and any data against which it is matched to be checked for UTF-8 85 validity. If the data string is very long, such a check might use suf- 86 ficiently many resources as to cause your application to lose perfor- 87 mance. 88 89 One way of guarding against this possibility is to use the pcre2_pat- 90 tern_info() function to check the compiled pattern's options for 91 PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when 92 calling pcre2_compile(). This causes an compile time error if a pattern 93 contains a UTF-setting sequence. 94 95 The use of Unicode properties for character types such as \d can also 96 be enabled from within the pattern, by specifying "(*UCP)". This fea- 97 ture can be disallowed by setting the PCRE2_NEVER_UCP option. 98 99 If your application is one that supports UTF, be aware that validity 100 checking can take time. If the same data string is to be matched many 101 times, you can use the PCRE2_NO_UTF_CHECK option for the second and 102 subsequent matches to avoid running redundant checks. 103 104 The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead 105 to problems, because it may leave the current matching point in the 106 middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C 107 option can be used by an application to lock out the use of \C, causing 108 a compile-time error if it is encountered. It is also possible to build 109 PCRE2 with the use of \C permanently disabled. 110 111 Another way that performance can be hit is by running a pattern that 112 has a very large search tree against a string that will never match. 113 Nested unlimited repeats in a pattern are a common example. PCRE2 pro- 114 vides some protection against this: see the pcre2_set_match_limit() 115 function in the pcre2api page. 116 117 118USER DOCUMENTATION 119 120 The user documentation for PCRE2 comprises a number of different sec- 121 tions. In the "man" format, each of these is a separate "man page". In 122 the HTML format, each is a separate page, linked from the index page. 123 In the plain text format, the descriptions of the pcre2grep and 124 pcre2test programs are in files called pcre2grep.txt and pcre2test.txt, 125 respectively. The remaining sections, except for the pcre2demo section 126 (which is a program listing), and the short pages for individual func- 127 tions, are concatenated in pcre2.txt, for ease of searching. The sec- 128 tions are as follows: 129 130 pcre2 this document 131 pcre2-config show PCRE2 installation configuration information 132 pcre2api details of PCRE2's native C API 133 pcre2build building PCRE2 134 pcre2callout details of the callout feature 135 pcre2compat discussion of Perl compatibility 136 pcre2demo a demonstration C program that uses PCRE2 137 pcre2grep description of the pcre2grep command (8-bit only) 138 pcre2jit discussion of just-in-time optimization support 139 pcre2limits details of size and other limits 140 pcre2matching discussion of the two matching algorithms 141 pcre2partial details of the partial matching facility 142 pcre2pattern syntax and semantics of supported regular 143 expression patterns 144 pcre2perform discussion of performance issues 145 pcre2posix the POSIX-compatible C API for the 8-bit library 146 pcre2sample discussion of the pcre2demo program 147 pcre2stack discussion of stack usage 148 pcre2syntax quick syntax reference 149 pcre2test description of the pcre2test command 150 pcre2unicode discussion of Unicode and UTF support 151 152 In the "man" and HTML formats, there is also a short page for each C 153 library function, listing its arguments and results. 154 155 156AUTHOR 157 158 Philip Hazel 159 University Computing Service 160 Cambridge, England. 161 162 Putting an actual email address here is a spam magnet. If you want to 163 email me, use my two initials, followed by the two digits 10, at the 164 domain cam.ac.uk. 165 166 167REVISION 168 169 Last updated: 16 October 2015 170 Copyright (c) 1997-2015 University of Cambridge. 171------------------------------------------------------------------------------ 172 173 174PCRE2API(3) Library Functions Manual PCRE2API(3) 175 176 177 178NAME 179 PCRE2 - Perl-compatible regular expressions (revised API) 180 181 #include <pcre2.h> 182 183 PCRE2 is a new API for PCRE. This document contains a description of 184 all its functions. See the pcre2 document for an overview of all the 185 PCRE2 documentation. 186 187 188PCRE2 NATIVE API BASIC FUNCTIONS 189 190 pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, 191 uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, 192 pcre2_compile_context *ccontext); 193 194 void pcre2_code_free(pcre2_code *code); 195 196 pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, 197 pcre2_general_context *gcontext); 198 199 pcre2_match_data *pcre2_match_data_create_from_pattern( 200 const pcre2_code *code, pcre2_general_context *gcontext); 201 202 int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, 203 PCRE2_SIZE length, PCRE2_SIZE startoffset, 204 uint32_t options, pcre2_match_data *match_data, 205 pcre2_match_context *mcontext); 206 207 int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, 208 PCRE2_SIZE length, PCRE2_SIZE startoffset, 209 uint32_t options, pcre2_match_data *match_data, 210 pcre2_match_context *mcontext, 211 int *workspace, PCRE2_SIZE wscount); 212 213 void pcre2_match_data_free(pcre2_match_data *match_data); 214 215 216PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS 217 218 PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); 219 220 uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); 221 222 PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); 223 224 PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); 225 226 227PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS 228 229 pcre2_general_context *pcre2_general_context_create( 230 void *(*private_malloc)(PCRE2_SIZE, void *), 231 void (*private_free)(void *, void *), void *memory_data); 232 233 pcre2_general_context *pcre2_general_context_copy( 234 pcre2_general_context *gcontext); 235 236 void pcre2_general_context_free(pcre2_general_context *gcontext); 237 238 239PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS 240 241 pcre2_compile_context *pcre2_compile_context_create( 242 pcre2_general_context *gcontext); 243 244 pcre2_compile_context *pcre2_compile_context_copy( 245 pcre2_compile_context *ccontext); 246 247 void pcre2_compile_context_free(pcre2_compile_context *ccontext); 248 249 int pcre2_set_bsr(pcre2_compile_context *ccontext, 250 uint32_t value); 251 252 int pcre2_set_character_tables(pcre2_compile_context *ccontext, 253 const unsigned char *tables); 254 255 int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, 256 PCRE2_SIZE value); 257 258 int pcre2_set_newline(pcre2_compile_context *ccontext, 259 uint32_t value); 260 261 int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, 262 uint32_t value); 263 264 int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, 265 int (*guard_function)(uint32_t, void *), void *user_data); 266 267 268PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS 269 270 pcre2_match_context *pcre2_match_context_create( 271 pcre2_general_context *gcontext); 272 273 pcre2_match_context *pcre2_match_context_copy( 274 pcre2_match_context *mcontext); 275 276 void pcre2_match_context_free(pcre2_match_context *mcontext); 277 278 int pcre2_set_callout(pcre2_match_context *mcontext, 279 int (*callout_function)(pcre2_callout_block *, void *), 280 void *callout_data); 281 282 int pcre2_set_match_limit(pcre2_match_context *mcontext, 283 uint32_t value); 284 285 int pcre2_set_offset_limit(pcre2_match_context *mcontext, 286 PCRE2_SIZE value); 287 288 int pcre2_set_recursion_limit(pcre2_match_context *mcontext, 289 uint32_t value); 290 291 int pcre2_set_recursion_memory_management( 292 pcre2_match_context *mcontext, 293 void *(*private_malloc)(PCRE2_SIZE, void *), 294 void (*private_free)(void *, void *), void *memory_data); 295 296 297PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS 298 299 int pcre2_substring_copy_byname(pcre2_match_data *match_data, 300 PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); 301 302 int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, 303 uint32_t number, PCRE2_UCHAR *buffer, 304 PCRE2_SIZE *bufflen); 305 306 void pcre2_substring_free(PCRE2_UCHAR *buffer); 307 308 int pcre2_substring_get_byname(pcre2_match_data *match_data, 309 PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); 310 311 int pcre2_substring_get_bynumber(pcre2_match_data *match_data, 312 uint32_t number, PCRE2_UCHAR **bufferptr, 313 PCRE2_SIZE *bufflen); 314 315 int pcre2_substring_length_byname(pcre2_match_data *match_data, 316 PCRE2_SPTR name, PCRE2_SIZE *length); 317 318 int pcre2_substring_length_bynumber(pcre2_match_data *match_data, 319 uint32_t number, PCRE2_SIZE *length); 320 321 int pcre2_substring_nametable_scan(const pcre2_code *code, 322 PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); 323 324 int pcre2_substring_number_from_name(const pcre2_code *code, 325 PCRE2_SPTR name); 326 327 void pcre2_substring_list_free(PCRE2_SPTR *list); 328 329 int pcre2_substring_list_get(pcre2_match_data *match_data, 330 PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); 331 332 333PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION 334 335 int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, 336 PCRE2_SIZE length, PCRE2_SIZE startoffset, 337 uint32_t options, pcre2_match_data *match_data, 338 pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP, 339 PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, 340 PCRE2_SIZE *outlengthptr); 341 342 343PCRE2 NATIVE API JIT FUNCTIONS 344 345 int pcre2_jit_compile(pcre2_code *code, uint32_t options); 346 347 int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, 348 PCRE2_SIZE length, PCRE2_SIZE startoffset, 349 uint32_t options, pcre2_match_data *match_data, 350 pcre2_match_context *mcontext); 351 352 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 353 354 pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, 355 PCRE2_SIZE maxsize, pcre2_general_context *gcontext); 356 357 void pcre2_jit_stack_assign(pcre2_match_context *mcontext, 358 pcre2_jit_callback callback_function, void *callback_data); 359 360 void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); 361 362 363PCRE2 NATIVE API SERIALIZATION FUNCTIONS 364 365 int32_t pcre2_serialize_decode(pcre2_code **codes, 366 int32_t number_of_codes, const uint8_t *bytes, 367 pcre2_general_context *gcontext); 368 369 int32_t pcre2_serialize_encode(const pcre2_code **codes, 370 int32_t number_of_codes, uint8_t **serialized_bytes, 371 PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); 372 373 void pcre2_serialize_free(uint8_t *bytes); 374 375 int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); 376 377 378PCRE2 NATIVE API AUXILIARY FUNCTIONS 379 380 pcre2_code *pcre2_code_copy(const pcre2_code *code); 381 382 int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, 383 PCRE2_SIZE bufflen); 384 385 const unsigned char *pcre2_maketables(pcre2_general_context *gcontext); 386 387 int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); 388 389 int pcre2_callout_enumerate(const pcre2_code *code, 390 int (*callback)(pcre2_callout_enumerate_block *, void *), 391 void *user_data); 392 393 int pcre2_config(uint32_t what, void *where); 394 395 396PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES 397 398 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit 399 code units, respectively. However, there is just one header file, 400 pcre2.h. This contains the function prototypes and other definitions 401 for all three libraries. One, two, or all three can be installed simul- 402 taneously. On Unix-like systems the libraries are called libpcre2-8, 403 libpcre2-16, and libpcre2-32, and they can also co-exist with the orig- 404 inal PCRE libraries. 405 406 Character strings are passed to and from a PCRE2 library as a sequence 407 of unsigned integers in code units of the appropriate width. Every 408 PCRE2 function comes in three different forms, one for each library, 409 for example: 410 411 pcre2_compile_8() 412 pcre2_compile_16() 413 pcre2_compile_32() 414 415 There are also three different sets of data types: 416 417 PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32 418 PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32 419 420 The UCHAR types define unsigned code units of the appropriate widths. 421 For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR 422 types are constant pointers to the equivalent UCHAR types, that is, 423 they are pointers to vectors of unsigned code units. 424 425 Many applications use only one code unit width. For their convenience, 426 macros are defined whose names are the generic forms such as pcre2_com- 427 pile() and PCRE2_SPTR. These macros use the value of the macro 428 PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func- 429 tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default. 430 An application must define it to be 8, 16, or 32 before including 431 pcre2.h in order to make use of the generic names. 432 433 Applications that use more than one code unit width can be linked with 434 more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to 435 be 0 before including pcre2.h, and then use the real function names. 436 Any code that is to be included in an environment where the value of 437 PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function 438 names. (Unfortunately, it is not possible in C code to save and restore 439 the value of a macro.) 440 441 If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a 442 compiler error occurs. 443 444 When using multiple libraries in an application, you must take care 445 when processing any particular pattern to use only functions from a 446 single library. For example, if you want to run a match using a pat- 447 tern that was compiled with pcre2_compile_16(), you must do so with 448 pcre2_match_16(), not pcre2_match_8(). 449 450 In the function summaries above, and in the rest of this document and 451 other PCRE2 documents, functions and data types are described using 452 their generic names, without the 8, 16, or 32 suffix. 453 454 455PCRE2 API OVERVIEW 456 457 PCRE2 has its own native API, which is described in this document. 458 There are also some wrapper functions for the 8-bit library that corre- 459 spond to the POSIX regular expression API, but they do not give access 460 to all the functionality. They are described in the pcre2posix documen- 461 tation. Both these APIs define a set of C function calls. 462 463 The native API C data types, function prototypes, option values, and 464 error codes are defined in the header file pcre2.h, which contains def- 465 initions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release 466 numbers for the library. Applications can use these to include support 467 for different releases of PCRE2. 468 469 In a Windows environment, if you want to statically link an application 470 program against a non-dll PCRE2 library, you must define PCRE2_STATIC 471 before including pcre2.h. 472 473 The functions pcre2_compile(), and pcre2_match() are used for compiling 474 and matching regular expressions in a Perl-compatible manner. A sample 475 program that demonstrates the simplest way of using them is provided in 476 the file called pcre2demo.c in the PCRE2 source distribution. A listing 477 of this program is given in the pcre2demo documentation, and the 478 pcre2sample documentation describes how to compile and run it. 479 480 Just-in-time compiler support is an optional feature of PCRE2 that can 481 be built in appropriate hardware environments. It greatly speeds up the 482 matching performance of many patterns. Programs can request that it be 483 used if available, by calling pcre2_jit_compile() after a pattern has 484 been successfully compiled by pcre2_compile(). This does nothing if JIT 485 support is not available. 486 487 More complicated programs might need to make use of the specialist 488 functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and 489 pcre2_jit_stack_assign() in order to control the JIT code's memory 490 usage. 491 492 JIT matching is automatically used by pcre2_match() if it is available, 493 unless the PCRE2_NO_JIT option is set. There is also a direct interface 494 for JIT matching, which gives improved performance. The JIT-specific 495 functions are discussed in the pcre2jit documentation. 496 497 A second matching function, pcre2_dfa_match(), which is not Perl-com- 498 patible, is also provided. This uses a different algorithm for the 499 matching. The alternative algorithm finds all possible matches (at a 500 given point in the subject), and scans the subject just once (unless 501 there are lookbehind assertions). However, this algorithm does not 502 return captured substrings. A description of the two matching algo- 503 rithms and their advantages and disadvantages is given in the 504 pcre2matching documentation. There is no JIT support for 505 pcre2_dfa_match(). 506 507 In addition to the main compiling and matching functions, there are 508 convenience functions for extracting captured substrings from a subject 509 string that has been matched by pcre2_match(). They are: 510 511 pcre2_substring_copy_byname() 512 pcre2_substring_copy_bynumber() 513 pcre2_substring_get_byname() 514 pcre2_substring_get_bynumber() 515 pcre2_substring_list_get() 516 pcre2_substring_length_byname() 517 pcre2_substring_length_bynumber() 518 pcre2_substring_nametable_scan() 519 pcre2_substring_number_from_name() 520 521 pcre2_substring_free() and pcre2_substring_list_free() are also pro- 522 vided, to free the memory used for extracted strings. 523 524 The function pcre2_substitute() can be called to match a pattern and 525 return a copy of the subject string with substitutions for parts that 526 were matched. 527 528 Functions whose names begin with pcre2_serialize_ are used for saving 529 compiled patterns on disc or elsewhere, and reloading them later. 530 531 Finally, there are functions for finding out information about a com- 532 piled pattern (pcre2_pattern_info()) and about the configuration with 533 which PCRE2 was built (pcre2_config()). 534 535 Functions with names ending with _free() are used for freeing memory 536 blocks of various sorts. In all cases, if one of these functions is 537 called with a NULL argument, it does nothing. 538 539 540STRING LENGTHS AND OFFSETS 541 542 The PCRE2 API uses string lengths and offsets into strings of code 543 units in several places. These values are always of type PCRE2_SIZE, 544 which is an unsigned integer type, currently always defined as size_t. 545 The largest value that can be stored in such a type (that is 546 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated 547 strings and unset offsets. Therefore, the longest string that can be 548 handled is one less than this maximum. 549 550 551NEWLINES 552 553 PCRE2 supports five different conventions for indicating line breaks in 554 strings: a single CR (carriage return) character, a single LF (line- 555 feed) character, the two-character sequence CRLF, any of the three pre- 556 ceding, or any Unicode newline sequence. The Unicode newline sequences 557 are the three just mentioned, plus the single characters VT (vertical 558 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line 559 separator, U+2028), and PS (paragraph separator, U+2029). 560 561 Each of the first three conventions is used by at least one operating 562 system as its standard newline sequence. When PCRE2 is built, a default 563 can be specified. The default default is LF, which is the Unix stan- 564 dard. However, the newline convention can be changed by an application 565 when calling pcre2_compile(), or it can be specified by special text at 566 the start of the pattern itself; this overrides any other settings. See 567 the pcre2pattern page for details of the special character sequences. 568 569 In the PCRE2 documentation the word "newline" is used to mean "the 570 character or pair of characters that indicate a line break". The choice 571 of newline convention affects the handling of the dot, circumflex, and 572 dollar metacharacters, the handling of #-comments in /x mode, and, when 573 CRLF is a recognized line ending sequence, the match position advance- 574 ment for a non-anchored pattern. There is more detail about this in the 575 section on pcre2_match() options below. 576 577 The choice of newline convention does not affect the interpretation of 578 the \n or \r escape sequences, nor does it affect what \R matches; this 579 has its own separate convention. 580 581 582MULTITHREADING 583 584 In a multithreaded application it is important to keep thread-specific 585 data separate from data that can be shared between threads. The PCRE2 586 library code itself is thread-safe: it contains no static or global 587 variables. The API is designed to be fairly simple for non-threaded 588 applications while at the same time ensuring that multithreaded appli- 589 cations can use it. 590 591 There are several different blocks of data that are used to pass infor- 592 mation between the application and the PCRE2 libraries. 593 594 The compiled pattern 595 596 A pointer to the compiled form of a pattern is returned to the user 597 when pcre2_compile() is successful. The data in the compiled pattern is 598 fixed, and does not change when the pattern is matched. Therefore, it 599 is thread-safe, that is, the same compiled pattern can be used by more 600 than one thread simultaneously. For example, an application can compile 601 all its patterns at the start, before forking off multiple threads that 602 use them. However, if the just-in-time optimization feature is being 603 used, it needs separate memory stack areas for each thread. See the 604 pcre2jit documentation for more details. 605 606 In a more complicated situation, where patterns are compiled only when 607 they are first needed, but are still shared between threads, pointers 608 to compiled patterns must be protected from simultaneous writing by 609 multiple threads, at least until a pattern has been compiled. The logic 610 can be something like this: 611 612 Get a read-only (shared) lock (mutex) for pointer 613 if (pointer == NULL) 614 { 615 Get a write (unique) lock for pointer 616 pointer = pcre2_compile(... 617 } 618 Release the lock 619 Use pointer in pcre2_match() 620 621 Of course, testing for compilation errors should also be included in 622 the code. 623 624 If JIT is being used, but the JIT compilation is not being done immedi- 625 ately, (perhaps waiting to see if the pattern is used often enough) 626 similar logic is required. JIT compilation updates a pointer within the 627 compiled code block, so a thread must gain unique write access to the 628 pointer before calling pcre2_jit_compile(). Alternatively, 629 pcre2_code_copy() can be used to obtain a private copy of the compiled 630 code. 631 632 Context blocks 633 634 The next main section below introduces the idea of "contexts" in which 635 PCRE2 functions are called. A context is nothing more than a collection 636 of parameters that control the way PCRE2 operates. Grouping a number of 637 parameters together in a context is a convenient way of passing them to 638 a PCRE2 function without using lots of arguments. The parameters that 639 are stored in contexts are in some sense "advanced features" of the 640 API. Many straightforward applications will not need to use contexts. 641 642 In a multithreaded application, if the parameters in a context are val- 643 ues that are never changed, the same context can be used by all the 644 threads. However, if any thread needs to change any value in a context, 645 it must make its own thread-specific copy. 646 647 Match blocks 648 649 The matching functions need a block of memory for working space and for 650 storing the results of a match. This includes details of what was 651 matched, as well as additional information such as the name of a 652 (*MARK) setting. Each thread must provide its own copy of this memory. 653 654 655PCRE2 CONTEXTS 656 657 Some PCRE2 functions have a lot of parameters, many of which are used 658 only by specialist applications, for example, those that use custom 659 memory management or non-standard character tables. To keep function 660 argument lists at a reasonable size, and at the same time to keep the 661 API extensible, "uncommon" parameters are passed to certain functions 662 in a context instead of directly. A context is just a block of memory 663 that holds the parameter values. Applications that do not need to 664 adjust any of the context parameters can pass NULL when a context 665 pointer is required. 666 667 There are three different types of context: a general context that is 668 relevant for several PCRE2 operations, a compile-time context, and a 669 match-time context. 670 671 The general context 672 673 At present, this context just contains pointers to (and data for) 674 external memory management functions that are called from several 675 places in the PCRE2 library. The context is named `general' rather than 676 specifically `memory' because in future other fields may be added. If 677 you do not want to supply your own custom memory management functions, 678 you do not need to bother with a general context. A general context is 679 created by: 680 681 pcre2_general_context *pcre2_general_context_create( 682 void *(*private_malloc)(PCRE2_SIZE, void *), 683 void (*private_free)(void *, void *), void *memory_data); 684 685 The two function pointers specify custom memory management functions, 686 whose prototypes are: 687 688 void *private_malloc(PCRE2_SIZE, void *); 689 void private_free(void *, void *); 690 691 Whenever code in PCRE2 calls these functions, the final argument is the 692 value of memory_data. Either of the first two arguments of the creation 693 function may be NULL, in which case the system memory management func- 694 tions malloc() and free() are used. (This is not currently useful, as 695 there are no other fields in a general context, but in future there 696 might be.) The private_malloc() function is used (if supplied) to 697 obtain memory for storing the context, and all three values are saved 698 as part of the context. 699 700 Whenever PCRE2 creates a data block of any kind, the block contains a 701 pointer to the free() function that matches the malloc() function that 702 was used. When the time comes to free the block, this function is 703 called. 704 705 A general context can be copied by calling: 706 707 pcre2_general_context *pcre2_general_context_copy( 708 pcre2_general_context *gcontext); 709 710 The memory used for a general context should be freed by calling: 711 712 void pcre2_general_context_free(pcre2_general_context *gcontext); 713 714 715 The compile context 716 717 A compile context is required if you want to change the default values 718 of any of the following compile-time parameters: 719 720 What \R matches (Unicode newlines or CR, LF, CRLF only) 721 PCRE2's character tables 722 The newline character sequence 723 The compile time nested parentheses limit 724 The maximum length of the pattern string 725 An external function for stack checking 726 727 A compile context is also required if you are using custom memory man- 728 agement. If none of these apply, just pass NULL as the context argu- 729 ment of pcre2_compile(). 730 731 A compile context is created, copied, and freed by the following func- 732 tions: 733 734 pcre2_compile_context *pcre2_compile_context_create( 735 pcre2_general_context *gcontext); 736 737 pcre2_compile_context *pcre2_compile_context_copy( 738 pcre2_compile_context *ccontext); 739 740 void pcre2_compile_context_free(pcre2_compile_context *ccontext); 741 742 A compile context is created with default values for its parameters. 743 These can be changed by calling the following functions, which return 0 744 on success, or PCRE2_ERROR_BADDATA if invalid data is detected. 745 746 int pcre2_set_bsr(pcre2_compile_context *ccontext, 747 uint32_t value); 748 749 The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only 750 CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any 751 Unicode line ending sequence. The value is used by the JIT compiler and 752 by the two interpreted matching functions, pcre2_match() and 753 pcre2_dfa_match(). 754 755 int pcre2_set_character_tables(pcre2_compile_context *ccontext, 756 const unsigned char *tables); 757 758 The value must be the result of a call to pcre2_maketables(), whose 759 only argument is a general context. This function builds a set of char- 760 acter tables in the current locale. 761 762 int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, 763 PCRE2_SIZE value); 764 765 This sets a maximum length, in code units, for the pattern string that 766 is to be compiled. If the pattern is longer, an error is generated. 767 This facility is provided so that applications that accept patterns 768 from external sources can limit their size. The default is the largest 769 number that a PCRE2_SIZE variable can hold, which is effectively unlim- 770 ited. 771 772 int pcre2_set_newline(pcre2_compile_context *ccontext, 773 uint32_t value); 774 775 This specifies which characters or character sequences are to be recog- 776 nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage 777 return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the 778 two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any 779 of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence). 780 781 When a pattern is compiled with the PCRE2_EXTENDED option, the value of 782 this parameter affects the recognition of white space and the end of 783 internal comments starting with #. The value is saved with the compiled 784 pattern for subsequent use by the JIT compiler and by the two inter- 785 preted matching functions, pcre2_match() and pcre2_dfa_match(). 786 787 int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, 788 uint32_t value); 789 790 This parameter ajusts the limit, set when PCRE2 is built (default 250), 791 on the depth of parenthesis nesting in a pattern. This limit stops 792 rogue patterns using up too much system stack when being compiled. 793 794 int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, 795 int (*guard_function)(uint32_t, void *), void *user_data); 796 797 There is at least one application that runs PCRE2 in threads with very 798 limited system stack, where running out of stack is to be avoided at 799 all costs. The parenthesis limit above cannot take account of how much 800 stack is actually available. For a finer control, you can supply a 801 function that is called whenever pcre2_compile() starts to compile a 802 parenthesized part of a pattern. This function can check the actual 803 stack size (or anything else that it wants to, of course). 804 805 The first argument to the callout function gives the current depth of 806 nesting, and the second is user data that is set up by the last argu- 807 ment of pcre2_set_compile_recursion_guard(). The callout function 808 should return zero if all is well, or non-zero to force an error. 809 810 The match context 811 812 A match context is required if you want to change the default values of 813 any of the following match-time parameters: 814 815 A callout function 816 The offset limit for matching an unanchored pattern 817 The limit for calling match() (see below) 818 The limit for calling match() recursively 819 820 A match context is also required if you are using custom memory manage- 821 ment. If none of these apply, just pass NULL as the context argument 822 of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match(). 823 824 A match context is created, copied, and freed by the following func- 825 tions: 826 827 pcre2_match_context *pcre2_match_context_create( 828 pcre2_general_context *gcontext); 829 830 pcre2_match_context *pcre2_match_context_copy( 831 pcre2_match_context *mcontext); 832 833 void pcre2_match_context_free(pcre2_match_context *mcontext); 834 835 A match context is created with default values for its parameters. 836 These can be changed by calling the following functions, which return 0 837 on success, or PCRE2_ERROR_BADDATA if invalid data is detected. 838 839 int pcre2_set_callout(pcre2_match_context *mcontext, 840 int (*callout_function)(pcre2_callout_block *, void *), 841 void *callout_data); 842 843 This sets up a "callout" function, which PCRE2 will call at specified 844 points during a matching operation. Details are given in the pcre2call- 845 out documentation. 846 847 int pcre2_set_offset_limit(pcre2_match_context *mcontext, 848 PCRE2_SIZE value); 849 850 The offset_limit parameter limits how far an unanchored search can 851 advance in the subject string. The default value is PCRE2_UNSET. The 852 pcre2_match() and pcre2_dfa_match() functions return 853 PCRE2_ERROR_NOMATCH if a match with a starting point before or at the 854 given offset is not found. For example, if the pattern /abc/ is matched 855 against "123abc" with an offset limit less than 3, the result is 856 PCRE2_ERROR_NO_MATCH. A match can never be found if the startoffset 857 argument of pcre2_match() or pcre2_dfa_match() is greater than the off- 858 set limit. 859 860 When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when 861 calling pcre2_compile() so that when JIT is in use, different code can 862 be compiled. If a match is started with a non-default match limit when 863 PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. 864 865 The offset limit facility can be used to track progress when searching 866 large subject strings. See also the PCRE2_FIRSTLINE option, which 867 requires a match to start within the first line of the subject. If this 868 is set with an offset limit, a match must occur in the first line and 869 also within the offset limit. In other words, whichever limit comes 870 first is used. 871 872 int pcre2_set_match_limit(pcre2_match_context *mcontext, 873 uint32_t value); 874 875 The match_limit parameter provides a means of preventing PCRE2 from 876 using up too many resources when processing patterns that are not going 877 to match, but which have a very large number of possibilities in their 878 search trees. The classic example is a pattern that uses nested unlim- 879 ited repeats. 880 881 Internally, pcre2_match() uses a function called match(), which it 882 calls repeatedly (sometimes recursively). The limit set by match_limit 883 is imposed on the number of times this function is called during a 884 match, which has the effect of limiting the amount of backtracking that 885 can take place. For patterns that are not anchored, the count restarts 886 from zero for each position in the subject string. This limit is not 887 relevant to pcre2_dfa_match(), which ignores it. 888 889 When pcre2_match() is called with a pattern that was successfully pro- 890 cessed by pcre2_jit_compile(), the way in which matching is executed is 891 entirely different. However, there is still the possibility of runaway 892 matching that goes on for a very long time, and so the match_limit 893 value is also used in this case (but in a different way) to limit how 894 long the matching can continue. 895 896 The default value for the limit can be set when PCRE2 is built; the 897 default default is 10 million, which handles all but the most extreme 898 cases. If the limit is exceeded, pcre2_match() returns 899 PCRE2_ERROR_MATCHLIMIT. A value for the match limit may also be sup- 900 plied by an item at the start of a pattern of the form 901 902 (*LIMIT_MATCH=ddd) 903 904 where ddd is a decimal number. However, such a setting is ignored 905 unless ddd is less than the limit set by the caller of pcre2_match() 906 or, if no such limit is set, less than the default. 907 908 int pcre2_set_recursion_limit(pcre2_match_context *mcontext, 909 uint32_t value); 910 911 The recursion_limit parameter is similar to match_limit, but instead of 912 limiting the total number of times that match() is called, it limits 913 the depth of recursion. The recursion depth is a smaller number than 914 the total number of calls, because not all calls to match() are recur- 915 sive. This limit is of use only if it is set smaller than match_limit. 916 917 Limiting the recursion depth limits the amount of system stack that can 918 be used, or, when PCRE2 has been compiled to use memory on the heap 919 instead of the stack, the amount of heap memory that can be used. This 920 limit is not relevant, and is ignored, when matching is done using JIT 921 compiled code or by the pcre2_dfa_match() function. 922 923 The default value for recursion_limit can be set when PCRE2 is built; 924 the default default is the same value as the default for match_limit. 925 If the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION- 926 LIMIT. A value for the recursion limit may also be supplied by an item 927 at the start of a pattern of the form 928 929 (*LIMIT_RECURSION=ddd) 930 931 where ddd is a decimal number. However, such a setting is ignored 932 unless ddd is less than the limit set by the caller of pcre2_match() 933 or, if no such limit is set, less than the default. 934 935 int pcre2_set_recursion_memory_management( 936 pcre2_match_context *mcontext, 937 void *(*private_malloc)(PCRE2_SIZE, void *), 938 void (*private_free)(void *, void *), void *memory_data); 939 940 This function sets up two additional custom memory management functions 941 for use by pcre2_match() when PCRE2 is compiled to use the heap for 942 remembering backtracking data, instead of recursive function calls that 943 use the system stack. There is a discussion about PCRE2's stack usage 944 in the pcre2stack documentation. See the pcre2build documentation for 945 details of how to build PCRE2. 946 947 Using the heap for recursion is a non-standard way of building PCRE2, 948 for use in environments that have limited stacks. Because of the 949 greater use of memory management, pcre2_match() runs more slowly. Func- 950 tions that are different to the general custom memory functions are 951 provided so that special-purpose external code can be used for this 952 case, because the memory blocks are all the same size. The blocks are 953 retained by pcre2_match() until it is about to exit so that they can be 954 re-used when possible during the match. In the absence of these func- 955 tions, the normal custom memory management functions are used, if sup- 956 plied, otherwise the system functions. 957 958 959CHECKING BUILD-TIME OPTIONS 960 961 int pcre2_config(uint32_t what, void *where); 962 963 The function pcre2_config() makes it possible for a PCRE2 client to 964 discover which optional features have been compiled into the PCRE2 965 library. The pcre2build documentation has more details about these 966 optional features. 967 968 The first argument for pcre2_config() specifies which information is 969 required. The second argument is a pointer to memory into which the 970 information is placed. If NULL is passed, the function returns the 971 amount of memory that is needed for the requested information. For 972 calls that return numerical values, the value is in bytes; when 973 requesting these values, where should point to appropriately aligned 974 memory. For calls that return strings, the required length is given in 975 code units, not counting the terminating zero. 976 977 When requesting information, the returned value from pcre2_config() is 978 non-negative on success, or the negative error code PCRE2_ERROR_BADOP- 979 TION if the value in the first argument is not recognized. The follow- 980 ing information is available: 981 982 PCRE2_CONFIG_BSR 983 984 The output is a uint32_t integer whose value indicates what character 985 sequences the \R escape sequence matches by default. A value of 986 PCRE2_BSR_UNICODE means that \R matches any Unicode line ending 987 sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, 988 LF, or CRLF. The default can be overridden when a pattern is compiled. 989 990 PCRE2_CONFIG_JIT 991 992 The output is a uint32_t integer that is set to one if support for 993 just-in-time compiling is available; otherwise it is set to zero. 994 995 PCRE2_CONFIG_JITTARGET 996 997 The where argument should point to a buffer that is at least 48 code 998 units long. (The exact length required can be found by calling 999 pcre2_config() with where set to NULL.) The buffer is filled with a 1000 string that contains the name of the architecture for which the JIT 1001 compiler is configured, for example "x86 32bit (little endian + 1002 unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is 1003 returned, otherwise the number of code units used is returned. This is 1004 the length of the string, plus one unit for the terminating zero. 1005 1006 PCRE2_CONFIG_LINKSIZE 1007 1008 The output is a uint32_t integer that contains the number of bytes used 1009 for internal linkage in compiled regular expressions. When PCRE2 is 1010 configured, the value can be set to 2, 3, or 4, with the default being 1011 2. This is the value that is returned by pcre2_config(). However, when 1012 the 16-bit library is compiled, a value of 3 is rounded up to 4, and 1013 when the 32-bit library is compiled, internal linkages always use 4 1014 bytes, so the configured value is not relevant. 1015 1016 The default value of 2 for the 8-bit and 16-bit libraries is sufficient 1017 for all but the most massive patterns, since it allows the size of the 1018 compiled pattern to be up to 64K code units. Larger values allow larger 1019 regular expressions to be compiled by those two libraries, but at the 1020 expense of slower matching. 1021 1022 PCRE2_CONFIG_MATCHLIMIT 1023 1024 The output is a uint32_t integer that gives the default limit for the 1025 number of internal matching function calls in a pcre2_match() execu- 1026 tion. Further details are given with pcre2_match() below. 1027 1028 PCRE2_CONFIG_NEWLINE 1029 1030 The output is a uint32_t integer whose value specifies the default 1031 character sequence that is recognized as meaning "newline". The values 1032 are: 1033 1034 PCRE2_NEWLINE_CR Carriage return (CR) 1035 PCRE2_NEWLINE_LF Linefeed (LF) 1036 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 1037 PCRE2_NEWLINE_ANY Any Unicode line ending 1038 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 1039 1040 The default should normally correspond to the standard sequence for 1041 your operating system. 1042 1043 PCRE2_CONFIG_PARENSLIMIT 1044 1045 The output is a uint32_t integer that gives the maximum depth of nest- 1046 ing of parentheses (of any kind) in a pattern. This limit is imposed to 1047 cap the amount of system stack used when a pattern is compiled. It is 1048 specified when PCRE2 is built; the default is 250. This limit does not 1049 take into account the stack that may already be used by the calling 1050 application. For finer control over compilation stack usage, see 1051 pcre2_set_compile_recursion_guard(). 1052 1053 PCRE2_CONFIG_RECURSIONLIMIT 1054 1055 The output is a uint32_t integer that gives the default limit for the 1056 depth of recursion when calling the internal matching function in a 1057 pcre2_match() execution. Further details are given with pcre2_match() 1058 below. 1059 1060 PCRE2_CONFIG_STACKRECURSE 1061 1062 The output is a uint32_t integer that is set to one if internal recur- 1063 sion when running pcre2_match() is implemented by recursive function 1064 calls that use the system stack to remember their state. This is the 1065 usual way that PCRE2 is compiled. The output is zero if PCRE2 was com- 1066 piled to use blocks of data on the heap instead of recursive function 1067 calls. 1068 1069 PCRE2_CONFIG_UNICODE_VERSION 1070 1071 The where argument should point to a buffer that is at least 24 code 1072 units long. (The exact length required can be found by calling 1073 pcre2_config() with where set to NULL.) If PCRE2 has been compiled 1074 without Unicode support, the buffer is filled with the text "Unicode 1075 not supported". Otherwise, the Unicode version string (for example, 1076 "8.0.0") is inserted. The number of code units used is returned. This 1077 is the length of the string plus one unit for the terminating zero. 1078 1079 PCRE2_CONFIG_UNICODE 1080 1081 The output is a uint32_t integer that is set to one if Unicode support 1082 is available; otherwise it is set to zero. Unicode support implies UTF 1083 support. 1084 1085 PCRE2_CONFIG_VERSION 1086 1087 The where argument should point to a buffer that is at least 12 code 1088 units long. (The exact length required can be found by calling 1089 pcre2_config() with where set to NULL.) The buffer is filled with the 1090 PCRE2 version string, zero-terminated. The number of code units used is 1091 returned. This is the length of the string plus one unit for the termi- 1092 nating zero. 1093 1094 1095COMPILING A PATTERN 1096 1097 pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, 1098 uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, 1099 pcre2_compile_context *ccontext); 1100 1101 void pcre2_code_free(pcre2_code *code); 1102 1103 pcre2_code *pcre2_code_copy(const pcre2_code *code); 1104 1105 The pcre2_compile() function compiles a pattern into an internal form. 1106 The pattern is defined by a pointer to a string of code units and a 1107 length. If the pattern is zero-terminated, the length can be specified 1108 as PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of 1109 memory that contains the compiled pattern and related data, or NULL if 1110 an error occurred. 1111 1112 If the compile context argument ccontext is NULL, memory for the com- 1113 piled pattern is obtained by calling malloc(). Otherwise, it is 1114 obtained from the same memory function that was used for the compile 1115 context. The caller must free the memory by calling pcre2_code_free() 1116 when it is no longer needed. 1117 1118 The function pcre2_code_copy() makes a copy of the compiled code in new 1119 memory, using the same memory allocator as was used for the original. 1120 However, if the code has been processed by the JIT compiler (see 1121 below), the JIT information cannot be copied (because it is position- 1122 dependent). The new copy can initially be used only for non-JIT match- 1123 ing, though it can be passed to pcre2_jit_compile() if required. The 1124 pcre2_code_copy() function provides a way for individual threads in a 1125 multithreaded application to acquire a private copy of shared compiled 1126 code. 1127 1128 NOTE: When one of the matching functions is called, pointers to the 1129 compiled pattern and the subject string are set in the match data block 1130 so that they can be referenced by the substring extraction functions. 1131 After running a match, you must not free a compiled pattern (or a sub- 1132 ject string) until after all operations on the match data block have 1133 taken place. 1134 1135 The options argument for pcre2_compile() contains various bit settings 1136 that affect the compilation. It should be zero if no options are 1137 required. The available options are described below. Some of them (in 1138 particular, those that are compatible with Perl, but some others as 1139 well) can also be set and unset from within the pattern (see the 1140 detailed description in the pcre2pattern documentation). 1141 1142 For those options that can be different in different parts of the pat- 1143 tern, the contents of the options argument specifies their settings at 1144 the start of compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK 1145 options can be set at the time of matching as well as at compile time. 1146 1147 Other, less frequently required compile-time parameters (for example, 1148 the newline setting) can be provided in a compile context (as described 1149 above). 1150 1151 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- 1152 diately. Otherwise, the variables to which these point are set to an 1153 error code and an offset (number of code units) within the pattern, 1154 respectively, when pcre2_compile() returns NULL because a compilation 1155 error has occurred. The values are not defined when compilation is suc- 1156 cessful and pcre2_compile() returns a non-NULL value. 1157 1158 The pcre2_get_error_message() function (see "Obtaining a textual error 1159 message" below) provides a textual message for each error code. Compi- 1160 lation errors have positive error codes; UTF formatting error codes are 1161 negative. For an invalid UTF-8 or UTF-16 string, the offset is that of 1162 the first code unit of the failing character. 1163 1164 Some errors are not detected until the whole pattern has been scanned; 1165 in these cases, the offset passed back is the length of the pattern. 1166 Note that the offset is in code units, not characters, even in a UTF 1167 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- 1168 acter. 1169 1170 This code fragment shows a typical straightforward call to pcre2_com- 1171 pile(): 1172 1173 pcre2_code *re; 1174 PCRE2_SIZE erroffset; 1175 int errorcode; 1176 re = pcre2_compile( 1177 "^A.*Z", /* the pattern */ 1178 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ 1179 0, /* default options */ 1180 &errorcode, /* for error code */ 1181 &erroffset, /* for error offset */ 1182 NULL); /* no compile context */ 1183 1184 The following names for option bits are defined in the pcre2.h header 1185 file: 1186 1187 PCRE2_ANCHORED 1188 1189 If this bit is set, the pattern is forced to be "anchored", that is, it 1190 is constrained to match only at the first matching point in the string 1191 that is being searched (the "subject string"). This effect can also be 1192 achieved by appropriate constructs in the pattern itself, which is the 1193 only way to do it in Perl. 1194 1195 PCRE2_ALLOW_EMPTY_CLASS 1196 1197 By default, for compatibility with Perl, a closing square bracket that 1198 immediately follows an opening one is treated as a data character for 1199 the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the 1200 class, which therefore contains no characters and so can never match. 1201 1202 PCRE2_ALT_BSUX 1203 1204 This option request alternative handling of three escape sequences, 1205 which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). 1206 When it is set: 1207 1208 (1) \U matches an upper case "U" character; by default \U causes a com- 1209 pile time error (Perl uses \U to upper case subsequent characters). 1210 1211 (2) \u matches a lower case "u" character unless it is followed by four 1212 hexadecimal digits, in which case the hexadecimal number defines the 1213 code point to match. By default, \u causes a compile time error (Perl 1214 uses it to upper case the following character). 1215 1216 (3) \x matches a lower case "x" character unless it is followed by two 1217 hexadecimal digits, in which case the hexadecimal number defines the 1218 code point to match. By default, as in Perl, a hexadecimal number is 1219 always expected after \x, but it may have zero, one, or two digits (so, 1220 for example, \xz matches a binary zero character followed by z). 1221 1222 PCRE2_ALT_CIRCUMFLEX 1223 1224 In multiline mode (when PCRE2_MULTILINE is set), the circumflex 1225 metacharacter matches at the start of the subject (unless PCRE2_NOTBOL 1226 is set), and also after any internal newline. However, it does not 1227 match after a newline at the end of the subject, for compatibility with 1228 Perl. If you want a multiline circumflex also to match after a termi- 1229 nating newline, you must set PCRE2_ALT_CIRCUMFLEX. 1230 1231 PCRE2_ALT_VERBNAMES 1232 1233 By default, for compatibility with Perl, the name in any verb sequence 1234 such as (*MARK:NAME) is any sequence of characters that does not 1235 include a closing parenthesis. The name is not processed in any way, 1236 and it is not possible to include a closing parenthesis in the name. 1237 However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash 1238 processing is applied to verb names and only an unescaped closing 1239 parenthesis terminates the name. A closing parenthesis can be included 1240 in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED 1241 option is set, unescaped whitespace in verb names is skipped and #-com- 1242 ments are recognized, exactly as in the rest of the pattern. 1243 1244 PCRE2_AUTO_CALLOUT 1245 1246 If this bit is set, pcre2_compile() automatically inserts callout 1247 items, all with number 255, before each pattern item. For discussion of 1248 the callout facility, see the pcre2callout documentation. 1249 1250 PCRE2_CASELESS 1251 1252 If this bit is set, letters in the pattern match both upper and lower 1253 case letters in the subject. It is equivalent to Perl's /i option, and 1254 it can be changed within a pattern by a (?i) option setting. 1255 1256 PCRE2_DOLLAR_ENDONLY 1257 1258 If this bit is set, a dollar metacharacter in the pattern matches only 1259 at the end of the subject string. Without this option, a dollar also 1260 matches immediately before a newline at the end of the string (but not 1261 before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored 1262 if PCRE2_MULTILINE is set. There is no equivalent to this option in 1263 Perl, and no way to set it within a pattern. 1264 1265 PCRE2_DOTALL 1266 1267 If this bit is set, a dot metacharacter in the pattern matches any 1268 character, including one that indicates a newline. However, it only 1269 ever matches one character, even if newlines are coded as CRLF. Without 1270 this option, a dot does not match when the current position in the sub- 1271 ject is at a newline. This option is equivalent to Perl's /s option, 1272 and it can be changed within a pattern by a (?s) option setting. A neg- 1273 ative class such as [^a] always matches newline characters, independent 1274 of the setting of this option. 1275 1276 PCRE2_DUPNAMES 1277 1278 If this bit is set, names used to identify capturing subpatterns need 1279 not be unique. This can be helpful for certain types of pattern when it 1280 is known that only one instance of the named subpattern can ever be 1281 matched. There are more details of named subpatterns below; see also 1282 the pcre2pattern documentation. 1283 1284 PCRE2_EXTENDED 1285 1286 If this bit is set, most white space characters in the pattern are 1287 totally ignored except when escaped or inside a character class. How- 1288 ever, white space is not allowed within sequences such as (?> that 1289 introduce various parenthesized subpatterns, nor within numerical quan- 1290 tifiers such as {1,3}. Ignorable white space is permitted between an 1291 item and a following quantifier and between a quantifier and a follow- 1292 ing + that indicates possessiveness. 1293 1294 PCRE2_EXTENDED also causes characters between an unescaped # outside a 1295 character class and the next newline, inclusive, to be ignored, which 1296 makes it possible to include comments inside complicated patterns. Note 1297 that the end of this type of comment is a literal newline sequence in 1298 the pattern; escape sequences that happen to represent a newline do not 1299 count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be 1300 changed within a pattern by a (?x) option setting. 1301 1302 Which characters are interpreted as newlines can be specified by a set- 1303 ting in the compile context that is passed to pcre2_compile() or by a 1304 special sequence at the start of the pattern, as described in the sec- 1305 tion entitled "Newline conventions" in the pcre2pattern documentation. 1306 A default is defined when PCRE2 is built. 1307 1308 PCRE2_FIRSTLINE 1309 1310 If this option is set, an unanchored pattern is required to match 1311 before or at the first newline in the subject string, though the 1312 matched text may continue over the newline. See also PCRE2_USE_OFF- 1313 SET_LIMIT, which provides a more general limiting facility. If 1314 PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the 1315 first line and also within the offset limit. In other words, whichever 1316 limit comes first is used. 1317 1318 PCRE2_MATCH_UNSET_BACKREF 1319 1320 If this option is set, a back reference to an unset subpattern group 1321 matches an empty string (by default this causes the current matching 1322 alternative to fail). A pattern such as (\1)(a) succeeds when this 1323 option is set (assuming it can find an "a" in the subject), whereas it 1324 fails by default, for Perl compatibility. Setting this option makes 1325 PCRE2 behave more like ECMAscript (aka JavaScript). 1326 1327 PCRE2_MULTILINE 1328 1329 By default, for the purposes of matching "start of line" and "end of 1330 line", PCRE2 treats the subject string as consisting of a single line 1331 of characters, even if it actually contains newlines. The "start of 1332 line" metacharacter (^) matches only at the start of the string, and 1333 the "end of line" metacharacter ($) matches only at the end of the 1334 string, or before a terminating newline (except when PCRE2_DOL- 1335 LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set, 1336 the "any character" metacharacter (.) does not match at a newline. This 1337 behaviour (for ^, $, and dot) is the same as Perl. 1338 1339 When PCRE2_MULTILINE it is set, the "start of line" and "end of line" 1340 constructs match immediately following or immediately before internal 1341 newlines in the subject string, respectively, as well as at the very 1342 start and end. This is equivalent to Perl's /m option, and it can be 1343 changed within a pattern by a (?m) option setting. Note that the "start 1344 of line" metacharacter does not match after a newline at the end of the 1345 subject, for compatibility with Perl. However, you can change this by 1346 setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a 1347 subject string, or no occurrences of ^ or $ in a pattern, setting 1348 PCRE2_MULTILINE has no effect. 1349 1350 PCRE2_NEVER_BACKSLASH_C 1351 1352 This option locks out the use of \C in the pattern that is being com- 1353 piled. This escape can cause unpredictable behaviour in UTF-8 or 1354 UTF-16 modes, because it may leave the current matching point in the 1355 middle of a multi-code-unit character. This option may be useful in 1356 applications that process patterns from external sources. Note that 1357 there is also a build-time option that permanently locks out the use of 1358 \C. 1359 1360 PCRE2_NEVER_UCP 1361 1362 This option locks out the use of Unicode properties for handling \B, 1363 \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as 1364 described for the PCRE2_UCP option below. In particular, it prevents 1365 the creator of the pattern from enabling this facility by starting the 1366 pattern with (*UCP). This option may be useful in applications that 1367 process patterns from external sources. The option combination PCRE_UCP 1368 and PCRE_NEVER_UCP causes an error. 1369 1370 PCRE2_NEVER_UTF 1371 1372 This option locks out interpretation of the pattern as UTF-8, UTF-16, 1373 or UTF-32, depending on which library is in use. In particular, it pre- 1374 vents the creator of the pattern from switching to UTF interpretation 1375 by starting the pattern with (*UTF). This option may be useful in 1376 applications that process patterns from external sources. The combina- 1377 tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error. 1378 1379 PCRE2_NO_AUTO_CAPTURE 1380 1381 If this option is set, it disables the use of numbered capturing paren- 1382 theses in the pattern. Any opening parenthesis that is not followed by 1383 ? behaves as if it were followed by ?: but named parentheses can still 1384 be used for capturing (and they acquire numbers in the usual way). 1385 There is no equivalent of this option in Perl. Note that, if this 1386 option is set, references to capturing groups (back references or 1387 recursion/subroutine calls) may only refer to named groups, though the 1388 reference can be by name or by number. 1389 1390 PCRE2_NO_AUTO_POSSESS 1391 1392 If this option is set, it disables "auto-possessification", which is an 1393 optimization that, for example, turns a+b into a++b in order to avoid 1394 backtracks into a+ that can never be successful. However, if callouts 1395 are in use, auto-possessification means that some callouts are never 1396 taken. You can set this option if you want the matching functions to do 1397 a full unoptimized search and run all the callouts, but it is mainly 1398 provided for testing purposes. 1399 1400 PCRE2_NO_DOTSTAR_ANCHOR 1401 1402 If this option is set, it disables an optimization that is applied when 1403 .* is the first significant item in a top-level branch of a pattern, 1404 and all the other branches also start with .* or with \A or \G or ^. 1405 The optimization is automatically disabled for .* if it is inside an 1406 atomic group or a capturing group that is the subject of a back refer- 1407 ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti- 1408 mization is not disabled, such a pattern is automatically anchored if 1409 PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set 1410 for any ^ items. Otherwise, the fact that any match must start either 1411 at the start of the subject or following a newline is remembered. Like 1412 other optimizations, this can cause callouts to be skipped. 1413 1414 PCRE2_NO_START_OPTIMIZE 1415 1416 This is an option whose main effect is at matching time. It does not 1417 change what pcre2_compile() generates, but it does affect the output of 1418 the JIT compiler. 1419 1420 There are a number of optimizations that may occur at the start of a 1421 match, in order to speed up the process. For example, if it is known 1422 that an unanchored match must start with a specific character, the 1423 matching code searches the subject for that character, and fails imme- 1424 diately if it cannot find it, without actually running the main match- 1425 ing function. This means that a special item such as (*COMMIT) at the 1426 start of a pattern is not considered until after a suitable starting 1427 point for the match has been found. Also, when callouts or (*MARK) 1428 items are in use, these "start-up" optimizations can cause them to be 1429 skipped if the pattern is never actually used. The start-up optimiza- 1430 tions are in effect a pre-scan of the subject that takes place before 1431 the pattern is run. 1432 1433 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, 1434 possibly causing performance to suffer, but ensuring that in cases 1435 where the result is "no match", the callouts do occur, and that items 1436 such as (*COMMIT) and (*MARK) are considered at every possible starting 1437 position in the subject string. 1438 1439 Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching 1440 operation. Consider the pattern 1441 1442 (*COMMIT)ABC 1443 1444 When this is compiled, PCRE2 records the fact that a match must start 1445 with the character "A". Suppose the subject string is "DEFABC". The 1446 start-up optimization scans along the subject, finds "A" and runs the 1447 first match attempt from there. The (*COMMIT) item means that the pat- 1448 tern must match the current starting position, which in this case, it 1449 does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE 1450 set, the initial scan along the subject string does not happen. The 1451 first match attempt is run starting from "D" and when this fails, 1452 (*COMMIT) prevents any further matches being tried, so the overall 1453 result is "no match". There are also other start-up optimizations. For 1454 example, a minimum length for the subject may be recorded. Consider the 1455 pattern 1456 1457 (*MARK:A)(X|Y) 1458 1459 The minimum length for a match is one character. If the subject is 1460 "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt 1461 to match an empty string at the end of the subject does not take place, 1462 because PCRE2 knows that the subject is now too short, and so the 1463 (*MARK) is never encountered. In this case, the optimization does not 1464 affect the overall match result, which is still "no match", but it does 1465 affect the auxiliary information that is returned. 1466 1467 PCRE2_NO_UTF_CHECK 1468 1469 When PCRE2_UTF is set, the validity of the pattern as a UTF string is 1470 automatically checked. There are discussions about the validity of 1471 UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode 1472 document. If an invalid UTF sequence is found, pcre2_compile() returns 1473 a negative error code. 1474 1475 If you know that your pattern is valid, and you want to skip this check 1476 for performance reasons, you can set the PCRE2_NO_UTF_CHECK option. 1477 When it is set, the effect of passing an invalid UTF string as a pat- 1478 tern is undefined. It may cause your program to crash or loop. Note 1479 that this option can also be passed to pcre2_match() and 1480 pcre_dfa_match(), to suppress validity checking of the subject string. 1481 1482 PCRE2_UCP 1483 1484 This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, 1485 \w, and some of the POSIX character classes. By default, only ASCII 1486 characters are recognized, but if PCRE2_UCP is set, Unicode properties 1487 are used instead to classify characters. More details are given in the 1488 section on generic character types in the pcre2pattern page. If you set 1489 PCRE2_UCP, matching one of the items it affects takes much longer. The 1490 option is available only if PCRE2 has been compiled with Unicode sup- 1491 port. 1492 1493 PCRE2_UNGREEDY 1494 1495 This option inverts the "greediness" of the quantifiers so that they 1496 are not greedy by default, but become greedy if followed by "?". It is 1497 not compatible with Perl. It can also be set by a (?U) option setting 1498 within the pattern. 1499 1500 PCRE2_USE_OFFSET_LIMIT 1501 1502 This option must be set for pcre2_compile() if pcre2_set_offset_limit() 1503 is going to be used to set a non-default offset limit in a match con- 1504 text for matches that use this pattern. An error is generated if an 1505 offset limit is set without this option. For more details, see the 1506 description of pcre2_set_offset_limit() in the section that describes 1507 match contexts. See also the PCRE2_FIRSTLINE option above. 1508 1509 PCRE2_UTF 1510 1511 This option causes PCRE2 to regard both the pattern and the subject 1512 strings that are subsequently processed as strings of UTF characters 1513 instead of single-code-unit strings. It is available when PCRE2 is 1514 built to include Unicode support (which is the default). If Unicode 1515 support is not available, the use of this option provokes an error. 1516 Details of how this option changes the behaviour of PCRE2 are given in 1517 the pcre2unicode page. 1518 1519 1520COMPILATION ERROR CODES 1521 1522 There are over 80 positive error codes that pcre2_compile() may return 1523 (via errorcode) if it finds an error in the pattern. There are also 1524 some negative error codes that are used for invalid UTF strings. These 1525 are the same as given by pcre2_match() and pcre2_dfa_match(), and are 1526 described in the pcre2unicode page. The pcre2_get_error_message() func- 1527 tion (see "Obtaining a textual error message" below) can be called to 1528 obtain a textual error message from any error code. 1529 1530 1531JUST-IN-TIME (JIT) COMPILATION 1532 1533 int pcre2_jit_compile(pcre2_code *code, uint32_t options); 1534 1535 int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, 1536 PCRE2_SIZE length, PCRE2_SIZE startoffset, 1537 uint32_t options, pcre2_match_data *match_data, 1538 pcre2_match_context *mcontext); 1539 1540 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 1541 1542 pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, 1543 PCRE2_SIZE maxsize, pcre2_general_context *gcontext); 1544 1545 void pcre2_jit_stack_assign(pcre2_match_context *mcontext, 1546 pcre2_jit_callback callback_function, void *callback_data); 1547 1548 void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); 1549 1550 These functions provide support for JIT compilation, which, if the 1551 just-in-time compiler is available, further processes a compiled pat- 1552 tern into machine code that executes much faster than the pcre2_match() 1553 interpretive matching function. Full details are given in the pcre2jit 1554 documentation. 1555 1556 JIT compilation is a heavyweight optimization. It can take some time 1557 for patterns to be analyzed, and for one-off matches and simple pat- 1558 terns the benefit of faster execution might be offset by a much slower 1559 compilation time. Most, but not all patterns can be optimized by the 1560 JIT compiler. 1561 1562 1563LOCALE SUPPORT 1564 1565 PCRE2 handles caseless matching, and determines whether characters are 1566 letters, digits, or whatever, by reference to a set of tables, indexed 1567 by character code point. This applies only to characters whose code 1568 points are less than 256. By default, higher-valued code points never 1569 match escapes such as \w or \d. However, if PCRE2 is built with UTF 1570 support, all characters can be tested with \p and \P, or, alterna- 1571 tively, the PCRE2_UCP option can be set when a pattern is compiled; 1572 this causes \w and friends to use Unicode property support instead of 1573 the built-in tables. 1574 1575 The use of locales with Unicode is discouraged. If you are handling 1576 characters with code points greater than 128, you should either use 1577 Unicode support, or use locales, but not try to mix the two. 1578 1579 PCRE2 contains an internal set of character tables that are used by 1580 default. These are sufficient for many applications. Normally, the 1581 internal tables recognize only ASCII characters. However, when PCRE2 is 1582 built, it is possible to cause the internal tables to be rebuilt in the 1583 default "C" locale of the local system, which may cause them to be dif- 1584 ferent. 1585 1586 The internal tables can be overridden by tables supplied by the appli- 1587 cation that calls PCRE2. These may be created in a different locale 1588 from the default. As more and more applications change to using Uni- 1589 code, the need for this locale support is expected to die away. 1590 1591 External tables are built by calling the pcre2_maketables() function, 1592 in the relevant locale. The result can be passed to pcre2_compile() as 1593 often as necessary, by creating a compile context and calling 1594 pcre2_set_character_tables() to set the tables pointer therein. For 1595 example, to build and use tables that are appropriate for the French 1596 locale (where accented characters with values greater than 128 are 1597 treated as letters), the following code could be used: 1598 1599 setlocale(LC_CTYPE, "fr_FR"); 1600 tables = pcre2_maketables(NULL); 1601 ccontext = pcre2_compile_context_create(NULL); 1602 pcre2_set_character_tables(ccontext, tables); 1603 re = pcre2_compile(..., ccontext); 1604 1605 The locale name "fr_FR" is used on Linux and other Unix-like systems; 1606 if you are using Windows, the name for the French locale is "french". 1607 It is the caller's responsibility to ensure that the memory containing 1608 the tables remains available for as long as it is needed. 1609 1610 The pointer that is passed (via the compile context) to pcre2_compile() 1611 is saved with the compiled pattern, and the same tables are used by 1612 pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com- 1613 pilation, and matching all happen in the same locale, but different 1614 patterns can be processed in different locales. 1615 1616 1617INFORMATION ABOUT A COMPILED PATTERN 1618 1619 int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); 1620 1621 The pcre2_pattern_info() function returns general information about a 1622 compiled pattern. For information about callouts, see the next section. 1623 The first argument for pcre2_pattern_info() is a pointer to the com- 1624 piled pattern. The second argument specifies which piece of information 1625 is required, and the third argument is a pointer to a variable to 1626 receive the data. If the third argument is NULL, the first argument is 1627 ignored, and the function returns the size in bytes of the variable 1628 that is required for the information requested. Otherwise, The yield of 1629 the function is zero for success, or one of the following negative num- 1630 bers: 1631 1632 PCRE2_ERROR_NULL the argument code was NULL 1633 PCRE2_ERROR_BADMAGIC the "magic number" was not found 1634 PCRE2_ERROR_BADOPTION the value of what was invalid 1635 PCRE2_ERROR_UNSET the requested field is not set 1636 1637 The "magic number" is placed at the start of each compiled pattern as 1638 an simple check against passing an arbitrary memory pointer. Here is a 1639 typical call of pcre2_pattern_info(), to obtain the length of the com- 1640 piled pattern: 1641 1642 int rc; 1643 size_t length; 1644 rc = pcre2_pattern_info( 1645 re, /* result of pcre2_compile() */ 1646 PCRE2_INFO_SIZE, /* what is required */ 1647 &length); /* where to put the data */ 1648 1649 The possible values for the second argument are defined in pcre2.h, and 1650 are as follows: 1651 1652 PCRE2_INFO_ALLOPTIONS 1653 PCRE2_INFO_ARGOPTIONS 1654 1655 Return a copy of the pattern's options. The third argument should point 1656 to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the 1657 options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP- 1658 TIONS returns the compile options as modified by any top-level (*XXX) 1659 option settings such as (*UTF) at the start of the pattern itself. 1660 1661 For example, if the pattern /(*UTF)abc/ is compiled with the 1662 PCRE2_EXTENDED option, the result for PCRE2_INFO_ALLOPTIONS is 1663 PCRE2_EXTENDED and PCRE2_UTF. Option settings such as (?i) that can 1664 change within a pattern do not affect the result of PCRE2_INFO_ALLOP- 1665 TIONS, even if they appear right at the start of the pattern. (This was 1666 different in some earlier releases.) 1667 1668 A pattern compiled without PCRE2_ANCHORED is automatically anchored by 1669 PCRE2 if the first significant item in every top-level branch is one of 1670 the following: 1671 1672 ^ unless PCRE2_MULTILINE is set 1673 \A always 1674 \G always 1675 .* sometimes - see below 1676 1677 When .* is the first significant item, anchoring is possible only when 1678 all the following are true: 1679 1680 .* is not in an atomic group 1681 .* is not in a capturing group that is the subject 1682 of a back reference 1683 PCRE2_DOTALL is in force for .* 1684 Neither (*PRUNE) nor (*SKIP) appears in the pattern. 1685 PCRE2_NO_DOTSTAR_ANCHOR is not set. 1686 1687 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in 1688 the options returned for PCRE2_INFO_ALLOPTIONS. 1689 1690 PCRE2_INFO_BACKREFMAX 1691 1692 Return the number of the highest back reference in the pattern. The 1693 third argument should point to an uint32_t variable. Named subpatterns 1694 acquire numbers as well as names, and these count towards the highest 1695 back reference. Back references such as \4 or \g{12} match the cap- 1696 tured characters of the given group, but in addition, the check that a 1697 capturing group is set in a conditional subpattern such as (?(3)a|b) is 1698 also a back reference. Zero is returned if there are no back refer- 1699 ences. 1700 1701 PCRE2_INFO_BSR 1702 1703 The output is a uint32_t whose value indicates what character sequences 1704 the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that 1705 \R matches any Unicode line ending sequence; a value of PCRE2_BSR_ANY- 1706 CRLF means that \R matches only CR, LF, or CRLF. 1707 1708 PCRE2_INFO_CAPTURECOUNT 1709 1710 Return the highest capturing subpattern number in the pattern. In pat- 1711 terns where (?| is not used, this is also the total number of capturing 1712 subpatterns. The third argument should point to an uint32_t variable. 1713 1714 PCRE2_INFO_FIRSTBITMAP 1715 1716 In the absence of a single first code unit for a non-anchored pattern, 1717 pcre2_compile() may construct a 256-bit table that defines a fixed set 1718 of values for the first code unit in any match. For example, a pattern 1719 that starts with [abc] results in a table with three bits set. When 1720 code unit values greater than 255 are supported, the flag bit for 255 1721 means "any code unit of value 255 or above". If such a table was con- 1722 structed, a pointer to it is returned. Otherwise NULL is returned. The 1723 third argument should point to an const uint8_t * variable. 1724 1725 PCRE2_INFO_FIRSTCODETYPE 1726 1727 Return information about the first code unit of any matched string, for 1728 a non-anchored pattern. The third argument should point to an uint32_t 1729 variable. If there is a fixed first value, for example, the letter "c" 1730 from a pattern such as (cat|cow|coyote), 1 is returned, and the charac- 1731 ter value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is 1732 no fixed first value, but it is known that a match can occur only at 1733 the start of the subject or following a newline in the subject, 2 is 1734 returned. Otherwise, and for anchored patterns, 0 is returned. 1735 1736 PCRE2_INFO_FIRSTCODEUNIT 1737 1738 Return the value of the first code unit of any matched string in the 1739 situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. 1740 The third argument should point to an uint32_t variable. In the 8-bit 1741 library, the value is always less than 256. In the 16-bit library the 1742 value can be up to 0xffff. In the 32-bit library in UTF-32 mode the 1743 value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 1744 mode. 1745 1746 PCRE2_INFO_HASBACKSLASHC 1747 1748 Return 1 if the pattern contains any instances of \C, otherwise 0. The 1749 third argument should point to an uint32_t variable. 1750 1751 PCRE2_INFO_HASCRORLF 1752 1753 Return 1 if the pattern contains any explicit matches for CR or LF 1754 characters, otherwise 0. The third argument should point to an uint32_t 1755 variable. An explicit match is either a literal CR or LF character, or 1756 \r or \n. 1757 1758 PCRE2_INFO_JCHANGED 1759 1760 Return 1 if the (?J) or (?-J) option setting is used in the pattern, 1761 otherwise 0. The third argument should point to an uint32_t variable. 1762 (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec- 1763 tively. 1764 1765 PCRE2_INFO_JITSIZE 1766 1767 If the compiled pattern was successfully processed by pcre2_jit_com- 1768 pile(), return the size of the JIT compiled code, otherwise return 1769 zero. The third argument should point to a size_t variable. 1770 1771 PCRE2_INFO_LASTCODETYPE 1772 1773 Returns 1 if there is a rightmost literal code unit that must exist in 1774 any matched string, other than at its start. The third argument should 1775 point to an uint32_t variable. If there is no such value, 0 is 1776 returned. When 1 is returned, the code unit value itself can be 1777 retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last 1778 literal value is recorded only if it follows something of variable 1779 length. For example, for the pattern /^a\d+z\d+/ the returned value is 1780 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ 1781 the returned value is 0. 1782 1783 PCRE2_INFO_LASTCODEUNIT 1784 1785 Return the value of the rightmost literal data unit that must exist in 1786 any matched string, other than at its start, if such a value has been 1787 recorded. The third argument should point to an uint32_t variable. If 1788 there is no such value, 0 is returned. 1789 1790 PCRE2_INFO_MATCHEMPTY 1791 1792 Return 1 if the pattern might match an empty string, otherwise 0. The 1793 third argument should point to an uint32_t variable. When a pattern 1794 contains recursive subroutine calls it is not always possible to deter- 1795 mine whether or not it can match an empty string. PCRE2 takes a cau- 1796 tious approach and returns 1 in such cases. 1797 1798 PCRE2_INFO_MATCHLIMIT 1799 1800 If the pattern set a match limit by including an item of the form 1801 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third 1802 argument should point to an unsigned 32-bit integer. If no such value 1803 has been set, the call to pcre2_pattern_info() returns the error 1804 PCRE2_ERROR_UNSET. 1805 1806 PCRE2_INFO_MAXLOOKBEHIND 1807 1808 Return the number of characters (not code units) in the longest lookbe- 1809 hind assertion in the pattern. The third argument should point to an 1810 unsigned 32-bit integer. This information is useful when doing multi- 1811 segment matching using the partial matching facilities. Note that the 1812 simple assertions \b and \B require a one-character lookbehind. \A also 1813 registers a one-character lookbehind, though it does not actually 1814 inspect the previous character. This is to ensure that at least one 1815 character from the old segment is retained when a new segment is pro- 1816 cessed. Otherwise, if there are no lookbehinds in the pattern, \A might 1817 match incorrectly at the start of a new segment. 1818 1819 PCRE2_INFO_MINLENGTH 1820 1821 If a minimum length for matching subject strings was computed, its 1822 value is returned. Otherwise the returned value is 0. The value is a 1823 number of characters, which in UTF mode may be different from the num- 1824 ber of code units. The third argument should point to an uint32_t 1825 variable. The value is a lower bound to the length of any matching 1826 string. There may not be any strings of that length that do actually 1827 match, but every string that does match is at least that long. 1828 1829 PCRE2_INFO_NAMECOUNT 1830 PCRE2_INFO_NAMEENTRYSIZE 1831 PCRE2_INFO_NAMETABLE 1832 1833 PCRE2 supports the use of named as well as numbered capturing parenthe- 1834 ses. The names are just an additional way of identifying the parenthe- 1835 ses, which still acquire numbers. Several convenience functions such as 1836 pcre2_substring_get_byname() are provided for extracting captured sub- 1837 strings by name. It is also possible to extract the data directly, by 1838 first converting the name to a number in order to access the correct 1839 pointers in the output vector (described with pcre2_match() below). To 1840 do the conversion, you need to use the name-to-number map, which is 1841 described by these three values. 1842 1843 The map consists of a number of fixed-size entries. PCRE2_INFO_NAME- 1844 COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives 1845 the size of each entry in code units; both of these return a uint32_t 1846 value. The entry size depends on the length of the longest name. 1847 1848 PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. 1849 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit 1850 library, the first two bytes of each entry are the number of the cap- 1851 turing parenthesis, most significant byte first. In the 16-bit library, 1852 the pointer points to 16-bit code units, the first of which contains 1853 the parenthesis number. In the 32-bit library, the pointer points to 1854 32-bit code units, the first of which contains the parenthesis number. 1855 The rest of the entry is the corresponding name, zero terminated. 1856 1857 The names are in alphabetical order. If (?| is used to create multiple 1858 groups with the same number, as described in the section on duplicate 1859 subpattern numbers in the pcre2pattern page, the groups may be given 1860 the same name, but there is only one entry in the table. Different 1861 names for groups of the same number are not permitted. 1862 1863 Duplicate names for subpatterns with different numbers are permitted, 1864 but only if PCRE2_DUPNAMES is set. They appear in the table in the 1865 order in which they were found in the pattern. In the absence of (?| 1866 this is the order of increasing number; when (?| is used this is not 1867 necessarily the case because later subpatterns may have lower numbers. 1868 1869 As a simple example of the name/number table, consider the following 1870 pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED 1871 is set, so white space - including newlines - is ignored): 1872 1873 (?<date> (?<year>(\d\d)?\d\d) - 1874 (?<month>\d\d) - (?<day>\d\d) ) 1875 1876 There are four named subpatterns, so the table has four entries, and 1877 each entry in the table is eight bytes long. The table is as follows, 1878 with non-printing bytes shows in hexadecimal, and undefined bytes shown 1879 as ??: 1880 1881 00 01 d a t e 00 ?? 1882 00 05 d a y 00 ?? ?? 1883 00 04 m o n t h 00 1884 00 02 y e a r 00 ?? 1885 1886 When writing code to extract data from named subpatterns using the 1887 name-to-number map, remember that the length of the entries is likely 1888 to be different for each compiled pattern. 1889 1890 PCRE2_INFO_NEWLINE 1891 1892 The output is a uint32_t with one of the following values: 1893 1894 PCRE2_NEWLINE_CR Carriage return (CR) 1895 PCRE2_NEWLINE_LF Linefeed (LF) 1896 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 1897 PCRE2_NEWLINE_ANY Any Unicode line ending 1898 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 1899 1900 This specifies the default character sequence that will be recognized 1901 as meaning "newline" while matching. 1902 1903 PCRE2_INFO_RECURSIONLIMIT 1904 1905 If the pattern set a recursion limit by including an item of the form 1906 (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third 1907 argument should point to an unsigned 32-bit integer. If no such value 1908 has been set, the call to pcre2_pattern_info() returns the error 1909 PCRE2_ERROR_UNSET. 1910 1911 PCRE2_INFO_SIZE 1912 1913 Return the size of the compiled pattern in bytes (for all three 1914 libraries). The third argument should point to a size_t variable. This 1915 value includes the size of the general data block that precedes the 1916 code units of the compiled pattern itself. The value that is used when 1917 pcre2_compile() is getting memory in which to place the compiled pat- 1918 tern may be slightly larger than the value returned by this option, 1919 because there are cases where the code that calculates the size has to 1920 over-estimate. Processing a pattern with the JIT compiler does not 1921 alter the value returned by this option. 1922 1923 1924INFORMATION ABOUT A PATTERN'S CALLOUTS 1925 1926 int pcre2_callout_enumerate(const pcre2_code *code, 1927 int (*callback)(pcre2_callout_enumerate_block *, void *), 1928 void *user_data); 1929 1930 A script language that supports the use of string arguments in callouts 1931 might like to scan all the callouts in a pattern before running the 1932 match. This can be done by calling pcre2_callout_enumerate(). The first 1933 argument is a pointer to a compiled pattern, the second points to a 1934 callback function, and the third is arbitrary user data. The callback 1935 function is called for every callout in the pattern in the order in 1936 which they appear. Its first argument is a pointer to a callout enumer- 1937 ation block, and its second argument is the user_data value that was 1938 passed to pcre2_callout_enumerate(). The contents of the callout enu- 1939 meration block are described in the pcre2callout documentation, which 1940 also gives further details about callouts. 1941 1942 1943SERIALIZATION AND PRECOMPILING 1944 1945 It is possible to save compiled patterns on disc or elsewhere, and 1946 reload them later, subject to a number of restrictions. The functions 1947 whose names begin with pcre2_serialize_ are used for this purpose. They 1948 are described in the pcre2serialize documentation. 1949 1950 1951THE MATCH DATA BLOCK 1952 1953 pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, 1954 pcre2_general_context *gcontext); 1955 1956 pcre2_match_data *pcre2_match_data_create_from_pattern( 1957 const pcre2_code *code, pcre2_general_context *gcontext); 1958 1959 void pcre2_match_data_free(pcre2_match_data *match_data); 1960 1961 Information about a successful or unsuccessful match is placed in a 1962 match data block, which is an opaque structure that is accessed by 1963 function calls. In particular, the match data block contains a vector 1964 of offsets into the subject string that define the matched part of the 1965 subject and any substrings that were captured. This is know as the 1966 ovector. 1967 1968 Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() 1969 you must create a match data block by calling one of the creation func- 1970 tions above. For pcre2_match_data_create(), the first argument is the 1971 number of pairs of offsets in the ovector. One pair of offsets is 1972 required to identify the string that matched the whole pattern, with 1973 another pair for each captured substring. For example, a value of 4 1974 creates enough space to record the matched portion of the subject plus 1975 three captured substrings. A minimum of at least 1 pair is imposed by 1976 pcre2_match_data_create(), so it is always possible to return the over- 1977 all matched string. 1978 1979 The second argument of pcre2_match_data_create() is a pointer to a gen- 1980 eral context, which can specify custom memory management for obtaining 1981 the memory for the match data block. If you are not using custom memory 1982 management, pass NULL, which causes malloc() to be used. 1983 1984 For pcre2_match_data_create_from_pattern(), the first argument is a 1985 pointer to a compiled pattern. The ovector is created to be exactly the 1986 right size to hold all the substrings a pattern might capture. The sec- 1987 ond argument is again a pointer to a general context, but in this case 1988 if NULL is passed, the memory is obtained using the same allocator that 1989 was used for the compiled pattern (custom or default). 1990 1991 A match data block can be used many times, with the same or different 1992 compiled patterns. You can extract information from a match data block 1993 after a match operation has finished, using functions that are 1994 described in the sections on matched strings and other match data 1995 below. 1996 1997 When a call of pcre2_match() fails, valid data is available in the 1998 match block only when the error is PCRE2_ERROR_NOMATCH, 1999 PCRE2_ERROR_PARTIAL, or one of the error codes for an invalid UTF 2000 string. Exactly what is available depends on the error, and is detailed 2001 below. 2002 2003 When one of the matching functions is called, pointers to the compiled 2004 pattern and the subject string are set in the match data block so that 2005 they can be referenced by the extraction functions. After running a 2006 match, you must not free a compiled pattern or a subject string until 2007 after all operations on the match data block (for that match) have 2008 taken place. 2009 2010 When a match data block itself is no longer needed, it should be freed 2011 by calling pcre2_match_data_free(). 2012 2013 2014MATCHING A PATTERN: THE TRADITIONAL FUNCTION 2015 2016 int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, 2017 PCRE2_SIZE length, PCRE2_SIZE startoffset, 2018 uint32_t options, pcre2_match_data *match_data, 2019 pcre2_match_context *mcontext); 2020 2021 The function pcre2_match() is called to match a subject string against 2022 a compiled pattern, which is passed in the code argument. You can call 2023 pcre2_match() with the same code argument as many times as you like, in 2024 order to find multiple matches in the subject string or to match dif- 2025 ferent subject strings with the same pattern. 2026 2027 This function is the main matching facility of the library, and it 2028 operates in a Perl-like manner. For specialist use there is also an 2029 alternative matching function, which is described below in the section 2030 about the pcre2_dfa_match() function. 2031 2032 Here is an example of a simple call to pcre2_match(): 2033 2034 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 2035 int rc = pcre2_match( 2036 re, /* result of pcre2_compile() */ 2037 "some string", /* the subject string */ 2038 11, /* the length of the subject string */ 2039 0, /* start at offset 0 in the subject */ 2040 0, /* default options */ 2041 match_data, /* the match data block */ 2042 NULL); /* a match context; NULL means use defaults */ 2043 2044 If the subject string is zero-terminated, the length can be given as 2045 PCRE2_ZERO_TERMINATED. A match context must be provided if certain less 2046 common matching parameters are to be changed. For details, see the sec- 2047 tion on the match context above. 2048 2049 The string to be matched by pcre2_match() 2050 2051 The subject string is passed to pcre2_match() as a pointer in subject, 2052 a length in length, and a starting offset in startoffset. The length 2053 and offset are in code units, not characters. That is, they are in 2054 bytes for the 8-bit library, 16-bit code units for the 16-bit library, 2055 and 32-bit code units for the 32-bit library, whether or not UTF pro- 2056 cessing is enabled. 2057 2058 If startoffset is greater than the length of the subject, pcre2_match() 2059 returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the 2060 search for a match starts at the beginning of the subject, and this is 2061 by far the most common case. In UTF-8 or UTF-16 mode, the starting off- 2062 set must point to the start of a character, or to the end of the sub- 2063 ject (in UTF-32 mode, one code unit equals one character, so all off- 2064 sets are valid). Like the pattern string, the subject may contain 2065 binary zeroes. 2066 2067 A non-zero starting offset is useful when searching for another match 2068 in the same subject by calling pcre2_match() again after a previous 2069 success. Setting startoffset differs from passing over a shortened 2070 string and setting PCRE2_NOTBOL in the case of a pattern that begins 2071 with any kind of lookbehind. For example, consider the pattern 2072 2073 \Biss\B 2074 2075 which finds occurrences of "iss" in the middle of words. (\B matches 2076 only if the current position in the subject is not a word boundary.) 2077 When applied to the string "Mississipi" the first call to pcre2_match() 2078 finds the first occurrence. If pcre2_match() is called again with just 2079 the remainder of the subject, namely "issipi", it does not match, 2080 because \B is always false at the start of the subject, which is deemed 2081 to be a word boundary. However, if pcre2_match() is passed the entire 2082 string again, but with startoffset set to 4, it finds the second occur- 2083 rence of "iss" because it is able to look behind the starting point to 2084 discover that it is preceded by a letter. 2085 2086 Finding all the matches in a subject is tricky when the pattern can 2087 match an empty string. It is possible to emulate Perl's /g behaviour by 2088 first trying the match again at the same offset, with the 2089 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that 2090 fails, advancing the starting offset and trying an ordinary match 2091 again. There is some code that demonstrates how to do this in the 2092 pcre2demo sample program. In the most general case, you have to check 2093 to see if the newline convention recognizes CRLF as a newline, and if 2094 so, and the current character is CR followed by LF, advance the start- 2095 ing offset by two characters instead of one. 2096 2097 If a non-zero starting offset is passed when the pattern is anchored, 2098 one attempt to match at the given offset is made. This can only succeed 2099 if the pattern does not require the match to be at the start of the 2100 subject. 2101 2102 Option bits for pcre2_match() 2103 2104 The unused bits of the options argument for pcre2_match() must be zero. 2105 The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, 2106 PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, 2107 PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their 2108 action is described below. 2109 2110 Setting PCRE2_ANCHORED at match time is not supported by the just-in- 2111 time (JIT) compiler. If it is set, JIT matching is disabled and the 2112 normal interpretive code in pcre2_match() is run. Apart from 2113 PCRE2_NO_JIT (obviously), the remaining options are supported for JIT 2114 matching. 2115 2116 PCRE2_ANCHORED 2117 2118 The PCRE2_ANCHORED option limits pcre2_match() to matching at the first 2119 matching position. If a pattern was compiled with PCRE2_ANCHORED, or 2120 turned out to be anchored by virtue of its contents, it cannot be made 2121 unachored at matching time. Note that setting the option at match time 2122 disables JIT matching. 2123 2124 PCRE2_NOTBOL 2125 2126 This option specifies that first character of the subject string is not 2127 the beginning of a line, so the circumflex metacharacter should not 2128 match before it. Setting this without having set PCRE2_MULTILINE at 2129 compile time causes circumflex never to match. This option affects only 2130 the behaviour of the circumflex metacharacter. It does not affect \A. 2131 2132 PCRE2_NOTEOL 2133 2134 This option specifies that the end of the subject string is not the end 2135 of a line, so the dollar metacharacter should not match it nor (except 2136 in multiline mode) a newline immediately before it. Setting this with- 2137 out having set PCRE2_MULTILINE at compile time causes dollar never to 2138 match. This option affects only the behaviour of the dollar metacharac- 2139 ter. It does not affect \Z or \z. 2140 2141 PCRE2_NOTEMPTY 2142 2143 An empty string is not considered to be a valid match if this option is 2144 set. If there are alternatives in the pattern, they are tried. If all 2145 the alternatives match the empty string, the entire match fails. For 2146 example, if the pattern 2147 2148 a?b? 2149 2150 is applied to a string not beginning with "a" or "b", it matches an 2151 empty string at the start of the subject. With PCRE2_NOTEMPTY set, this 2152 match is not valid, so pcre2_match() searches further into the string 2153 for occurrences of "a" or "b". 2154 2155 PCRE2_NOTEMPTY_ATSTART 2156 2157 This is like PCRE2_NOTEMPTY, except that it locks out an empty string 2158 match only at the first matching position, that is, at the start of the 2159 subject plus the starting offset. An empty string match later in the 2160 subject is permitted. If the pattern is anchored, such a match can 2161 occur only if the pattern contains \K. 2162 2163 PCRE2_NO_JIT 2164 2165 By default, if a pattern has been successfully processed by 2166 pcre2_jit_compile(), JIT is automatically used when pcre2_match() is 2167 called with options that JIT supports. Setting PCRE2_NO_JIT disables 2168 the use of JIT; it forces matching to be done by the interpreter. 2169 2170 PCRE2_NO_UTF_CHECK 2171 2172 When PCRE2_UTF is set at compile time, the validity of the subject as a 2173 UTF string is checked by default when pcre2_match() is subsequently 2174 called. If a non-zero starting offset is given, the check is applied 2175 only to that part of the subject that could be inspected during match- 2176 ing, and there is a check that the starting offset points to the first 2177 code unit of a character or to the end of the subject. If there are no 2178 lookbehind assertions in the pattern, the check starts at the starting 2179 offset. Otherwise, it starts at the length of the longest lookbehind 2180 before the starting offset, or at the start of the subject if there are 2181 not that many characters before the starting offset. Note that the 2182 sequences \b and \B are one-character lookbehinds. 2183 2184 The check is carried out before any other processing takes place, and a 2185 negative error code is returned if the check fails. There are several 2186 UTF error codes for each code unit width, corresponding to different 2187 problems with the code unit sequence. There are discussions about the 2188 validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the 2189 pcre2unicode page. 2190 2191 If you know that your subject is valid, and you want to skip these 2192 checks for performance reasons, you can set the PCRE2_NO_UTF_CHECK 2193 option when calling pcre2_match(). You might want to do this for the 2194 second and subsequent calls to pcre2_match() if you are making repeated 2195 calls to find all the matches in a single subject string. 2196 2197 NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid 2198 string as a subject, or an invalid value of startoffset, is undefined. 2199 Your program may crash or loop indefinitely. 2200 2201 PCRE2_PARTIAL_HARD 2202 PCRE2_PARTIAL_SOFT 2203 2204 These options turn on the partial matching feature. A partial match 2205 occurs if the end of the subject string is reached successfully, but 2206 there are not enough subject characters to complete the match. If this 2207 happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, 2208 matching continues by testing any remaining alternatives. Only if no 2209 complete match can be found is PCRE2_ERROR_PARTIAL returned instead of 2210 PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that 2211 the caller is prepared to handle a partial match, but only if no com- 2212 plete match can be found. 2213 2214 If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this 2215 case, if a partial match is found, pcre2_match() immediately returns 2216 PCRE2_ERROR_PARTIAL, without considering any other alternatives. In 2217 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid- 2218 ered to be more important that an alternative complete match. 2219 2220 There is a more detailed discussion of partial and multi-segment match- 2221 ing, with examples, in the pcre2partial documentation. 2222 2223 2224NEWLINE HANDLING WHEN MATCHING 2225 2226 When PCRE2 is built, a default newline convention is set; this is usu- 2227 ally the standard convention for the operating system. The default can 2228 be overridden in a compile context by calling pcre2_set_newline(). It 2229 can also be overridden by starting a pattern string with, for example, 2230 (*CRLF), as described in the section on newline conventions in the 2231 pcre2pattern page. During matching, the newline choice affects the be- 2232 haviour of the dot, circumflex, and dollar metacharacters. It may also 2233 alter the way the match starting position is advanced after a match 2234 failure for an unanchored pattern. 2235 2236 When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is 2237 set as the newline convention, and a match attempt for an unanchored 2238 pattern fails when the current starting position is at a CRLF sequence, 2239 and the pattern contains no explicit matches for CR or LF characters, 2240 the match position is advanced by two characters instead of one, in 2241 other words, to after the CRLF. 2242 2243 The above rule is a compromise that makes the most common cases work as 2244 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL 2245 option is not set), it does not match the string "\r\nA" because, after 2246 failing at the start, it skips both the CR and the LF before retrying. 2247 However, the pattern [\r\n]A does match that string, because it con- 2248 tains an explicit CR or LF reference, and so advances only by one char- 2249 acter after the first failure. 2250 2251 An explicit match for CR of LF is either a literal appearance of one of 2252 those characters in the pattern, or one of the \r or \n escape 2253 sequences. Implicit matches such as [^X] do not count, nor does \s, 2254 even though it includes CR and LF in the characters that it matches. 2255 2256 Notwithstanding the above, anomalous effects may still occur when CRLF 2257 is a valid newline sequence and explicit \r or \n escapes appear in the 2258 pattern. 2259 2260 2261HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS 2262 2263 uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); 2264 2265 PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); 2266 2267 In general, a pattern matches a certain portion of the subject, and in 2268 addition, further substrings from the subject may be picked out by 2269 parenthesized parts of the pattern. Following the usage in Jeffrey 2270 Friedl's book, this is called "capturing" in what follows, and the 2271 phrase "capturing subpattern" or "capturing group" is used for a frag- 2272 ment of a pattern that picks out a substring. PCRE2 supports several 2273 other kinds of parenthesized subpattern that do not cause substrings to 2274 be captured. The pcre2_pattern_info() function can be used to find out 2275 how many capturing subpatterns there are in a compiled pattern. 2276 2277 You can use auxiliary functions for accessing captured substrings by 2278 number or by name, as described in sections below. 2279 2280 Alternatively, you can make direct use of the vector of PCRE2_SIZE val- 2281 ues, called the ovector, which contains the offsets of captured 2282 strings. It is part of the match data block. The function 2283 pcre2_get_ovector_pointer() returns the address of the ovector, and 2284 pcre2_get_ovector_count() returns the number of pairs of values it con- 2285 tains. 2286 2287 Within the ovector, the first in each pair of values is set to the off- 2288 set of the first code unit of a substring, and the second is set to the 2289 offset of the first code unit after the end of a substring. These val- 2290 ues are always code unit offsets, not character offsets. That is, they 2291 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit 2292 library, and 32-bit offsets in the 32-bit library. 2293 2294 After a partial match (error return PCRE2_ERROR_PARTIAL), only the 2295 first pair of offsets (that is, ovector[0] and ovector[1]) are set. 2296 They identify the part of the subject that was partially matched. See 2297 the pcre2partial documentation for details of partial matching. 2298 2299 After a successful match, the first pair of offsets identifies the por- 2300 tion of the subject string that was matched by the entire pattern. The 2301 next pair is used for the first capturing subpattern, and so on. The 2302 value returned by pcre2_match() is one more than the highest numbered 2303 pair that has been set. For example, if two substrings have been cap- 2304 tured, the returned value is 3. If there are no capturing subpatterns, 2305 the return value from a successful match is 1, indicating that just the 2306 first pair of offsets has been set. 2307 2308 If a pattern uses the \K escape sequence within a positive assertion, 2309 the reported start of a successful match can be greater than the end of 2310 the match. For example, if the pattern (?=ab\K) is matched against 2311 "ab", the start and end offset values for the match are 2 and 0. 2312 2313 If a capturing subpattern group is matched repeatedly within a single 2314 match operation, it is the last portion of the subject that it matched 2315 that is returned. 2316 2317 If the ovector is too small to hold all the captured substring offsets, 2318 as much as possible is filled in, and the function returns a value of 2319 zero. If captured substrings are not of interest, pcre2_match() may be 2320 called with a match data block whose ovector is of minimum length (that 2321 is, one pair). However, if the pattern contains back references and the 2322 ovector is not big enough to remember the related substrings, PCRE2 has 2323 to get additional memory for use during matching. Thus it is usually 2324 advisable to set up a match data block containing an ovector of reason- 2325 able size. 2326 2327 It is possible for capturing subpattern number n+1 to match some part 2328 of the subject when subpattern n has not been used at all. For example, 2329 if the string "abc" is matched against the pattern (a|(z))(bc) the 2330 return from the function is 4, and subpatterns 1 and 3 are matched, but 2331 2 is not. When this happens, both values in the offset pairs corre- 2332 sponding to unused subpatterns are set to PCRE2_UNSET. 2333 2334 Offset values that correspond to unused subpatterns at the end of the 2335 expression are also set to PCRE2_UNSET. For example, if the string 2336 "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 2337 are not matched. The return from the function is 2, because the high- 2338 est used capturing subpattern number is 1. The offsets for for the sec- 2339 ond and third capturing subpatterns (assuming the vector is large 2340 enough, of course) are set to PCRE2_UNSET. 2341 2342 Elements in the ovector that do not correspond to capturing parentheses 2343 in the pattern are never changed. That is, if a pattern contains n cap- 2344 turing parentheses, no more than ovector[0] to ovector[2n+1] are set by 2345 pcre2_match(). The other elements retain whatever values they previ- 2346 ously had. 2347 2348 2349OTHER INFORMATION ABOUT A MATCH 2350 2351 PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); 2352 2353 PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); 2354 2355 As well as the offsets in the ovector, other information about a match 2356 is retained in the match data block and can be retrieved by the above 2357 functions in appropriate circumstances. If they are called at other 2358 times, the result is undefined. 2359 2360 After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a 2361 failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail- 2362 able, and pcre2_get_mark() can be called. It returns a pointer to the 2363 zero-terminated name, which is within the compiled pattern. Otherwise 2364 NULL is returned. The length of the (*MARK) name (excluding the termi- 2365 nating zero) is stored in the code unit that preceeds the name. You 2366 should use this instead of relying on the terminating zero if the 2367 (*MARK) name might contain a binary zero. 2368 2369 After a successful match, the (*MARK) name that is returned is the last 2370 one encountered on the matching path through the pattern. After a "no 2371 match" or a partial match, the last encountered (*MARK) name is 2372 returned. For example, consider this pattern: 2373 2374 ^(*MARK:A)((*MARK:B)a|b)c 2375 2376 When it matches "bc", the returned mark is A. The B mark is "seen" in 2377 the first branch of the group, but it is not on the matching path. On 2378 the other hand, when this pattern fails to match "bx", the returned 2379 mark is B. 2380 2381 After a successful match, a partial match, or one of the invalid UTF 2382 errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can 2383 be called. After a successful or partial match it returns the code unit 2384 offset of the character at which the match started. For a non-partial 2385 match, this can be different to the value of ovector[0] if the pattern 2386 contains the \K escape sequence. After a partial match, however, this 2387 value is always the same as ovector[0] because \K does not affect the 2388 result of a partial match. 2389 2390 After a UTF check failure, pcre2_get_startchar() can be used to obtain 2391 the code unit offset of the invalid UTF character. Details are given in 2392 the pcre2unicode page. 2393 2394 2395ERROR RETURNS FROM pcre2_match() 2396 2397 If pcre2_match() fails, it returns a negative number. This can be con- 2398 verted to a text string by calling the pcre2_get_error_message() func- 2399 tion (see "Obtaining a textual error message" below). Negative error 2400 codes are also returned by other functions, and are documented with 2401 them. The codes are given names in the header file. If UTF checking is 2402 in force and an invalid UTF subject string is detected, one of a number 2403 of UTF-specific negative error codes is returned. Details are given in 2404 the pcre2unicode page. The following are the other errors that may be 2405 returned by pcre2_match(): 2406 2407 PCRE2_ERROR_NOMATCH 2408 2409 The subject string did not match the pattern. 2410 2411 PCRE2_ERROR_PARTIAL 2412 2413 The subject string did not match, but it did match partially. See the 2414 pcre2partial documentation for details of partial matching. 2415 2416 PCRE2_ERROR_BADMAGIC 2417 2418 PCRE2 stores a 4-byte "magic number" at the start of the compiled code, 2419 to catch the case when it is passed a junk pointer. This is the error 2420 that is returned when the magic number is not present. 2421 2422 PCRE2_ERROR_BADMODE 2423 2424 This error is given when a pattern that was compiled by the 8-bit 2425 library is passed to a 16-bit or 32-bit library function, or vice 2426 versa. 2427 2428 PCRE2_ERROR_BADOFFSET 2429 2430 The value of startoffset was greater than the length of the subject. 2431 2432 PCRE2_ERROR_BADOPTION 2433 2434 An unrecognized bit was set in the options argument. 2435 2436 PCRE2_ERROR_BADUTFOFFSET 2437 2438 The UTF code unit sequence that was passed as a subject was checked and 2439 found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the 2440 value of startoffset did not point to the beginning of a UTF character 2441 or the end of the subject. 2442 2443 PCRE2_ERROR_CALLOUT 2444 2445 This error is never generated by pcre2_match() itself. It is provided 2446 for use by callout functions that want to cause pcre2_match() or 2447 pcre2_callout_enumerate() to return a distinctive error code. See the 2448 pcre2callout documentation for details. 2449 2450 PCRE2_ERROR_INTERNAL 2451 2452 An unexpected internal error has occurred. This error could be caused 2453 by a bug in PCRE2 or by overwriting of the compiled pattern. 2454 2455 PCRE2_ERROR_JIT_BADOPTION 2456 2457 This error is returned when a pattern that was successfully studied 2458 using JIT is being matched, but the matching mode (partial or complete 2459 match) does not correspond to any JIT compilation mode. When the JIT 2460 fast path function is used, this error may be also given for invalid 2461 options. See the pcre2jit documentation for more details. 2462 2463 PCRE2_ERROR_JIT_STACKLIMIT 2464 2465 This error is returned when a pattern that was successfully studied 2466 using JIT is being matched, but the memory available for the just-in- 2467 time processing stack is not large enough. See the pcre2jit documenta- 2468 tion for more details. 2469 2470 PCRE2_ERROR_MATCHLIMIT 2471 2472 The backtracking limit was reached. 2473 2474 PCRE2_ERROR_NOMEMORY 2475 2476 If a pattern contains back references, but the ovector is not big 2477 enough to remember the referenced substrings, PCRE2 gets a block of 2478 memory at the start of matching to use for this purpose. There are some 2479 other special cases where extra memory is needed during matching. This 2480 error is given when memory cannot be obtained. 2481 2482 PCRE2_ERROR_NULL 2483 2484 Either the code, subject, or match_data argument was passed as NULL. 2485 2486 PCRE2_ERROR_RECURSELOOP 2487 2488 This error is returned when pcre2_match() detects a recursion loop 2489 within the pattern. Specifically, it means that either the whole pat- 2490 tern or a subpattern has been called recursively for the second time at 2491 the same position in the subject string. Some simple patterns that 2492 might do this are detected and faulted at compile time, but more com- 2493 plicated cases, in particular mutual recursions between two different 2494 subpatterns, cannot be detected until matching is attempted. 2495 2496 PCRE2_ERROR_RECURSIONLIMIT 2497 2498 The internal recursion limit was reached. 2499 2500 2501OBTAINING A TEXTUAL ERROR MESSAGE 2502 2503 int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, 2504 PCRE2_SIZE bufflen); 2505 2506 A text message for an error code from any PCRE2 function (compile, 2507 match, or auxiliary) can be obtained by calling pcre2_get_error_mes- 2508 sage(). The code is passed as the first argument, with the remaining 2509 two arguments specifying a code unit buffer and its length, into which 2510 the text message is placed. Note that the message is returned in code 2511 units of the appropriate width for the library that is being used. 2512 2513 The returned message is terminated with a trailing zero, and the func- 2514 tion returns the number of code units used, excluding the trailing 2515 zero. If the error number is unknown, the negative error code 2516 PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes- 2517 sage is truncated (but still with a trailing zero), and the negative 2518 error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are 2519 very long; a buffer size of 120 code units is ample. 2520 2521 2522EXTRACTING CAPTURED SUBSTRINGS BY NUMBER 2523 2524 int pcre2_substring_length_bynumber(pcre2_match_data *match_data, 2525 uint32_t number, PCRE2_SIZE *length); 2526 2527 int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, 2528 uint32_t number, PCRE2_UCHAR *buffer, 2529 PCRE2_SIZE *bufflen); 2530 2531 int pcre2_substring_get_bynumber(pcre2_match_data *match_data, 2532 uint32_t number, PCRE2_UCHAR **bufferptr, 2533 PCRE2_SIZE *bufflen); 2534 2535 void pcre2_substring_free(PCRE2_UCHAR *buffer); 2536 2537 Captured substrings can be accessed directly by using the ovector as 2538 described above. For convenience, auxiliary functions are provided for 2539 extracting captured substrings as new, separate, zero-terminated 2540 strings. A substring that contains a binary zero is correctly extracted 2541 and has a further zero added on the end, but the result is not, of 2542 course, a C string. 2543 2544 The functions in this section identify substrings by number. The number 2545 zero refers to the entire matched substring, with higher numbers refer- 2546 ring to substrings captured by parenthesized groups. After a partial 2547 match, only substring zero is available. An attempt to extract any 2548 other substring gives the error PCRE2_ERROR_PARTIAL. The next section 2549 describes similar functions for extracting captured substrings by name. 2550 2551 If a pattern uses the \K escape sequence within a positive assertion, 2552 the reported start of a successful match can be greater than the end of 2553 the match. For example, if the pattern (?=ab\K) is matched against 2554 "ab", the start and end offset values for the match are 2 and 0. In 2555 this situation, calling these functions with a zero substring number 2556 extracts a zero-length empty string. 2557 2558 You can find the length in code units of a captured substring without 2559 extracting it by calling pcre2_substring_length_bynumber(). The first 2560 argument is a pointer to the match data block, the second is the group 2561 number, and the third is a pointer to a variable into which the length 2562 is placed. If you just want to know whether or not the substring has 2563 been captured, you can pass the third argument as NULL. 2564 2565 The pcre2_substring_copy_bynumber() function copies a captured sub- 2566 string into a supplied buffer, whereas pcre2_substring_get_bynumber() 2567 copies it into new memory, obtained using the same memory allocation 2568 function that was used for the match data block. The first two argu- 2569 ments of these functions are a pointer to the match data block and a 2570 capturing group number. 2571 2572 The final arguments of pcre2_substring_copy_bynumber() are a pointer to 2573 the buffer and a pointer to a variable that contains its length in code 2574 units. This is updated to contain the actual number of code units used 2575 for the extracted substring, excluding the terminating zero. 2576 2577 For pcre2_substring_get_bynumber() the third and fourth arguments point 2578 to variables that are updated with a pointer to the new memory and the 2579 number of code units that comprise the substring, again excluding the 2580 terminating zero. When the substring is no longer needed, the memory 2581 should be freed by calling pcre2_substring_free(). 2582 2583 The return value from all these functions is zero for success, or a 2584 negative error code. If the pattern match failed, the match failure 2585 code is returned. If a substring number greater than zero is used 2586 after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible 2587 error codes are: 2588 2589 PCRE2_ERROR_NOMEMORY 2590 2591 The buffer was too small for pcre2_substring_copy_bynumber(), or the 2592 attempt to get memory failed for pcre2_substring_get_bynumber(). 2593 2594 PCRE2_ERROR_NOSUBSTRING 2595 2596 There is no substring with that number in the pattern, that is, the 2597 number is greater than the number of capturing parentheses. 2598 2599 PCRE2_ERROR_UNAVAILABLE 2600 2601 The substring number, though not greater than the number of captures in 2602 the pattern, is greater than the number of slots in the ovector, so the 2603 substring could not be captured. 2604 2605 PCRE2_ERROR_UNSET 2606 2607 The substring did not participate in the match. For example, if the 2608 pattern is (abc)|(def) and the subject is "def", and the ovector con- 2609 tains at least two capturing slots, substring number 1 is unset. 2610 2611 2612EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS 2613 2614 int pcre2_substring_list_get(pcre2_match_data *match_data, 2615 PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); 2616 2617 void pcre2_substring_list_free(PCRE2_SPTR *list); 2618 2619 The pcre2_substring_list_get() function extracts all available sub- 2620 strings and builds a list of pointers to them. It also (optionally) 2621 builds a second list that contains their lengths (in code units), 2622 excluding a terminating zero that is added to each of them. All this is 2623 done in a single block of memory that is obtained using the same memory 2624 allocation function that was used to get the match data block. 2625 2626 This function must be called only after a successful match. If called 2627 after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. 2628 2629 The address of the memory block is returned via listptr, which is also 2630 the start of the list of string pointers. The end of the list is marked 2631 by a NULL pointer. The address of the list of lengths is returned via 2632 lengthsptr. If your strings do not contain binary zeros and you do not 2633 therefore need the lengths, you may supply NULL as the lengthsptr argu- 2634 ment to disable the creation of a list of lengths. The yield of the 2635 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- 2636 ory block could not be obtained. When the list is no longer needed, it 2637 should be freed by calling pcre2_substring_list_free(). 2638 2639 If this function encounters a substring that is unset, which can happen 2640 when capturing subpattern number n+1 matches some part of the subject, 2641 but subpattern n has not been used at all, it returns an empty string. 2642 This can be distinguished from a genuine zero-length substring by 2643 inspecting the appropriate offset in the ovector, which contain 2644 PCRE2_UNSET for unset substrings, or by calling pcre2_sub- 2645 string_length_bynumber(). 2646 2647 2648EXTRACTING CAPTURED SUBSTRINGS BY NAME 2649 2650 int pcre2_substring_number_from_name(const pcre2_code *code, 2651 PCRE2_SPTR name); 2652 2653 int pcre2_substring_length_byname(pcre2_match_data *match_data, 2654 PCRE2_SPTR name, PCRE2_SIZE *length); 2655 2656 int pcre2_substring_copy_byname(pcre2_match_data *match_data, 2657 PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); 2658 2659 int pcre2_substring_get_byname(pcre2_match_data *match_data, 2660 PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); 2661 2662 void pcre2_substring_free(PCRE2_UCHAR *buffer); 2663 2664 To extract a substring by name, you first have to find associated num- 2665 ber. For example, for this pattern: 2666 2667 (a+)b(?<xxx>\d+)... 2668 2669 the number of the subpattern called "xxx" is 2. If the name is known to 2670 be unique (PCRE2_DUPNAMES was not set), you can find the number from 2671 the name by calling pcre2_substring_number_from_name(). The first argu- 2672 ment is the compiled pattern, and the second is the name. The yield of 2673 the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there 2674 is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if 2675 there is more than one subpattern of that name. Given the number, you 2676 can extract the substring directly, or use one of the functions 2677 described above. 2678 2679 For convenience, there are also "byname" functions that correspond to 2680 the "bynumber" functions, the only difference being that the second 2681 argument is a name instead of a number. If PCRE2_DUPNAMES is set and 2682 there are duplicate names, these functions scan all the groups with the 2683 given name, and return the first named string that is set. 2684 2685 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is 2686 returned. If all groups with the name have numbers that are greater 2687 than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is 2688 returned. If there is at least one group with a slot in the ovector, 2689 but no group is found to be set, PCRE2_ERROR_UNSET is returned. 2690 2691 Warning: If the pattern uses the (?| feature to set up multiple subpat- 2692 terns with the same number, as described in the section on duplicate 2693 subpattern numbers in the pcre2pattern page, you cannot use names to 2694 distinguish the different subpatterns, because names are not included 2695 in the compiled code. The matching process uses only numbers. For this 2696 reason, the use of different names for subpatterns of the same number 2697 causes an error at compile time. 2698 2699 2700CREATING A NEW STRING WITH SUBSTITUTIONS 2701 2702 int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, 2703 PCRE2_SIZE length, PCRE2_SIZE startoffset, 2704 uint32_t options, pcre2_match_data *match_data, 2705 pcre2_match_context *mcontext, PCRE2_SPTR replacement, 2706 PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP, 2707 PCRE2_SIZE *outlengthptr); 2708 2709 This function calls pcre2_match() and then makes a copy of the subject 2710 string in outputbuffer, replacing the part that was matched with the 2711 replacement string, whose length is supplied in rlength. This can be 2712 given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in 2713 which a \K item in a lookahead in the pattern causes the match to end 2714 before it starts are not supported, and give rise to an error return. 2715 2716 The first seven arguments of pcre2_substitute() are the same as for 2717 pcre2_match(), except that the partial matching options are not permit- 2718 ted, and match_data may be passed as NULL, in which case a match data 2719 block is obtained and freed within this function, using memory manage- 2720 ment functions from the match context, if provided, or else those that 2721 were used to allocate memory for the compiled code. 2722 2723 The outlengthptr argument must point to a variable that contains the 2724 length, in code units, of the output buffer. If the function is suc- 2725 cessful, the value is updated to contain the length of the new string, 2726 excluding the trailing zero that is automatically added. 2727 2728 If the function is not successful, the value set via outlengthptr 2729 depends on the type of error. For syntax errors in the replacement 2730 string, the value is the offset in the replacement string where the 2731 error was detected. For other errors, the value is PCRE2_UNSET by 2732 default. This includes the case of the output buffer being too small, 2733 unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which 2734 case the value is the minimum length needed, including space for the 2735 trailing zero. Note that in order to compute the required length, 2736 pcre2_substitute() has to simulate all the matching and copying, 2737 instead of giving an error return as soon as the buffer overflows. Note 2738 also that the length is in code units, not bytes. 2739 2740 In the replacement string, which is interpreted as a UTF string in UTF 2741 mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK 2742 option is set, a dollar character is an escape character that can spec- 2743 ify the insertion of characters from capturing groups or (*MARK) items 2744 in the pattern. The following forms are always recognized: 2745 2746 $$ insert a dollar character 2747 $<n> or ${<n>} insert the contents of group <n> 2748 $*MARK or ${*MARK} insert the name of the last (*MARK) encountered 2749 2750 Either a group number or a group name can be given for <n>. Curly 2751 brackets are required only if the following character would be inter- 2752 preted as part of the number or name. The number may be zero to include 2753 the entire matched string. For example, if the pattern a(b)c is 2754 matched with "=abc=" and the replacement string "+$1$0$1+", the result 2755 is "=+babcb+=". 2756 2757 The facility for inserting a (*MARK) name can be used to perform simple 2758 simultaneous substitutions, as this pcre2test example shows: 2759 2760 /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK} 2761 apple lemon 2762 2: pear orange 2763 2764 As well as the usual options for pcre2_match(), a number of additional 2765 options can be set in the options argument. 2766 2767 PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject 2768 string, replacing every matching substring. If this is not set, only 2769 the first matching substring is replaced. If any matched substring has 2770 zero length, after the substitution has happened, an attempt to find a 2771 non-empty match at the same position is performed. If this is not suc- 2772 cessful, the current position is advanced by one character except when 2773 CRLF is a valid newline sequence and the next two characters are CR, 2774 LF. In this case, the current position is advanced by two characters. 2775 2776 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output 2777 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- 2778 ORY immediately. If this option is set, however, pcre2_substitute() 2779 continues to go through the motions of matching and substituting (with- 2780 out, of course, writing anything) in order to compute the size of buf- 2781 fer that is needed. This value is passed back via the outlengthptr 2782 variable, with the result of the function still being 2783 PCRE2_ERROR_NOMEMORY. 2784 2785 Passing a buffer size of zero is a permitted way of finding out how 2786 much memory is needed for given substitution. However, this does mean 2787 that the entire operation is carried out twice. Depending on the appli- 2788 cation, it may be more efficient to allocate a large buffer and free 2789 the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- 2790 FLOW_LENGTH. 2791 2792 PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups 2793 that do not appear in the pattern to be treated as unset groups. This 2794 option should be used with care, because it means that a typo in a 2795 group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING 2796 error. 2797 2798 PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including 2799 unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be 2800 treated as empty strings when inserted as described above. If this 2801 option is not set, an attempt to insert an unset group causes the 2802 PCRE2_ERROR_UNSET error. This option does not influence the extended 2803 substitution syntax described below. 2804 2805 PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the 2806 replacement string. Without this option, only the dollar character is 2807 special, and only the group insertion forms listed above are valid. 2808 When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: 2809 2810 Firstly, backslash in a replacement string is interpreted as an escape 2811 character. The usual forms such as \n or \x{ddd} can be used to specify 2812 particular character codes, and backslash followed by any non-alphanu- 2813 meric character quotes that character. Extended quoting can be coded 2814 using \Q...\E, exactly as in pattern strings. 2815 2816 There are also four escape sequences for forcing the case of inserted 2817 letters. The insertion mechanism has three states: no case forcing, 2818 force upper case, and force lower case. The escape sequences change the 2819 current state: \U and \L change to upper or lower case forcing, respec- 2820 tively, and \E (when not terminating a \Q quoted sequence) reverts to 2821 no case forcing. The sequences \u and \l force the next character (if 2822 it is a letter) to upper or lower case, respectively, and then the 2823 state automatically reverts to no case forcing. Case forcing applies to 2824 all inserted characters, including those from captured groups and let- 2825 ters within \Q...\E quoted sequences. 2826 2827 Note that case forcing sequences such as \U...\E do not nest. For exam- 2828 ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final 2829 \E has no effect. 2830 2831 The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more 2832 flexibility to group substitution. The syntax is similar to that used 2833 by Bash: 2834 2835 ${<n>:-<string>} 2836 ${<n>:+<string1>:<string2>} 2837 2838 As before, <n> may be a group number or a name. The first form speci- 2839 fies a default value. If group <n> is set, its value is inserted; if 2840 not, <string> is expanded and the result inserted. The second form 2841 specifies strings that are expanded and inserted when group <n> is set 2842 or unset, respectively. The first form is just a convenient shorthand 2843 for 2844 2845 ${<n>:+${<n>}:<string>} 2846 2847 Backslash can be used to escape colons and closing curly brackets in 2848 the replacement strings. A change of the case forcing state within a 2849 replacement string remains in force afterwards, as shown in this 2850 pcre2test example: 2851 2852 /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo 2853 body 2854 1: hello 2855 somebody 2856 1: HELLO 2857 2858 The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended 2859 substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause 2860 unknown groups in the extended syntax forms to be treated as unset. 2861 2862 If successful, pcre2_substitute() returns the number of replacements 2863 that were made. This may be zero if no matches were found, and is never 2864 greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set. 2865 2866 In the event of an error, a negative error code is returned. Except for 2867 PCRE2_ERROR_NOMATCH (which is never returned), errors from 2868 pcre2_match() are passed straight back. 2869 2870 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- 2871 tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. 2872 2873 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- 2874 ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) 2875 when the simple (non-extended) syntax is used and PCRE2_SUBSTI- 2876 TUTE_UNSET_EMPTY is not set. 2877 2878 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big 2879 enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size 2880 of buffer that is needed is returned via outlengthptr. Note that this 2881 does not happen by default. 2882 2883 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in 2884 the replacement string, with more particular errors being 2885 PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP- 2886 MISSING_BRACE (closing curly bracket not found), PCRE2_BADSUBSTITUTION 2887 (syntax error in extended group substitution), and PCRE2_BADSUBPATTERN 2888 (the pattern match ended before it started, which can happen if \K is 2889 used in an assertion). 2890 2891 As for all PCRE2 errors, a text message that describes the error can be 2892 obtained by calling the pcre2_get_error_message() function (see 2893 "Obtaining a textual error message" above). 2894 2895 2896DUPLICATE SUBPATTERN NAMES 2897 2898 int pcre2_substring_nametable_scan(const pcre2_code *code, 2899 PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); 2900 2901 When a pattern is compiled with the PCRE2_DUPNAMES option, names for 2902 subpatterns are not required to be unique. Duplicate names are always 2903 allowed for subpatterns with the same number, created by using the (?| 2904 feature. Indeed, if such subpatterns are named, they are required to 2905 use the same names. 2906 2907 Normally, patterns with duplicate names are such that in any one match, 2908 only one of the named subpatterns participates. An example is shown in 2909 the pcre2pattern documentation. 2910 2911 When duplicates are present, pcre2_substring_copy_byname() and 2912 pcre2_substring_get_byname() return the first substring corresponding 2913 to the given name that is set. Only if none are set is 2914 PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name() 2915 function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are 2916 duplicate names. 2917 2918 If you want to get full details of all captured substrings for a given 2919 name, you must use the pcre2_substring_nametable_scan() function. The 2920 first argument is the compiled pattern, and the second is the name. If 2921 the third and fourth arguments are NULL, the function returns a group 2922 number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. 2923 2924 When the third and fourth arguments are not NULL, they must be pointers 2925 to variables that are updated by the function. After it has run, they 2926 point to the first and last entries in the name-to-number table for the 2927 given name, and the function returns the length of each entry in code 2928 units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are 2929 no entries for the given name. 2930 2931 The format of the name table is described above in the section entitled 2932 Information about a pattern. Given all the relevant entries for the 2933 name, you can extract each of their numbers, and hence the captured 2934 data. 2935 2936 2937FINDING ALL POSSIBLE MATCHES AT ONE POSITION 2938 2939 The traditional matching function uses a similar algorithm to Perl, 2940 which stops when it finds the first match at a given point in the sub- 2941 ject. If you want to find all possible matches, or the longest possible 2942 match at a given position, consider using the alternative matching 2943 function (see below) instead. If you cannot use the alternative func- 2944 tion, you can kludge it up by making use of the callout facility, which 2945 is described in the pcre2callout documentation. 2946 2947 What you have to do is to insert a callout right at the end of the pat- 2948 tern. When your callout function is called, extract and save the cur- 2949 rent matched substring. Then return 1, which forces pcre2_match() to 2950 backtrack and try other alternatives. Ultimately, when it runs out of 2951 matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. 2952 2953 2954MATCHING A PATTERN: THE ALTERNATIVE FUNCTION 2955 2956 int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, 2957 PCRE2_SIZE length, PCRE2_SIZE startoffset, 2958 uint32_t options, pcre2_match_data *match_data, 2959 pcre2_match_context *mcontext, 2960 int *workspace, PCRE2_SIZE wscount); 2961 2962 The function pcre2_dfa_match() is called to match a subject string 2963 against a compiled pattern, using a matching algorithm that scans the 2964 subject string just once, and does not backtrack. This has different 2965 characteristics to the normal algorithm, and is not compatible with 2966 Perl. Some of the features of PCRE2 patterns are not supported. Never- 2967 theless, there are times when this kind of matching can be useful. For 2968 a discussion of the two matching algorithms, and a list of features 2969 that pcre2_dfa_match() does not support, see the pcre2matching documen- 2970 tation. 2971 2972 The arguments for the pcre2_dfa_match() function are the same as for 2973 pcre2_match(), plus two extras. The ovector within the match data block 2974 is used in a different way, and this is described below. The other com- 2975 mon arguments are used in the same way as for pcre2_match(), so their 2976 description is not repeated here. 2977 2978 The two additional arguments provide workspace for the function. The 2979 workspace vector should contain at least 20 elements. It is used for 2980 keeping track of multiple paths through the pattern tree. More 2981 workspace is needed for patterns and subjects where there are a lot of 2982 potential matches. 2983 2984 Here is an example of a simple call to pcre2_dfa_match(): 2985 2986 int wspace[20]; 2987 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 2988 int rc = pcre2_dfa_match( 2989 re, /* result of pcre2_compile() */ 2990 "some string", /* the subject string */ 2991 11, /* the length of the subject string */ 2992 0, /* start at offset 0 in the subject */ 2993 0, /* default options */ 2994 match_data, /* the match data block */ 2995 NULL, /* a match context; NULL means use defaults */ 2996 wspace, /* working space vector */ 2997 20); /* number of elements (NOT size in bytes) */ 2998 2999 Option bits for pcre_dfa_match() 3000 3001 The unused bits of the options argument for pcre2_dfa_match() must be 3002 zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, 3003 PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, 3004 PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, 3005 PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of 3006 these are exactly the same as for pcre2_match(), so their description 3007 is not repeated here. 3008 3009 PCRE2_PARTIAL_HARD 3010 PCRE2_PARTIAL_SOFT 3011 3012 These have the same general effect as they do for pcre2_match(), but 3013 the details are slightly different. When PCRE2_PARTIAL_HARD is set for 3014 pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the 3015 subject is reached and there is still at least one matching possibility 3016 that requires additional characters. This happens even if some complete 3017 matches have already been found. When PCRE2_PARTIAL_SOFT is set, the 3018 return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL 3019 if the end of the subject is reached, there have been no complete 3020 matches, but there is still at least one matching possibility. The por- 3021 tion of the string that was inspected when the longest partial match 3022 was found is set as the first matching string in both cases. There is a 3023 more detailed discussion of partial and multi-segment matching, with 3024 examples, in the pcre2partial documentation. 3025 3026 PCRE2_DFA_SHORTEST 3027 3028 Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to 3029 stop as soon as it has found one match. Because of the way the alterna- 3030 tive algorithm works, this is necessarily the shortest possible match 3031 at the first possible matching point in the subject string. 3032 3033 PCRE2_DFA_RESTART 3034 3035 When pcre2_dfa_match() returns a partial match, it is possible to call 3036 it again, with additional subject characters, and have it continue with 3037 the same match. The PCRE2_DFA_RESTART option requests this action; when 3038 it is set, the workspace and wscount options must reference the same 3039 vector as before because data about the match so far is left in them 3040 after a partial match. There is more discussion of this facility in the 3041 pcre2partial documentation. 3042 3043 Successful returns from pcre2_dfa_match() 3044 3045 When pcre2_dfa_match() succeeds, it may have matched more than one sub- 3046 string in the subject. Note, however, that all the matches from one run 3047 of the function start at the same point in the subject. The shorter 3048 matches are all initial substrings of the longer matches. For example, 3049 if the pattern 3050 3051 <.*> 3052 3053 is matched against the string 3054 3055 This is <something> <something else> <something further> no more 3056 3057 the three matched strings are 3058 3059 <something> <something else> <something further> 3060 <something> <something else> 3061 <something> 3062 3063 On success, the yield of the function is a number greater than zero, 3064 which is the number of matched substrings. The offsets of the sub- 3065 strings are returned in the ovector, and can be extracted by number in 3066 the same way as for pcre2_match(), but the numbers bear no relation to 3067 any capturing groups that may exist in the pattern, because DFA match- 3068 ing does not support group capture. 3069 3070 Calls to the convenience functions that extract substrings by name 3071 return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used 3072 after a DFA match. The convenience functions that extract substrings by 3073 number never return PCRE2_ERROR_NOSUBSTRING, and the meanings of some 3074 other errors are slightly different: 3075 3076 PCRE2_ERROR_UNAVAILABLE 3077 3078 The ovector is not big enough to include a slot for the given substring 3079 number. 3080 3081 PCRE2_ERROR_UNSET 3082 3083 There is a slot in the ovector for this substring, but there were 3084 insufficient matches to fill it. 3085 3086 The matched strings are stored in the ovector in reverse order of 3087 length; that is, the longest matching string is first. If there were 3088 too many matches to fit into the ovector, the yield of the function is 3089 zero, and the vector is filled with the longest matches. 3090 3091 NOTE: PCRE2's "auto-possessification" optimization usually applies to 3092 character repeats at the end of a pattern (as well as internally). For 3093 example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA 3094 matching, this means that only one possible match is found. If you 3095 really do want multiple matches in such cases, either use an ungreedy 3096 repeat auch as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when 3097 compiling. 3098 3099 Error returns from pcre2_dfa_match() 3100 3101 The pcre2_dfa_match() function returns a negative number when it fails. 3102 Many of the errors are the same as for pcre2_match(), as described 3103 above. There are in addition the following errors that are specific to 3104 pcre2_dfa_match(): 3105 3106 PCRE2_ERROR_DFA_UITEM 3107 3108 This return is given if pcre2_dfa_match() encounters an item in the 3109 pattern that it does not support, for instance, the use of \C in a UTF 3110 mode or a back reference. 3111 3112 PCRE2_ERROR_DFA_UCOND 3113 3114 This return is given if pcre2_dfa_match() encounters a condition item 3115 that uses a back reference for the condition, or a test for recursion 3116 in a specific group. These are not supported. 3117 3118 PCRE2_ERROR_DFA_WSSIZE 3119 3120 This return is given if pcre2_dfa_match() runs out of space in the 3121 workspace vector. 3122 3123 PCRE2_ERROR_DFA_RECURSE 3124 3125 When a recursive subpattern is processed, the matching function calls 3126 itself recursively, using private memory for the ovector and workspace. 3127 This error is given if the internal ovector is not large enough. This 3128 should be extremely rare, as a vector of size 1000 is used. 3129 3130 PCRE2_ERROR_DFA_BADRESTART 3131 3132 When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, 3133 some plausibility checks are made on the contents of the workspace, 3134 which should contain data about the previous partial match. If any of 3135 these checks fail, this error is given. 3136 3137 3138SEE ALSO 3139 3140 pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), 3141 pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2stack(3), 3142 pcre2unicode(3). 3143 3144 3145AUTHOR 3146 3147 Philip Hazel 3148 University Computing Service 3149 Cambridge, England. 3150 3151 3152REVISION 3153 3154 Last updated: 17 June 2016 3155 Copyright (c) 1997-2016 University of Cambridge. 3156------------------------------------------------------------------------------ 3157 3158 3159PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3) 3160 3161 3162 3163NAME 3164 PCRE2 - Perl-compatible regular expressions (revised API) 3165 3166BUILDING PCRE2 3167 3168 PCRE2 is distributed with a configure script that can be used to build 3169 the library in Unix-like environments using the applications known as 3170 Autotools. Also in the distribution are files to support building using 3171 CMake instead of configure. The text file README contains general 3172 information about building with Autotools (some of which is repeated 3173 below), and also has some comments about building on various operating 3174 systems. There is a lot more information about building PCRE2 without 3175 using Autotools (including information about using CMake and building 3176 "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should 3177 consult this file as well as the README file if you are building in a 3178 non-Unix-like environment. 3179 3180 3181PCRE2 BUILD-TIME OPTIONS 3182 3183 The rest of this document describes the optional features of PCRE2 that 3184 can be selected when the library is compiled. It assumes use of the 3185 configure script, where the optional features are selected or dese- 3186 lected by providing options to configure before running the make com- 3187 mand. However, the same options can be selected in both Unix-like and 3188 non-Unix-like environments if you are using CMake instead of configure 3189 to build PCRE2. 3190 3191 If you are not using Autotools or CMake, option selection can be done 3192 by editing the config.h file, or by passing parameter settings to the 3193 compiler, as described in NON-AUTOTOOLS-BUILD. 3194 3195 The complete list of options for configure (which includes the standard 3196 ones such as the selection of the installation directory) can be 3197 obtained by running 3198 3199 ./configure --help 3200 3201 The following sections include descriptions of options whose names 3202 begin with --enable or --disable. These settings specify changes to the 3203 defaults for the configure command. Because of the way that configure 3204 works, --enable and --disable always come in pairs, so the complemen- 3205 tary option always exists as well, but as it specifies the default, it 3206 is not described. 3207 3208 3209BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES 3210 3211 By default, a library called libpcre2-8 is built, containing functions 3212 that take string arguments contained in vectors of bytes, interpreted 3213 either as single-byte characters, or UTF-8 strings. You can also build 3214 two other libraries, called libpcre2-16 and libpcre2-32, which process 3215 strings that are contained in vectors of 16-bit and 32-bit code units, 3216 respectively. These can be interpreted either as single-unit characters 3217 or UTF-16/UTF-32 strings. To build these additional libraries, add one 3218 or both of the following to the configure command: 3219 3220 --enable-pcre2-16 3221 --enable-pcre2-32 3222 3223 If you do not want the 8-bit library, add 3224 3225 --disable-pcre2-8 3226 3227 as well. At least one of the three libraries must be built. Note that 3228 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is 3229 an 8-bit program. Neither of these are built if you select only the 3230 16-bit or 32-bit libraries. 3231 3232 3233BUILDING SHARED AND STATIC LIBRARIES 3234 3235 The Autotools PCRE2 building process uses libtool to build both shared 3236 and static libraries by default. You can suppress an unwanted library 3237 by adding one of 3238 3239 --disable-shared 3240 --disable-static 3241 3242 to the configure command. 3243 3244 3245UNICODE AND UTF SUPPORT 3246 3247 By default, PCRE2 is built with support for Unicode and UTF character 3248 strings. To build it without Unicode support, add 3249 3250 --disable-unicode 3251 3252 to the configure command. This setting applies to all three libraries. 3253 It is not possible to build one library with Unicode support, and 3254 another without, in the same configuration. 3255 3256 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, 3257 UTF-16 or UTF-32. To do that, applications that use the library can set 3258 the PCRE2_UTF option when they call pcre2_compile() to compile a pat- 3259 tern. Alternatively, patterns may be started with (*UTF) unless the 3260 application has locked this out by setting PCRE2_NEVER_UTF. 3261 3262 UTF support allows the libraries to process character code points up to 3263 0x10ffff in the strings that they handle. It also provides support for 3264 accessing the Unicode properties of such characters, using pattern 3265 escapes such as \P, \p, and \X. Only the general category properties 3266 such as Lu and Nd are supported. Details are given in the pcre2pattern 3267 documentation. 3268 3269 Pattern escapes such as \d and \w do not by default make use of Unicode 3270 properties. The application can request that they do by setting the 3271 PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a 3272 pattern may also request this by starting with (*UCP). 3273 3274 3275DISABLING THE USE OF \C 3276 3277 The \C escape sequence, which matches a single code unit, even in a UTF 3278 mode, can cause unpredictable behaviour because it may leave the cur- 3279 rent matching point in the middle of a multi-code-unit character. The 3280 application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C 3281 option when calling pcre2_compile(). There is also a build-time option 3282 3283 --enable-never-backslash-C 3284 3285 (note the upper case C) which locks out the use of \C entirely. 3286 3287 3288JUST-IN-TIME COMPILER SUPPORT 3289 3290 Just-in-time compiler support is included in the build by specifying 3291 3292 --enable-jit 3293 3294 This support is available only for certain hardware architectures. If 3295 this option is set for an unsupported architecture, a building error 3296 occurs. See the pcre2jit documentation for a discussion of JIT usage. 3297 When JIT support is enabled, pcre2grep automatically makes use of it, 3298 unless you add 3299 3300 --disable-pcre2grep-jit 3301 3302 to the "configure" command. 3303 3304 3305NEWLINE RECOGNITION 3306 3307 By default, PCRE2 interprets the linefeed (LF) character as indicating 3308 the end of a line. This is the normal newline character on Unix-like 3309 systems. You can compile PCRE2 to use carriage return (CR) instead, by 3310 adding 3311 3312 --enable-newline-is-cr 3313 3314 to the configure command. There is also an --enable-newline-is-lf 3315 option, which explicitly specifies linefeed as the newline character. 3316 3317 Alternatively, you can specify that line endings are to be indicated by 3318 the two-character sequence CRLF (CR immediately followed by LF). If you 3319 want this, add 3320 3321 --enable-newline-is-crlf 3322 3323 to the configure command. There is a fourth option, specified by 3324 3325 --enable-newline-is-anycrlf 3326 3327 which causes PCRE2 to recognize any of the three sequences CR, LF, or 3328 CRLF as indicating a line ending. Finally, a fifth option, specified by 3329 3330 --enable-newline-is-any 3331 3332 causes PCRE2 to recognize any Unicode newline sequence. The Unicode 3333 newline sequences are the three just mentioned, plus the single charac- 3334 ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, 3335 U+0085), LS (line separator, U+2028), and PS (paragraph separator, 3336 U+2029). 3337 3338 Whatever default line ending convention is selected when PCRE2 is built 3339 can be overridden by applications that use the library. At build time 3340 it is conventional to use the standard for your operating system. 3341 3342 3343WHAT \R MATCHES 3344 3345 By default, the sequence \R in a pattern matches any Unicode newline 3346 sequence, independently of what has been selected as the line ending 3347 sequence. If you specify 3348 3349 --enable-bsr-anycrlf 3350 3351 the default is changed so that \R matches only CR, LF, or CRLF. What- 3352 ever is selected when PCRE2 is built can be overridden by applications 3353 that use the called. 3354 3355 3356HANDLING VERY LARGE PATTERNS 3357 3358 Within a compiled pattern, offset values are used to point from one 3359 part to another (for example, from an opening parenthesis to an alter- 3360 nation metacharacter). By default, in the 8-bit and 16-bit libraries, 3361 two-byte values are used for these offsets, leading to a maximum size 3362 for a compiled pattern of around 64K code units. This is sufficient to 3363 handle all but the most gigantic patterns. Nevertheless, some people do 3364 want to process truly enormous patterns, so it is possible to compile 3365 PCRE2 to use three-byte or four-byte offsets by adding a setting such 3366 as 3367 3368 --with-link-size=3 3369 3370 to the configure command. The value given must be 2, 3, or 4. For the 3371 16-bit library, a value of 3 is rounded up to 4. In these libraries, 3372 using longer offsets slows down the operation of PCRE2 because it has 3373 to load additional data when handling them. For the 32-bit library the 3374 value is always 4 and cannot be overridden; the value of --with-link- 3375 size is ignored. 3376 3377 3378AVOIDING EXCESSIVE STACK USAGE 3379 3380 When matching with the pcre2_match() function, PCRE2 implements back- 3381 tracking by making recursive calls to an internal function called 3382 match(). In environments where the size of the stack is limited, this 3383 can severely limit PCRE2's operation. (The Unix environment does not 3384 usually suffer from this problem, but it may sometimes be necessary to 3385 increase the maximum stack size. There is a discussion in the 3386 pcre2stack documentation.) An alternative approach to recursion that 3387 uses memory from the heap to remember data, instead of using recursive 3388 function calls, has been implemented to work round the problem of lim- 3389 ited stack size. If you want to build a version of PCRE2 that works 3390 this way, add 3391 3392 --disable-stack-for-recursion 3393 3394 to the configure command. By default, the system functions malloc() and 3395 free() are called to manage the heap memory that is required, but cus- 3396 tom memory management functions can be called instead. PCRE2 runs 3397 noticeably more slowly when built in this way. This option affects only 3398 the pcre2_match() function; it is not relevant for pcre2_dfa_match(). 3399 3400 3401LIMITING PCRE2 RESOURCE USAGE 3402 3403 Internally, PCRE2 has a function called match(), which it calls repeat- 3404 edly (sometimes recursively) when matching a pattern with the 3405 pcre2_match() function. By controlling the maximum number of times this 3406 function may be called during a single matching operation, a limit can 3407 be placed on the resources used by a single call to pcre2_match(). The 3408 limit can be changed at run time, as described in the pcre2api documen- 3409 tation. The default is 10 million, but this can be changed by adding a 3410 setting such as 3411 3412 --with-match-limit=500000 3413 3414 to the configure command. This setting has no effect on the 3415 pcre2_dfa_match() matching function. 3416 3417 In some environments it is desirable to limit the depth of recursive 3418 calls of match() more strictly than the total number of calls, in order 3419 to restrict the maximum amount of stack (or heap, if --disable-stack- 3420 for-recursion is specified) that is used. A second limit controls this; 3421 it defaults to the value that is set for --with-match-limit, which 3422 imposes no additional constraints. However, you can set a lower limit 3423 by adding, for example, 3424 3425 --with-match-limit-recursion=10000 3426 3427 to the configure command. This value can also be overridden at run 3428 time. 3429 3430 3431CREATING CHARACTER TABLES AT BUILD TIME 3432 3433 PCRE2 uses fixed tables for processing characters whose code points are 3434 less than 256. By default, PCRE2 is built with a set of tables that are 3435 distributed in the file src/pcre2_chartables.c.dist. These tables are 3436 for ASCII codes only. If you add 3437 3438 --enable-rebuild-chartables 3439 3440 to the configure command, the distributed tables are no longer used. 3441 Instead, a program called dftables is compiled and run. This outputs 3442 the source for new set of tables, created in the default locale of your 3443 C run-time system. (This method of replacing the tables does not work 3444 if you are cross compiling, because dftables is run on the local host. 3445 If you need to create alternative tables when cross compiling, you will 3446 have to do so "by hand".) 3447 3448 3449USING EBCDIC CODE 3450 3451 PCRE2 assumes by default that it will run in an environment where the 3452 character code is ASCII or Unicode, which is a superset of ASCII. This 3453 is the case for most computer operating systems. PCRE2 can, however, be 3454 compiled to run in an 8-bit EBCDIC environment by adding 3455 3456 --enable-ebcdic --disable-unicode 3457 3458 to the configure command. This setting implies --enable-rebuild-charta- 3459 bles. You should only use it if you know that you are in an EBCDIC 3460 environment (for example, an IBM mainframe operating system). 3461 3462 It is not possible to support both EBCDIC and UTF-8 codes in the same 3463 version of the library. Consequently, --enable-unicode and --enable- 3464 ebcdic are mutually exclusive. 3465 3466 The EBCDIC character that corresponds to an ASCII LF is assumed to have 3467 the value 0x15 by default. However, in some EBCDIC environments, 0x25 3468 is used. In such an environment you should use 3469 3470 --enable-ebcdic-nl25 3471 3472 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR 3473 has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 3474 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- 3475 acter (which, in Unicode, is 0x85). 3476 3477 The options that select newline behaviour, such as --enable-newline-is- 3478 cr, and equivalent run-time options, refer to these character values in 3479 an EBCDIC environment. 3480 3481 3482PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS 3483 3484 By default, on non-Windows systems, pcre2grep supports the use of call- 3485 outs with string arguments within the patterns it is matching, in order 3486 to run external scripts. For details, see the pcre2grep documentation. 3487 This support can be disabled by adding --disable-pcre2grep-callout to 3488 the configure command. 3489 3490 3491PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT 3492 3493 By default, pcre2grep reads all files as plain text. You can build it 3494 so that it recognizes files whose names end in .gz or .bz2, and reads 3495 them with libz or libbz2, respectively, by adding one or both of 3496 3497 --enable-pcre2grep-libz 3498 --enable-pcre2grep-libbz2 3499 3500 to the configure command. These options naturally require that the rel- 3501 evant libraries are installed on your system. Configuration will fail 3502 if they are not. 3503 3504 3505PCRE2GREP BUFFER SIZE 3506 3507 pcre2grep uses an internal buffer to hold a "window" on the file it is 3508 scanning, in order to be able to output "before" and "after" lines when 3509 it finds a match. The size of the buffer is controlled by a parameter 3510 whose default value is 20K. The buffer itself is three times this size, 3511 but because of the way it is used for holding "before" lines, the long- 3512 est line that is guaranteed to be processable is the parameter size. 3513 You can change the default parameter value by adding, for example, 3514 3515 --with-pcre2grep-bufsize=50K 3516 3517 to the configure command. The caller of pcre2grep can override this 3518 value by using --buffer-size on the command line. 3519 3520 3521PCRE2TEST OPTION FOR LIBREADLINE SUPPORT 3522 3523 If you add one of 3524 3525 --enable-pcre2test-libreadline 3526 --enable-pcre2test-libedit 3527 3528 to the configure command, pcre2test is linked with the libreadline 3529 orlibedit library, respectively, and when its input is from a terminal, 3530 it reads it using the readline() function. This provides line-editing 3531 and history facilities. Note that libreadline is GPL-licensed, so if 3532 you distribute a binary of pcre2test linked in this way, there may be 3533 licensing issues. These can be avoided by linking instead with libedit, 3534 which has a BSD licence. 3535 3536 Setting --enable-pcre2test-libreadline causes the -lreadline option to 3537 be added to the pcre2test build. In many operating environments with a 3538 sytem-installed readline library this is sufficient. However, in some 3539 environments (e.g. if an unmodified distribution version of readline is 3540 in use), some extra configuration may be necessary. The INSTALL file 3541 for libreadline says this: 3542 3543 "Readline uses the termcap functions, but does not link with 3544 the termcap or curses library itself, allowing applications 3545 which link with readline the to choose an appropriate library." 3546 3547 If your environment has not been set up so that an appropriate library 3548 is automatically included, you may need to add something like 3549 3550 LIBS="-ncurses" 3551 3552 immediately before the configure command. 3553 3554 3555INCLUDING DEBUGGING CODE 3556 3557 If you add 3558 3559 --enable-debug 3560 3561 to the configure command, additional debugging code is included in the 3562 build. This feature is intended for use by the PCRE2 maintainers. 3563 3564 3565DEBUGGING WITH VALGRIND SUPPORT 3566 3567 If you add 3568 3569 --enable-valgrind 3570 3571 to the configure command, PCRE2 will use valgrind annotations to mark 3572 certain memory regions as unaddressable. This allows it to detect 3573 invalid memory accesses, and is mostly useful for debugging PCRE2 3574 itself. 3575 3576 3577CODE COVERAGE REPORTING 3578 3579 If your C compiler is gcc, you can build a version of PCRE2 that can 3580 generate a code coverage report for its test suite. To enable this, you 3581 must install lcov version 1.6 or above. Then specify 3582 3583 --enable-coverage 3584 3585 to the configure command and build PCRE2 in the usual way. 3586 3587 Note that using ccache (a caching C compiler) is incompatible with code 3588 coverage reporting. If you have configured ccache to run automatically 3589 on your system, you must set the environment variable 3590 3591 CCACHE_DISABLE=1 3592 3593 before running make to build PCRE2, so that ccache is not used. 3594 3595 When --enable-coverage is used, the following addition targets are 3596 added to the Makefile: 3597 3598 make coverage 3599 3600 This creates a fresh coverage report for the PCRE2 test suite. It is 3601 equivalent to running "make coverage-reset", "make coverage-baseline", 3602 "make check", and then "make coverage-report". 3603 3604 make coverage-reset 3605 3606 This zeroes the coverage counters, but does nothing else. 3607 3608 make coverage-baseline 3609 3610 This captures baseline coverage information. 3611 3612 make coverage-report 3613 3614 This creates the coverage report. 3615 3616 make coverage-clean-report 3617 3618 This removes the generated coverage report without cleaning the cover- 3619 age data itself. 3620 3621 make coverage-clean-data 3622 3623 This removes the captured coverage data without removing the coverage 3624 files created at compile time (*.gcno). 3625 3626 make coverage-clean 3627 3628 This cleans all coverage data including the generated coverage report. 3629 For more information about code coverage, see the gcov and lcov docu- 3630 mentation. 3631 3632 3633SEE ALSO 3634 3635 pcre2api(3), pcre2-config(3). 3636 3637 3638AUTHOR 3639 3640 Philip Hazel 3641 University Computing Service 3642 Cambridge, England. 3643 3644 3645REVISION 3646 3647 Last updated: 01 April 2016 3648 Copyright (c) 1997-2016 University of Cambridge. 3649------------------------------------------------------------------------------ 3650 3651 3652PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3) 3653 3654 3655 3656NAME 3657 PCRE2 - Perl-compatible regular expressions (revised API) 3658 3659SYNOPSIS 3660 3661 #include <pcre2.h> 3662 3663 int (*pcre2_callout)(pcre2_callout_block *, void *); 3664 3665 int pcre2_callout_enumerate(const pcre2_code *code, 3666 int (*callback)(pcre2_callout_enumerate_block *, void *), 3667 void *user_data); 3668 3669 3670DESCRIPTION 3671 3672 PCRE2 provides a feature called "callout", which is a means of tempo- 3673 rarily passing control to the caller of PCRE2 in the middle of pattern 3674 matching. The caller of PCRE2 provides an external function by putting 3675 its entry point in a match context (see pcre2_set_callout() in the 3676 pcre2api documentation). 3677 3678 Within a regular expression, (?C<arg>) indicates a point at which the 3679 external function is to be called. Different callout points can be 3680 identified by putting a number less than 256 after the letter C. The 3681 default value is zero. Alternatively, the argument may be a delimited 3682 string. The starting delimiter must be one of ` ' " ^ % # $ { and the 3683 ending delimiter is the same as the start, except for {, where the end- 3684 ing delimiter is }. If the ending delimiter is needed within the 3685 string, it must be doubled. For example, this pattern has two callout 3686 points: 3687 3688 (?C1)abc(?C"some ""arbitrary"" text")def 3689 3690 If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, 3691 PCRE2 automatically inserts callouts, all with number 255, before each 3692 item in the pattern. For example, if PCRE2_AUTO_CALLOUT is used with 3693 the pattern 3694 3695 A(\d{2}|--) 3696 3697 it is processed as if it were 3698 3699 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) 3700 3701 Notice that there is a callout before and after each parenthesis and 3702 alternation bar. If the pattern contains a conditional group whose con- 3703 dition is an assertion, an automatic callout is inserted immediately 3704 before the condition. Such a callout may also be inserted explicitly, 3705 for example: 3706 3707 (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de) 3708 3709 This applies only to assertion conditions (because they are themselves 3710 independent groups). 3711 3712 Callouts can be useful for tracking the progress of pattern matching. 3713 The pcre2test program has a pattern qualifier (/auto_callout) that sets 3714 automatic callouts. When any callouts are present, the output from 3715 pcre2test indicates how the pattern is being matched. This is useful 3716 information when you are trying to optimize the performance of a par- 3717 ticular pattern. 3718 3719 3720MISSING CALLOUTS 3721 3722 You should be aware that, because of optimizations in the way PCRE2 3723 compiles and matches patterns, callouts sometimes do not happen exactly 3724 as you might expect. 3725 3726 Auto-possessification 3727 3728 At compile time, PCRE2 "auto-possessifies" repeated items when it knows 3729 that what follows cannot be part of the repeat. For example, a+[bc] is 3730 compiled as if it were a++[bc]. The pcre2test output when this pattern 3731 is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied 3732 to the string "aaaa" is: 3733 3734 --->aaaa 3735 +0 ^ a+ 3736 +2 ^ ^ [bc] 3737 No match 3738 3739 This indicates that when matching [bc] fails, there is no backtracking 3740 into a+ and therefore the callouts that would be taken for the back- 3741 tracks do not occur. You can disable the auto-possessify feature by 3742 passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat- 3743 tern with (*NO_AUTO_POSSESS). In this case, the output changes to this: 3744 3745 --->aaaa 3746 +0 ^ a+ 3747 +2 ^ ^ [bc] 3748 +2 ^ ^ [bc] 3749 +2 ^ ^ [bc] 3750 +2 ^^ [bc] 3751 No match 3752 3753 This time, when matching [bc] fails, the matcher backtracks into a+ and 3754 tries again, repeatedly, until a+ itself fails. 3755 3756 Automatic .* anchoring 3757 3758 By default, an optimization is applied when .* is the first significant 3759 item in a pattern. If PCRE2_DOTALL is set, so that the dot can match 3760 any character, the pattern is automatically anchored. If PCRE2_DOTALL 3761 is not set, a match can start only after an internal newline or at the 3762 beginning of the subject, and pcre2_compile() remembers this. This 3763 optimization is disabled, however, if .* is in an atomic group or if 3764 there is a back reference to the capturing group in which it appears. 3765 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How- 3766 ever, the presence of callouts does not affect it. 3767 3768 For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT 3769 and applied to the string "aa", the pcre2test output is: 3770 3771 --->aa 3772 +0 ^ .* 3773 +2 ^ ^ \d 3774 +2 ^^ \d 3775 +2 ^ \d 3776 No match 3777 3778 This shows that all match attempts start at the beginning of the sub- 3779 ject. In other words, the pattern is anchored. You can disable this 3780 optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or 3781 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out- 3782 put changes to: 3783 3784 --->aa 3785 +0 ^ .* 3786 +2 ^ ^ \d 3787 +2 ^^ \d 3788 +2 ^ \d 3789 +0 ^ .* 3790 +2 ^^ \d 3791 +2 ^ \d 3792 No match 3793 3794 This shows more match attempts, starting at the second subject charac- 3795 ter. Another optimization, described in the next section, means that 3796 there is no subsequent attempt to match with an empty subject. 3797 3798 If a pattern has more than one top-level branch, automatic anchoring 3799 occurs if all branches are anchorable. 3800 3801 Other optimizations 3802 3803 Other optimizations that provide fast "no match" results also affect 3804 callouts. For example, if the pattern is 3805 3806 ab(?C4)cd 3807 3808 PCRE2 knows that any matching string must contain the letter "d". If 3809 the subject string is "abyz", the lack of "d" means that matching 3810 doesn't ever start, and the callout is never reached. However, with 3811 "abyd", though the result is still no match, the callout is obeyed. 3812 3813 PCRE2 also knows the minimum length of a matching string, and will 3814 immediately give a "no match" return without actually running a match 3815 if the subject is not long enough, or, for unanchored patterns, if it 3816 has been scanned far enough. 3817 3818 You can disable these optimizations by passing the PCRE2_NO_START_OPTI- 3819 MIZE option to pcre2_compile(), or by starting the pattern with 3820 (*NO_START_OPT). This slows down the matching process, but does ensure 3821 that callouts such as the example above are obeyed. 3822 3823 3824THE CALLOUT INTERFACE 3825 3826 During matching, when PCRE2 reaches a callout point, if an external 3827 function is set in the match context, it is called. This applies to 3828 both normal and DFA matching. The first argument to the callout func- 3829 tion is a pointer to a pcre2_callout block. The second argument is the 3830 void * callout data that was supplied when the callout was set up by 3831 calling pcre2_set_callout() (see the pcre2api documentation). The call- 3832 out block structure contains the following fields: 3833 3834 uint32_t version; 3835 uint32_t callout_number; 3836 uint32_t capture_top; 3837 uint32_t capture_last; 3838 PCRE2_SIZE *offset_vector; 3839 PCRE2_SPTR mark; 3840 PCRE2_SPTR subject; 3841 PCRE2_SIZE subject_length; 3842 PCRE2_SIZE start_match; 3843 PCRE2_SIZE current_position; 3844 PCRE2_SIZE pattern_position; 3845 PCRE2_SIZE next_item_length; 3846 PCRE2_SIZE callout_string_offset; 3847 PCRE2_SIZE callout_string_length; 3848 PCRE2_SPTR callout_string; 3849 3850 The version field contains the version number of the block format. The 3851 current version is 1; the three callout string fields were added for 3852 this version. If you are writing an application that might use an ear- 3853 lier release of PCRE2, you should check the version number before 3854 accessing any of these fields. The version number will increase in 3855 future if more fields are added, but the intention is never to remove 3856 any of the existing fields. 3857 3858 Fields for numerical callouts 3859 3860 For a numerical callout, callout_string is NULL, and callout_number 3861 contains the number of the callout, in the range 0-255. This is the 3862 number that follows (?C for manual callouts; it is 255 for automati- 3863 cally generated callouts. 3864 3865 Fields for string callouts 3866 3867 For callouts with string arguments, callout_number is always zero, and 3868 callout_string points to the string that is contained within the com- 3869 piled pattern. Its length is given by callout_string_length. Duplicated 3870 ending delimiters that were present in the original pattern string have 3871 been turned into single characters, but there is no other processing of 3872 the callout string argument. An additional code unit containing binary 3873 zero is present after the string, but is not included in the length. 3874 The delimiter that was used to start the string is also stored within 3875 the pattern, immediately before the string itself. You can access this 3876 delimiter as callout_string[-1] if you need it. 3877 3878 The callout_string_offset field is the code unit offset to the start of 3879 the callout argument string within the original pattern string. This is 3880 provided for the benefit of applications such as script languages that 3881 might need to report errors in the callout string within the pattern. 3882 3883 Fields for all callouts 3884 3885 The remaining fields in the callout block are the same for both kinds 3886 of callout. 3887 3888 The offset_vector field is a pointer to the vector of capturing offsets 3889 (the "ovector") that was passed to the matching function in the match 3890 data block. When pcre2_match() is used, the contents can be inspected 3891 in order to extract substrings that have been matched so far, in the 3892 same way as for extracting substrings after a match has completed. For 3893 the DFA matching function, this field is not useful. 3894 3895 The subject and subject_length fields contain copies of the values that 3896 were passed to the matching function. 3897 3898 The start_match field normally contains the offset within the subject 3899 at which the current match attempt started. However, if the escape 3900 sequence \K has been encountered, this value is changed to reflect the 3901 modified starting point. If the pattern is not anchored, the callout 3902 function may be called several times from the same point in the pattern 3903 for different starting points in the subject. 3904 3905 The current_position field contains the offset within the subject of 3906 the current match pointer. 3907 3908 When the pcre2_match() is used, the capture_top field contains one more 3909 than the number of the highest numbered captured substring so far. If 3910 no substrings have been captured, the value of capture_top is one. This 3911 is always the case when the DFA functions are used, because they do not 3912 support captured substrings. 3913 3914 The capture_last field contains the number of the most recently cap- 3915 tured substring. However, when a recursion exits, the value reverts to 3916 what it was outside the recursion, as do the values of all captured 3917 substrings. If no substrings have been captured, the value of cap- 3918 ture_last is 0. This is always the case for the DFA matching functions. 3919 3920 The pattern_position field contains the offset in the pattern string to 3921 the next item to be matched. 3922 3923 The next_item_length field contains the length of the next item to be 3924 matched in the pattern string. When the callout immediately precedes an 3925 alternation bar, a closing parenthesis, or the end of the pattern, the 3926 length is zero. When the callout precedes an opening parenthesis, the 3927 length is that of the entire subpattern. 3928 3929 The pattern_position and next_item_length fields are intended to help 3930 in distinguishing between different automatic callouts, which all have 3931 the same callout number. However, they are set for all callouts, and 3932 are used by pcre2test to show the next item to be matched when display- 3933 ing callout information. 3934 3935 In callouts from pcre2_match() the mark field contains a pointer to the 3936 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or 3937 (*THEN) item in the match, or NULL if no such items have been passed. 3938 Instances of (*PRUNE) or (*THEN) without a name do not obliterate a 3939 previous (*MARK). In callouts from the DFA matching function this field 3940 always contains NULL. 3941 3942 3943RETURN VALUES FROM CALLOUTS 3944 3945 The external callout function returns an integer to PCRE2. If the value 3946 is zero, matching proceeds as normal. If the value is greater than 3947 zero, matching fails at the current point, but the testing of other 3948 matching possibilities goes ahead, just as if a lookahead assertion had 3949 failed. If the value is less than zero, the match is abandoned, and the 3950 matching function returns the negative value. 3951 3952 Negative values should normally be chosen from the set of 3953 PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a 3954 standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is 3955 reserved for use by callout functions; it will never be used by PCRE2 3956 itself. 3957 3958 3959CALLOUT ENUMERATION 3960 3961 int pcre2_callout_enumerate(const pcre2_code *code, 3962 int (*callback)(pcre2_callout_enumerate_block *, void *), 3963 void *user_data); 3964 3965 A script language that supports the use of string arguments in callouts 3966 might like to scan all the callouts in a pattern before running the 3967 match. This can be done by calling pcre2_callout_enumerate(). The first 3968 argument is a pointer to a compiled pattern, the second points to a 3969 callback function, and the third is arbitrary user data. The callback 3970 function is called for every callout in the pattern in the order in 3971 which they appear. Its first argument is a pointer to a callout enumer- 3972 ation block, and its second argument is the user_data value that was 3973 passed to pcre2_callout_enumerate(). The data block contains the fol- 3974 lowing fields: 3975 3976 version Block version number 3977 pattern_position Offset to next item in pattern 3978 next_item_length Length of next item in pattern 3979 callout_number Number for numbered callouts 3980 callout_string_offset Offset to string within pattern 3981 callout_string_length Length of callout string 3982 callout_string Points to callout string or is NULL 3983 3984 The version number is currently 0. It will increase if new fields are 3985 ever added to the block. The remaining fields are the same as their 3986 namesakes in the pcre2_callout block that is used for callouts during 3987 matching, as described above. 3988 3989 Note that the value of pattern_position is unique for each callout. 3990 However, if a callout occurs inside a group that is quantified with a 3991 non-zero minimum or a fixed maximum, the group is replicated inside the 3992 compiled pattern. For example, a pattern such as /(a){2}/ is compiled 3993 as if it were /(a)(a)/. This means that the callout will be enumerated 3994 more than once, but with the same value for pattern_position in each 3995 case. 3996 3997 The callback function should normally return zero. If it returns a non- 3998 zero value, scanning the pattern stops, and that value is returned from 3999 pcre2_callout_enumerate(). 4000 4001 4002AUTHOR 4003 4004 Philip Hazel 4005 University Computing Service 4006 Cambridge, England. 4007 4008 4009REVISION 4010 4011 Last updated: 23 March 2015 4012 Copyright (c) 1997-2015 University of Cambridge. 4013------------------------------------------------------------------------------ 4014 4015 4016PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3) 4017 4018 4019 4020NAME 4021 PCRE2 - Perl-compatible regular expressions (revised API) 4022 4023DIFFERENCES BETWEEN PCRE2 AND PERL 4024 4025 This document describes the differences in the ways that PCRE2 and Perl 4026 handle regular expressions. The differences described here are with 4027 respect to Perl versions 5.10 and above. 4028 4029 1. PCRE2 has only a subset of Perl's Unicode support. Details of what 4030 it does have are given in the pcre2unicode page. 4031 4032 2. PCRE2 allows repeat quantifiers only on parenthesized assertions, 4033 but they do not mean what you might think. For example, (?!a){3} does 4034 not assert that the next three characters are not "a". It just asserts 4035 that the next character is not "a" three times (in principle: PCRE2 4036 optimizes this to run the assertion just once). Perl allows repeat 4037 quantifiers on other assertions such as \b, but these do not seem to 4038 have any use. 4039 4040 3. Capturing subpatterns that occur inside negative lookahead asser- 4041 tions are counted, but their entries in the offsets vector are never 4042 set. Perl sometimes (but not always) sets its numerical variables from 4043 inside negative assertions. 4044 4045 4. The following Perl escape sequences are not supported: \l, \u, \L, 4046 \U, and \N when followed by a character name or Unicode value. (\N on 4047 its own, matching a non-newline character, is supported.) In fact these 4048 are implemented by Perl's general string-handling and are not part of 4049 its pattern matching engine. If any of these are encountered by PCRE2, 4050 an error is generated by default. However, if the PCRE2_ALT_BSUX option 4051 is set, \U and \u are interpreted as ECMAScript interprets them. 4052 4053 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 4054 is built with Unicode support. The properties that can be tested with 4055 \p and \P are limited to the general category properties such as Lu and 4056 Nd, script names such as Greek or Han, and the derived properties Any 4057 and L&. PCRE2 does support the Cs (surrogate) property, which Perl does 4058 not; the Perl documentation says "Because Perl hides the need for the 4059 user to understand the internal representation of Unicode characters, 4060 there is no need to implement the somewhat messy concept of surro- 4061 gates." 4062 4063 6. PCRE2 does support the \Q...\E escape for quoting substrings. Char- 4064 acters in between are treated as literals. This is slightly different 4065 from Perl in that $ and @ are also handled as literals inside the 4066 quotes. In Perl, they cause variable interpolation (but of course PCRE2 4067 does not have variables). Note the following examples: 4068 4069 Pattern PCRE2 matches Perl matches 4070 4071 \Qabc$xyz\E abc$xyz abc followed by the 4072 contents of $xyz 4073 \Qabc\$xyz\E abc\$xyz abc\$xyz 4074 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 4075 4076 The \Q...\E sequence is recognized both inside and outside character 4077 classes. 4078 4079 7. Fairly obviously, PCRE2 does not support the (?{code}) and 4080 (??{code}) constructions. However, there is support for recursive pat- 4081 terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also, 4082 the PCRE2 "callout" feature allows an external function to be called 4083 during pattern matching. See the pcre2callout documentation for 4084 details. 4085 4086 8. Subroutine calls (whether recursive or not) are treated as atomic 4087 groups. Atomic recursion is like Python, but unlike Perl. Captured 4088 values that are set outside a subroutine call can be referenced from 4089 inside in PCRE2, but not in Perl. There is a discussion that explains 4090 these differences in more detail in the section on recursion differ- 4091 ences from Perl in the pcre2pattern page. 4092 4093 9. If any of the backtracking control verbs are used in a subpattern 4094 that is called as a subroutine (whether or not recursively), their 4095 effect is confined to that subpattern; it does not extend to the sur- 4096 rounding pattern. This is not always the case in Perl. In particular, 4097 if (*THEN) is present in a group that is called as a subroutine, its 4098 action is limited to that group, even if the group does not contain any 4099 | characters. Note that such subpatterns are processed as anchored at 4100 the point where they are tested. 4101 4102 10. If a pattern contains more than one backtracking control verb, the 4103 first one that is backtracked onto acts. For example, in the pattern 4104 A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure 4105 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases 4106 it is the same as PCRE2, but there are examples where it differs. 4107 4108 11. Most backtracking verbs in assertions have their normal actions. 4109 They are not confined to the assertion. 4110 4111 12. There are some differences that are concerned with the settings of 4112 captured strings when part of a pattern is repeated. For example, 4113 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 4114 unset, but in PCRE2 it is set to "b". 4115 4116 13. PCRE2's handling of duplicate subpattern numbers and duplicate sub- 4117 pattern names is not as general as Perl's. This is a consequence of the 4118 fact the PCRE2 works internally just with numbers, using an external 4119 table to translate between numbers and names. In particular, a pattern 4120 such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have 4121 the same number but different names, is not supported, and causes an 4122 error at compile time. If it were allowed, it would not be possible to 4123 distinguish which parentheses matched, because both names map to cap- 4124 turing subpattern number 1. To avoid this confusing situation, an error 4125 is given at compile time. 4126 4127 14. Perl recognizes comments in some places that PCRE2 does not, for 4128 example, between the ( and ? at the start of a subpattern. If the /x 4129 modifier is set, Perl allows white space between ( and ? (though cur- 4130 rent Perls warn that this is deprecated) but PCRE2 never does, even if 4131 the PCRE2_EXTENDED option is set. 4132 4133 15. Perl, when in warning mode, gives warnings for character classes 4134 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- 4135 als. PCRE2 has no warning features, so it gives an error in these cases 4136 because they are almost certainly user mistakes. 4137 4138 16. In PCRE2, the upper/lower case character properties Lu and Ll are 4139 not affected when case-independent matching is specified. For example, 4140 \p{Lu} always matches an upper case letter. I think Perl has changed in 4141 this respect; in the release at the time of writing (5.16), \p{Lu} and 4142 \p{Ll} match all letters, regardless of case, when case independence is 4143 specified. 4144 4145 17. PCRE2 provides some extensions to the Perl regular expression 4146 facilities. Perl 5.10 includes new features that are not in earlier 4147 versions of Perl, some of which (such as named parentheses) have been 4148 in PCRE2 for some time. This list is with respect to Perl 5.10: 4149 4150 (a) Although lookbehind assertions in PCRE2 must match fixed length 4151 strings, each alternative branch of a lookbehind assertion can match a 4152 different length of string. Perl requires them all to have the same 4153 length. 4154 4155 (b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the 4156 $ meta-character matches only at the very end of the string. 4157 4158 (c) A backslash followed by a letter with no special meaning is 4159 faulted. (Perl can be made to issue a warning.) 4160 4161 (d) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti- 4162 fiers is inverted, that is, by default they are not greedy, but if fol- 4163 lowed by a question mark they are. 4164 4165 (e) PCRE2_ANCHORED can be used at matching time to force a pattern to 4166 be tried only at the first matching position in the subject string. 4167 4168 (f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, 4169 PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no Perl 4170 equivalents. 4171 4172 (g) The \R escape sequence can be restricted to match only CR, LF, or 4173 CRLF by the PCRE2_BSR_ANYCRLF option. 4174 4175 (h) The callout facility is PCRE2-specific. 4176 4177 (i) The partial matching facility is PCRE2-specific. 4178 4179 (j) The alternative matching function (pcre2_dfa_match() matches in a 4180 different way and is not Perl-compatible. 4181 4182 (k) PCRE2 recognizes some special sequences such as (*CR) at the start 4183 of a pattern that set overall options that cannot be changed within the 4184 pattern. 4185 4186 4187AUTHOR 4188 4189 Philip Hazel 4190 University Computing Service 4191 Cambridge, England. 4192 4193 4194REVISION 4195 4196 Last updated: 15 March 2015 4197 Copyright (c) 1997-2015 University of Cambridge. 4198------------------------------------------------------------------------------ 4199 4200 4201PCRE2JIT(3) Library Functions Manual PCRE2JIT(3) 4202 4203 4204 4205NAME 4206 PCRE2 - Perl-compatible regular expressions (revised API) 4207 4208PCRE2 JUST-IN-TIME COMPILER SUPPORT 4209 4210 Just-in-time compiling is a heavyweight optimization that can greatly 4211 speed up pattern matching. However, it comes at the cost of extra pro- 4212 cessing before the match is performed, so it is of most benefit when 4213 the same pattern is going to be matched many times. This does not nec- 4214 essarily mean many calls of a matching function; if the pattern is not 4215 anchored, matching attempts may take place many times at various posi- 4216 tions in the subject, even for a single call. Therefore, if the subject 4217 string is very long, it may still pay to use JIT even for one-off 4218 matches. JIT support is available for all of the 8-bit, 16-bit and 4219 32-bit PCRE2 libraries. 4220 4221 JIT support applies only to the traditional Perl-compatible matching 4222 function. It does not apply when the DFA matching function is being 4223 used. The code for this support was written by Zoltan Herczeg. 4224 4225 4226AVAILABILITY OF JIT SUPPORT 4227 4228 JIT support is an optional feature of PCRE2. The "configure" option 4229 --enable-jit (or equivalent CMake option) must be set when PCRE2 is 4230 built if you want to use JIT. The support is limited to the following 4231 hardware platforms: 4232 4233 ARM 32-bit (v5, v7, and Thumb2) 4234 ARM 64-bit 4235 Intel x86 32-bit and 64-bit 4236 MIPS 32-bit and 64-bit 4237 Power PC 32-bit and 64-bit 4238 SPARC 32-bit 4239 4240 If --enable-jit is set on an unsupported platform, compilation fails. 4241 4242 A program can tell if JIT support is available by calling pcre2_con- 4243 fig() with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is 4244 available, and 0 otherwise. However, a simple program does not need to 4245 check this in order to use JIT. The API is implemented in a way that 4246 falls back to the interpretive code if JIT is not available. For pro- 4247 grams that need the best possible performance, there is also a "fast 4248 path" API that is JIT-specific. 4249 4250 4251SIMPLE USE OF JIT 4252 4253 To make use of the JIT support in the simplest way, all you have to do 4254 is to call pcre2_jit_compile() after successfully compiling a pattern 4255 with pcre2_compile(). This function has two arguments: the first is the 4256 compiled pattern pointer that was returned by pcre2_compile(), and the 4257 second is zero or more of the following option bits: PCRE2_JIT_COM- 4258 PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. 4259 4260 If JIT support is not available, a call to pcre2_jit_compile() does 4261 nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled 4262 pattern is passed to the JIT compiler, which turns it into machine code 4263 that executes much faster than the normal interpretive code, but yields 4264 exactly the same results. The returned value from pcre2_jit_compile() 4265 is zero on success, or a negative error code. 4266 4267 There is a limit to the size of pattern that JIT supports, imposed by 4268 the size of machine stack that it uses. The exact rules are not docu- 4269 mented because they may change at any time, in particular, when new 4270 optimizations are introduced. If a pattern is too big, a call to 4271 pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY. 4272 4273 PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com- 4274 plete matches. If you want to run partial matches using the PCRE2_PAR- 4275 TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should 4276 set one or both of the other options as well as, or instead of 4277 PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code 4278 for each of the three modes (normal, soft partial, hard partial). When 4279 pcre2_match() is called, the appropriate code is run if it is avail- 4280 able. Otherwise, the pattern is matched using interpretive code. 4281 4282 You can call pcre2_jit_compile() multiple times for the same compiled 4283 pattern. It does nothing if it has previously compiled code for any of 4284 the option bits. For example, you can call it once with PCRE2_JIT_COM- 4285 PLETE and (perhaps later, when you find you need partial matching) 4286 again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it 4287 will ignore PCRE2_JIT_COMPLETE and just compile code for partial match- 4288 ing. If pcre2_jit_compile() is called with no option bits set, it imme- 4289 diately returns zero. This is an alternative way of testing whether JIT 4290 is available. 4291 4292 At present, it is not possible to free JIT compiled code except when 4293 the entire compiled pattern is freed by calling pcre2_code_free(). 4294 4295 In some circumstances you may need to call additional functions. These 4296 are described in the section entitled "Controlling the JIT stack" 4297 below. 4298 4299 There are some pcre2_match() options that are not supported by JIT, and 4300 there are also some pattern items that JIT cannot handle. Details are 4301 given below. In both cases, matching automatically falls back to the 4302 interpretive code. If you want to know whether JIT was actually used 4303 for a particular match, you should arrange for a JIT callback function 4304 to be set up as described in the section entitled "Controlling the JIT 4305 stack" below, even if you do not need to supply a non-default JIT 4306 stack. Such a callback function is called whenever JIT code is about to 4307 be obeyed. If the match-time options are not right for JIT execution, 4308 the callback function is not obeyed. 4309 4310 If the JIT compiler finds an unsupported item, no JIT data is gener- 4311 ated. You can find out if JIT matching is available after compiling a 4312 pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE 4313 option. A non-zero result means that JIT compilation was successful. A 4314 result of 0 means that JIT support is not available, or the pattern was 4315 not processed by pcre2_jit_compile(), or the JIT compiler was not able 4316 to handle the pattern. 4317 4318 4319UNSUPPORTED OPTIONS AND PATTERN ITEMS 4320 4321 The pcre2_match() options that are supported for JIT matching are 4322 PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, 4323 PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The 4324 PCRE2_ANCHORED option is not supported at match time. 4325 4326 If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the 4327 use of JIT, forcing matching by the interpreter code. 4328 4329 The only unsupported pattern items are \C (match a single data unit) 4330 when running in a UTF mode, and a callout immediately before an asser- 4331 tion condition in a conditional group. 4332 4333 4334RETURN VALUES FROM JIT MATCHING 4335 4336 When a pattern is matched using JIT matching, the return values are the 4337 same as those given by the interpretive pcre2_match() code, with the 4338 addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means 4339 that the memory used for the JIT stack was insufficient. See "Control- 4340 ling the JIT stack" below for a discussion of JIT stack usage. 4341 4342 The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if 4343 searching a very large pattern tree goes on for too long, as it is in 4344 the same circumstance when JIT is not used, but the details of exactly 4345 what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error 4346 code is never returned when JIT matching is used. 4347 4348 4349CONTROLLING THE JIT STACK 4350 4351 When the compiled JIT code runs, it needs a block of memory to use as a 4352 stack. By default, it uses 32K on the machine stack. However, some 4353 large or complicated patterns need more than this. The error 4354 PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack. 4355 Three functions are provided for managing blocks of memory for use as 4356 JIT stacks. There is further discussion about the use of JIT stacks in 4357 the section entitled "JIT stack FAQ" below. 4358 4359 The pcre2_jit_stack_create() function creates a JIT stack. Its argu- 4360 ments are a starting size, a maximum size, and a general context (for 4361 memory allocation functions, or NULL for standard memory allocation). 4362 It returns a pointer to an opaque structure of type pcre2_jit_stack, or 4363 NULL if there is an error. The pcre2_jit_stack_free() function is used 4364 to free a stack that is no longer needed. (For the technically minded: 4365 the address space is allocated by mmap or VirtualAlloc.) 4366 4367 JIT uses far less memory for recursion than the interpretive code, and 4368 a maximum stack size of 512K to 1M should be more than enough for any 4369 pattern. 4370 4371 The pcre2_jit_stack_assign() function specifies which stack JIT code 4372 should use. Its arguments are as follows: 4373 4374 pcre2_match_context *mcontext 4375 pcre2_jit_callback callback 4376 void *data 4377 4378 The first argument is a pointer to a match context. When this is subse- 4379 quently passed to a matching function, its information determines which 4380 JIT stack is used. There are three cases for the values of the other 4381 two options: 4382 4383 (1) If callback is NULL and data is NULL, an internal 32K block 4384 on the machine stack is used. This is the default when a match 4385 context is created. 4386 4387 (2) If callback is NULL and data is not NULL, data must be 4388 a pointer to a valid JIT stack, the result of calling 4389 pcre2_jit_stack_create(). 4390 4391 (3) If callback is not NULL, it must point to a function that is 4392 called with data as an argument at the start of matching, in 4393 order to set up a JIT stack. If the return from the callback 4394 function is NULL, the internal 32K stack is used; otherwise the 4395 return value must be a valid JIT stack, the result of calling 4396 pcre2_jit_stack_create(). 4397 4398 A callback function is obeyed whenever JIT code is about to be run; it 4399 is not obeyed when pcre2_match() is called with options that are incom- 4400 patible for JIT matching. A callback function can therefore be used to 4401 determine whether a match operation was executed by JIT or by the 4402 interpreter. 4403 4404 You may safely use the same JIT stack for more than one pattern (either 4405 by assigning directly or by callback), as long as the patterns are 4406 matched sequentially in the same thread. Currently, the only way to set 4407 up non-sequential matches in one thread is to use callouts: if a call- 4408 out function starts another match, that match must use a different JIT 4409 stack to the one used for currently suspended match(es). 4410 4411 In a multithread application, if you do not specify a JIT stack, or if 4412 you assign or pass back NULL from a callback, that is thread-safe, 4413 because each thread has its own machine stack. However, if you assign 4414 or pass back a non-NULL JIT stack, this must be a different stack for 4415 each thread so that the application is thread-safe. 4416 4417 Strictly speaking, even more is allowed. You can assign the same non- 4418 NULL stack to a match context that is used by any number of patterns, 4419 as long as they are not used for matching by multiple threads at the 4420 same time. For example, you could use the same stack in all compiled 4421 patterns, with a global mutex in the callback to wait until the stack 4422 is available for use. However, this is an inefficient solution, and not 4423 recommended. 4424 4425 This is a suggestion for how a multithreaded program that needs to set 4426 up non-default JIT stacks might operate: 4427 4428 During thread initalization 4429 thread_local_var = pcre2_jit_stack_create(...) 4430 4431 During thread exit 4432 pcre2_jit_stack_free(thread_local_var) 4433 4434 Use a one-line callback function 4435 return thread_local_var 4436 4437 All the functions described in this section do nothing if JIT is not 4438 available. 4439 4440 4441JIT STACK FAQ 4442 4443 (1) Why do we need JIT stacks? 4444 4445 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack 4446 where the local data of the current node is pushed before checking its 4447 child nodes. Allocating real machine stack on some platforms is diffi- 4448 cult. For example, the stack chain needs to be updated every time if we 4449 extend the stack on PowerPC. Although it is possible, its updating 4450 time overhead decreases performance. So we do the recursion in memory. 4451 4452 (2) Why don't we simply allocate blocks of memory with malloc()? 4453 4454 Modern operating systems have a nice feature: they can reserve an 4455 address space instead of allocating memory. We can safely allocate mem- 4456 ory pages inside this address space, so the stack could grow without 4457 moving memory data (this is important because of pointers). Thus we can 4458 allocate 1M address space, and use only a single memory page (usually 4459 4K) if that is enough. However, we can still grow up to 1M anytime if 4460 needed. 4461 4462 (3) Who "owns" a JIT stack? 4463 4464 The owner of the stack is the user program, not the JIT studied pattern 4465 or anything else. The user program must ensure that if a stack is being 4466 used by pcre2_match(), (that is, it is assigned to a match context that 4467 is passed to the pattern currently running), that stack must not be 4468 used by any other threads (to avoid overwriting the same memory area). 4469 The best practice for multithreaded programs is to allocate a stack for 4470 each thread, and return this stack through the JIT callback function. 4471 4472 (4) When should a JIT stack be freed? 4473 4474 You can free a JIT stack at any time, as long as it will not be used by 4475 pcre2_match() again. When you assign the stack to a match context, only 4476 a pointer is set. There is no reference counting or any other magic. 4477 You can free compiled patterns, contexts, and stacks in any order, any- 4478 time. Just do not call pcre2_match() with a match context pointing to 4479 an already freed stack, as that will cause SEGFAULT. (Also, do not free 4480 a stack currently used by pcre2_match() in another thread). You can 4481 also replace the stack in a context at any time when it is not in use. 4482 You should free the previous stack before assigning a replacement. 4483 4484 (5) Should I allocate/free a stack every time before/after calling 4485 pcre2_match()? 4486 4487 No, because this is too costly in terms of resources. However, you 4488 could implement some clever idea which release the stack if it is not 4489 used in let's say two minutes. The JIT callback can help to achieve 4490 this without keeping a list of patterns. 4491 4492 (6) OK, the stack is for long term memory allocation. But what happens 4493 if a pattern causes stack overflow with a stack of 1M? Is that 1M kept 4494 until the stack is freed? 4495 4496 Especially on embedded sytems, it might be a good idea to release mem- 4497 ory sometimes without freeing the stack. There is no API for this at 4498 the moment. Probably a function call which returns with the currently 4499 allocated memory for any stack and another which allows releasing mem- 4500 ory (shrinking the stack) would be a good idea if someone needs this. 4501 4502 (7) This is too much of a headache. Isn't there any better solution for 4503 JIT stack handling? 4504 4505 No, thanks to Windows. If POSIX threads were used everywhere, we could 4506 throw out this complicated API. 4507 4508 4509FREEING JIT SPECULATIVE MEMORY 4510 4511 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 4512 4513 The JIT executable allocator does not free all memory when it is possi- 4514 ble. It expects new allocations, and keeps some free memory around to 4515 improve allocation speed. However, in low memory conditions, it might 4516 be better to free all possible memory. You can cause this to happen by 4517 calling pcre2_jit_free_unused_memory(). Its argument is a general con- 4518 text, for custom memory management, or NULL for standard memory manage- 4519 ment. 4520 4521 4522EXAMPLE CODE 4523 4524 This is a single-threaded example that specifies a JIT stack without 4525 using a callback. A real program should include error checking after 4526 all the function calls. 4527 4528 int rc; 4529 pcre2_code *re; 4530 pcre2_match_data *match_data; 4531 pcre2_match_context *mcontext; 4532 pcre2_jit_stack *jit_stack; 4533 4534 re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0, 4535 &errornumber, &erroffset, NULL); 4536 rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE); 4537 mcontext = pcre2_match_context_create(NULL); 4538 jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL); 4539 pcre2_jit_stack_assign(mcontext, NULL, jit_stack); 4540 match_data = pcre2_match_data_create(re, 10); 4541 rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext); 4542 /* Process result */ 4543 4544 pcre2_code_free(re); 4545 pcre2_match_data_free(match_data); 4546 pcre2_match_context_free(mcontext); 4547 pcre2_jit_stack_free(jit_stack); 4548 4549 4550JIT FAST PATH API 4551 4552 Because the API described above falls back to interpreted matching when 4553 JIT is not available, it is convenient for programs that are written 4554 for general use in many environments. However, calling JIT via 4555 pcre2_match() does have a performance impact. Programs that are written 4556 for use where JIT is known to be available, and which need the best 4557 possible performance, can instead use a "fast path" API to call JIT 4558 matching directly instead of calling pcre2_match() (obviously only for 4559 patterns that have been successfully processed by pcre2_jit_compile()). 4560 4561 The fast path function is called pcre2_jit_match(), and it takes 4562 exactly the same arguments as pcre2_match(). The return values are also 4563 the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or 4564 complete) is requested that was not compiled. Unsupported option bits 4565 (for example, PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT 4566 option. 4567 4568 When you call pcre2_match(), as well as testing for invalid options, a 4569 number of other sanity checks are performed on the arguments. For exam- 4570 ple, if the subject pointer is NULL, an immediate error is given. Also, 4571 unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for 4572 validity. In the interests of speed, these checks do not happen on the 4573 JIT fast path, and if invalid data is passed, the result is undefined. 4574 4575 Bypassing the sanity checks and the pcre2_match() wrapping can give 4576 speedups of more than 10%. 4577 4578 4579SEE ALSO 4580 4581 pcre2api(3) 4582 4583 4584AUTHOR 4585 4586 Philip Hazel (FAQ by Zoltan Herczeg) 4587 University Computing Service 4588 Cambridge, England. 4589 4590 4591REVISION 4592 4593 Last updated: 05 June 2016 4594 Copyright (c) 1997-2016 University of Cambridge. 4595------------------------------------------------------------------------------ 4596 4597 4598PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3) 4599 4600 4601 4602NAME 4603 PCRE2 - Perl-compatible regular expressions (revised API) 4604 4605SIZE AND OTHER LIMITATIONS 4606 4607 There are some size limitations in PCRE2 but it is hoped that they will 4608 never in practice be relevant. 4609 4610 The maximum size of a compiled pattern is approximately 64K code units 4611 for the 8-bit and 16-bit libraries if PCRE2 is compiled with the 4612 default internal linkage size, which is 2 bytes for these libraries. If 4613 you want to process regular expressions that are truly enormous, you 4614 can compile PCRE2 with an internal linkage size of 3 or 4 (when build- 4615 ing the 16-bit library, 3 is rounded up to 4). See the README file in 4616 the source distribution and the pcre2build documentation for details. 4617 In these cases the limit is substantially larger. However, the speed 4618 of execution is slower. In the 32-bit library, the internal linkage 4619 size is always 4. 4620 4621 The maximum length of a source pattern string is essentially unlimited; 4622 it is the largest number a PCRE2_SIZE variable can hold. However, the 4623 program that calls pcre2_compile() can specify a smaller limit. 4624 4625 The maximum length (in code units) of a subject string is one less than 4626 the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an 4627 unsigned integer type, usually defined as size_t. Its maximum value 4628 (that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero- 4629 terminated strings and unset offsets. 4630 4631 Note that when using the traditional matching function, PCRE2 uses 4632 recursion to handle subpatterns and indefinite repetition. This means 4633 that the available stack space may limit the size of a subject string 4634 that can be processed by certain patterns. For a discussion of stack 4635 issues, see the pcre2stack documentation. 4636 4637 All values in repeating quantifiers must be less than 65536. 4638 4639 The maximum length of a lookbehind assertion is 65535 characters. 4640 4641 There is no limit to the number of parenthesized subpatterns, but there 4642 can be no more than 65535 capturing subpatterns. There is, however, a 4643 limit to the depth of nesting of parenthesized subpatterns of all 4644 kinds. This is imposed in order to limit the amount of system stack 4645 used at compile time. The limit can be specified when PCRE2 is built; 4646 the default is 250. 4647 4648 There is a limit to the number of forward references to subsequent sub- 4649 patterns of around 200,000. Repeated forward references with fixed 4650 upper limits, for example, (?2){0,100} when subpattern number 2 is to 4651 the right, are included in the count. There is no limit to the number 4652 of backward references. 4653 4654 The maximum length of name for a named subpattern is 32 code units, and 4655 the maximum number of named subpatterns is 10000. 4656 4657 The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or 4658 (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit and 4659 32-bit libraries. 4660 4661 4662AUTHOR 4663 4664 Philip Hazel 4665 University Computing Service 4666 Cambridge, England. 4667 4668 4669REVISION 4670 4671 Last updated: 05 November 2015 4672 Copyright (c) 1997-2015 University of Cambridge. 4673------------------------------------------------------------------------------ 4674 4675 4676PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3) 4677 4678 4679 4680NAME 4681 PCRE2 - Perl-compatible regular expressions (revised API) 4682 4683PCRE2 MATCHING ALGORITHMS 4684 4685 This document describes the two different algorithms that are available 4686 in PCRE2 for matching a compiled regular expression against a given 4687 subject string. The "standard" algorithm is the one provided by the 4688 pcre2_match() function. This works in the same as as Perl's matching 4689 function, and provide a Perl-compatible matching operation. The just- 4690 in-time (JIT) optimization that is described in the pcre2jit documenta- 4691 tion is compatible with this function. 4692 4693 An alternative algorithm is provided by the pcre2_dfa_match() function; 4694 it operates in a different way, and is not Perl-compatible. This alter- 4695 native has advantages and disadvantages compared with the standard 4696 algorithm, and these are described below. 4697 4698 When there is only one possible way in which a given subject string can 4699 match a pattern, the two algorithms give the same answer. A difference 4700 arises, however, when there are multiple possibilities. For example, if 4701 the pattern 4702 4703 ^<.*> 4704 4705 is matched against the string 4706 4707 <something> <something else> <something further> 4708 4709 there are three possible answers. The standard algorithm finds only one 4710 of them, whereas the alternative algorithm finds all three. 4711 4712 4713REGULAR EXPRESSIONS AS TREES 4714 4715 The set of strings that are matched by a regular expression can be rep- 4716 resented as a tree structure. An unlimited repetition in the pattern 4717 makes the tree of infinite size, but it is still a tree. Matching the 4718 pattern to a given subject string (from a given starting point) can be 4719 thought of as a search of the tree. There are two ways to search a 4720 tree: depth-first and breadth-first, and these correspond to the two 4721 matching algorithms provided by PCRE2. 4722 4723 4724THE STANDARD MATCHING ALGORITHM 4725 4726 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres- 4727 sions", the standard algorithm is an "NFA algorithm". It conducts a 4728 depth-first search of the pattern tree. That is, it proceeds along a 4729 single path through the tree, checking that the subject matches what is 4730 required. When there is a mismatch, the algorithm tries any alterna- 4731 tives at the current point, and if they all fail, it backs up to the 4732 previous branch point in the tree, and tries the next alternative 4733 branch at that level. This often involves backing up (moving to the 4734 left) in the subject string as well. The order in which repetition 4735 branches are tried is controlled by the greedy or ungreedy nature of 4736 the quantifier. 4737 4738 If a leaf node is reached, a matching string has been found, and at 4739 that point the algorithm stops. Thus, if there is more than one possi- 4740 ble match, this algorithm returns the first one that it finds. Whether 4741 this is the shortest, the longest, or some intermediate length depends 4742 on the way the greedy and ungreedy repetition quantifiers are specified 4743 in the pattern. 4744 4745 Because it ends up with a single path through the tree, it is rela- 4746 tively straightforward for this algorithm to keep track of the sub- 4747 strings that are matched by portions of the pattern in parentheses. 4748 This provides support for capturing parentheses and back references. 4749 4750 4751THE ALTERNATIVE MATCHING ALGORITHM 4752 4753 This algorithm conducts a breadth-first search of the tree. Starting 4754 from the first matching point in the subject, it scans the subject 4755 string from left to right, once, character by character, and as it does 4756 this, it remembers all the paths through the tree that represent valid 4757 matches. In Friedl's terminology, this is a kind of "DFA algorithm", 4758 though it is not implemented as a traditional finite state machine (it 4759 keeps multiple states active simultaneously). 4760 4761 Although the general principle of this matching algorithm is that it 4762 scans the subject string only once, without backtracking, there is one 4763 exception: when a lookaround assertion is encountered, the characters 4764 following or preceding the current point have to be independently 4765 inspected. 4766 4767 The scan continues until either the end of the subject is reached, or 4768 there are no more unterminated paths. At this point, terminated paths 4769 represent the different matching possibilities (if there are none, the 4770 match has failed). Thus, if there is more than one possible match, 4771 this algorithm finds all of them, and in particular, it finds the long- 4772 est. The matches are returned in decreasing order of length. There is 4773 an option to stop the algorithm after the first match (which is neces- 4774 sarily the shortest) is found. 4775 4776 Note that all the matches that are found start at the same point in the 4777 subject. If the pattern 4778 4779 cat(er(pillar)?)? 4780 4781 is matched against the string "the caterpillar catchment", the result 4782 is the three strings "caterpillar", "cater", and "cat" that start at 4783 the fifth character of the subject. The algorithm does not automati- 4784 cally move on to find matches that start at later positions. 4785 4786 PCRE2's "auto-possessification" optimization usually applies to charac- 4787 ter repeats at the end of a pattern (as well as internally). For exam- 4788 ple, the pattern "a\d+" is compiled as if it were "a\d++" because there 4789 is no point even considering the possibility of backtracking into the 4790 repeated digits. For DFA matching, this means that only one possible 4791 match is found. If you really do want multiple matches in such cases, 4792 either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS- 4793 SESS option when compiling. 4794 4795 There are a number of features of PCRE2 regular expressions that are 4796 not supported by the alternative matching algorithm. They are as fol- 4797 lows: 4798 4799 1. Because the algorithm finds all possible matches, the greedy or 4800 ungreedy nature of repetition quantifiers is not relevant (though it 4801 may affect auto-possessification, as just described). During matching, 4802 greedy and ungreedy quantifiers are treated in exactly the same way. 4803 However, possessive quantifiers can make a difference when what follows 4804 could also match what is quantified, for example in a pattern like 4805 this: 4806 4807 ^a++\w! 4808 4809 This pattern matches "aaab!" but not "aaa!", which would be matched by 4810 a non-possessive quantifier. Similarly, if an atomic group is present, 4811 it is matched as if it were a standalone pattern at the current point, 4812 and the longest match is then "locked in" for the rest of the overall 4813 pattern. 4814 4815 2. When dealing with multiple paths through the tree simultaneously, it 4816 is not straightforward to keep track of captured substrings for the 4817 different matching possibilities, and PCRE2's implementation of this 4818 algorithm does not attempt to do this. This means that no captured sub- 4819 strings are available. 4820 4821 3. Because no substrings are captured, back references within the pat- 4822 tern are not supported, and cause errors if encountered. 4823 4824 4. For the same reason, conditional expressions that use a backrefer- 4825 ence as the condition or test for a specific group recursion are not 4826 supported. 4827 4828 5. Because many paths through the tree may be active, the \K escape 4829 sequence, which resets the start of the match when encountered (but may 4830 be on some paths and not on others), is not supported. It causes an 4831 error if encountered. 4832 4833 6. Callouts are supported, but the value of the capture_top field is 4834 always 1, and the value of the capture_last field is always 0. 4835 4836 7. The \C escape sequence, which (in the standard algorithm) always 4837 matches a single code unit, even in a UTF mode, is not supported in 4838 these modes, because the alternative algorithm moves through the sub- 4839 ject string one character (not code unit) at a time, for all active 4840 paths through the tree. 4841 4842 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) 4843 are not supported. (*FAIL) is supported, and behaves like a failing 4844 negative assertion. 4845 4846 4847ADVANTAGES OF THE ALTERNATIVE ALGORITHM 4848 4849 Using the alternative matching algorithm provides the following advan- 4850 tages: 4851 4852 1. All possible matches (at a single point in the subject) are automat- 4853 ically found, and in particular, the longest match is found. To find 4854 more than one match using the standard algorithm, you have to do kludgy 4855 things with callouts. 4856 4857 2. Because the alternative algorithm scans the subject string just 4858 once, and never needs to backtrack (except for lookbehinds), it is pos- 4859 sible to pass very long subject strings to the matching function in 4860 several pieces, checking for partial matching each time. Although it is 4861 also possible to do multi-segment matching using the standard algo- 4862 rithm, by retaining partially matched substrings, it is more compli- 4863 cated. The pcre2partial documentation gives details of partial matching 4864 and discusses multi-segment matching. 4865 4866 4867DISADVANTAGES OF THE ALTERNATIVE ALGORITHM 4868 4869 The alternative algorithm suffers from a number of disadvantages: 4870 4871 1. It is substantially slower than the standard algorithm. This is 4872 partly because it has to search for all possible matches, but is also 4873 because it is less susceptible to optimization. 4874 4875 2. Capturing parentheses and back references are not supported. 4876 4877 3. Although atomic groups are supported, their use does not provide the 4878 performance advantage that it does for the standard algorithm. 4879 4880 4881AUTHOR 4882 4883 Philip Hazel 4884 University Computing Service 4885 Cambridge, England. 4886 4887 4888REVISION 4889 4890 Last updated: 29 September 2014 4891 Copyright (c) 1997-2014 University of Cambridge. 4892------------------------------------------------------------------------------ 4893 4894 4895PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3) 4896 4897 4898 4899NAME 4900 PCRE2 - Perl-compatible regular expressions 4901 4902PARTIAL MATCHING IN PCRE2 4903 4904 In normal use of PCRE2, if the subject string that is passed to a 4905 matching function matches as far as it goes, but is too short to match 4906 the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum- 4907 stances where it might be helpful to distinguish this case from other 4908 cases in which there is no match. 4909 4910 Consider, for example, an application where a human is required to type 4911 in data for a field with specific formatting requirements. An example 4912 might be a date in the form ddmmmyy, defined by this pattern: 4913 4914 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$ 4915 4916 If the application sees the user's keystrokes one by one, and can check 4917 that what has been typed so far is potentially valid, it is able to 4918 raise an error as soon as a mistake is made, by beeping and not 4919 reflecting the character that has been typed, for example. This immedi- 4920 ate feedback is likely to be a better user interface than a check that 4921 is delayed until the entire string has been entered. Partial matching 4922 can also be useful when the subject string is very long and is not all 4923 available at once. 4924 4925 PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and 4926 PCRE2_PARTIAL_HARD options, which can be set when calling a matching 4927 function. The difference between the two options is whether or not a 4928 partial match is preferred to an alternative complete match, though the 4929 details differ between the two types of matching function. If both 4930 options are set, PCRE2_PARTIAL_HARD takes precedence. 4931 4932 If you want to use partial matching with just-in-time optimized code, 4933 you must call pcre2_jit_compile() with one or both of these options: 4934 4935 PCRE2_JIT_PARTIAL_SOFT 4936 PCRE2_JIT_PARTIAL_HARD 4937 4938 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par- 4939 tial matches on the same pattern. If the appropriate JIT mode has not 4940 been compiled, interpretive matching code is used. 4941 4942 Setting a partial matching option disables two of PCRE2's standard 4943 optimizations. PCRE2 remembers the last literal code unit in a pattern, 4944 and abandons matching immediately if it is not present in the subject 4945 string. This optimization cannot be used for a subject string that 4946 might match only partially. PCRE2 also knows the minimum length of a 4947 matching string, and does not bother to run the matching function on 4948 shorter strings. This optimization is also disabled for partial match- 4949 ing. 4950 4951 4952PARTIAL MATCHING USING pcre2_match() 4953 4954 A partial match occurs during a call to pcre2_match() when the end of 4955 the subject string is reached successfully, but matching cannot con- 4956 tinue because more characters are needed. However, at least one charac- 4957 ter in the subject must have been inspected. This character need not 4958 form part of the final matched string; lookbehind assertions and the \K 4959 escape sequence provide ways of inspecting characters before the start 4960 of a matched string. The requirement for inspecting at least one char- 4961 acter exists because an empty string can always be matched; without 4962 such a restriction there would always be a partial match of an empty 4963 string at the end of the subject. 4964 4965 When a partial match is returned, the first two elements in the ovector 4966 point to the portion of the subject that was matched, but the values in 4967 the rest of the ovector are undefined. The appearance of \K in the pat- 4968 tern has no effect for a partial match. Consider this pattern: 4969 4970 /abc\K123/ 4971 4972 If it is matched against "456abc123xyz" the result is a complete match, 4973 and the ovector defines the matched string as "123", because \K resets 4974 the "start of match" point. However, if a partial match is requested 4975 and the subject string is "456abc12", a partial match is found for the 4976 string "abc12", because all these characters are needed for a subse- 4977 quent re-match with additional characters. 4978 4979 What happens when a partial match is identified depends on which of the 4980 two partial matching options are set. 4981 4982 PCRE2_PARTIAL_SOFT WITH pcre2_match() 4983 4984 If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial 4985 match, the partial match is remembered, but matching continues as nor- 4986 mal, and other alternatives in the pattern are tried. If no complete 4987 match can be found, PCRE2_ERROR_PARTIAL is returned instead of 4988 PCRE2_ERROR_NOMATCH. 4989 4990 This option is "soft" because it prefers a complete match over a par- 4991 tial match. All the various matching items in a pattern behave as if 4992 the subject string is potentially complete. For example, \z, \Z, and $ 4993 match at the end of the subject, as normal, and for \b and \B the end 4994 of the subject is treated as a non-alphanumeric. 4995 4996 If there is more than one partial match, the first one that was found 4997 provides the data that is returned. Consider this pattern: 4998 4999 /123\w+X|dogY/ 5000 5001 If this is matched against the subject string "abc123dog", both alter- 5002 natives fail to match, but the end of the subject is reached during 5003 matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 5004 and 9, identifying "123dog" as the first partial match that was found. 5005 (In this example, there are two partial matches, because "dog" on its 5006 own partially matches the second alternative.) 5007 5008 PCRE2_PARTIAL_HARD WITH pcre2_match() 5009 5010 If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is 5011 returned as soon as a partial match is found, without continuing to 5012 search for possible complete matches. This option is "hard" because it 5013 prefers an earlier partial match over a later complete match. For this 5014 reason, the assumption is made that the end of the supplied subject 5015 string may not be the true end of the available data, and so, if \z, 5016 \Z, \b, \B, or $ are encountered at the end of the subject, the result 5017 is PCRE2_ERROR_PARTIAL, provided that at least one character in the 5018 subject has been inspected. 5019 5020 Comparing hard and soft partial matching 5021 5022 The difference between the two partial matching options can be illus- 5023 trated by a pattern such as: 5024 5025 /dog(sbody)?/ 5026 5027 This matches either "dog" or "dogsbody", greedily (that is, it prefers 5028 the longer string if possible). If it is matched against the string 5029 "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". 5030 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR- 5031 TIAL. On the other hand, if the pattern is made ungreedy the result is 5032 different: 5033 5034 /dog(sbody)??/ 5035 5036 In this case the result is always a complete match because that is 5037 found first, and matching never continues after finding a complete 5038 match. It might be easier to follow this explanation by thinking of the 5039 two patterns like this: 5040 5041 /dog(sbody)?/ is the same as /dogsbody|dog/ 5042 /dog(sbody)??/ is the same as /dog|dogsbody/ 5043 5044 The second pattern will never match "dogsbody", because it will always 5045 find the shorter match first. 5046 5047 5048PARTIAL MATCHING USING pcre2_dfa_match() 5049 5050 The DFA functions move along the subject string character by character, 5051 without backtracking, searching for all possible matches simultane- 5052 ously. If the end of the subject is reached before the end of the pat- 5053 tern, there is the possibility of a partial match, again provided that 5054 at least one character has been inspected. 5055 5056 When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if 5057 there have been no complete matches. Otherwise, the complete matches 5058 are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match 5059 takes precedence over any complete matches. The portion of the string 5060 that was matched when the longest partial match was found is set as the 5061 first matching string. 5062 5063 Because the DFA functions always search for all possible matches, and 5064 there is no difference between greedy and ungreedy repetition, their 5065 behaviour is different from the standard functions when PCRE2_PAR- 5066 TIAL_HARD is set. Consider the string "dog" matched against the 5067 ungreedy pattern shown above: 5068 5069 /dog(sbody)??/ 5070 5071 Whereas the standard function stops as soon as it finds the complete 5072 match for "dog", the DFA function also finds the partial match for 5073 "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set. 5074 5075 5076PARTIAL MATCHING AND WORD BOUNDARIES 5077 5078 If a pattern ends with one of sequences \b or \B, which test for word 5079 boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter- 5080 intuitive results. Consider this pattern: 5081 5082 /\bcat\b/ 5083 5084 This matches "cat", provided there is a word boundary at either end. If 5085 the subject string is "the cat", the comparison of the final "t" with a 5086 following character cannot take place, so a partial match is found. 5087 However, normal matching carries on, and \b matches at the end of the 5088 subject when the last character is a letter, so a complete match is 5089 found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using 5090 PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because 5091 then the partial match takes precedence. 5092 5093 5094EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST 5095 5096 If the partial_soft (or ps) modifier is present on a pcre2test data 5097 line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a 5098 run of pcre2test that uses the date example quoted above: 5099 5100 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 5101 data> 25jun04\=ps 5102 0: 25jun04 5103 1: jun 5104 data> 25dec3\=ps 5105 Partial match: 23dec3 5106 data> 3ju\=ps 5107 Partial match: 3ju 5108 data> 3juj\=ps 5109 No match 5110 data> j\=ps 5111 No match 5112 5113 The first data string is matched completely, so pcre2test shows the 5114 matched substrings. The remaining four strings do not match the com- 5115 plete pattern, but the first two are partial matches. Similar output is 5116 obtained if DFA matching is used. 5117 5118 If the partial_hard (or ph) modifier is present on a pcre2test data 5119 line, the PCRE2_PARTIAL_HARD option is set for the match. 5120 5121 5122MULTI-SEGMENT MATCHING WITH pcre2_dfa_match() 5123 5124 When a partial match has been found using a DFA matching function, it 5125 is possible to continue the match by providing additional subject data 5126 and calling the function again with the same compiled regular expres- 5127 sion, this time setting the PCRE2_DFA_RESTART option. You must pass the 5128 same working space as before, because this is where details of the pre- 5129 vious partial match are stored. Here is an example using pcre2test: 5130 5131 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 5132 data> 23ja\=dfa,ps 5133 Partial match: 23ja 5134 data> n05\=dfa,dfa_restart 5135 0: n05 5136 5137 The first call has "23ja" as the subject, and requests partial match- 5138 ing; the second call has "n05" as the subject for the continued 5139 (restarted) match. Notice that when the match is complete, only the 5140 last part is shown; PCRE2 does not retain the previously partially- 5141 matched string. It is up to the calling program to do that if it needs 5142 to. 5143 5144 That means that, for an unanchored pattern, if a continued match fails, 5145 it is not possible to try again at a new starting point. All this 5146 facility is capable of doing is continuing with the previous match 5147 attempt. In the previous example, if the second set of data is "ug23" 5148 the result is no match, even though there would be a match for "aug23" 5149 if the entire string were given at once. Depending on the application, 5150 this may or may not be what you want. The only way to allow for start- 5151 ing again at the next character is to retain the matched part of the 5152 subject and try a new complete match. 5153 5154 You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with 5155 PCRE2_DFA_RESTART to continue partial matching over multiple segments. 5156 This facility can be used to pass very long subject strings to the DFA 5157 matching functions. 5158 5159 5160MULTI-SEGMENT MATCHING WITH pcre2_match() 5161 5162 Unlike the DFA function, it is not possible to restart the previous 5163 match with a new segment of data when using pcre2_match(). Instead, new 5164 data must be added to the previous subject string, and the entire match 5165 re-run, starting from the point where the partial match occurred. Ear- 5166 lier data can be discarded. 5167 5168 It is best to use PCRE2_PARTIAL_HARD in this situation, because it does 5169 not treat the end of a segment as the end of the subject when matching 5170 \z, \Z, \b, \B, and $. Consider an unanchored pattern that matches 5171 dates: 5172 5173 re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ 5174 data> The date is 23ja\=ph 5175 Partial match: 23ja 5176 5177 At this stage, an application could discard the text preceding "23ja", 5178 add on text from the next segment, and call the matching function 5179 again. Unlike the DFA matching function, the entire matching string 5180 must always be available, and the complete matching process occurs for 5181 each call, so more memory and more processing time is needed. 5182 5183 5184ISSUES WITH MULTI-SEGMENT MATCHING 5185 5186 Certain types of pattern may give problems with multi-segment matching, 5187 whichever matching function is used. 5188 5189 1. If the pattern contains a test for the beginning of a line, you need 5190 to pass the PCRE2_NOTBOL option when the subject string for any call 5191 does start at the beginning of a line. There is also a PCRE2_NOTEOL 5192 option, but in practice when doing multi-segment matching you should be 5193 using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL. 5194 5195 2. If a pattern contains a lookbehind assertion, characters that pre- 5196 cede the start of the partial match may have been inspected during the 5197 matching process. When using pcre2_match(), sufficient characters must 5198 be retained for the next match attempt. You can ensure that enough 5199 characters are retained by doing the following: 5200 5201 Before doing any matching, find the length of the longest lookbehind in 5202 the pattern by calling pcre2_pattern_info() with the 5203 PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in 5204 characters, not code units. After a partial match, moving back from the 5205 ovector[0] offset in the subject by the number of characters given for 5206 the maximum lookbehind gets you to the earliest character that must be 5207 retained. In a non-UTF or a 32-bit situation, moving back is just a 5208 subtraction, but in UTF-8 or UTF-16 you have to count characters while 5209 moving back through the code units. 5210 5211 Characters before the point you have now reached can be discarded, and 5212 after the next segment has been added to what is retained, you should 5213 run the next match with the startoffset argument set so that the match 5214 begins at the same point as before. 5215 5216 For example, if the pattern "(?<=123)abc" is partially matched against 5217 the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi- 5218 mum lookbehind count is 3, so all characters before offset 2 can be 5219 discarded. The value of startoffset for the next match should be 3. 5220 When pcre2test displays a partial match, it indicates the lookbehind 5221 characters with '<' characters: 5222 5223 re> "(?<=123)abc" 5224 data> xx123ab\=ph 5225 Partial match: 123ab 5226 <<< 5227 5228 3. Because a partial match must always contain at least one character, 5229 what might be considered a partial match of an empty string actually 5230 gives a "no match" result. For example: 5231 5232 re> /c(?<=abc)x/ 5233 data> ab\=ps 5234 No match 5235 5236 If the next segment begins "cx", a match should be found, but this will 5237 only happen if characters from the previous segment are retained. For 5238 this reason, a "no match" result should be interpreted as "partial 5239 match of an empty string" when the pattern contains lookbehinds. 5240 5241 4. Matching a subject string that is split into multiple segments may 5242 not always produce exactly the same result as matching over one single 5243 long string, especially when PCRE2_PARTIAL_SOFT is used. The section 5244 "Partial Matching and Word Boundaries" above describes an issue that 5245 arises if the pattern ends with \b or \B. Another kind of difference 5246 may occur when there are multiple matching possibilities, because (for 5247 PCRE2_PARTIAL_SOFT) a partial match result is given only when there are 5248 no completed matches. This means that as soon as the shortest match has 5249 been found, continuation to a new subject segment is no longer possi- 5250 ble. Consider this pcre2test example: 5251 5252 re> /dog(sbody)?/ 5253 data> dogsb\=ps 5254 0: dog 5255 data> do\=ps,dfa 5256 Partial match: do 5257 data> gsb\=ps,dfa,dfa_restart 5258 0: g 5259 data> dogsbody\=dfa 5260 0: dogsbody 5261 1: dog 5262 5263 The first data line passes the string "dogsb" to a standard matching 5264 function, setting the PCRE2_PARTIAL_SOFT option. Although the string is 5265 a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, 5266 because the shorter string "dog" is a complete match. Similarly, when 5267 the subject is presented to a DFA matching function in several parts 5268 ("do" and "gsb" being the first two) the match stops when "dog" has 5269 been found, and it is not possible to continue. On the other hand, if 5270 "dogsbody" is presented as a single string, a DFA matching function 5271 finds both matches. 5272 5273 Because of these problems, it is best to use PCRE2_PARTIAL_HARD when 5274 matching multi-segment data. The example above then behaves differ- 5275 ently: 5276 5277 re> /dog(sbody)?/ 5278 data> dogsb\=ph 5279 Partial match: dogsb 5280 data> do\=ps,dfa 5281 Partial match: do 5282 data> gsb\=ph,dfa,dfa_restart 5283 Partial match: gsb 5284 5285 5. Patterns that contain alternatives at the top level which do not all 5286 start with the same pattern item may not work as expected when 5287 PCRE2_DFA_RESTART is used. For example, consider this pattern: 5288 5289 1234|3789 5290 5291 If the first part of the subject is "ABC123", a partial match of the 5292 first alternative is found at offset 3. There is no partial match for 5293 the second alternative, because such a match does not start at the same 5294 point in the subject string. Attempting to continue with the string 5295 "7890" does not yield a match because only those alternatives that 5296 match at one point in the subject are remembered. The problem arises 5297 because the start of the second alternative matches within the first 5298 alternative. There is no problem with anchored patterns or patterns 5299 such as: 5300 5301 1234|ABCD 5302 5303 where no string can be a partial match for both alternatives. This is 5304 not a problem if a standard matching function is used, because the 5305 entire match has to be rerun each time: 5306 5307 re> /1234|3789/ 5308 data> ABC123\=ph 5309 Partial match: 123 5310 data> 1237890 5311 0: 3789 5312 5313 Of course, instead of using PCRE2_DFA_RESTART, the same technique of 5314 re-running the entire match can also be used with the DFA matching 5315 function. Another possibility is to work with two buffers. If a partial 5316 match at offset n in the first buffer is followed by "no match" when 5317 PCRE2_DFA_RESTART is used on the second buffer, you can then try a new 5318 match starting at offset n+1 in the first buffer. 5319 5320 5321AUTHOR 5322 5323 Philip Hazel 5324 University Computing Service 5325 Cambridge, England. 5326 5327 5328REVISION 5329 5330 Last updated: 22 December 2014 5331 Copyright (c) 1997-2014 University of Cambridge. 5332------------------------------------------------------------------------------ 5333 5334 5335PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3) 5336 5337 5338 5339NAME 5340 PCRE2 - Perl-compatible regular expressions (revised API) 5341 5342PCRE2 REGULAR EXPRESSION DETAILS 5343 5344 The syntax and semantics of the regular expressions that are supported 5345 by PCRE2 are described in detail below. There is a quick-reference syn- 5346 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax 5347 and semantics as closely as it can. PCRE2 also supports some alterna- 5348 tive regular expression syntax (which does not conflict with the Perl 5349 syntax) in order to provide some compatibility with regular expressions 5350 in Python, .NET, and Oniguruma. 5351 5352 Perl's regular expressions are described in its own documentation, and 5353 regular expressions in general are covered in a number of books, some 5354 of which have copious examples. Jeffrey Friedl's "Mastering Regular 5355 Expressions", published by O'Reilly, covers regular expressions in 5356 great detail. This description of PCRE2's regular expressions is 5357 intended as reference material. 5358 5359 This document discusses the patterns that are supported by PCRE2 when 5360 its main matching function, pcre2_match(), is used. PCRE2 also has an 5361 alternative matching function, pcre2_dfa_match(), which matches using a 5362 different algorithm that is not Perl-compatible. Some of the features 5363 discussed below are not available when DFA matching is used. The advan- 5364 tages and disadvantages of the alternative function, and how it differs 5365 from the normal function, are discussed in the pcre2matching page. 5366 5367 5368SPECIAL START-OF-PATTERN ITEMS 5369 5370 A number of options that can be passed to pcre2_compile() can also be 5371 set by special items at the start of a pattern. These are not Perl-com- 5372 patible, but are provided to make these options accessible to pattern 5373 writers who are not able to change the program that processes the pat- 5374 tern. Any number of these items may appear, but they must all be 5375 together right at the start of the pattern string, and the letters must 5376 be in upper case. 5377 5378 UTF support 5379 5380 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either 5381 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 5382 can be specified for the 32-bit library, in which case it constrains 5383 the character values to valid Unicode code points. To process UTF 5384 strings, PCRE2 must be built to include Unicode support (which is the 5385 default). When using UTF strings you must either call the compiling 5386 function with the PCRE2_UTF option, or the pattern must start with the 5387 special sequence (*UTF), which is equivalent to setting the relevant 5388 option. How setting a UTF mode affects pattern matching is mentioned in 5389 several places below. There is also a summary of features in the 5390 pcre2unicode page. 5391 5392 Some applications that allow their users to supply patterns may wish to 5393 restrict them to non-UTF data for security reasons. If the 5394 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not 5395 allowed, and its appearance in a pattern causes an error. 5396 5397 Unicode property support 5398 5399 Another special sequence that may appear at the start of a pattern is 5400 (*UCP). This has the same effect as setting the PCRE2_UCP option: it 5401 causes sequences such as \d and \w to use Unicode properties to deter- 5402 mine character types, instead of recognizing only characters with codes 5403 less than 256 via a lookup table. 5404 5405 Some applications that allow their users to supply patterns may wish to 5406 restrict them for security reasons. If the PCRE2_NEVER_UCP option is 5407 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in 5408 a pattern causes an error. 5409 5410 Locking out empty string matching 5411 5412 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same 5413 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option 5414 to whichever matching function is subsequently called to match the pat- 5415 tern. These options lock out the matching of empty strings, either 5416 entirely, or only at the start of the subject. 5417 5418 Disabling auto-possessification 5419 5420 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as 5421 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making 5422 quantifiers possessive when what follows cannot match the repeated 5423 item. For example, by default a+b is treated as a++b. For more details, 5424 see the pcre2api documentation. 5425 5426 Disabling start-up optimizations 5427 5428 If a pattern starts with (*NO_START_OPT), it has the same effect as 5429 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti- 5430 mizations for quickly reaching "no match" results. For more details, 5431 see the pcre2api documentation. 5432 5433 Disabling automatic anchoring 5434 5435 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect 5436 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza- 5437 tions that apply to patterns whose top-level branches all start with .* 5438 (match any number of arbitrary characters). For more details, see the 5439 pcre2api documentation. 5440 5441 Disabling JIT compilation 5442 5443 If a pattern that starts with (*NO_JIT) is successfully compiled, an 5444 attempt by the application to apply the JIT optimization by calling 5445 pcre2_jit_compile() is ignored. 5446 5447 Setting match and recursion limits 5448 5449 The caller of pcre2_match() can set a limit on the number of times the 5450 internal match() function is called and on the maximum depth of recur- 5451 sive calls. These facilities are provided to catch runaway matches that 5452 are provoked by patterns with huge matching trees (a typical example is 5453 a pattern with nested unlimited repeats) and to avoid running out of 5454 system stack by too much recursion. When one of these limits is 5455 reached, pcre2_match() gives an error return. The limits can also be 5456 set by items at the start of the pattern of the form 5457 5458 (*LIMIT_MATCH=d) 5459 (*LIMIT_RECURSION=d) 5460 5461 where d is any number of decimal digits. However, the value of the set- 5462 ting must be less than the value set (or defaulted) by the caller of 5463 pcre2_match() for it to have any effect. In other words, the pattern 5464 writer can lower the limits set by the programmer, but not raise them. 5465 If there is more than one setting of one of these limits, the lower 5466 value is used. 5467 5468 Newline conventions 5469 5470 PCRE2 supports five different conventions for indicating line breaks in 5471 strings: a single CR (carriage return) character, a single LF (line- 5472 feed) character, the two-character sequence CRLF, any of the three pre- 5473 ceding, or any Unicode newline sequence. The pcre2api page has further 5474 discussion about newlines, and shows how to set the newline convention 5475 when calling pcre2_compile(). 5476 5477 It is also possible to specify a newline convention by starting a pat- 5478 tern string with one of the following five sequences: 5479 5480 (*CR) carriage return 5481 (*LF) linefeed 5482 (*CRLF) carriage return, followed by linefeed 5483 (*ANYCRLF) any of the three above 5484 (*ANY) all Unicode newline sequences 5485 5486 These override the default and the options given to the compiling func- 5487 tion. For example, on a Unix system where LF is the default newline 5488 sequence, the pattern 5489 5490 (*CR)a.b 5491 5492 changes the convention to CR. That pattern matches "a\nb" because LF is 5493 no longer a newline. If more than one of these settings is present, the 5494 last one is used. 5495 5496 The newline convention affects where the circumflex and dollar asser- 5497 tions are true. It also affects the interpretation of the dot metachar- 5498 acter when PCRE2_DOTALL is not set, and the behaviour of \N. However, 5499 it does not affect what the \R escape sequence matches. By default, 5500 this is any Unicode newline sequence, for Perl compatibility. However, 5501 this can be changed; see the description of \R in the section entitled 5502 "Newline sequences" below. A change of \R setting can be combined with 5503 a change of newline convention. 5504 5505 Specifying what \R matches 5506 5507 It is possible to restrict \R to match only CR, LF, or CRLF (instead of 5508 the complete set of Unicode line endings) by setting the option 5509 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by 5510 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI- 5511 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE. 5512 5513 5514EBCDIC CHARACTER CODES 5515 5516 PCRE2 can be compiled to run in an environment that uses EBCDIC as its 5517 character code rather than ASCII or Unicode (typically a mainframe sys- 5518 tem). In the sections below, character code values are ASCII or Uni- 5519 code; in an EBCDIC environment these characters may have different code 5520 values, and there are no code points greater than 255. 5521 5522 5523CHARACTERS AND METACHARACTERS 5524 5525 A regular expression is a pattern that is matched against a subject 5526 string from left to right. Most characters stand for themselves in a 5527 pattern, and match the corresponding characters in the subject. As a 5528 trivial example, the pattern 5529 5530 The quick brown fox 5531 5532 matches a portion of a subject string that is identical to itself. When 5533 caseless matching is specified (the PCRE2_CASELESS option), letters are 5534 matched independently of case. 5535 5536 The power of regular expressions comes from the ability to include 5537 alternatives and repetitions in the pattern. These are encoded in the 5538 pattern by the use of metacharacters, which do not stand for themselves 5539 but instead are interpreted in some special way. 5540 5541 There are two different sets of metacharacters: those that are recog- 5542 nized anywhere in the pattern except within square brackets, and those 5543 that are recognized within square brackets. Outside square brackets, 5544 the metacharacters are as follows: 5545 5546 \ general escape character with several uses 5547 ^ assert start of string (or line, in multiline mode) 5548 $ assert end of string (or line, in multiline mode) 5549 . match any character except newline (by default) 5550 [ start character class definition 5551 | start of alternative branch 5552 ( start subpattern 5553 ) end subpattern 5554 ? extends the meaning of ( 5555 also 0 or 1 quantifier 5556 also quantifier minimizer 5557 * 0 or more quantifier 5558 + 1 or more quantifier 5559 also "possessive quantifier" 5560 { start min/max quantifier 5561 5562 Part of a pattern that is in square brackets is called a "character 5563 class". In a character class the only metacharacters are: 5564 5565 \ general escape character 5566 ^ negate the class, but only if the first character 5567 - indicates character range 5568 [ POSIX character class (only if followed by POSIX 5569 syntax) 5570 ] terminates the character class 5571 5572 The following sections describe the use of each of the metacharacters. 5573 5574 5575BACKSLASH 5576 5577 The backslash character has several uses. Firstly, if it is followed by 5578 a character that is not a number or a letter, it takes away any special 5579 meaning that character may have. This use of backslash as an escape 5580 character applies both inside and outside character classes. 5581 5582 For example, if you want to match a * character, you write \* in the 5583 pattern. This escaping action applies whether or not the following 5584 character would otherwise be interpreted as a metacharacter, so it is 5585 always safe to precede a non-alphanumeric with backslash to specify 5586 that it stands for itself. In particular, if you want to match a back- 5587 slash, you write \\. 5588 5589 In a UTF mode, only ASCII numbers and letters have any special meaning 5590 after a backslash. All other characters (in particular, those whose 5591 codepoints are greater than 127) are treated as literals. 5592 5593 If a pattern is compiled with the PCRE2_EXTENDED option, most white 5594 space in the pattern (other than in a character class), and characters 5595 between a # outside a character class and the next newline, inclusive, 5596 are ignored. An escaping backslash can be used to include a white space 5597 or # character as part of the pattern. 5598 5599 If you want to remove the special meaning from a sequence of charac- 5600 ters, you can do so by putting them between \Q and \E. This is differ- 5601 ent from Perl in that $ and @ are handled as literals in \Q...\E 5602 sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola- 5603 tion. Note the following examples: 5604 5605 Pattern PCRE2 matches Perl matches 5606 5607 \Qabc$xyz\E abc$xyz abc followed by the 5608 contents of $xyz 5609 \Qabc\$xyz\E abc\$xyz abc\$xyz 5610 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 5611 5612 The \Q...\E sequence is recognized both inside and outside character 5613 classes. An isolated \E that is not preceded by \Q is ignored. If \Q 5614 is not followed by \E later in the pattern, the literal interpretation 5615 continues to the end of the pattern (that is, \E is assumed at the 5616 end). If the isolated \Q is inside a character class, this causes an 5617 error, because the character class is not terminated. 5618 5619 Non-printing characters 5620 5621 A second use of backslash provides a way of encoding non-printing char- 5622 acters in patterns in a visible manner. There is no restriction on the 5623 appearance of non-printing characters in a pattern, but when a pattern 5624 is being prepared by text editing, it is often easier to use one of the 5625 following escape sequences than the binary character it represents. In 5626 an ASCII or Unicode environment, these escapes are as follows: 5627 5628 \a alarm, that is, the BEL character (hex 07) 5629 \cx "control-x", where x is any printable ASCII character 5630 \e escape (hex 1B) 5631 \f form feed (hex 0C) 5632 \n linefeed (hex 0A) 5633 \r carriage return (hex 0D) 5634 \t tab (hex 09) 5635 \0dd character with octal code 0dd 5636 \ddd character with octal code ddd, or back reference 5637 \o{ddd..} character with octal code ddd.. 5638 \xhh character with hex code hh 5639 \x{hhh..} character with hex code hhh.. (default mode) 5640 \uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set) 5641 5642 The precise effect of \cx on ASCII characters is as follows: if x is a 5643 lower case letter, it is converted to upper case. Then bit 6 of the 5644 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A 5645 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes 5646 hex 7B (; is 3B). If the code unit following \c has a value less than 5647 32 or greater than 126, a compile-time error occurs. This locks out 5648 non-printable ASCII characters in all modes. 5649 5650 When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen- 5651 erate the appropriate EBCDIC code values. The \c escape is processed as 5652 specified for Perl in the perlebcdic document. The only characters that 5653 are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. 5654 Any other character provokes a compile-time error. The sequence \@ 5655 encodes character code 0; the letters (in either case) encode charac- 5656 ters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 5657 (hex 1B to hex 1F), and \? becomes either 255 (hex FF) or 95 (hex 5F). 5658 5659 Thus, apart from \?, these escapes generate the same character code 5660 values as they do in an ASCII environment, though the meanings of the 5661 values mostly differ. For example, \G always generates code value 7, 5662 which is BEL in ASCII but DEL in EBCDIC. 5663 5664 The sequence \? generates DEL (127, hex 7F) in an ASCII environment, 5665 but because 127 is not a control character in EBCDIC, Perl makes it 5666 generate the APC character. Unfortunately, there are several variants 5667 of EBCDIC. In most of them the APC character has the value 255 (hex 5668 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If 5669 certain other characters have POSIX-BC values, PCRE2 makes \? generate 5670 95; otherwise it generates 255. 5671 5672 After \0 up to two further octal digits are read. If there are fewer 5673 than two digits, just those that are present are used. Thus the 5674 sequence \0\x\015 specifies two binary zeros followed by a CR character 5675 (code value 13). Make sure you supply two digits after the initial zero 5676 if the pattern character that follows is itself an octal digit. 5677 5678 The escape \o must be followed by a sequence of octal digits, enclosed 5679 in braces. An error occurs if this is not the case. This escape is a 5680 recent addition to Perl; it provides way of specifying character code 5681 points as octal numbers greater than 0777, and it also allows octal 5682 numbers and back references to be unambiguously specified. 5683 5684 For greater clarity and unambiguity, it is best to avoid following \ by 5685 a digit greater than zero. Instead, use \o{} or \x{} to specify charac- 5686 ter numbers, and \g{} to specify back references. The following para- 5687 graphs describe the old, ambiguous syntax. 5688 5689 The handling of a backslash followed by a digit other than 0 is compli- 5690 cated, and Perl has changed over time, causing PCRE2 also to change. 5691 5692 Outside a character class, PCRE2 reads the digit and any following dig- 5693 its as a decimal number. If the number is less than 10, begins with the 5694 digit 8 or 9, or if there are at least that many previous capturing 5695 left parentheses in the expression, the entire sequence is taken as a 5696 back reference. A description of how this works is given later, follow- 5697 ing the discussion of parenthesized subpatterns. Otherwise, up to 5698 three octal digits are read to form a character code. 5699 5700 Inside a character class, PCRE2 handles \8 and \9 as the literal char- 5701 acters "8" and "9", and otherwise reads up to three octal digits fol- 5702 lowing the backslash, using them to generate a data character. Any sub- 5703 sequent digits stand for themselves. For example, outside a character 5704 class: 5705 5706 \040 is another way of writing an ASCII space 5707 \40 is the same, provided there are fewer than 40 5708 previous capturing subpatterns 5709 \7 is always a back reference 5710 \11 might be a back reference, or another way of 5711 writing a tab 5712 \011 is always a tab 5713 \0113 is a tab followed by the character "3" 5714 \113 might be a back reference, otherwise the 5715 character with octal code 113 5716 \377 might be a back reference, otherwise 5717 the value 255 (decimal) 5718 \81 is always a back reference 5719 5720 Note that octal values of 100 or greater that are specified using this 5721 syntax must not be introduced by a leading zero, because no more than 5722 three octal digits are ever read. 5723 5724 By default, after \x that is not followed by {, from zero to two hexa- 5725 decimal digits are read (letters can be in upper or lower case). Any 5726 number of hexadecimal digits may appear between \x{ and }. If a charac- 5727 ter other than a hexadecimal digit appears between \x{ and }, or if 5728 there is no terminating }, an error occurs. 5729 5730 If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as 5731 just described only when it is followed by two hexadecimal digits. Oth- 5732 erwise, it matches a literal "x" character. In this mode mode, support 5733 for code points greater than 256 is provided by \u, which must be fol- 5734 lowed by four hexadecimal digits; otherwise it matches a literal "u" 5735 character. 5736 5737 Characters whose value is less than 256 can be defined by either of the 5738 two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif- 5739 ference in the way they are handled. For example, \xdc is exactly the 5740 same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode). 5741 5742 Constraints on character values 5743 5744 Characters that are specified using octal or hexadecimal numbers are 5745 limited to certain values, as follows: 5746 5747 8-bit non-UTF mode less than 0x100 5748 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint 5749 16-bit non-UTF mode less than 0x10000 5750 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint 5751 32-bit non-UTF mode less than 0x100000000 5752 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint 5753 5754 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- 5755 called "surrogate" codepoints), and 0xffef. 5756 5757 Escape sequences in character classes 5758 5759 All the sequences that define a single character value can be used both 5760 inside and outside character classes. In addition, inside a character 5761 class, \b is interpreted as the backspace character (hex 08). 5762 5763 \N is not allowed in a character class. \B, \R, and \X are not special 5764 inside a character class. Like other unrecognized alphabetic escape 5765 sequences, they cause an error. Outside a character class, these 5766 sequences have different meanings. 5767 5768 Unsupported escape sequences 5769 5770 In Perl, the sequences \l, \L, \u, and \U are recognized by its string 5771 handler and used to modify the case of following characters. By 5772 default, PCRE2 does not support these escape sequences. However, if the 5773 PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be 5774 used to define a character by code point, as described in the previous 5775 section. 5776 5777 Absolute and relative back references 5778 5779 The sequence \g followed by an unsigned or a negative number, option- 5780 ally enclosed in braces, is an absolute or relative back reference. A 5781 named back reference can be coded as \g{name}. Back references are dis- 5782 cussed later, following the discussion of parenthesized subpatterns. 5783 5784 Absolute and relative subroutine calls 5785 5786 For compatibility with Oniguruma, the non-Perl syntax \g followed by a 5787 name or a number enclosed either in angle brackets or single quotes, is 5788 an alternative syntax for referencing a subpattern as a "subroutine". 5789 Details are discussed later. Note that \g{...} (Perl syntax) and 5790 \g<...> (Oniguruma syntax) are not synonymous. The former is a back 5791 reference; the latter is a subroutine call. 5792 5793 Generic character types 5794 5795 Another use of backslash is for specifying generic character types: 5796 5797 \d any decimal digit 5798 \D any character that is not a decimal digit 5799 \h any horizontal white space character 5800 \H any character that is not a horizontal white space character 5801 \s any white space character 5802 \S any character that is not a white space character 5803 \v any vertical white space character 5804 \V any character that is not a vertical white space character 5805 \w any "word" character 5806 \W any "non-word" character 5807 5808 There is also the single sequence \N, which matches a non-newline char- 5809 acter. This is the same as the "." metacharacter when PCRE2_DOTALL is 5810 not set. Perl also uses \N to match characters by name; PCRE2 does not 5811 support this. 5812 5813 Each pair of lower and upper case escape sequences partitions the com- 5814 plete set of characters into two disjoint sets. Any given character 5815 matches one, and only one, of each pair. The sequences can appear both 5816 inside and outside character classes. They each match one character of 5817 the appropriate type. If the current matching point is at the end of 5818 the subject string, all of them fail, because there is no character to 5819 match. 5820 5821 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR 5822 (13), and space (32), which are defined as white space in the "C" 5823 locale. This list may vary if locale-specific matching is taking place. 5824 For example, in some locales the "non-breaking space" character (\xA0) 5825 is recognized as white space, and in others the VT character is not. 5826 5827 A "word" character is an underscore or any character that is a letter 5828 or digit. By default, the definition of letters and digits is con- 5829 trolled by PCRE2's low-valued character tables, and may vary if locale- 5830 specific matching is taking place (see "Locale support" in the pcre2api 5831 page). For example, in a French locale such as "fr_FR" in Unix-like 5832 systems, or "french" in Windows, some character codes greater than 127 5833 are used for accented letters, and these are then matched by \w. The 5834 use of locales with Unicode is discouraged. 5835 5836 By default, characters whose code points are greater than 127 never 5837 match \d, \s, or \w, and always match \D, \S, and \W, although this may 5838 be different for characters in the range 128-255 when locale-specific 5839 matching is happening. These escape sequences retain their original 5840 meanings from before Unicode support was available, mainly for effi- 5841 ciency reasons. If the PCRE2_UCP option is set, the behaviour is 5842 changed so that Unicode properties are used to determine character 5843 types, as follows: 5844 5845 \d any character that matches \p{Nd} (decimal digit) 5846 \s any character that matches \p{Z} or \h or \v 5847 \w any character that matches \p{L} or \p{N}, plus underscore 5848 5849 The upper case escapes match the inverse sets of characters. Note that 5850 \d matches only decimal digits, whereas \w matches any Unicode digit, 5851 as well as any Unicode letter, and underscore. Note also that PCRE2_UCP 5852 affects \b, and \B because they are defined in terms of \w and \W. 5853 Matching these sequences is noticeably slower when PCRE2_UCP is set. 5854 5855 The sequences \h, \H, \v, and \V, in contrast to the other sequences, 5856 which match only ASCII characters by default, always match a specific 5857 list of code points, whether or not PCRE2_UCP is set. The horizontal 5858 space characters are: 5859 5860 U+0009 Horizontal tab (HT) 5861 U+0020 Space 5862 U+00A0 Non-break space 5863 U+1680 Ogham space mark 5864 U+180E Mongolian vowel separator 5865 U+2000 En quad 5866 U+2001 Em quad 5867 U+2002 En space 5868 U+2003 Em space 5869 U+2004 Three-per-em space 5870 U+2005 Four-per-em space 5871 U+2006 Six-per-em space 5872 U+2007 Figure space 5873 U+2008 Punctuation space 5874 U+2009 Thin space 5875 U+200A Hair space 5876 U+202F Narrow no-break space 5877 U+205F Medium mathematical space 5878 U+3000 Ideographic space 5879 5880 The vertical space characters are: 5881 5882 U+000A Linefeed (LF) 5883 U+000B Vertical tab (VT) 5884 U+000C Form feed (FF) 5885 U+000D Carriage return (CR) 5886 U+0085 Next line (NEL) 5887 U+2028 Line separator 5888 U+2029 Paragraph separator 5889 5890 In 8-bit, non-UTF-8 mode, only the characters with code points less 5891 than 256 are relevant. 5892 5893 Newline sequences 5894 5895 Outside a character class, by default, the escape sequence \R matches 5896 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent 5897 to the following: 5898 5899 (?>\r\n|\n|\x0b|\f|\r|\x85) 5900 5901 This is an example of an "atomic group", details of which are given 5902 below. This particular group matches either the two-character sequence 5903 CR followed by LF, or one of the single characters LF (linefeed, 5904 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- 5905 riage return, U+000D), or NEL (next line, U+0085). Because this is an 5906 atomic group, the two-character sequence is treated as a single unit 5907 that cannot be split. 5908 5909 In other modes, two additional characters whose codepoints are greater 5910 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- 5911 rator, U+2029). Unicode support is not needed for these characters to 5912 be recognized. 5913 5914 It is possible to restrict \R to match only CR, LF, or CRLF (instead of 5915 the complete set of Unicode line endings) by setting the option 5916 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back- 5917 slash R".) This can be made the default when PCRE2 is built; if this is 5918 the case, the other behaviour can be requested via the PCRE2_BSR_UNI- 5919 CODE option. It is also possible to specify these settings by starting 5920 a pattern string with one of the following sequences: 5921 5922 (*BSR_ANYCRLF) CR, LF, or CRLF only 5923 (*BSR_UNICODE) any Unicode newline sequence 5924 5925 These override the default and the options given to the compiling func- 5926 tion. Note that these special settings, which are not Perl-compatible, 5927 are recognized only at the very start of a pattern, and that they must 5928 be in upper case. If more than one of them is present, the last one is 5929 used. They can be combined with a change of newline convention; for 5930 example, a pattern can start with: 5931 5932 (*ANY)(*BSR_ANYCRLF) 5933 5934 They can also be combined with the (*UTF) or (*UCP) special sequences. 5935 Inside a character class, \R is treated as an unrecognized escape 5936 sequence, and causes an error. 5937 5938 Unicode character properties 5939 5940 When PCRE2 is built with Unicode support (the default), three addi- 5941 tional escape sequences that match characters with specific properties 5942 are available. In 8-bit non-UTF-8 mode, these sequences are of course 5943 limited to testing characters whose codepoints are less than 256, but 5944 they do work in this mode. The extra escape sequences are: 5945 5946 \p{xx} a character with the xx property 5947 \P{xx} a character without the xx property 5948 \X a Unicode extended grapheme cluster 5949 5950 The property names represented by xx above are limited to the Unicode 5951 script names, the general category properties, "Any", which matches any 5952 character (including newline), and some special PCRE2 properties 5953 (described in the next section). Other Perl properties such as "InMu- 5954 sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not 5955 match any characters, so always causes a match failure. 5956 5957 Sets of Unicode characters are defined as belonging to certain scripts. 5958 A character from one of these sets can be matched using a script name. 5959 For example: 5960 5961 \p{Greek} 5962 \P{Han} 5963 5964 Those that are not part of an identified script are lumped together as 5965 "Common". The current list of scripts is: 5966 5967 Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese, 5968 Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, 5969 Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham, 5970 Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, 5971 Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Geor- 5972 gian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han, 5973 Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, 5974 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan- 5975 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, 5976 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha- 5977 jani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui, 5978 Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, 5979 Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki, 5980 Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, 5981 Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, 5982 Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, 5983 Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala, 5984 Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, 5985 Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, 5986 Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi. 5987 5988 Each character has exactly one Unicode general category property, spec- 5989 ified by a two-letter abbreviation. For compatibility with Perl, nega- 5990 tion can be specified by including a circumflex between the opening 5991 brace and the property name. For example, \p{^Lu} is the same as 5992 \P{Lu}. 5993 5994 If only one letter is specified with \p or \P, it includes all the gen- 5995 eral category properties that start with that letter. In this case, in 5996 the absence of negation, the curly brackets in the escape sequence are 5997 optional; these two examples have the same effect: 5998 5999 \p{L} 6000 \pL 6001 6002 The following general category property codes are supported: 6003 6004 C Other 6005 Cc Control 6006 Cf Format 6007 Cn Unassigned 6008 Co Private use 6009 Cs Surrogate 6010 6011 L Letter 6012 Ll Lower case letter 6013 Lm Modifier letter 6014 Lo Other letter 6015 Lt Title case letter 6016 Lu Upper case letter 6017 6018 M Mark 6019 Mc Spacing mark 6020 Me Enclosing mark 6021 Mn Non-spacing mark 6022 6023 N Number 6024 Nd Decimal number 6025 Nl Letter number 6026 No Other number 6027 6028 P Punctuation 6029 Pc Connector punctuation 6030 Pd Dash punctuation 6031 Pe Close punctuation 6032 Pf Final punctuation 6033 Pi Initial punctuation 6034 Po Other punctuation 6035 Ps Open punctuation 6036 6037 S Symbol 6038 Sc Currency symbol 6039 Sk Modifier symbol 6040 Sm Mathematical symbol 6041 So Other symbol 6042 6043 Z Separator 6044 Zl Line separator 6045 Zp Paragraph separator 6046 Zs Space separator 6047 6048 The special property L& is also supported: it matches a character that 6049 has the Lu, Ll, or Lt property, in other words, a letter that is not 6050 classified as a modifier or "other". 6051 6052 The Cs (Surrogate) property applies only to characters in the range 6053 U+D800 to U+DFFF. Such characters are not valid in Unicode strings and 6054 so cannot be tested by PCRE2, unless UTF validity checking has been 6055 turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api 6056 page). Perl does not support the Cs property. 6057 6058 The long synonyms for property names that Perl supports (such as 6059 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix 6060 any of these properties with "Is". 6061 6062 No character that is in the Unicode table has the Cn (unassigned) prop- 6063 erty. Instead, this property is assumed for any code point that is not 6064 in the Unicode table. 6065 6066 Specifying caseless matching does not affect these escape sequences. 6067 For example, \p{Lu} always matches only upper case letters. This is 6068 different from the behaviour of current versions of Perl. 6069 6070 Matching characters by Unicode property is not fast, because PCRE2 has 6071 to do a multistage table lookup in order to find a character's prop- 6072 erty. That is why the traditional escape sequences such as \d and \w do 6073 not use Unicode properties in PCRE2 by default, though you can make 6074 them do so by setting the PCRE2_UCP option or by starting the pattern 6075 with (*UCP). 6076 6077 Extended grapheme clusters 6078 6079 The \X escape matches any number of Unicode characters that form an 6080 "extended grapheme cluster", and treats the sequence as an atomic group 6081 (see below). Unicode supports various kinds of composite character by 6082 giving each character a grapheme breaking property, and having rules 6083 that use these properties to define the boundaries of extended grapheme 6084 clusters. \X always matches at least one character. Then it decides 6085 whether to add additional characters according to the following rules 6086 for ending a cluster: 6087 6088 1. End at the end of the subject string. 6089 6090 2. Do not end between CR and LF; otherwise end after any control char- 6091 acter. 6092 6093 3. Do not break Hangul (a Korean script) syllable sequences. Hangul 6094 characters are of five types: L, V, T, LV, and LVT. An L character may 6095 be followed by an L, V, LV, or LVT character; an LV or V character may 6096 be followed by a V or T character; an LVT or T character may be follwed 6097 only by a T character. 6098 6099 4. Do not end before extending characters or spacing marks. Characters 6100 with the "mark" property always have the "extend" grapheme breaking 6101 property. 6102 6103 5. Do not end after prepend characters. 6104 6105 6. Otherwise, end the cluster. 6106 6107 PCRE2's additional properties 6108 6109 As well as the standard Unicode properties described above, PCRE2 sup- 6110 ports four more that make it possible to convert traditional escape 6111 sequences such as \w and \s to use Unicode properties. PCRE2 uses these 6112 non-standard, non-Perl properties internally when PCRE2_UCP is set. 6113 However, they may also be used explicitly. These properties are: 6114 6115 Xan Any alphanumeric character 6116 Xps Any POSIX space character 6117 Xsp Any Perl space character 6118 Xwd Any Perl "word" character 6119 6120 Xan matches characters that have either the L (letter) or the N (num- 6121 ber) property. Xps matches the characters tab, linefeed, vertical tab, 6122 form feed, or carriage return, and any other character that has the Z 6123 (separator) property. Xsp is the same as Xps; in PCRE1 it used to 6124 exclude vertical tab, for Perl compatibility, but Perl changed. Xwd 6125 matches the same characters as Xan, plus underscore. 6126 6127 There is another non-standard property, Xuc, which matches any charac- 6128 ter that can be represented by a Universal Character Name in C++ and 6129 other programming languages. These are the characters $, @, ` (grave 6130 accent), and all characters with Unicode code points greater than or 6131 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that 6132 most base (ASCII) characters are excluded. (Universal Character Names 6133 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. 6134 Note that the Xuc property does not match these sequences but the char- 6135 acters that they represent.) 6136 6137 Resetting the match start 6138 6139 The escape sequence \K causes any previously matched characters not to 6140 be included in the final matched sequence. For example, the pattern: 6141 6142 foo\Kbar 6143 6144 matches "foobar", but reports that it has matched "bar". This feature 6145 is similar to a lookbehind assertion (described below). However, in 6146 this case, the part of the subject before the real match does not have 6147 to be of fixed length, as lookbehind assertions do. The use of \K does 6148 not interfere with the setting of captured substrings. For example, 6149 when the pattern 6150 6151 (foo)\Kbar 6152 6153 matches "foobar", the first substring is still set to "foo". 6154 6155 Perl documents that the use of \K within assertions is "not well 6156 defined". In PCRE2, \K is acted upon when it occurs inside positive 6157 assertions, but is ignored in negative assertions. Note that when a 6158 pattern such as (?=ab\K) matches, the reported start of the match can 6159 be greater than the end of the match. 6160 6161 Simple assertions 6162 6163 The final use of backslash is for certain simple assertions. An asser- 6164 tion specifies a condition that has to be met at a particular point in 6165 a match, without consuming any characters from the subject string. The 6166 use of subpatterns for more complicated assertions is described below. 6167 The backslashed assertions are: 6168 6169 \b matches at a word boundary 6170 \B matches when not at a word boundary 6171 \A matches at the start of the subject 6172 \Z matches at the end of the subject 6173 also matches before a newline at the end of the subject 6174 \z matches only at the end of the subject 6175 \G matches at the first matching position in the subject 6176 6177 Inside a character class, \b has a different meaning; it matches the 6178 backspace character. If any other of these assertions appears in a 6179 character class, an "invalid escape sequence" error is generated. 6180 6181 A word boundary is a position in the subject string where the current 6182 character and the previous character do not both match \w or \W (i.e. 6183 one matches \w and the other matches \W), or the start or end of the 6184 string if the first or last character matches \w, respectively. In a 6185 UTF mode, the meanings of \w and \W can be changed by setting the 6186 PCRE2_UCP option. When this is done, it also affects \b and \B. Neither 6187 PCRE2 nor Perl has a separate "start of word" or "end of word" metase- 6188 quence. However, whatever follows \b normally determines which it is. 6189 For example, the fragment \ba matches "a" at the start of a word. 6190 6191 The \A, \Z, and \z assertions differ from the traditional circumflex 6192 and dollar (described in the next section) in that they only ever match 6193 at the very start and end of the subject string, whatever options are 6194 set. Thus, they are independent of multiline mode. These three asser- 6195 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options, 6196 which affect only the behaviour of the circumflex and dollar metachar- 6197 acters. However, if the startoffset argument of pcre2_match() is non- 6198 zero, indicating that matching is to start at a point other than the 6199 beginning of the subject, \A can never match. The difference between 6200 \Z and \z is that \Z matches before a newline at the end of the string 6201 as well as at the very end, whereas \z matches only at the end. 6202 6203 The \G assertion is true only when the current matching position is at 6204 the start point of the match, as specified by the startoffset argument 6205 of pcre2_match(). It differs from \A when the value of startoffset is 6206 non-zero. By calling pcre2_match() multiple times with appropriate 6207 arguments, you can mimic Perl's /g option, and it is in this kind of 6208 implementation where \G can be useful. 6209 6210 Note, however, that PCRE2's interpretation of \G, as the start of the 6211 current match, is subtly different from Perl's, which defines it as the 6212 end of the previous match. In Perl, these can be different when the 6213 previously matched string was empty. Because PCRE2 does just one match 6214 at a time, it cannot reproduce this behaviour. 6215 6216 If all the alternatives of a pattern begin with \G, the expression is 6217 anchored to the starting match position, and the "anchored" flag is set 6218 in the compiled regular expression. 6219 6220 6221CIRCUMFLEX AND DOLLAR 6222 6223 The circumflex and dollar metacharacters are zero-width assertions. 6224 That is, they test for a particular condition being true without con- 6225 suming any characters from the subject string. These two metacharacters 6226 are concerned with matching the starts and ends of lines. If the new- 6227 line convention is set so that only the two-character sequence CRLF is 6228 recognized as a newline, isolated CR and LF characters are treated as 6229 ordinary data characters, and are not recognized as newlines. 6230 6231 Outside a character class, in the default matching mode, the circumflex 6232 character is an assertion that is true only if the current matching 6233 point is at the start of the subject string. If the startoffset argu- 6234 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum- 6235 flex can never match if the PCRE2_MULTILINE option is unset. Inside a 6236 character class, circumflex has an entirely different meaning (see 6237 below). 6238 6239 Circumflex need not be the first character of the pattern if a number 6240 of alternatives are involved, but it should be the first thing in each 6241 alternative in which it appears if the pattern is ever to match that 6242 branch. If all possible alternatives start with a circumflex, that is, 6243 if the pattern is constrained to match only at the start of the sub- 6244 ject, it is said to be an "anchored" pattern. (There are also other 6245 constructs that can cause a pattern to be anchored.) 6246 6247 The dollar character is an assertion that is true only if the current 6248 matching point is at the end of the subject string, or immediately 6249 before a newline at the end of the string (by default), unless 6250 PCRE2_NOTEOL is set. Note, however, that it does not actually match the 6251 newline. Dollar need not be the last character of the pattern if a num- 6252 ber of alternatives are involved, but it should be the last item in any 6253 branch in which it appears. Dollar has no special meaning in a charac- 6254 ter class. 6255 6256 The meaning of dollar can be changed so that it matches only at the 6257 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at 6258 compile time. This does not affect the \Z assertion. 6259 6260 The meanings of the circumflex and dollar metacharacters are changed if 6261 the PCRE2_MULTILINE option is set. When this is the case, a dollar 6262 character matches before any newlines in the string, as well as at the 6263 very end, and a circumflex matches immediately after internal newlines 6264 as well as at the start of the subject string. It does not match after 6265 a newline that ends the string, for compatibility with Perl. However, 6266 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option. 6267 6268 For example, the pattern /^abc$/ matches the subject string "def\nabc" 6269 (where \n represents a newline) in multiline mode, but not otherwise. 6270 Consequently, patterns that are anchored in single line mode because 6271 all branches start with ^ are not anchored in multiline mode, and a 6272 match for circumflex is possible when the startoffset argument of 6273 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored 6274 if PCRE2_MULTILINE is set. 6275 6276 When the newline convention (see "Newline conventions" below) recog- 6277 nizes the two-character sequence CRLF as a newline, this is preferred, 6278 even if the single characters CR and LF are also recognized as new- 6279 lines. For example, if the newline convention is "any", a multiline 6280 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather 6281 than after CR, even though CR on its own is a valid newline. (It also 6282 matches at the very start of the string, of course.) 6283 6284 Note that the sequences \A, \Z, and \z can be used to match the start 6285 and end of the subject in both modes, and if all branches of a pattern 6286 start with \A it is always anchored, whether or not PCRE2_MULTILINE is 6287 set. 6288 6289 6290FULL STOP (PERIOD, DOT) AND \N 6291 6292 Outside a character class, a dot in the pattern matches any one charac- 6293 ter in the subject string except (by default) a character that signi- 6294 fies the end of a line. 6295 6296 When a line ending is defined as a single character, dot never matches 6297 that character; when the two-character sequence CRLF is used, dot does 6298 not match CR if it is immediately followed by LF, but otherwise it 6299 matches all characters (including isolated CRs and LFs). When any Uni- 6300 code line endings are being recognized, dot does not match CR or LF or 6301 any of the other line ending characters. 6302 6303 The behaviour of dot with regard to newlines can be changed. If the 6304 PCRE2_DOTALL option is set, a dot matches any one character, without 6305 exception. If the two-character sequence CRLF is present in the sub- 6306 ject string, it takes two dots to match it. 6307 6308 The handling of dot is entirely independent of the handling of circum- 6309 flex and dollar, the only relationship being that they both involve 6310 newlines. Dot has no special meaning in a character class. 6311 6312 The escape sequence \N behaves like a dot, except that it is not 6313 affected by the PCRE2_DOTALL option. In other words, it matches any 6314 character except one that signifies the end of a line. Perl also uses 6315 \N to match characters by name; PCRE2 does not support this. 6316 6317 6318MATCHING A SINGLE CODE UNIT 6319 6320 Outside a character class, the escape sequence \C matches any one code 6321 unit, whether or not a UTF mode is set. In the 8-bit library, one code 6322 unit is one byte; in the 16-bit library it is a 16-bit unit; in the 6323 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches 6324 line-ending characters. The feature is provided in Perl in order to 6325 match individual bytes in UTF-8 mode, but it is unclear how it can use- 6326 fully be used. 6327 6328 Because \C breaks up characters into individual code units, matching 6329 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the 6330 string may start with a malformed UTF character. This has undefined 6331 results, because PCRE2 assumes that it is matching character by charac- 6332 ter in a valid UTF string (by default it checks the subject string's 6333 validity at the start of processing unless the PCRE2_NO_UTF_CHECK 6334 option is used). 6335 6336 An application can lock out the use of \C by setting the 6337 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also 6338 possible to build PCRE2 with the use of \C permanently disabled. 6339 6340 PCRE2 does not allow \C to appear in lookbehind assertions (described 6341 below) in UTF-8 or UTF-16 modes, because this would make it impossible 6342 to calculate the length of the lookbehind. Neither the alternative 6343 matching function pcre2_dfa_match() nor the JIT optimizer support \C in 6344 these UTF modes. The former gives a match-time error; the latter fails 6345 to optimize and so the match is always run using the interpreter. 6346 6347 In the 32-bit library, however, \C is always supported (when not 6348 explicitly locked out) because it always matches a single code unit, 6349 whether or not UTF-32 is specified. 6350 6351 In general, the \C escape sequence is best avoided. However, one way of 6352 using it that avoids the problem of malformed UTF-8 or UTF-16 charac- 6353 ters is to use a lookahead to check the length of the next character, 6354 as in this pattern, which could be used with a UTF-8 string (ignore 6355 white space and line breaks): 6356 6357 (?| (?=[\x00-\x7f])(\C) | 6358 (?=[\x80-\x{7ff}])(\C)(\C) | 6359 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | 6360 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) 6361 6362 In this example, a group that starts with (?| resets the capturing 6363 parentheses numbers in each alternative (see "Duplicate Subpattern Num- 6364 bers" below). The assertions at the start of each branch check the next 6365 UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes, 6366 respectively. The character's individual bytes are then captured by the 6367 appropriate number of \C groups. 6368 6369 6370SQUARE BRACKETS AND CHARACTER CLASSES 6371 6372 An opening square bracket introduces a character class, terminated by a 6373 closing square bracket. A closing square bracket on its own is not spe- 6374 cial by default. If a closing square bracket is required as a member 6375 of the class, it should be the first data character in the class (after 6376 an initial circumflex, if present) or escaped with a backslash. This 6377 means that, by default, an empty class cannot be defined. However, if 6378 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at 6379 the start does end the (empty) class. 6380 6381 A character class matches a single character in the subject. A matched 6382 character must be in the set of characters defined by the class, unless 6383 the first character in the class definition is a circumflex, in which 6384 case the subject character must not be in the set defined by the class. 6385 If a circumflex is actually required as a member of the class, ensure 6386 it is not the first character, or escape it with a backslash. 6387 6388 For example, the character class [aeiou] matches any lower case vowel, 6389 while [^aeiou] matches any character that is not a lower case vowel. 6390 Note that a circumflex is just a convenient notation for specifying the 6391 characters that are in the class by enumerating those that are not. A 6392 class that starts with a circumflex is not an assertion; it still con- 6393 sumes a character from the subject string, and therefore it fails if 6394 the current pointer is at the end of the string. 6395 6396 When caseless matching is set, any letters in a class represent both 6397 their upper case and lower case versions, so for example, a caseless 6398 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not 6399 match "A", whereas a caseful version would. 6400 6401 Characters that might indicate line breaks are never treated in any 6402 special way when matching character classes, whatever line-ending 6403 sequence is in use, and whatever setting of the PCRE2_DOTALL and 6404 PCRE2_MULTILINE options is used. A class such as [^a] always matches 6405 one of these characters. 6406 6407 The minus (hyphen) character can be used to specify a range of charac- 6408 ters in a character class. For example, [d-m] matches any letter 6409 between d and m, inclusive. If a minus character is required in a 6410 class, it must be escaped with a backslash or appear in a position 6411 where it cannot be interpreted as indicating a range, typically as the 6412 first or last character in the class, or immediately after a range. For 6413 example, [b-d-z] matches letters in the range b to d, a hyphen charac- 6414 ter, or z. 6415 6416 It is not possible to have the literal character "]" as the end charac- 6417 ter of a range. A pattern such as [W-]46] is interpreted as a class of 6418 two characters ("W" and "-") followed by a literal string "46]", so it 6419 would match "W46]" or "-46]". However, if the "]" is escaped with a 6420 backslash it is interpreted as the end of range, so [W-\]46] is inter- 6421 preted as a class containing a range followed by two other characters. 6422 The octal or hexadecimal representation of "]" can also be used to end 6423 a range. 6424 6425 An error is generated if a POSIX character class (see below) or an 6426 escape sequence other than one that defines a single character appears 6427 at a point where a range ending character is expected. For example, 6428 [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not. 6429 6430 Ranges normally include all code points between the start and end char- 6431 acters, inclusive. They can also be used for code points specified 6432 numerically, for example [\000-\037]. Ranges can include any characters 6433 that are valid for the current mode. 6434 6435 There is a special case in EBCDIC environments for ranges whose end 6436 points are both specified as literal letters in the same case. For com- 6437 patibility with Perl, EBCDIC code points within the range that are not 6438 letters are omitted. For example, [h-k] matches only four characters, 6439 even though the codes for h and k are 0x88 and 0x92, a range of 11 code 6440 points. However, if the range is specified numerically, for example, 6441 [\x88-\x92] or [h-\x92], all code points are included. 6442 6443 If a range that includes letters is used when caseless matching is set, 6444 it matches the letters in either case. For example, [W-c] is equivalent 6445 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if 6446 character tables for a French locale are in use, [\xc8-\xcb] matches 6447 accented E characters in both cases. 6448 6449 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, 6450 \w, and \W may appear in a character class, and add the characters that 6451 they match to the class. For example, [\dABCDEF] matches any hexadeci- 6452 mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of 6453 \d, \s, \w and their upper case partners, just as it does when they 6454 appear outside a character class, as described in the section entitled 6455 "Generic character types" above. The escape sequence \b has a different 6456 meaning inside a character class; it matches the backspace character. 6457 The sequences \B, \N, \R, and \X are not special inside a character 6458 class. Like any other unrecognized escape sequences, they cause an 6459 error. 6460 6461 A circumflex can conveniently be used with the upper case character 6462 types to specify a more restricted set of characters than the matching 6463 lower case type. For example, the class [^\W_] matches any letter or 6464 digit, but not underscore, whereas [\w] includes underscore. A positive 6465 character class should be read as "something OR something OR ..." and a 6466 negative class as "NOT something AND NOT something AND NOT ...". 6467 6468 The only metacharacters that are recognized in character classes are 6469 backslash, hyphen (only where it can be interpreted as specifying a 6470 range), circumflex (only at the start), opening square bracket (only 6471 when it can be interpreted as introducing a POSIX class name, or for a 6472 special compatibility feature - see the next two sections), and the 6473 terminating closing square bracket. However, escaping other non- 6474 alphanumeric characters does no harm. 6475 6476 6477POSIX CHARACTER CLASSES 6478 6479 Perl supports the POSIX notation for character classes. This uses names 6480 enclosed by [: and :] within the enclosing square brackets. PCRE2 also 6481 supports this notation. For example, 6482 6483 [01[:alpha:]%] 6484 6485 matches "0", "1", any alphabetic character, or "%". The supported class 6486 names are: 6487 6488 alnum letters and digits 6489 alpha letters 6490 ascii character codes 0 - 127 6491 blank space or tab only 6492 cntrl control characters 6493 digit decimal digits (same as \d) 6494 graph printing characters, excluding space 6495 lower lower case letters 6496 print printing characters, including space 6497 punct printing characters, excluding letters and digits and space 6498 space white space (the same as \s from PCRE2 8.34) 6499 upper upper case letters 6500 word "word" characters (same as \w) 6501 xdigit hexadecimal digits 6502 6503 The default "space" characters are HT (9), LF (10), VT (11), FF (12), 6504 CR (13), and space (32). If locale-specific matching is taking place, 6505 the list of space characters may be different; there may be fewer or 6506 more of them. "Space" and \s match the same set of characters. 6507 6508 The name "word" is a Perl extension, and "blank" is a GNU extension 6509 from Perl 5.8. Another Perl extension is negation, which is indicated 6510 by a ^ character after the colon. For example, 6511 6512 [12[:^digit:]] 6513 6514 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the 6515 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but 6516 these are not supported, and an error is given if they are encountered. 6517 6518 By default, characters with values greater than 127 do not match any of 6519 the POSIX character classes, although this may be different for charac- 6520 ters in the range 128-255 when locale-specific matching is happening. 6521 However, if the PCRE2_UCP option is passed to pcre2_compile(), some of 6522 the classes are changed so that Unicode character properties are used. 6523 This is achieved by replacing certain POSIX classes with other 6524 sequences, as follows: 6525 6526 [:alnum:] becomes \p{Xan} 6527 [:alpha:] becomes \p{L} 6528 [:blank:] becomes \h 6529 [:cntrl:] becomes \p{Cc} 6530 [:digit:] becomes \p{Nd} 6531 [:lower:] becomes \p{Ll} 6532 [:space:] becomes \p{Xps} 6533 [:upper:] becomes \p{Lu} 6534 [:word:] becomes \p{Xwd} 6535 6536 Negated versions, such as [:^alpha:] use \P instead of \p. Three other 6537 POSIX classes are handled specially in UCP mode: 6538 6539 [:graph:] This matches characters that have glyphs that mark the page 6540 when printed. In Unicode property terms, it matches all char- 6541 acters with the L, M, N, P, S, or Cf properties, except for: 6542 6543 U+061C Arabic Letter Mark 6544 U+180E Mongolian Vowel Separator 6545 U+2066 - U+2069 Various "isolate"s 6546 6547 6548 [:print:] This matches the same characters as [:graph:] plus space 6549 characters that are not controls, that is, characters with 6550 the Zs property. 6551 6552 [:punct:] This matches all characters that have the Unicode P (punctua- 6553 tion) property, plus those characters with code points less 6554 than 256 that have the S (Symbol) property. 6555 6556 The other POSIX classes are unchanged, and match only characters with 6557 code points less than 256. 6558 6559 6560COMPATIBILITY FEATURE FOR WORD BOUNDARIES 6561 6562 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the 6563 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" 6564 and "end of word". PCRE2 treats these items as follows: 6565 6566 [[:<:]] is converted to \b(?=\w) 6567 [[:>:]] is converted to \b(?<=\w) 6568 6569 Only these exact character sequences are recognized. A sequence such as 6570 [a[:<:]b] provokes error for an unrecognized POSIX class name. This 6571 support is not compatible with Perl. It is provided to help migrations 6572 from other environments, and is best not used in any new patterns. Note 6573 that \b matches at the start and the end of a word (see "Simple asser- 6574 tions" above), and in a Perl-style pattern the preceding or following 6575 character normally shows which is wanted, without the need for the 6576 assertions that are used above in order to give exactly the POSIX be- 6577 haviour. 6578 6579 6580VERTICAL BAR 6581 6582 Vertical bar characters are used to separate alternative patterns. For 6583 example, the pattern 6584 6585 gilbert|sullivan 6586 6587 matches either "gilbert" or "sullivan". Any number of alternatives may 6588 appear, and an empty alternative is permitted (matching the empty 6589 string). The matching process tries each alternative in turn, from left 6590 to right, and the first one that succeeds is used. If the alternatives 6591 are within a subpattern (defined below), "succeeds" means matching the 6592 rest of the main pattern as well as the alternative in the subpattern. 6593 6594 6595INTERNAL OPTION SETTING 6596 6597 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and 6598 PCRE2_EXTENDED options (which are Perl-compatible) can be changed from 6599 within the pattern by a sequence of Perl option letters enclosed 6600 between "(?" and ")". The option letters are 6601 6602 i for PCRE2_CASELESS 6603 m for PCRE2_MULTILINE 6604 s for PCRE2_DOTALL 6605 x for PCRE2_EXTENDED 6606 6607 For example, (?im) sets caseless, multiline matching. It is also possi- 6608 ble to unset these options by preceding the letter with a hyphen, and a 6609 combined setting and unsetting such as (?im-sx), which sets PCRE2_CASE- 6610 LESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and 6611 PCRE2_EXTENDED, is also permitted. If a letter appears both before and 6612 after the hyphen, the option is unset. An empty options setting "(?)" 6613 is allowed. Needless to say, it has no effect. 6614 6615 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be 6616 changed in the same way as the Perl-compatible options by using the 6617 characters J and U respectively. 6618 6619 When one of these option changes occurs at top level (that is, not 6620 inside subpattern parentheses), the change applies to the remainder of 6621 the pattern that follows. If the change is placed right at the start of 6622 a pattern, PCRE2 extracts it into the global options (and it will 6623 therefore show up in data extracted by the pcre2_pattern_info() func- 6624 tion). 6625 6626 An option change within a subpattern (see below for a description of 6627 subpatterns) affects only that part of the subpattern that follows it, 6628 so 6629 6630 (a(?i)b)c 6631 6632 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is 6633 not used). By this means, options can be made to have different set- 6634 tings in different parts of the pattern. Any changes made in one alter- 6635 native do carry on into subsequent branches within the same subpattern. 6636 For example, 6637 6638 (a(?i)b|c) 6639 6640 matches "ab", "aB", "c", and "C", even though when matching "C" the 6641 first branch is abandoned before the option setting. This is because 6642 the effects of option settings happen at compile time. There would be 6643 some very weird behaviour otherwise. 6644 6645 As a convenient shorthand, if any option settings are required at the 6646 start of a non-capturing subpattern (see the next section), the option 6647 letters may appear between the "?" and the ":". Thus the two patterns 6648 6649 (?i:saturday|sunday) 6650 (?:(?i)saturday|sunday) 6651 6652 match exactly the same set of strings. 6653 6654 Note: There are other PCRE2-specific options that can be set by the 6655 application when the compiling function is called. The pattern can con- 6656 tain special leading sequences such as (*CRLF) to override what the 6657 application has set or what has been defaulted. Details are given in 6658 the section entitled "Newline sequences" above. There are also the 6659 (*UTF) and (*UCP) leading sequences that can be used to set UTF and 6660 Unicode property modes; they are equivalent to setting the PCRE2_UTF 6661 and PCRE2_UCP options, respectively. However, the application can set 6662 the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use 6663 of the (*UTF) and (*UCP) sequences. 6664 6665 6666SUBPATTERNS 6667 6668 Subpatterns are delimited by parentheses (round brackets), which can be 6669 nested. Turning part of a pattern into a subpattern does two things: 6670 6671 1. It localizes a set of alternatives. For example, the pattern 6672 6673 cat(aract|erpillar|) 6674 6675 matches "cataract", "caterpillar", or "cat". Without the parentheses, 6676 it would match "cataract", "erpillar" or an empty string. 6677 6678 2. It sets up the subpattern as a capturing subpattern. This means 6679 that, when the whole pattern matches, the portion of the subject string 6680 that matched the subpattern is passed back to the caller, separately 6681 from the portion that matched the whole pattern. (This applies only to 6682 the traditional matching function; the DFA matching function does not 6683 support capturing.) 6684 6685 Opening parentheses are counted from left to right (starting from 1) to 6686 obtain numbers for the capturing subpatterns. For example, if the 6687 string "the red king" is matched against the pattern 6688 6689 the ((red|white) (king|queen)) 6690 6691 the captured substrings are "red king", "red", and "king", and are num- 6692 bered 1, 2, and 3, respectively. 6693 6694 The fact that plain parentheses fulfil two functions is not always 6695 helpful. There are often times when a grouping subpattern is required 6696 without a capturing requirement. If an opening parenthesis is followed 6697 by a question mark and a colon, the subpattern does not do any captur- 6698 ing, and is not counted when computing the number of any subsequent 6699 capturing subpatterns. For example, if the string "the white queen" is 6700 matched against the pattern 6701 6702 the ((?:red|white) (king|queen)) 6703 6704 the captured substrings are "white queen" and "queen", and are numbered 6705 1 and 2. The maximum number of capturing subpatterns is 65535. 6706 6707 As a convenient shorthand, if any option settings are required at the 6708 start of a non-capturing subpattern, the option letters may appear 6709 between the "?" and the ":". Thus the two patterns 6710 6711 (?i:saturday|sunday) 6712 (?:(?i)saturday|sunday) 6713 6714 match exactly the same set of strings. Because alternative branches are 6715 tried from left to right, and options are not reset until the end of 6716 the subpattern is reached, an option setting in one branch does affect 6717 subsequent branches, so the above patterns match "SUNDAY" as well as 6718 "Saturday". 6719 6720 6721DUPLICATE SUBPATTERN NUMBERS 6722 6723 Perl 5.10 introduced a feature whereby each alternative in a subpattern 6724 uses the same numbers for its capturing parentheses. Such a subpattern 6725 starts with (?| and is itself a non-capturing subpattern. For example, 6726 consider this pattern: 6727 6728 (?|(Sat)ur|(Sun))day 6729 6730 Because the two alternatives are inside a (?| group, both sets of cap- 6731 turing parentheses are numbered one. Thus, when the pattern matches, 6732 you can look at captured substring number one, whichever alternative 6733 matched. This construct is useful when you want to capture part, but 6734 not all, of one of a number of alternatives. Inside a (?| group, paren- 6735 theses are numbered as usual, but the number is reset at the start of 6736 each branch. The numbers of any capturing parentheses that follow the 6737 subpattern start after the highest number used in any branch. The fol- 6738 lowing example is taken from the Perl documentation. The numbers under- 6739 neath show in which buffer the captured content will be stored. 6740 6741 # before ---------------branch-reset----------- after 6742 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x 6743 # 1 2 2 3 2 3 4 6744 6745 A back reference to a numbered subpattern uses the most recent value 6746 that is set for that number by any subpattern. The following pattern 6747 matches "abcabc" or "defdef": 6748 6749 /(?|(abc)|(def))\1/ 6750 6751 In contrast, a subroutine call to a numbered subpattern always refers 6752 to the first one in the pattern with the given number. The following 6753 pattern matches "abcabc" or "defabc": 6754 6755 /(?|(abc)|(def))(?1)/ 6756 6757 A relative reference such as (?-1) is no different: it is just a conve- 6758 nient way of computing an absolute group number. 6759 6760 If a condition test for a subpattern's having matched refers to a non- 6761 unique number, the test is true if any of the subpatterns of that num- 6762 ber have matched. 6763 6764 An alternative approach to using this "branch reset" feature is to use 6765 duplicate named subpatterns, as described in the next section. 6766 6767 6768NAMED SUBPATTERNS 6769 6770 Identifying capturing parentheses by number is simple, but it can be 6771 very hard to keep track of the numbers in complicated regular expres- 6772 sions. Furthermore, if an expression is modified, the numbers may 6773 change. To help with this difficulty, PCRE2 supports the naming of sub- 6774 patterns. This feature was not added to Perl until release 5.10. Python 6775 had the feature earlier, and PCRE1 introduced it at release 4.0, using 6776 the Python syntax. PCRE2 supports both the Perl and the Python syntax. 6777 Perl allows identically numbered subpatterns to have different names, 6778 but PCRE2 does not. 6779 6780 In PCRE2, a subpattern can be named in one of three ways: (?<name>...) 6781 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References 6782 to capturing parentheses from other parts of the pattern, such as back 6783 references, recursion, and conditions, can be made by name as well as 6784 by number. 6785 6786 Names consist of up to 32 alphanumeric characters and underscores, but 6787 must start with a non-digit. Named capturing parentheses are still 6788 allocated numbers as well as names, exactly as if the names were not 6789 present. The PCRE2 API provides function calls for extracting the name- 6790 to-number translation table from a compiled pattern. There are also 6791 convenience functions for extracting a captured substring by name. 6792 6793 By default, a name must be unique within a pattern, but it is possible 6794 to relax this constraint by setting the PCRE2_DUPNAMES option at com- 6795 pile time. (Duplicate names are also always permitted for subpatterns 6796 with the same number, set up as described in the previous section.) 6797 Duplicate names can be useful for patterns where only one instance of 6798 the named parentheses can match. Suppose you want to match the name of 6799 a weekday, either as a 3-letter abbreviation or as the full name, and 6800 in both cases you want to extract the abbreviation. This pattern 6801 (ignoring the line breaks) does the job: 6802 6803 (?<DN>Mon|Fri|Sun)(?:day)?| 6804 (?<DN>Tue)(?:sday)?| 6805 (?<DN>Wed)(?:nesday)?| 6806 (?<DN>Thu)(?:rsday)?| 6807 (?<DN>Sat)(?:urday)? 6808 6809 There are five capturing substrings, but only one is ever set after a 6810 match. (An alternative way of solving this problem is to use a "branch 6811 reset" subpattern, as described in the previous section.) 6812 6813 The convenience functions for extracting the data by name returns the 6814 substring for the first (and in this example, the only) subpattern of 6815 that name that matched. This saves searching to find which numbered 6816 subpattern it was. 6817 6818 If you make a back reference to a non-unique named subpattern from 6819 elsewhere in the pattern, the subpatterns to which the name refers are 6820 checked in the order in which they appear in the overall pattern. The 6821 first one that is set is used for the reference. For example, this pat- 6822 tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo": 6823 6824 (?:(?<n>foo)|(?<n>bar))\k<n> 6825 6826 6827 If you make a subroutine call to a non-unique named subpattern, the one 6828 that corresponds to the first occurrence of the name is used. In the 6829 absence of duplicate numbers (see the previous section) this is the one 6830 with the lowest number. 6831 6832 If you use a named reference in a condition test (see the section about 6833 conditions below), either to check whether a subpattern has matched, or 6834 to check for recursion, all subpatterns with the same name are tested. 6835 If the condition is true for any one of them, the overall condition is 6836 true. This is the same behaviour as testing by number. For further 6837 details of the interfaces for handling named subpatterns, see the 6838 pcre2api documentation. 6839 6840 Warning: You cannot use different names to distinguish between two sub- 6841 patterns with the same number because PCRE2 uses only the numbers when 6842 matching. For this reason, an error is given at compile time if differ- 6843 ent names are given to subpatterns with the same number. However, you 6844 can always give the same name to subpatterns with the same number, even 6845 when PCRE2_DUPNAMES is not set. 6846 6847 6848REPETITION 6849 6850 Repetition is specified by quantifiers, which can follow any of the 6851 following items: 6852 6853 a literal data character 6854 the dot metacharacter 6855 the \C escape sequence 6856 the \X escape sequence 6857 the \R escape sequence 6858 an escape such as \d or \pL that matches a single character 6859 a character class 6860 a back reference 6861 a parenthesized subpattern (including most assertions) 6862 a subroutine call to a subpattern (recursive or otherwise) 6863 6864 The general repetition quantifier specifies a minimum and maximum num- 6865 ber of permitted matches, by giving the two numbers in curly brackets 6866 (braces), separated by a comma. The numbers must be less than 65536, 6867 and the first must be less than or equal to the second. For example: 6868 6869 z{2,4} 6870 6871 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a 6872 special character. If the second number is omitted, but the comma is 6873 present, there is no upper limit; if the second number and the comma 6874 are both omitted, the quantifier specifies an exact number of required 6875 matches. Thus 6876 6877 [aeiou]{3,} 6878 6879 matches at least 3 successive vowels, but may match many more, whereas 6880 6881 \d{8} 6882 6883 matches exactly 8 digits. An opening curly bracket that appears in a 6884 position where a quantifier is not allowed, or one that does not match 6885 the syntax of a quantifier, is taken as a literal character. For exam- 6886 ple, {,6} is not a quantifier, but a literal string of four characters. 6887 6888 In UTF modes, quantifiers apply to characters rather than to individual 6889 code units. Thus, for example, \x{100}{2} matches two characters, each 6890 of which is represented by a two-byte sequence in a UTF-8 string. Simi- 6891 larly, \X{3} matches three Unicode extended grapheme clusters, each of 6892 which may be several code units long (and they may be of different 6893 lengths). 6894 6895 The quantifier {0} is permitted, causing the expression to behave as if 6896 the previous item and the quantifier were not present. This may be use- 6897 ful for subpatterns that are referenced as subroutines from elsewhere 6898 in the pattern (but see also the section entitled "Defining subpatterns 6899 for use by reference only" below). Items other than subpatterns that 6900 have a {0} quantifier are omitted from the compiled pattern. 6901 6902 For convenience, the three most common quantifiers have single-charac- 6903 ter abbreviations: 6904 6905 * is equivalent to {0,} 6906 + is equivalent to {1,} 6907 ? is equivalent to {0,1} 6908 6909 It is possible to construct infinite loops by following a subpattern 6910 that can match no characters with a quantifier that has no upper limit, 6911 for example: 6912 6913 (a?)* 6914 6915 Earlier versions of Perl and PCRE1 used to give an error at compile 6916 time for such patterns. However, because there are cases where this can 6917 be useful, such patterns are now accepted, but if any repetition of the 6918 subpattern does in fact match no characters, the loop is forcibly bro- 6919 ken. 6920 6921 By default, the quantifiers are "greedy", that is, they match as much 6922 as possible (up to the maximum number of permitted times), without 6923 causing the rest of the pattern to fail. The classic example of where 6924 this gives problems is in trying to match comments in C programs. These 6925 appear between /* and */ and within the comment, individual * and / 6926 characters may appear. An attempt to match C comments by applying the 6927 pattern 6928 6929 /\*.*\*/ 6930 6931 to the string 6932 6933 /* first comment */ not comment /* second comment */ 6934 6935 fails, because it matches the entire string owing to the greediness of 6936 the .* item. 6937 6938 If a quantifier is followed by a question mark, it ceases to be greedy, 6939 and instead matches the minimum number of times possible, so the pat- 6940 tern 6941 6942 /\*.*?\*/ 6943 6944 does the right thing with the C comments. The meaning of the various 6945 quantifiers is not otherwise changed, just the preferred number of 6946 matches. Do not confuse this use of question mark with its use as a 6947 quantifier in its own right. Because it has two uses, it can sometimes 6948 appear doubled, as in 6949 6950 \d??\d 6951 6952 which matches one digit by preference, but can match two if that is the 6953 only way the rest of the pattern matches. 6954 6955 If the PCRE2_UNGREEDY option is set (an option that is not available in 6956 Perl), the quantifiers are not greedy by default, but individual ones 6957 can be made greedy by following them with a question mark. In other 6958 words, it inverts the default behaviour. 6959 6960 When a parenthesized subpattern is quantified with a minimum repeat 6961 count that is greater than 1 or with a limited maximum, more memory is 6962 required for the compiled pattern, in proportion to the size of the 6963 minimum or maximum. 6964 6965 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option 6966 (equivalent to Perl's /s) is set, thus allowing the dot to match new- 6967 lines, the pattern is implicitly anchored, because whatever follows 6968 will be tried against every character position in the subject string, 6969 so there is no point in retrying the overall match at any position 6970 after the first. PCRE2 normally treats such a pattern as though it were 6971 preceded by \A. 6972 6973 In cases where it is known that the subject string contains no new- 6974 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti- 6975 mization, or alternatively, using ^ to indicate anchoring explicitly. 6976 6977 However, there are some cases where the optimization cannot be used. 6978 When .* is inside capturing parentheses that are the subject of a back 6979 reference elsewhere in the pattern, a match at the start may fail where 6980 a later one succeeds. Consider, for example: 6981 6982 (.*)abc\1 6983 6984 If the subject is "xyz123abc123" the match point is the fourth charac- 6985 ter. For this reason, such a pattern is not implicitly anchored. 6986 6987 Another case where implicit anchoring is not applied is when the lead- 6988 ing .* is inside an atomic group. Once again, a match at the start may 6989 fail where a later one succeeds. Consider this pattern: 6990 6991 (?>.*?a)b 6992 6993 It matches "ab" in the subject "aab". The use of the backtracking con- 6994 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and 6995 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. 6996 6997 When a capturing subpattern is repeated, the value captured is the sub- 6998 string that matched the final iteration. For example, after 6999 7000 (tweedle[dume]{3}\s*)+ 7001 7002 has matched "tweedledum tweedledee" the value of the captured substring 7003 is "tweedledee". However, if there are nested capturing subpatterns, 7004 the corresponding captured values may have been set in previous itera- 7005 tions. For example, after 7006 7007 (a|(b))+ 7008 7009 matches "aba" the value of the second captured substring is "b". 7010 7011 7012ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS 7013 7014 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") 7015 repetition, failure of what follows normally causes the repeated item 7016 to be re-evaluated to see if a different number of repeats allows the 7017 rest of the pattern to match. Sometimes it is useful to prevent this, 7018 either to change the nature of the match, or to cause it fail earlier 7019 than it otherwise might, when the author of the pattern knows there is 7020 no point in carrying on. 7021 7022 Consider, for example, the pattern \d+foo when applied to the subject 7023 line 7024 7025 123456bar 7026 7027 After matching all 6 digits and then failing to match "foo", the normal 7028 action of the matcher is to try again with only 5 digits matching the 7029 \d+ item, and then with 4, and so on, before ultimately failing. 7030 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides 7031 the means for specifying that once a subpattern has matched, it is not 7032 to be re-evaluated in this way. 7033 7034 If we use atomic grouping for the previous example, the matcher gives 7035 up immediately on failing to match "foo" the first time. The notation 7036 is a kind of special parenthesis, starting with (?> as in this example: 7037 7038 (?>\d+)foo 7039 7040 This kind of parenthesis "locks up" the part of the pattern it con- 7041 tains once it has matched, and a failure further into the pattern is 7042 prevented from backtracking into it. Backtracking past it to previous 7043 items, however, works as normal. 7044 7045 An alternative description is that a subpattern of this type matches 7046 exactly the string of characters that an identical standalone pattern 7047 would match, if anchored at the current point in the subject string. 7048 7049 Atomic grouping subpatterns are not capturing subpatterns. Simple cases 7050 such as the above example can be thought of as a maximizing repeat that 7051 must swallow everything it can. So, while both \d+ and \d+? are pre- 7052 pared to adjust the number of digits they match in order to make the 7053 rest of the pattern match, (?>\d+) can only match an entire sequence of 7054 digits. 7055 7056 Atomic groups in general can of course contain arbitrarily complicated 7057 subpatterns, and can be nested. However, when the subpattern for an 7058 atomic group is just a single repeated item, as in the example above, a 7059 simpler notation, called a "possessive quantifier" can be used. This 7060 consists of an additional + character following a quantifier. Using 7061 this notation, the previous example can be rewritten as 7062 7063 \d++foo 7064 7065 Note that a possessive quantifier can be used with an entire group, for 7066 example: 7067 7068 (abc|xyz){2,3}+ 7069 7070 Possessive quantifiers are always greedy; the setting of the 7071 PCRE2_UNGREEDY option is ignored. They are a convenient notation for 7072 the simpler forms of atomic group. However, there is no difference in 7073 the meaning of a possessive quantifier and the equivalent atomic group, 7074 though there may be a performance difference; possessive quantifiers 7075 should be slightly faster. 7076 7077 The possessive quantifier syntax is an extension to the Perl 5.8 syn- 7078 tax. Jeffrey Friedl originated the idea (and the name) in the first 7079 edition of his book. Mike McCloskey liked it, so implemented it when he 7080 built Sun's Java package, and PCRE1 copied it from there. It ultimately 7081 found its way into Perl at release 5.10. 7082 7083 PCRE2 has an optimization that automatically "possessifies" certain 7084 simple pattern constructs. For example, the sequence A+B is treated as 7085 A++B because there is no point in backtracking into a sequence of A's 7086 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO- 7087 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). 7088 7089 When a pattern contains an unlimited repeat inside a subpattern that 7090 can itself be repeated an unlimited number of times, the use of an 7091 atomic group is the only way to avoid some failing matches taking a 7092 very long time indeed. The pattern 7093 7094 (\D+|<\d+>)*[!?] 7095 7096 matches an unlimited number of substrings that either consist of non- 7097 digits, or digits enclosed in <>, followed by either ! or ?. When it 7098 matches, it runs quickly. However, if it is applied to 7099 7100 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 7101 7102 it takes a long time before reporting failure. This is because the 7103 string can be divided between the internal \D+ repeat and the external 7104 * repeat in a large number of ways, and all have to be tried. (The 7105 example uses [!?] rather than a single character at the end, because 7106 both PCRE2 and Perl have an optimization that allows for fast failure 7107 when a single character is used. They remember the last single charac- 7108 ter that is required for a match, and fail early if it is not present 7109 in the string.) If the pattern is changed so that it uses an atomic 7110 group, like this: 7111 7112 ((?>\D+)|<\d+>)*[!?] 7113 7114 sequences of non-digits cannot be broken, and failure happens quickly. 7115 7116 7117BACK REFERENCES 7118 7119 Outside a character class, a backslash followed by a digit greater than 7120 0 (and possibly further digits) is a back reference to a capturing sub- 7121 pattern earlier (that is, to its left) in the pattern, provided there 7122 have been that many previous capturing left parentheses. 7123 7124 However, if the decimal number following the backslash is less than 8, 7125 it is always taken as a back reference, and causes an error only if 7126 there are not that many capturing left parentheses in the entire pat- 7127 tern. In other words, the parentheses that are referenced need not be 7128 to the left of the reference for numbers less than 8. A "forward back 7129 reference" of this type can make sense when a repetition is involved 7130 and the subpattern to the right has participated in an earlier itera- 7131 tion. 7132 7133 It is not possible to have a numerical "forward back reference" to a 7134 subpattern whose number is 8 or more using this syntax because a 7135 sequence such as \50 is interpreted as a character defined in octal. 7136 See the subsection entitled "Non-printing characters" above for further 7137 details of the handling of digits following a backslash. There is no 7138 such problem when named parentheses are used. A back reference to any 7139 subpattern is possible using named parentheses (see below). 7140 7141 Another way of avoiding the ambiguity inherent in the use of digits 7142 following a backslash is to use the \g escape sequence. This escape 7143 must be followed by an unsigned number or a negative number, optionally 7144 enclosed in braces. These examples are all identical: 7145 7146 (ring), \1 7147 (ring), \g1 7148 (ring), \g{1} 7149 7150 An unsigned number specifies an absolute reference without the ambigu- 7151 ity that is present in the older syntax. It is also useful when literal 7152 digits follow the reference. A negative number is a relative reference. 7153 Consider this example: 7154 7155 (abc(def)ghi)\g{-1} 7156 7157 The sequence \g{-1} is a reference to the most recently started captur- 7158 ing subpattern before \g, that is, is it equivalent to \2 in this exam- 7159 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative 7160 references can be helpful in long patterns, and also in patterns that 7161 are created by joining together fragments that contain references 7162 within themselves. 7163 7164 A back reference matches whatever actually matched the capturing sub- 7165 pattern in the current subject string, rather than anything matching 7166 the subpattern itself (see "Subpatterns as subroutines" below for a way 7167 of doing that). So the pattern 7168 7169 (sens|respons)e and \1ibility 7170 7171 matches "sense and sensibility" and "response and responsibility", but 7172 not "sense and responsibility". If caseful matching is in force at the 7173 time of the back reference, the case of letters is relevant. For exam- 7174 ple, 7175 7176 ((?i)rah)\s+\1 7177 7178 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the 7179 original capturing subpattern is matched caselessly. 7180 7181 There are several different ways of writing back references to named 7182 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or 7183 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's 7184 unified back reference syntax, in which \g can be used for both numeric 7185 and named references, is also supported. We could rewrite the above 7186 example in any of the following ways: 7187 7188 (?<p1>(?i)rah)\s+\k<p1> 7189 (?'p1'(?i)rah)\s+\k{p1} 7190 (?P<p1>(?i)rah)\s+(?P=p1) 7191 (?<p1>(?i)rah)\s+\g{p1} 7192 7193 A subpattern that is referenced by name may appear in the pattern 7194 before or after the reference. 7195 7196 There may be more than one back reference to the same subpattern. If a 7197 subpattern has not actually been used in a particular match, any back 7198 references to it always fail by default. For example, the pattern 7199 7200 (a|(bc))\2 7201 7202 always fails if it starts to match "a" rather than "bc". However, if 7203 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back 7204 reference to an unset value matches an empty string. 7205 7206 Because there may be many capturing parentheses in a pattern, all dig- 7207 its following a backslash are taken as part of a potential back refer- 7208 ence number. If the pattern continues with a digit character, some 7209 delimiter must be used to terminate the back reference. If the 7210 PCRE2_EXTENDED option is set, this can be white space. Otherwise, the 7211 \g{ syntax or an empty comment (see "Comments" below) can be used. 7212 7213 Recursive back references 7214 7215 A back reference that occurs inside the parentheses to which it refers 7216 fails when the subpattern is first used, so, for example, (a\1) never 7217 matches. However, such references can be useful inside repeated sub- 7218 patterns. For example, the pattern 7219 7220 (a|b\1)+ 7221 7222 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- 7223 ation of the subpattern, the back reference matches the character 7224 string corresponding to the previous iteration. In order for this to 7225 work, the pattern must be such that the first iteration does not need 7226 to match the back reference. This can be done using alternation, as in 7227 the example above, or by a quantifier with a minimum of zero. 7228 7229 Back references of this type cause the group that they reference to be 7230 treated as an atomic group. Once the whole group has been matched, a 7231 subsequent matching failure cannot cause backtracking into the middle 7232 of the group. 7233 7234 7235ASSERTIONS 7236 7237 An assertion is a test on the characters following or preceding the 7238 current matching point that does not consume any characters. The simple 7239 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described 7240 above. 7241 7242 More complicated assertions are coded as subpatterns. There are two 7243 kinds: those that look ahead of the current position in the subject 7244 string, and those that look behind it. An assertion subpattern is 7245 matched in the normal way, except that it does not cause the current 7246 matching position to be changed. 7247 7248 Assertion subpatterns are not capturing subpatterns. If such an asser- 7249 tion contains capturing subpatterns within it, these are counted for 7250 the purposes of numbering the capturing subpatterns in the whole pat- 7251 tern. However, substring capturing is carried out only for positive 7252 assertions. (Perl sometimes, but not always, does do capturing in nega- 7253 tive assertions.) 7254 7255 For compatibility with Perl, most assertion subpatterns may be 7256 repeated; though it makes no sense to assert the same thing several 7257 times, the side effect of capturing parentheses may occasionally be 7258 useful. However, an assertion that forms the condition for a condi- 7259 tional subpattern may not be quantified. In practice, for other asser- 7260 tions, there only three cases: 7261 7262 (1) If the quantifier is {0}, the assertion is never obeyed during 7263 matching. However, it may contain internal capturing parenthesized 7264 groups that are called from elsewhere via the subroutine mechanism. 7265 7266 (2) If quantifier is {0,n} where n is greater than zero, it is treated 7267 as if it were {0,1}. At run time, the rest of the pattern match is 7268 tried with and without the assertion, the order depending on the greed- 7269 iness of the quantifier. 7270 7271 (3) If the minimum repetition is greater than zero, the quantifier is 7272 ignored. The assertion is obeyed just once when encountered during 7273 matching. 7274 7275 Lookahead assertions 7276 7277 Lookahead assertions start with (?= for positive assertions and (?! for 7278 negative assertions. For example, 7279 7280 \w+(?=;) 7281 7282 matches a word followed by a semicolon, but does not include the semi- 7283 colon in the match, and 7284 7285 foo(?!bar) 7286 7287 matches any occurrence of "foo" that is not followed by "bar". Note 7288 that the apparently similar pattern 7289 7290 (?!foo)bar 7291 7292 does not find an occurrence of "bar" that is preceded by something 7293 other than "foo"; it finds any occurrence of "bar" whatsoever, because 7294 the assertion (?!foo) is always true when the next three characters are 7295 "bar". A lookbehind assertion is needed to achieve the other effect. 7296 7297 If you want to force a matching failure at some point in a pattern, the 7298 most convenient way to do it is with (?!) because an empty string 7299 always matches, so an assertion that requires there not to be an empty 7300 string must always fail. The backtracking control verb (*FAIL) or (*F) 7301 is a synonym for (?!). 7302 7303 Lookbehind assertions 7304 7305 Lookbehind assertions start with (?<= for positive assertions and (?<! 7306 for negative assertions. For example, 7307 7308 (?<!foo)bar 7309 7310 does find an occurrence of "bar" that is not preceded by "foo". The 7311 contents of a lookbehind assertion are restricted such that all the 7312 strings it matches must have a fixed length. However, if there are sev- 7313 eral top-level alternatives, they do not all have to have the same 7314 fixed length. Thus 7315 7316 (?<=bullock|donkey) 7317 7318 is permitted, but 7319 7320 (?<!dogs?|cats?) 7321 7322 causes an error at compile time. Branches that match different length 7323 strings are permitted only at the top level of a lookbehind assertion. 7324 This is an extension compared with Perl, which requires all branches to 7325 match the same length of string. An assertion such as 7326 7327 (?<=ab(c|de)) 7328 7329 is not permitted, because its single top-level branch can match two 7330 different lengths, but it is acceptable to PCRE2 if rewritten to use 7331 two top-level branches: 7332 7333 (?<=abc|abde) 7334 7335 In some cases, the escape sequence \K (see above) can be used instead 7336 of a lookbehind assertion to get round the fixed-length restriction. 7337 7338 The implementation of lookbehind assertions is, for each alternative, 7339 to temporarily move the current position back by the fixed length and 7340 then try to match. If there are insufficient characters before the cur- 7341 rent position, the assertion fails. 7342 7343 In a UTF mode, PCRE2 does not allow the \C escape (which matches a sin- 7344 gle code unit even in a UTF mode) to appear in lookbehind assertions, 7345 because it makes it impossible to calculate the length of the lookbe- 7346 hind. The \X and \R escapes, which can match different numbers of code 7347 units, are also not permitted. 7348 7349 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in 7350 lookbehinds, as long as the subpattern matches a fixed-length string. 7351 Recursion, however, is not supported. 7352 7353 Possessive quantifiers can be used in conjunction with lookbehind 7354 assertions to specify efficient matching of fixed-length strings at the 7355 end of subject strings. Consider a simple pattern such as 7356 7357 abcd$ 7358 7359 when applied to a long string that does not match. Because matching 7360 proceeds from left to right, PCRE2 will look for each "a" in the sub- 7361 ject and then see if what follows matches the rest of the pattern. If 7362 the pattern is specified as 7363 7364 ^.*abcd$ 7365 7366 the initial .* matches the entire string at first, but when this fails 7367 (because there is no following "a"), it backtracks to match all but the 7368 last character, then all but the last two characters, and so on. Once 7369 again the search for "a" covers the entire string, from right to left, 7370 so we are no better off. However, if the pattern is written as 7371 7372 ^.*+(?<=abcd) 7373 7374 there can be no backtracking for the .*+ item because of the possessive 7375 quantifier; it can match only the entire string. The subsequent lookbe- 7376 hind assertion does a single test on the last four characters. If it 7377 fails, the match fails immediately. For long strings, this approach 7378 makes a significant difference to the processing time. 7379 7380 Using multiple assertions 7381 7382 Several assertions (of any sort) may occur in succession. For example, 7383 7384 (?<=\d{3})(?<!999)foo 7385 7386 matches "foo" preceded by three digits that are not "999". Notice that 7387 each of the assertions is applied independently at the same point in 7388 the subject string. First there is a check that the previous three 7389 characters are all digits, and then there is a check that the same 7390 three characters are not "999". This pattern does not match "foo" pre- 7391 ceded by six characters, the first of which are digits and the last 7392 three of which are not "999". For example, it doesn't match "123abc- 7393 foo". A pattern to do that is 7394 7395 (?<=\d{3}...)(?<!999)foo 7396 7397 This time the first assertion looks at the preceding six characters, 7398 checking that the first three are digits, and then the second assertion 7399 checks that the preceding three characters are not "999". 7400 7401 Assertions can be nested in any combination. For example, 7402 7403 (?<=(?<!foo)bar)baz 7404 7405 matches an occurrence of "baz" that is preceded by "bar" which in turn 7406 is not preceded by "foo", while 7407 7408 (?<=\d{3}(?!999)...)foo 7409 7410 is another pattern that matches "foo" preceded by three digits and any 7411 three characters that are not "999". 7412 7413 7414CONDITIONAL SUBPATTERNS 7415 7416 It is possible to cause the matching process to obey a subpattern con- 7417 ditionally or to choose between two alternative subpatterns, depending 7418 on the result of an assertion, or whether a specific capturing subpat- 7419 tern has already been matched. The two possible forms of conditional 7420 subpattern are: 7421 7422 (?(condition)yes-pattern) 7423 (?(condition)yes-pattern|no-pattern) 7424 7425 If the condition is satisfied, the yes-pattern is used; otherwise the 7426 no-pattern (if present) is used. If there are more than two alterna- 7427 tives in the subpattern, a compile-time error occurs. Each of the two 7428 alternatives may itself contain nested subpatterns of any form, includ- 7429 ing conditional subpatterns; the restriction to two alternatives 7430 applies only at the level of the condition. This pattern fragment is an 7431 example where the alternatives are complex: 7432 7433 (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) 7434 7435 7436 There are five kinds of condition: references to subpatterns, refer- 7437 ences to recursion, two pseudo-conditions called DEFINE and VERSION, 7438 and assertions. 7439 7440 Checking for a used subpattern by number 7441 7442 If the text between the parentheses consists of a sequence of digits, 7443 the condition is true if a capturing subpattern of that number has pre- 7444 viously matched. If there is more than one capturing subpattern with 7445 the same number (see the earlier section about duplicate subpattern 7446 numbers), the condition is true if any of them have matched. An alter- 7447 native notation is to precede the digits with a plus or minus sign. In 7448 this case, the subpattern number is relative rather than absolute. The 7449 most recently opened parentheses can be referenced by (?(-1), the next 7450 most recent by (?(-2), and so on. Inside loops it can also make sense 7451 to refer to subsequent groups. The next parentheses to be opened can be 7452 referenced as (?(+1), and so on. (The value zero in any of these forms 7453 is not used; it provokes a compile-time error.) 7454 7455 Consider the following pattern, which contains non-significant white 7456 space to make it more readable (assume the PCRE2_EXTENDED option) and 7457 to divide it into three parts for ease of discussion: 7458 7459 ( \( )? [^()]+ (?(1) \) ) 7460 7461 The first part matches an optional opening parenthesis, and if that 7462 character is present, sets it as the first captured substring. The sec- 7463 ond part matches one or more characters that are not parentheses. The 7464 third part is a conditional subpattern that tests whether or not the 7465 first set of parentheses matched. If they did, that is, if subject 7466 started with an opening parenthesis, the condition is true, and so the 7467 yes-pattern is executed and a closing parenthesis is required. Other- 7468 wise, since no-pattern is not present, the subpattern matches nothing. 7469 In other words, this pattern matches a sequence of non-parentheses, 7470 optionally enclosed in parentheses. 7471 7472 If you were embedding this pattern in a larger one, you could use a 7473 relative reference: 7474 7475 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... 7476 7477 This makes the fragment independent of the parentheses in the larger 7478 pattern. 7479 7480 Checking for a used subpattern by name 7481 7482 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a 7483 used subpattern by name. For compatibility with earlier versions of 7484 PCRE1, which had this facility before Perl, the syntax (?(name)...) is 7485 also recognized. 7486 7487 Rewriting the above example to use a named subpattern gives this: 7488 7489 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) 7490 7491 If the name used in a condition of this kind is a duplicate, the test 7492 is applied to all subpatterns of the same name, and is true if any one 7493 of them has matched. 7494 7495 Checking for pattern recursion 7496 7497 If the condition is the string (R), and there is no subpattern with the 7498 name R, the condition is true if a recursive call to the whole pattern 7499 or any subpattern has been made. If digits or a name preceded by amper- 7500 sand follow the letter R, for example: 7501 7502 (?(R3)...) or (?(R&name)...) 7503 7504 the condition is true if the most recent recursion is into a subpattern 7505 whose number or name is given. This condition does not check the entire 7506 recursion stack. If the name used in a condition of this kind is a 7507 duplicate, the test is applied to all subpatterns of the same name, and 7508 is true if any one of them is the most recent recursion. 7509 7510 At "top level", all these recursion test conditions are false. The 7511 syntax for recursive patterns is described below. 7512 7513 Defining subpatterns for use by reference only 7514 7515 If the condition is the string (DEFINE), and there is no subpattern 7516 with the name DEFINE, the condition is always false. In this case, 7517 there may be only one alternative in the subpattern. It is always 7518 skipped if control reaches this point in the pattern; the idea of 7519 DEFINE is that it can be used to define subroutines that can be refer- 7520 enced from elsewhere. (The use of subroutines is described below.) For 7521 example, a pattern to match an IPv4 address such as "192.168.23.245" 7522 could be written like this (ignore white space and line breaks): 7523 7524 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) 7525 \b (?&byte) (\.(?&byte)){3} \b 7526 7527 The first part of the pattern is a DEFINE group inside which a another 7528 group named "byte" is defined. This matches an individual component of 7529 an IPv4 address (a number less than 256). When matching takes place, 7530 this part of the pattern is skipped because DEFINE acts like a false 7531 condition. The rest of the pattern uses references to the named group 7532 to match the four dot-separated components of an IPv4 address, insist- 7533 ing on a word boundary at each end. 7534 7535 Checking the PCRE2 version 7536 7537 Programs that link with a PCRE2 library can check the version by call- 7538 ing pcre2_config() with appropriate arguments. Users of applications 7539 that do not have access to the underlying code cannot do this. A spe- 7540 cial "condition" called VERSION exists to allow such users to discover 7541 which version of PCRE2 they are dealing with by using this condition to 7542 match a string such as "yesno". VERSION must be followed either by "=" 7543 or ">=" and a version number. For example: 7544 7545 (?(VERSION>=10.4)yes|no) 7546 7547 This pattern matches "yes" if the PCRE2 version is greater or equal to 7548 10.4, or "no" otherwise. The fractional part of the version number may 7549 not contain more than two digits. 7550 7551 Assertion conditions 7552 7553 If the condition is not in any of the above formats, it must be an 7554 assertion. This may be a positive or negative lookahead or lookbehind 7555 assertion. Consider this pattern, again containing non-significant 7556 white space, and with the two alternatives on the second line: 7557 7558 (?(?=[^a-z]*[a-z]) 7559 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) 7560 7561 The condition is a positive lookahead assertion that matches an 7562 optional sequence of non-letters followed by a letter. In other words, 7563 it tests for the presence of at least one letter in the subject. If a 7564 letter is found, the subject is matched against the first alternative; 7565 otherwise it is matched against the second. This pattern matches 7566 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are 7567 letters and dd are digits. 7568 7569 7570COMMENTS 7571 7572 There are two ways of including comments in patterns that are processed 7573 by PCRE2. In both cases, the start of the comment must not be in a 7574 character class, nor in the middle of any other sequence of related 7575 characters such as (?: or a subpattern name or number. The characters 7576 that make up a comment play no part in the pattern matching. 7577 7578 The sequence (?# marks the start of a comment that continues up to the 7579 next closing parenthesis. Nested parentheses are not permitted. If the 7580 PCRE2_EXTENDED option is set, an unescaped # character also introduces 7581 a comment, which in this case continues to immediately after the next 7582 newline character or character sequence in the pattern. Which charac- 7583 ters are interpreted as newlines is controlled by an option passed to 7584 the compiling function or by a special sequence at the start of the 7585 pattern, as described in the section entitled "Newline conventions" 7586 above. Note that the end of this type of comment is a literal newline 7587 sequence in the pattern; escape sequences that happen to represent a 7588 newline do not count. For example, consider this pattern when 7589 PCRE2_EXTENDED is set, and the default newline convention (a single 7590 linefeed character) is in force: 7591 7592 abc #comment \n still comment 7593 7594 On encountering the # character, pcre2_compile() skips along, looking 7595 for a newline in the pattern. The sequence \n is still literal at this 7596 stage, so it does not terminate the comment. Only an actual character 7597 with the code value 0x0a (the default newline) does so. 7598 7599 7600RECURSIVE PATTERNS 7601 7602 Consider the problem of matching a string in parentheses, allowing for 7603 unlimited nested parentheses. Without the use of recursion, the best 7604 that can be done is to use a pattern that matches up to some fixed 7605 depth of nesting. It is not possible to handle an arbitrary nesting 7606 depth. 7607 7608 For some time, Perl has provided a facility that allows regular expres- 7609 sions to recurse (amongst other things). It does this by interpolating 7610 Perl code in the expression at run time, and the code can refer to the 7611 expression itself. A Perl pattern using code interpolation to solve the 7612 parentheses problem can be created like this: 7613 7614 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; 7615 7616 The (?p{...}) item interpolates Perl code at run time, and in this case 7617 refers recursively to the pattern in which it appears. 7618 7619 Obviously, PCRE2 cannot support the interpolation of Perl code. 7620 Instead, it supports special syntax for recursion of the entire pat- 7621 tern, and also for individual subpattern recursion. After its introduc- 7622 tion in PCRE1 and Python, this kind of recursion was subsequently 7623 introduced into Perl at release 5.10. 7624 7625 A special item that consists of (? followed by a number greater than 7626 zero and a closing parenthesis is a recursive subroutine call of the 7627 subpattern of the given number, provided that it occurs inside that 7628 subpattern. (If not, it is a non-recursive subroutine call, which is 7629 described in the next section.) The special item (?R) or (?0) is a 7630 recursive call of the entire regular expression. 7631 7632 This PCRE2 pattern solves the nested parentheses problem (assume the 7633 PCRE2_EXTENDED option is set so that white space is ignored): 7634 7635 \( ( [^()]++ | (?R) )* \) 7636 7637 First it matches an opening parenthesis. Then it matches any number of 7638 substrings which can either be a sequence of non-parentheses, or a 7639 recursive match of the pattern itself (that is, a correctly parenthe- 7640 sized substring). Finally there is a closing parenthesis. Note the use 7641 of a possessive quantifier to avoid backtracking into sequences of non- 7642 parentheses. 7643 7644 If this were part of a larger pattern, you would not want to recurse 7645 the entire pattern, so instead you could use this: 7646 7647 ( \( ( [^()]++ | (?1) )* \) ) 7648 7649 We have put the pattern into parentheses, and caused the recursion to 7650 refer to them instead of the whole pattern. 7651 7652 In a larger pattern, keeping track of parenthesis numbers can be 7653 tricky. This is made easier by the use of relative references. Instead 7654 of (?1) in the pattern above you can write (?-2) to refer to the second 7655 most recently opened parentheses preceding the recursion. In other 7656 words, a negative number counts capturing parentheses leftwards from 7657 the point at which it is encountered. 7658 7659 Be aware however, that if duplicate subpattern numbers are in use, rel- 7660 ative references refer to the earliest subpattern with the appropriate 7661 number. Consider, for example: 7662 7663 (?|(a)|(b)) (c) (?-2) 7664 7665 The first two capturing groups (a) and (b) are both numbered 1, and 7666 group (c) is number 2. When the reference (?-2) is encountered, the 7667 second most recently opened parentheses has the number 1, but it is the 7668 first such group (the (a) group) to which the recursion refers. This 7669 would be the same if an absolute reference (?1) was used. In other 7670 words, relative references are just a shorthand for computing a group 7671 number. 7672 7673 It is also possible to refer to subsequently opened parentheses, by 7674 writing references such as (?+2). However, these cannot be recursive 7675 because the reference is not inside the parentheses that are refer- 7676 enced. They are always non-recursive subroutine calls, as described in 7677 the next section. 7678 7679 An alternative approach is to use named parentheses. The Perl syntax 7680 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup- 7681 ported. We could rewrite the above example as follows: 7682 7683 (?<pn> \( ( [^()]++ | (?&pn) )* \) ) 7684 7685 If there is more than one subpattern with the same name, the earliest 7686 one is used. 7687 7688 The example pattern that we have been looking at contains nested unlim- 7689 ited repeats, and so the use of a possessive quantifier for matching 7690 strings of non-parentheses is important when applying the pattern to 7691 strings that do not match. For example, when this pattern is applied to 7692 7693 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 7694 7695 it yields "no match" quickly. However, if a possessive quantifier is 7696 not used, the match runs for a very long time indeed because there are 7697 so many different ways the + and * repeats can carve up the subject, 7698 and all have to be tested before failure can be reported. 7699 7700 At the end of a match, the values of capturing parentheses are those 7701 from the outermost level. If you want to obtain intermediate values, a 7702 callout function can be used (see below and the pcre2callout documenta- 7703 tion). If the pattern above is matched against 7704 7705 (ab(cd)ef) 7706 7707 the value for the inner capturing parentheses (numbered 2) is "ef", 7708 which is the last value taken on at the top level. If a capturing sub- 7709 pattern is not matched at the top level, its final captured value is 7710 unset, even if it was (temporarily) set at a deeper level during the 7711 matching process. 7712 7713 If there are more than 15 capturing parentheses in a pattern, PCRE2 has 7714 to obtain extra memory from the heap to store data during a recursion. 7715 If no memory can be obtained, the match fails with the 7716 PCRE2_ERROR_NOMEMORY error. 7717 7718 Do not confuse the (?R) item with the condition (R), which tests for 7719 recursion. Consider this pattern, which matches text in angle brack- 7720 ets, allowing for arbitrary nesting. Only digits are allowed in nested 7721 brackets (that is, when recursing), whereas any characters are permit- 7722 ted at the outer level. 7723 7724 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > 7725 7726 In this pattern, (?(R) is the start of a conditional subpattern, with 7727 two different alternatives for the recursive and non-recursive cases. 7728 The (?R) item is the actual recursive call. 7729 7730 Differences in recursion processing between PCRE2 and Perl 7731 7732 Recursion processing in PCRE2 differs from Perl in two important ways. 7733 In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is 7734 always treated as an atomic group. That is, once it has matched some of 7735 the subject string, it is never re-entered, even if it contains untried 7736 alternatives and there is a subsequent matching failure. This can be 7737 illustrated by the following pattern, which purports to match a palin- 7738 dromic string that contains an odd number of characters (for example, 7739 "a", "aba", "abcba", "abcdcba"): 7740 7741 ^(.|(.)(?1)\2)$ 7742 7743 The idea is that it either matches a single character, or two identical 7744 characters surrounding a sub-palindrome. In Perl, this pattern works; 7745 in PCRE2 it does not if the pattern is longer than three characters. 7746 Consider the subject string "abcba": 7747 7748 At the top level, the first character is matched, but as it is not at 7749 the end of the string, the first alternative fails; the second alterna- 7750 tive is taken and the recursion kicks in. The recursive call to subpat- 7751 tern 1 successfully matches the next character ("b"). (Note that the 7752 beginning and end of line tests are not part of the recursion). 7753 7754 Back at the top level, the next character ("c") is compared with what 7755 subpattern 2 matched, which was "a". This fails. Because the recursion 7756 is treated as an atomic group, there are now no backtracking points, 7757 and so the entire match fails. (Perl is able, at this point, to re- 7758 enter the recursion and try the second alternative.) However, if the 7759 pattern is written with the alternatives in the other order, things are 7760 different: 7761 7762 ^((.)(?1)\2|.)$ 7763 7764 This time, the recursing alternative is tried first, and continues to 7765 recurse until it runs out of characters, at which point the recursion 7766 fails. But this time we do have another alternative to try at the 7767 higher level. That is the big difference: in the previous case the 7768 remaining alternative is at a deeper recursion level, which PCRE2 can- 7769 not use. 7770 7771 To change the pattern so that it matches all palindromic strings, not 7772 just those with an odd number of characters, it is tempting to change 7773 the pattern to this: 7774 7775 ^((.)(?1)\2|.?)$ 7776 7777 Again, this works in Perl, but not in PCRE2, and for the same reason. 7778 When a deeper recursion has matched a single character, it cannot be 7779 entered again in order to match an empty string. The solution is to 7780 separate the two cases, and write out the odd and even cases as alter- 7781 natives at the higher level: 7782 7783 ^(?:((.)(?1)\2|)|((.)(?3)\4|.)) 7784 7785 If you want to match typical palindromic phrases, the pattern has to 7786 ignore all non-word characters, which can be done like this: 7787 7788 ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ 7789 7790 If run with the PCRE2_CASELESS option, this pattern matches phrases 7791 such as "A man, a plan, a canal: Panama!" and it works in both PCRE2 7792 and Perl. Note the use of the possessive quantifier *+ to avoid back- 7793 tracking into sequences of non-word characters. Without this, PCRE2 7794 takes a great deal longer (ten times or more) to match typical phrases, 7795 and Perl takes so long that you think it has gone into a loop. 7796 7797 WARNING: The palindrome-matching patterns above work only if the sub- 7798 ject string does not start with a palindrome that is shorter than the 7799 entire string. For example, although "abcba" is correctly matched, if 7800 the subject is "ababa", PCRE2 finds the palindrome "aba" at the start, 7801 then fails at top level because the end of the string does not follow. 7802 Once again, it cannot jump back into the recursion to try other alter- 7803 natives, so the entire match fails. 7804 7805 The second way in which PCRE2 and Perl differ in their recursion pro- 7806 cessing is in the handling of captured values. In Perl, when a subpat- 7807 tern is called recursively or as a subpattern (see the next section), 7808 it has no access to any values that were captured outside the recur- 7809 sion, whereas in PCRE2 these values can be referenced. Consider this 7810 pattern: 7811 7812 ^(.)(\1|a(?2)) 7813 7814 In PCRE2, this pattern matches "bab". The first capturing parentheses 7815 match "b", then in the second group, when the back reference \1 fails 7816 to match "b", the second alternative matches "a" and then recurses. In 7817 the recursion, \1 does now match "b" and so the whole match succeeds. 7818 In Perl, the pattern fails to match because inside the recursive call 7819 \1 cannot access the externally set value. 7820 7821 7822SUBPATTERNS AS SUBROUTINES 7823 7824 If the syntax for a recursive subpattern call (either by number or by 7825 name) is used outside the parentheses to which it refers, it operates 7826 like a subroutine in a programming language. The called subpattern may 7827 be defined before or after the reference. A numbered reference can be 7828 absolute or relative, as in these examples: 7829 7830 (...(absolute)...)...(?2)... 7831 (...(relative)...)...(?-1)... 7832 (...(?+1)...(relative)... 7833 7834 An earlier example pointed out that the pattern 7835 7836 (sens|respons)e and \1ibility 7837 7838 matches "sense and sensibility" and "response and responsibility", but 7839 not "sense and responsibility". If instead the pattern 7840 7841 (sens|respons)e and (?1)ibility 7842 7843 is used, it does match "sense and responsibility" as well as the other 7844 two strings. Another example is given in the discussion of DEFINE 7845 above. 7846 7847 All subroutine calls, whether recursive or not, are always treated as 7848 atomic groups. That is, once a subroutine has matched some of the sub- 7849 ject string, it is never re-entered, even if it contains untried alter- 7850 natives and there is a subsequent matching failure. Any capturing 7851 parentheses that are set during the subroutine call revert to their 7852 previous values afterwards. 7853 7854 Processing options such as case-independence are fixed when a subpat- 7855 tern is defined, so if it is used as a subroutine, such options cannot 7856 be changed for different calls. For example, consider this pattern: 7857 7858 (abc)(?i:(?-1)) 7859 7860 It matches "abcabc". It does not match "abcABC" because the change of 7861 processing option does not affect the called subpattern. 7862 7863 7864ONIGURUMA SUBROUTINE SYNTAX 7865 7866 For compatibility with Oniguruma, the non-Perl syntax \g followed by a 7867 name or a number enclosed either in angle brackets or single quotes, is 7868 an alternative syntax for referencing a subpattern as a subroutine, 7869 possibly recursively. Here are two of the examples used above, rewrit- 7870 ten using this syntax: 7871 7872 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) 7873 (sens|respons)e and \g'1'ibility 7874 7875 PCRE2 supports an extension to Oniguruma: if a number is preceded by a 7876 plus or a minus sign it is taken as a relative reference. For example: 7877 7878 (abc)(?i:\g<-1>) 7879 7880 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not 7881 synonymous. The former is a back reference; the latter is a subroutine 7882 call. 7883 7884 7885CALLOUTS 7886 7887 Perl has a feature whereby using the sequence (?{...}) causes arbitrary 7888 Perl code to be obeyed in the middle of matching a regular expression. 7889 This makes it possible, amongst other things, to extract different sub- 7890 strings that match the same pair of parentheses when there is a repeti- 7891 tion. 7892 7893 PCRE2 provides a similar feature, but of course it cannot obey arbi- 7894 trary Perl code. The feature is called "callout". The caller of PCRE2 7895 provides an external function by putting its entry point in a match 7896 context using the function pcre2_set_callout(), and then passing that 7897 context to pcre2_match() or pcre2_dfa_match(). If no match context is 7898 passed, or if the callout entry point is set to NULL, callouts are dis- 7899 abled. 7900 7901 Within a regular expression, (?C<arg>) indicates a point at which the 7902 external function is to be called. There are two kinds of callout: 7903 those with a numerical argument and those with a string argument. (?C) 7904 on its own with no argument is treated as (?C0). A numerical argument 7905 allows the application to distinguish between different callouts. 7906 String arguments were added for release 10.20 to make it possible for 7907 script languages that use PCRE2 to embed short scripts within patterns 7908 in a similar way to Perl. 7909 7910 During matching, when PCRE2 reaches a callout point, the external func- 7911 tion is called. It is provided with the number or string argument of 7912 the callout, the position in the pattern, and one item of data that is 7913 also set in the match block. The callout function may cause matching to 7914 proceed, to backtrack, or to fail. 7915 7916 By default, PCRE2 implements a number of optimizations at matching 7917 time, and one side-effect is that sometimes callouts are skipped. If 7918 you need all possible callouts to happen, you need to set options that 7919 disable the relevant optimizations. More details, including a complete 7920 description of the programming interface to the callout function, are 7921 given in the pcre2callout documentation. 7922 7923 Callouts with numerical arguments 7924 7925 If you just want to have a means of identifying different callout 7926 points, put a number less than 256 after the letter C. For example, 7927 this pattern has two callout points: 7928 7929 (?C1)abc(?C2)def 7930 7931 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical 7932 callouts are automatically installed before each item in the pattern. 7933 They are all numbered 255. If there is a conditional group in the pat- 7934 tern whose condition is an assertion, an additional callout is inserted 7935 just before the condition. An explicit callout may also be set at this 7936 position, as in this example: 7937 7938 (?(?C9)(?=a)abc|def) 7939 7940 Note that this applies only to assertion conditions, not to other types 7941 of condition. 7942 7943 Callouts with string arguments 7944 7945 A delimited string may be used instead of a number as a callout argu- 7946 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the 7947 ending delimiter is the same as the start, except for {, where the end- 7948 ing delimiter is }. If the ending delimiter is needed within the 7949 string, it must be doubled. For example: 7950 7951 (?C'ab ''c'' d')xyz(?C{any text})pqr 7952 7953 The doubling is removed before the string is passed to the callout 7954 function. 7955 7956 7957BACKTRACKING CONTROL 7958 7959 Perl 5.10 introduced a number of "Special Backtracking Control Verbs", 7960 which are still described in the Perl documentation as "experimental 7961 and subject to change or removal in a future version of Perl". It goes 7962 on to say: "Their usage in production code should be noted to avoid 7963 problems during upgrades." The same remarks apply to the PCRE2 features 7964 described in this section. 7965 7966 The new verbs make use of what was previously invalid syntax: an open- 7967 ing parenthesis followed by an asterisk. They are generally of the form 7968 (*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving 7969 differently depending on whether or not a name is present. 7970 7971 By default, for compatibility with Perl, a name is any sequence of 7972 characters that does not include a closing parenthesis. The name is not 7973 processed in any way, and it is not possible to include a closing 7974 parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES option is 7975 set, normal backslash processing is applied to verb names and only an 7976 unescaped closing parenthesis terminates the name. A closing parenthe- 7977 sis can be included in a name either as \) or between \Q and \E. If the 7978 PCRE2_EXTENDED option is set, unescaped whitespace in verb names is 7979 skipped and #-comments are recognized, exactly as in the rest of the 7980 pattern. 7981 7982 The maximum length of a name is 255 in the 8-bit library and 65535 in 7983 the 16-bit and 32-bit libraries. If the name is empty, that is, if the 7984 closing parenthesis immediately follows the colon, the effect is as if 7985 the colon were not there. Any number of these verbs may occur in a pat- 7986 tern. 7987 7988 Since these verbs are specifically related to backtracking, most of 7989 them can be used only when the pattern is to be matched using the tra- 7990 ditional matching function, because these use a backtracking algorithm. 7991 With the exception of (*FAIL), which behaves like a failing negative 7992 assertion, the backtracking control verbs cause an error if encountered 7993 by the DFA matching function. 7994 7995 The behaviour of these verbs in repeated groups, assertions, and in 7996 subpatterns called as subroutines (whether or not recursively) is docu- 7997 mented below. 7998 7999 Optimizations that affect backtracking verbs 8000 8001 PCRE2 contains some optimizations that are used to speed up matching by 8002 running some checks at the start of each match attempt. For example, it 8003 may know the minimum length of matching subject, or that a particular 8004 character must be present. When one of these optimizations bypasses the 8005 running of a match, any included backtracking verbs will not, of 8006 course, be processed. You can suppress the start-of-match optimizations 8007 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- 8008 pile(), or by starting the pattern with (*NO_START_OPT). There is more 8009 discussion of this option in the section entitled "Compiling a pattern" 8010 in the pcre2api documentation. 8011 8012 Experiments with Perl suggest that it too has similar optimizations, 8013 sometimes leading to anomalous results. 8014 8015 Verbs that act immediately 8016 8017 The following verbs act as soon as they are encountered. They may not 8018 be followed by a name. 8019 8020 (*ACCEPT) 8021 8022 This verb causes the match to end successfully, skipping the remainder 8023 of the pattern. However, when it is inside a subpattern that is called 8024 as a subroutine, only that subpattern is ended successfully. Matching 8025 then continues at the outer level. If (*ACCEPT) in triggered in a posi- 8026 tive assertion, the assertion succeeds; in a negative assertion, the 8027 assertion fails. 8028 8029 If (*ACCEPT) is inside capturing parentheses, the data so far is cap- 8030 tured. For example: 8031 8032 A((?:A|B(*ACCEPT)|C)D) 8033 8034 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- 8035 tured by the outer parentheses. 8036 8037 (*FAIL) or (*F) 8038 8039 This verb causes a matching failure, forcing backtracking to occur. It 8040 is equivalent to (?!) but easier to read. The Perl documentation notes 8041 that it is probably useful only when combined with (?{}) or (??{}). 8042 Those are, of course, Perl features that are not present in PCRE2. The 8043 nearest equivalent is the callout feature, as for example in this pat- 8044 tern: 8045 8046 a+(?C)(*FAIL) 8047 8048 A match with the string "aaaa" always fails, but the callout is taken 8049 before each backtrack happens (in this example, 10 times). 8050 8051 Recording which path was taken 8052 8053 There is one verb whose main purpose is to track how a match was 8054 arrived at, though it also has a secondary use in conjunction with 8055 advancing the match starting point (see (*SKIP) below). 8056 8057 (*MARK:NAME) or (*:NAME) 8058 8059 A name is always required with this verb. There may be as many 8060 instances of (*MARK) as you like in a pattern, and their names do not 8061 have to be unique. 8062 8063 When a match succeeds, the name of the last-encountered (*MARK:NAME), 8064 (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to 8065 the caller as described in the section entitled "Other information 8066 about the match" in the pcre2api documentation. Here is an example of 8067 pcre2test output, where the "mark" modifier requests the retrieval and 8068 outputting of (*MARK) data: 8069 8070 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 8071 data> XY 8072 0: XY 8073 MK: A 8074 XZ 8075 0: XZ 8076 MK: B 8077 8078 The (*MARK) name is tagged with "MK:" in this output, and in this exam- 8079 ple it indicates which of the two alternatives matched. This is a more 8080 efficient way of obtaining this information than putting each alterna- 8081 tive in its own capturing parentheses. 8082 8083 If a verb with a name is encountered in a positive assertion that is 8084 true, the name is recorded and passed back if it is the last-encoun- 8085 tered. This does not happen for negative assertions or failing positive 8086 assertions. 8087 8088 After a partial match or a failed match, the last encountered name in 8089 the entire match process is returned. For example: 8090 8091 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 8092 data> XP 8093 No match, mark = B 8094 8095 Note that in this unanchored example the mark is retained from the 8096 match attempt that started at the letter "X" in the subject. Subsequent 8097 match attempts starting at "P" and then with an empty string do not get 8098 as far as the (*MARK) item, but nevertheless do not reset it. 8099 8100 If you are interested in (*MARK) values after failed matches, you 8101 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to 8102 ensure that the match is always attempted. 8103 8104 Verbs that act after backtracking 8105 8106 The following verbs do nothing when they are encountered. Matching con- 8107 tinues with what follows, but if there is no subsequent match, causing 8108 a backtrack to the verb, a failure is forced. That is, backtracking 8109 cannot pass to the left of the verb. However, when one of these verbs 8110 appears inside an atomic group (which includes any group that is called 8111 as a subroutine) or in an assertion that is true, its effect is con- 8112 fined to that group, because once the group has been matched, there is 8113 never any backtracking into it. In this situation, backtracking has to 8114 jump to the left of the entire atomic group or assertion. 8115 8116 These verbs differ in exactly what kind of failure occurs when back- 8117 tracking reaches them. The behaviour described below is what happens 8118 when the verb is not in a subroutine or an assertion. Subsequent sec- 8119 tions cover these special cases. 8120 8121 (*COMMIT) 8122 8123 This verb, which may not be followed by a name, causes the whole match 8124 to fail outright if there is a later matching failure that causes back- 8125 tracking to reach it. Even if the pattern is unanchored, no further 8126 attempts to find a match by advancing the starting point take place. If 8127 (*COMMIT) is the only backtracking verb that is encountered, once it 8128 has been passed pcre2_match() is committed to finding a match at the 8129 current starting point, or not at all. For example: 8130 8131 a+(*COMMIT)b 8132 8133 This matches "xxaab" but not "aacaab". It can be thought of as a kind 8134 of dynamic anchor, or "I've started, so I must finish." The name of the 8135 most recently passed (*MARK) in the path is passed back when (*COMMIT) 8136 forces a match failure. 8137 8138 If there is more than one backtracking verb in a pattern, a different 8139 one that follows (*COMMIT) may be triggered first, so merely passing 8140 (*COMMIT) during a match does not always guarantee that a match must be 8141 at this starting point. 8142 8143 Note that (*COMMIT) at the start of a pattern is not the same as an 8144 anchor, unless PCRE2's start-of-match optimizations are turned off, as 8145 shown in this output from pcre2test: 8146 8147 re> /(*COMMIT)abc/ 8148 data> xyzabc 8149 0: abc 8150 data> 8151 re> /(*COMMIT)abc/no_start_optimize 8152 data> xyzabc 8153 No match 8154 8155 For the first pattern, PCRE2 knows that any match must start with "a", 8156 so the optimization skips along the subject to "a" before applying the 8157 pattern to the first set of data. The match attempt then succeeds. The 8158 second pattern disables the optimization that skips along to the first 8159 character. The pattern is now applied starting at "x", and so the 8160 (*COMMIT) causes the match to fail without trying any other starting 8161 points. 8162 8163 (*PRUNE) or (*PRUNE:NAME) 8164 8165 This verb causes the match to fail at the current starting position in 8166 the subject if there is a later matching failure that causes backtrack- 8167 ing to reach it. If the pattern is unanchored, the normal "bumpalong" 8168 advance to the next starting character then happens. Backtracking can 8169 occur as usual to the left of (*PRUNE), before it is reached, or when 8170 matching to the right of (*PRUNE), but if there is no match to the 8171 right, backtracking cannot cross (*PRUNE). In simple cases, the use of 8172 (*PRUNE) is just an alternative to an atomic group or possessive quan- 8173 tifier, but there are some uses of (*PRUNE) that cannot be expressed in 8174 any other way. In an anchored pattern (*PRUNE) has the same effect as 8175 (*COMMIT). 8176 8177 The behaviour of (*PRUNE:NAME) is the not the same as 8178 (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is 8179 remembered for passing back to the caller. However, (*SKIP:NAME) 8180 searches only for names set with (*MARK), ignoring those set by 8181 (*PRUNE) or (*THEN). 8182 8183 (*SKIP) 8184 8185 This verb, when given without a name, is like (*PRUNE), except that if 8186 the pattern is unanchored, the "bumpalong" advance is not to the next 8187 character, but to the position in the subject where (*SKIP) was encoun- 8188 tered. (*SKIP) signifies that whatever text was matched leading up to 8189 it cannot be part of a successful match. Consider: 8190 8191 a+(*SKIP)b 8192 8193 If the subject is "aaaac...", after the first match attempt fails 8194 (starting at the first character in the string), the starting point 8195 skips on to start the next attempt at "c". Note that a possessive quan- 8196 tifer does not have the same effect as this example; although it would 8197 suppress backtracking during the first match attempt, the second 8198 attempt would start at the second character instead of skipping on to 8199 "c". 8200 8201 (*SKIP:NAME) 8202 8203 When (*SKIP) has an associated name, its behaviour is modified. When it 8204 is triggered, the previous path through the pattern is searched for the 8205 most recent (*MARK) that has the same name. If one is found, the 8206 "bumpalong" advance is to the subject position that corresponds to that 8207 (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with 8208 a matching name is found, the (*SKIP) is ignored. 8209 8210 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It 8211 ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME). 8212 8213 (*THEN) or (*THEN:NAME) 8214 8215 This verb causes a skip to the next innermost alternative when back- 8216 tracking reaches it. That is, it cancels any further backtracking 8217 within the current alternative. Its name comes from the observation 8218 that it can be used for a pattern-based if-then-else block: 8219 8220 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... 8221 8222 If the COND1 pattern matches, FOO is tried (and possibly further items 8223 after the end of the group if FOO succeeds); on failure, the matcher 8224 skips to the second alternative and tries COND2, without backtracking 8225 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- 8226 quently BAZ fails, there are no more alternatives, so there is a back- 8227 track to whatever came before the entire group. If (*THEN) is not 8228 inside an alternation, it acts like (*PRUNE). 8229 8230 The behaviour of (*THEN:NAME) is the not the same as 8231 (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is 8232 remembered for passing back to the caller. However, (*SKIP:NAME) 8233 searches only for names set with (*MARK), ignoring those set by 8234 (*PRUNE) and (*THEN). 8235 8236 A subpattern that does not contain a | character is just a part of the 8237 enclosing alternative; it is not a nested alternation with only one 8238 alternative. The effect of (*THEN) extends beyond such a subpattern to 8239 the enclosing alternative. Consider this pattern, where A, B, etc. are 8240 complex pattern fragments that do not contain any | characters at this 8241 level: 8242 8243 A (B(*THEN)C) | D 8244 8245 If A and B are matched, but there is a failure in C, matching does not 8246 backtrack into A; instead it moves to the next alternative, that is, D. 8247 However, if the subpattern containing (*THEN) is given an alternative, 8248 it behaves differently: 8249 8250 A (B(*THEN)C | (*FAIL)) | D 8251 8252 The effect of (*THEN) is now confined to the inner subpattern. After a 8253 failure in C, matching moves to (*FAIL), which causes the whole subpat- 8254 tern to fail because there are no more alternatives to try. In this 8255 case, matching does now backtrack into A. 8256 8257 Note that a conditional subpattern is not considered as having two 8258 alternatives, because only one is ever used. In other words, the | 8259 character in a conditional subpattern has a different meaning. Ignoring 8260 white space, consider: 8261 8262 ^.*? (?(?=a) a | b(*THEN)c ) 8263 8264 If the subject is "ba", this pattern does not match. Because .*? is 8265 ungreedy, it initially matches zero characters. The condition (?=a) 8266 then fails, the character "b" is matched, but "c" is not. At this 8267 point, matching does not backtrack to .*? as might perhaps be expected 8268 from the presence of the | character. The conditional subpattern is 8269 part of the single alternative that comprises the whole pattern, and so 8270 the match fails. (If there was a backtrack into .*?, allowing it to 8271 match "b", the match would succeed.) 8272 8273 The verbs just described provide four different "strengths" of control 8274 when subsequent matching fails. (*THEN) is the weakest, carrying on the 8275 match at the next alternative. (*PRUNE) comes next, failing the match 8276 at the current starting position, but allowing an advance to the next 8277 character (for an unanchored pattern). (*SKIP) is similar, except that 8278 the advance may be more than one character. (*COMMIT) is the strongest, 8279 causing the entire match to fail. 8280 8281 More than one backtracking verb 8282 8283 If more than one backtracking verb is present in a pattern, the one 8284 that is backtracked onto first acts. For example, consider this pat- 8285 tern, where A, B, etc. are complex pattern fragments: 8286 8287 (A(*COMMIT)B(*THEN)C|ABD) 8288 8289 If A matches but B fails, the backtrack to (*COMMIT) causes the entire 8290 match to fail. However, if A and B match, but C fails, the backtrack to 8291 (*THEN) causes the next alternative (ABD) to be tried. This behaviour 8292 is consistent, but is not always the same as Perl's. It means that if 8293 two or more backtracking verbs appear in succession, all the the last 8294 of them has no effect. Consider this example: 8295 8296 ...(*COMMIT)(*PRUNE)... 8297 8298 If there is a matching failure to the right, backtracking onto (*PRUNE) 8299 causes it to be triggered, and its action is taken. There can never be 8300 a backtrack onto (*COMMIT). 8301 8302 Backtracking verbs in repeated groups 8303 8304 PCRE2 differs from Perl in its handling of backtracking verbs in 8305 repeated groups. For example, consider: 8306 8307 /(a(*COMMIT)b)+ac/ 8308 8309 If the subject is "abac", Perl matches, but PCRE2 fails because the 8310 (*COMMIT) in the second repeat of the group acts. 8311 8312 Backtracking verbs in assertions 8313 8314 (*FAIL) in an assertion has its normal effect: it forces an immediate 8315 backtrack. 8316 8317 (*ACCEPT) in a positive assertion causes the assertion to succeed with- 8318 out any further processing. In a negative assertion, (*ACCEPT) causes 8319 the assertion to fail without any further processing. 8320 8321 The other backtracking verbs are not treated specially if they appear 8322 in a positive assertion. In particular, (*THEN) skips to the next 8323 alternative in the innermost enclosing group that has alternations, 8324 whether or not this is within the assertion. 8325 8326 Negative assertions are, however, different, in order to ensure that 8327 changing a positive assertion into a negative assertion changes its 8328 result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg- 8329 ative assertion to be true, without considering any further alternative 8330 branches in the assertion. Backtracking into (*THEN) causes it to skip 8331 to the next enclosing alternative within the assertion (the normal be- 8332 haviour), but if the assertion does not have such an alternative, 8333 (*THEN) behaves like (*PRUNE). 8334 8335 Backtracking verbs in subroutines 8336 8337 These behaviours occur whether or not the subpattern is called recur- 8338 sively. Perl's treatment of subroutines is different in some cases. 8339 8340 (*FAIL) in a subpattern called as a subroutine has its normal effect: 8341 it forces an immediate backtrack. 8342 8343 (*ACCEPT) in a subpattern called as a subroutine causes the subroutine 8344 match to succeed without any further processing. Matching then contin- 8345 ues after the subroutine call. 8346 8347 (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine 8348 cause the subroutine match to fail. 8349 8350 (*THEN) skips to the next alternative in the innermost enclosing group 8351 within the subpattern that has alternatives. If there is no such group 8352 within the subpattern, (*THEN) causes the subroutine match to fail. 8353 8354 8355SEE ALSO 8356 8357 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), 8358 pcre2(3). 8359 8360 8361AUTHOR 8362 8363 Philip Hazel 8364 University Computing Service 8365 Cambridge, England. 8366 8367 8368REVISION 8369 8370 Last updated: 20 June 2016 8371 Copyright (c) 1997-2016 University of Cambridge. 8372------------------------------------------------------------------------------ 8373 8374 8375PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3) 8376 8377 8378 8379NAME 8380 PCRE2 - Perl-compatible regular expressions (revised API) 8381 8382PCRE2 PERFORMANCE 8383 8384 Two aspects of performance are discussed below: memory usage and pro- 8385 cessing time. The way you express your pattern as a regular expression 8386 can affect both of them. 8387 8388 8389COMPILED PATTERN MEMORY USAGE 8390 8391 Patterns are compiled by PCRE2 into a reasonably efficient interpretive 8392 code, so that most simple patterns do not use much memory. However, 8393 there is one case where the memory usage of a compiled pattern can be 8394 unexpectedly large. If a parenthesized subpattern has a quantifier with 8395 a minimum greater than 1 and/or a limited maximum, the whole subpattern 8396 is repeated in the compiled code. For example, the pattern 8397 8398 (abc|def){2,4} 8399 8400 is compiled as if it were 8401 8402 (abc|def)(abc|def)((abc|def)(abc|def)?)? 8403 8404 (Technical aside: It is done this way so that backtrack points within 8405 each of the repetitions can be independently maintained.) 8406 8407 For regular expressions whose quantifiers use only small numbers, this 8408 is not usually a problem. However, if the numbers are large, and par- 8409 ticularly if such repetitions are nested, the memory usage can become 8410 an embarrassment. For example, the very simple pattern 8411 8412 ((ab){1,1000}c){1,3} 8413 8414 uses 51K bytes when compiled using the 8-bit library. When PCRE2 is 8415 compiled with its default internal pointer size of two bytes, the size 8416 limit on a compiled pattern is 64K code units in the 8-bit and 16-bit 8417 libraries, and this is reached with the above pattern if the outer rep- 8418 etition is increased from 3 to 4. PCRE2 can be compiled to use larger 8419 internal pointers and thus handle larger compiled patterns, but it is 8420 better to try to rewrite your pattern to use less memory if you can. 8421 8422 One way of reducing the memory usage for such patterns is to make use 8423 of PCRE2's "subroutine" facility. Re-writing the above pattern as 8424 8425 ((ab)(?2){0,999}c)(?1){0,2} 8426 8427 reduces the memory requirements to 18K, and indeed it remains under 20K 8428 even with the outer repetition increased to 100. However, this pattern 8429 is not exactly equivalent, because the "subroutine" calls are treated 8430 as atomic groups into which there can be no backtracking if there is a 8431 subsequent matching failure. Therefore, PCRE2 cannot do this kind of 8432 rewriting automatically. Furthermore, there is a noticeable loss of 8433 speed when executing the modified pattern. Nevertheless, if the atomic 8434 grouping is not a problem and the loss of speed is acceptable, this 8435 kind of rewriting will allow you to process patterns that PCRE2 cannot 8436 otherwise handle. 8437 8438 8439STACK USAGE AT RUN TIME 8440 8441 When pcre2_match() is used for matching, certain kinds of pattern can 8442 cause it to use large amounts of the process stack. In some environ- 8443 ments the default process stack is quite small, and if it runs out the 8444 result is often SIGSEGV. Rewriting your pattern can often help. The 8445 pcre2stack documentation discusses this issue in detail. 8446 8447 8448PROCESSING TIME 8449 8450 Certain items in regular expression patterns are processed more effi- 8451 ciently than others. It is more efficient to use a character class like 8452 [aeiou] than a set of single-character alternatives such as 8453 (a|e|i|o|u). In general, the simplest construction that provides the 8454 required behaviour is usually the most efficient. Jeffrey Friedl's book 8455 contains a lot of useful general discussion about optimizing regular 8456 expressions for efficient performance. This document contains a few 8457 observations about PCRE2. 8458 8459 Using Unicode character properties (the \p, \P, and \X escapes) is 8460 slow, because PCRE2 has to use a multi-stage table lookup whenever it 8461 needs a character's property. If you can find an alternative pattern 8462 that does not use character properties, it will probably be faster. 8463 8464 By default, the escape sequences \b, \d, \s, and \w, and the POSIX 8465 character classes such as [:alpha:] do not use Unicode properties, 8466 partly for backwards compatibility, and partly for performance reasons. 8467 However, you can set the PCRE2_UCP option or start the pattern with 8468 (*UCP) if you want Unicode character properties to be used. This can 8469 double the matching time for items such as \d, when matched with 8470 pcre2_match(); the performance loss is less with a DFA matching func- 8471 tion, and in both cases there is not much difference for \b. 8472 8473 When a pattern begins with .* not in atomic parentheses, nor in paren- 8474 theses that are the subject of a backreference, and the PCRE2_DOTALL 8475 option is set, the pattern is implicitly anchored by PCRE2, since it 8476 can match only at the start of a subject string. If the pattern has 8477 multiple top-level branches, they must all be anchorable. The optimiza- 8478 tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is 8479 automatically disabled if the pattern contains (*PRUNE) or (*SKIP). 8480 8481 If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, 8482 because the dot metacharacter does not then match a newline, and if the 8483 subject string contains newlines, the pattern may match from the char- 8484 acter immediately following one of them instead of from the very start. 8485 For example, the pattern 8486 8487 .*second 8488 8489 matches the subject "first\nand second" (where \n stands for a newline 8490 character), with the match starting at the seventh character. In order 8491 to do this, PCRE2 has to retry the match starting after every newline 8492 in the subject. 8493 8494 If you are using such a pattern with subject strings that do not con- 8495 tain newlines, the best performance is obtained by setting 8496 PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate 8497 explicit anchoring. That saves PCRE2 from having to scan along the sub- 8498 ject looking for a newline to restart at. 8499 8500 Beware of patterns that contain nested indefinite repeats. These can 8501 take a long time to run when applied to a string that does not match. 8502 Consider the pattern fragment 8503 8504 ^(a+)* 8505 8506 This can match "aaaa" in 16 different ways, and this number increases 8507 very rapidly as the string gets longer. (The * repeat can match 0, 1, 8508 2, 3, or 4 times, and for each of those cases other than 0 or 4, the + 8509 repeats can match different numbers of times.) When the remainder of 8510 the pattern is such that the entire match is going to fail, PCRE2 has 8511 in principle to try every possible variation, and this can take an 8512 extremely long time, even for relatively short strings. 8513 8514 An optimization catches some of the more simple cases such as 8515 8516 (a+)*b 8517 8518 where a literal character follows. Before embarking on the standard 8519 matching procedure, PCRE2 checks that there is a "b" later in the sub- 8520 ject string, and if there is not, it fails the match immediately. How- 8521 ever, when there is no following literal this optimization cannot be 8522 used. You can see the difference by comparing the behaviour of 8523 8524 (a+)*\d 8525 8526 with the pattern above. The former gives a failure almost instantly 8527 when applied to a whole line of "a" characters, whereas the latter 8528 takes an appreciable time with strings longer than about 20 characters. 8529 8530 In many cases, the solution to this kind of performance issue is to use 8531 an atomic group or a possessive quantifier. 8532 8533 8534AUTHOR 8535 8536 Philip Hazel 8537 University Computing Service 8538 Cambridge, England. 8539 8540 8541REVISION 8542 8543 Last updated: 02 January 2015 8544 Copyright (c) 1997-2015 University of Cambridge. 8545------------------------------------------------------------------------------ 8546 8547 8548PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3) 8549 8550 8551 8552NAME 8553 PCRE2 - Perl-compatible regular expressions (revised API) 8554 8555SYNOPSIS 8556 8557 #include <pcre2posix.h> 8558 8559 int regcomp(regex_t *preg, const char *pattern, 8560 int cflags); 8561 8562 int regexec(const regex_t *preg, const char *string, 8563 size_t nmatch, regmatch_t pmatch[], int eflags); 8564 8565 size_t regerror(int errcode, const regex_t *preg, 8566 char *errbuf, size_t errbuf_size); 8567 8568 void regfree(regex_t *preg); 8569 8570 8571DESCRIPTION 8572 8573 This set of functions provides a POSIX-style API for the PCRE2 regular 8574 expression 8-bit library. See the pcre2api documentation for a descrip- 8575 tion of PCRE2's native API, which contains much additional functional- 8576 ity. There are no POSIX-style wrappers for PCRE2's 16-bit and 32-bit 8577 libraries. 8578 8579 The functions described here are just wrapper functions that ultimately 8580 call the PCRE2 native API. Their prototypes are defined in the 8581 pcre2posix.h header file, and on Unix systems the library itself is 8582 called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to 8583 the command for linking an application that uses them. Because the 8584 POSIX functions call the native ones, it is also necessary to add 8585 -lpcre2-8. 8586 8587 Those POSIX option bits that can reasonably be mapped to PCRE2 native 8588 options have been implemented. In addition, the option REG_EXTENDED is 8589 defined with the value zero. This has no effect, but since programs 8590 that are written to the POSIX interface often use it, this makes it 8591 easier to slot in PCRE2 as a replacement library. Other POSIX options 8592 are not even defined. 8593 8594 There are also some options that are not defined by POSIX. These have 8595 been added at the request of users who want to make use of certain 8596 PCRE2-specific features via the POSIX calling interface. 8597 8598 When PCRE2 is called via these functions, it is only the API that is 8599 POSIX-like in style. The syntax and semantics of the regular expres- 8600 sions themselves are still those of Perl, subject to the setting of 8601 various PCRE2 options, as described below. "POSIX-like in style" means 8602 that the API approximates to the POSIX definition; it is not fully 8603 POSIX-compatible, and in multi-unit encoding domains it is probably 8604 even less compatible. 8605 8606 The header for these functions is supplied as pcre2posix.h to avoid any 8607 potential clash with other POSIX libraries. It can, of course, be 8608 renamed or aliased as regex.h, which is the "correct" name. It provides 8609 two structure types, regex_t for compiled internal forms, and reg- 8610 match_t for returning captured substrings. It also defines some con- 8611 stants whose names start with "REG_"; these are used for setting 8612 options and identifying error codes. 8613 8614 8615COMPILING A PATTERN 8616 8617 The function regcomp() is called to compile a pattern into an internal 8618 form. The pattern is a C string terminated by a binary zero, and is 8619 passed in the argument pattern. The preg argument is a pointer to a 8620 regex_t structure that is used as a base for storing information about 8621 the compiled regular expression. 8622 8623 The argument cflags is either zero, or contains one or more of the bits 8624 defined by the following macros: 8625 8626 REG_DOTALL 8627 8628 The PCRE2_DOTALL option is set when the regular expression is passed 8629 for compilation to the native function. Note that REG_DOTALL is not 8630 part of the POSIX standard. 8631 8632 REG_ICASE 8633 8634 The PCRE2_CASELESS option is set when the regular expression is passed 8635 for compilation to the native function. 8636 8637 REG_NEWLINE 8638 8639 The PCRE2_MULTILINE option is set when the regular expression is passed 8640 for compilation to the native function. Note that this does not mimic 8641 the defined POSIX behaviour for REG_NEWLINE (see the following sec- 8642 tion). 8643 8644 REG_NOSUB 8645 8646 When a pattern that is compiled with this flag is passed to regexec() 8647 for matching, the nmatch and pmatch arguments are ignored, and no cap- 8648 tured strings are returned. Versions of the PCRE library prior to 10.22 8649 used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no 8650 longer happens because it disables the use of back references. 8651 8652 REG_UCP 8653 8654 The PCRE2_UCP option is set when the regular expression is passed for 8655 compilation to the native function. This causes PCRE2 to use Unicode 8656 properties when matchine \d, \w, etc., instead of just recognizing 8657 ASCII values. Note that REG_UCP is not part of the POSIX standard. 8658 8659 REG_UNGREEDY 8660 8661 The PCRE2_UNGREEDY option is set when the regular expression is passed 8662 for compilation to the native function. Note that REG_UNGREEDY is not 8663 part of the POSIX standard. 8664 8665 REG_UTF 8666 8667 The PCRE2_UTF option is set when the regular expression is passed for 8668 compilation to the native function. This causes the pattern itself and 8669 all data strings used for matching it to be treated as UTF-8 strings. 8670 Note that REG_UTF is not part of the POSIX standard. 8671 8672 In the absence of these flags, no options are passed to the native 8673 function. This means the the regex is compiled with PCRE2 default 8674 semantics. In particular, the way it handles newline characters in the 8675 subject string is the Perl way, not the POSIX way. Note that setting 8676 PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. 8677 It does not affect the way newlines are matched by the dot metacharac- 8678 ter (they are not) or by a negative class such as [^a] (they are). 8679 8680 The yield of regcomp() is zero on success, and non-zero otherwise. The 8681 preg structure is filled in on success, and one member of the structure 8682 is public: re_nsub contains the number of capturing subpatterns in the 8683 regular expression. Various error codes are defined in the header file. 8684 8685 NOTE: If the yield of regcomp() is non-zero, you must not attempt to 8686 use the contents of the preg structure. If, for example, you pass it to 8687 regexec(), the result is undefined and your program is likely to crash. 8688 8689 8690MATCHING NEWLINE CHARACTERS 8691 8692 This area is not simple, because POSIX and Perl take different views of 8693 things. It is not possible to get PCRE2 to obey POSIX semantics, but 8694 then PCRE2 was never intended to be a POSIX engine. The following table 8695 lists the different possibilities for matching newline characters in 8696 Perl and PCRE2: 8697 8698 Default Change with 8699 8700 . matches newline no PCRE2_DOTALL 8701 newline matches [^a] yes not changeable 8702 $ matches \n at end yes PCRE2_DOLLAR_ENDONLY 8703 $ matches \n in middle no PCRE2_MULTILINE 8704 ^ matches \n in middle no PCRE2_MULTILINE 8705 8706 This is the equivalent table for a POSIX-compatible pattern matcher: 8707 8708 Default Change with 8709 8710 . matches newline yes REG_NEWLINE 8711 newline matches [^a] yes REG_NEWLINE 8712 $ matches \n at end no REG_NEWLINE 8713 $ matches \n in middle no REG_NEWLINE 8714 ^ matches \n in middle no REG_NEWLINE 8715 8716 This behaviour is not what happens when PCRE2 is called via its POSIX 8717 API. By default, PCRE2's behaviour is the same as Perl's, except that 8718 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 8719 and Perl, there is no way to stop newline from matching [^a]. 8720 8721 Default POSIX newline handling can be obtained by setting PCRE2_DOTALL 8722 and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but 8723 there is no way to make PCRE2 behave exactly as for the REG_NEWLINE 8724 action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg- 8725 comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(), 8726 and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL- 8727 LAR_ENDONLY. 8728 8729 8730MATCHING A PATTERN 8731 8732 The function regexec() is called to match a compiled pattern preg 8733 against a given string, which is by default terminated by a zero byte 8734 (but see REG_STARTEND below), subject to the options in eflags. These 8735 can be: 8736 8737 REG_NOTBOL 8738 8739 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match- 8740 ing function. 8741 8742 REG_NOTEMPTY 8743 8744 The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 8745 matching function. Note that REG_NOTEMPTY is not part of the POSIX 8746 standard. However, setting this option can give more POSIX-like behav- 8747 iour in some situations. 8748 8749 REG_NOTEOL 8750 8751 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match- 8752 ing function. 8753 8754 REG_STARTEND 8755 8756 The string is considered to start at string + pmatch[0].rm_so and to 8757 have a terminating NUL located at string + pmatch[0].rm_eo (there need 8758 not actually be a NUL at that location), regardless of the value of 8759 nmatch. This is a BSD extension, compatible with but not specified by 8760 IEEE Standard 1003.2 (POSIX.2), and should be used with caution in 8761 software intended to be portable to other systems. Note that a non-zero 8762 rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location 8763 of the string, not how it is matched. Setting REG_STARTEND and passing 8764 pmatch as NULL are mutually exclusive; the error REG_INVARG is 8765 returned. 8766 8767 If the pattern was compiled with the REG_NOSUB flag, no data about any 8768 matched strings is returned. The nmatch and pmatch arguments of 8769 regexec() are ignored (except possibly as input for REG_STARTEND). 8770 8771 The value of nmatch may be zero, and the value pmatch may be NULL 8772 (unless REG_STARTEND is set); in both these cases no data about any 8773 matched strings is returned. 8774 8775 Otherwise, the portion of the string that was matched, and also any 8776 captured substrings, are returned via the pmatch argument, which points 8777 to an array of nmatch structures of type regmatch_t, containing the 8778 members rm_so and rm_eo. These contain the byte offset to the first 8779 character of each substring and the offset to the first character after 8780 the end of each substring, respectively. The 0th element of the vector 8781 relates to the entire portion of string that was matched; subsequent 8782 elements relate to the capturing subpatterns of the regular expression. 8783 Unused entries in the array have both structure members set to -1. 8784 8785 A successful match yields a zero return; various error codes are 8786 defined in the header file, of which REG_NOMATCH is the "expected" 8787 failure code. 8788 8789 8790ERROR MESSAGES 8791 8792 The regerror() function maps a non-zero errorcode from either regcomp() 8793 or regexec() to a printable message. If preg is not NULL, the error 8794 should have arisen from the use of that structure. A message terminated 8795 by a binary zero is placed in errbuf. If the buffer is too short, only 8796 the first errbuf_size - 1 characters of the error message are used. The 8797 yield of the function is the size of buffer needed to hold the whole 8798 message, including the terminating zero. This value is greater than 8799 errbuf_size if the message was truncated. 8800 8801 8802MEMORY USAGE 8803 8804 Compiling a regular expression causes memory to be allocated and asso- 8805 ciated with the preg structure. The function regfree() frees all such 8806 memory, after which preg may no longer be used as a compiled expres- 8807 sion. 8808 8809 8810AUTHOR 8811 8812 Philip Hazel 8813 University Computing Service 8814 Cambridge, England. 8815 8816 8817REVISION 8818 8819 Last updated: 31 January 2016 8820 Copyright (c) 1997-2016 University of Cambridge. 8821------------------------------------------------------------------------------ 8822 8823 8824PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3) 8825 8826 8827 8828NAME 8829 PCRE2 - Perl-compatible regular expressions (revised API) 8830 8831PCRE2 SAMPLE PROGRAM 8832 8833 A simple, complete demonstration program to get you started with using 8834 PCRE2 is supplied in the file pcre2demo.c in the src directory in the 8835 PCRE2 distribution. A listing of this program is given in the pcre2demo 8836 documentation. If you do not have a copy of the PCRE2 distribution, you 8837 can save this listing to re-create the contents of pcre2demo.c. 8838 8839 The demonstration program compiles the regular expression that is its 8840 first argument, and matches it against the subject string in its second 8841 argument. No PCRE2 options are set, and default character tables are 8842 used. If matching succeeds, the program outputs the portion of the sub- 8843 ject that matched, together with the contents of any captured sub- 8844 strings. 8845 8846 If the -g option is given on the command line, the program then goes on 8847 to check for further matches of the same regular expression in the same 8848 subject string. The logic is a little bit tricky because of the possi- 8849 bility of matching an empty string. Comments in the code explain what 8850 is going on. 8851 8852 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit 8853 library. It handles strings and characters that are stored in 8-bit 8854 code units. By default, one character corresponds to one code unit, 8855 but if the pattern starts with "(*UTF)", both it and the subject are 8856 treated as UTF-8 strings, where characters may occupy multiple code 8857 units. 8858 8859 If PCRE2 is installed in the standard include and library directories 8860 for your operating system, you should be able to compile the demonstra- 8861 tion program using a command like this: 8862 8863 cc -o pcre2demo pcre2demo.c -lpcre2-8 8864 8865 If PCRE2 is installed elsewhere, you may need to add additional options 8866 to the command line. For example, on a Unix-like system that has PCRE2 8867 installed in /usr/local, you can compile the demonstration program 8868 using a command like this: 8869 8870 cc -o pcre2demo -I/usr/local/include pcre2demo.c \ 8871 -L/usr/local/lib -lpcre2-8 8872 8873 Once you have built the demonstration program, you can run simple tests 8874 like this: 8875 8876 ./pcre2demo 'cat|dog' 'the cat sat on the mat' 8877 ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' 8878 8879 Note that there is a much more comprehensive test program, called 8880 pcre2test, which supports many more facilities for testing regular 8881 expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit, 8882 though not all three need be installed). The pcre2demo program is pro- 8883 vided as a relatively simple coding example. 8884 8885 If you try to run pcre2demo when PCRE2 is not installed in the standard 8886 library directory, you may get an error like this on some operating 8887 systems (e.g. Solaris): 8888 8889 ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file 8890 or directory 8891 8892 This is caused by the way shared library support works on those sys- 8893 tems. You need to add 8894 8895 -R/usr/local/lib 8896 8897 (for example) to the compile command to get round this problem. 8898 8899 8900AUTHOR 8901 8902 Philip Hazel 8903 University Computing Service 8904 Cambridge, England. 8905 8906 8907REVISION 8908 8909 Last updated: 02 February 2016 8910 Copyright (c) 1997-2016 University of Cambridge. 8911------------------------------------------------------------------------------ 8912PCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3) 8913 8914 8915 8916NAME 8917 PCRE2 - Perl-compatible regular expressions (revised API) 8918 8919SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS 8920 8921 int32_t pcre2_serialize_decode(pcre2_code **codes, 8922 int32_t number_of_codes, const uint32_t *bytes, 8923 pcre2_general_context *gcontext); 8924 8925 int32_t pcre2_serialize_encode(pcre2_code **codes, 8926 int32_t number_of_codes, uint32_t **serialized_bytes, 8927 PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); 8928 8929 void pcre2_serialize_free(uint8_t *bytes); 8930 8931 int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); 8932 8933 If you are running an application that uses a large number of regular 8934 expression patterns, it may be useful to store them in a precompiled 8935 form instead of having to compile them every time the application is 8936 run. However, if you are using the just-in-time optimization feature, 8937 it is not possible to save and reload the JIT data, because it is posi- 8938 tion-dependent. The host on which the patterns are reloaded must be 8939 running the same version of PCRE2, with the same code unit width, and 8940 must also have the same endianness, pointer width and PCRE2_SIZE type. 8941 For example, patterns compiled on a 32-bit system using PCRE2's 16-bit 8942 library cannot be reloaded on a 64-bit system, nor can they be reloaded 8943 using the 8-bit library. 8944 8945 8946SECURITY CONCERNS 8947 8948 The facility for saving and restoring compiled patterns is intended for 8949 use within individual applications. As such, the data supplied to 8950 pcre2_serialize_decode() is expected to be trusted data, not data from 8951 arbitrary external sources. There is only some simple consistency 8952 checking, not complete validation of what is being re-loaded. 8953 8954 8955SAVING COMPILED PATTERNS 8956 8957 Before compiled patterns can be saved they must be serialized, that is, 8958 converted to a stream of bytes. A single byte stream may contain any 8959 number of compiled patterns, but they must all use the same character 8960 tables. A single copy of the tables is included in the byte stream (its 8961 size is 1088 bytes). For more details of character tables, see the sec- 8962 tion on locale support in the pcre2api documentation. 8963 8964 The function pcre2_serialize_encode() creates a serialized byte stream 8965 from a list of compiled patterns. Its first two arguments specify the 8966 list, being a pointer to a vector of pointers to compiled patterns, and 8967 the length of the vector. The third and fourth arguments point to vari- 8968 ables which are set to point to the created byte stream and its length, 8969 respectively. The final argument is a pointer to a general context, 8970 which can be used to specify custom memory mangagement functions. If 8971 this argument is NULL, malloc() is used to obtain memory for the byte 8972 stream. The yield of the function is the number of serialized patterns, 8973 or one of the following negative error codes: 8974 8975 PCRE2_ERROR_BADDATA the number of patterns is zero or less 8976 PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns 8977 PCRE2_ERROR_MEMORY memory allocation failed 8978 PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables 8979 PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL 8980 8981 PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor- 8982 rupted, or that a slot in the vector does not point to a compiled pat- 8983 tern. 8984 8985 Once a set of patterns has been serialized you can save the data in any 8986 appropriate manner. Here is sample code that compiles two patterns and 8987 writes them to a file. It assumes that the variable fd refers to a file 8988 that is open for output. The error checking that should be present in a 8989 real application has been omitted for simplicity. 8990 8991 int errorcode; 8992 uint8_t *bytes; 8993 PCRE2_SIZE erroroffset; 8994 PCRE2_SIZE bytescount; 8995 pcre2_code *list_of_codes[2]; 8996 list_of_codes[0] = pcre2_compile("first pattern", 8997 PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); 8998 list_of_codes[1] = pcre2_compile("second pattern", 8999 PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); 9000 errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes, 9001 &bytescount, NULL); 9002 errorcode = fwrite(bytes, 1, bytescount, fd); 9003 9004 Note that the serialized data is binary data that may contain any of 9005 the 256 possible byte values. On systems that make a distinction 9006 between binary and non-binary data, be sure that the file is opened for 9007 binary output. 9008 9009 Serializing a set of patterns leaves the original data untouched, so 9010 they can still be used for matching. Their memory must eventually be 9011 freed in the usual way by calling pcre2_code_free(). When you have fin- 9012 ished with the byte stream, it too must be freed by calling pcre2_seri- 9013 alize_free(). 9014 9015 9016RE-USING PRECOMPILED PATTERNS 9017 9018 In order to re-use a set of saved patterns you must first make the 9019 serialized byte stream available in main memory (for example, by read- 9020 ing from a file). The management of this memory block is up to the 9021 application. You can use the pcre2_serialize_get_number_of_codes() 9022 function to find out how many compiled patterns are in the serialized 9023 data without actually decoding the patterns: 9024 9025 uint8_t *bytes = <serialized data>; 9026 int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes); 9027 9028 The pcre2_serialize_decode() function reads a byte stream and recreates 9029 the compiled patterns in new memory blocks, setting pointers to them in 9030 a vector. The first two arguments are a pointer to a suitable vector 9031 and its length, and the third argument points to a byte stream. The 9032 final argument is a pointer to a general context, which can be used to 9033 specify custom memory mangagement functions for the decoded patterns. 9034 If this argument is NULL, malloc() and free() are used. After deserial- 9035 ization, the byte stream is no longer needed and can be discarded. 9036 9037 int32_t number_of_codes; 9038 pcre2_code *list_of_codes[2]; 9039 uint8_t *bytes = <serialized data>; 9040 int32_t number_of_codes = 9041 pcre2_serialize_decode(list_of_codes, 2, bytes, NULL); 9042 9043 If the vector is not large enough for all the patterns in the byte 9044 stream, it is filled with those that fit, and the remainder are 9045 ignored. The yield of the function is the number of decoded patterns, 9046 or one of the following negative error codes: 9047 9048 PCRE2_ERROR_BADDATA second argument is zero or less 9049 PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data 9050 PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version 9051 PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure 9052 PCRE2_ERROR_MEMORY memory allocation failed 9053 PCRE2_ERROR_NULL first or third argument is NULL 9054 9055 PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was 9056 compiled on a system with different endianness. 9057 9058 Decoded patterns can be used for matching in the usual way, and must be 9059 freed by calling pcre2_code_free(). However, be aware that there is a 9060 potential race issue if you are using multiple patterns that were 9061 decoded from a single byte stream in a multithreaded application. A 9062 single copy of the character tables is used by all the decoded patterns 9063 and a reference count is used to arrange for its memory to be automati- 9064 cally freed when the last pattern is freed, but there is no locking on 9065 this reference count. Therefore, if you want to call pcre2_code_free() 9066 for these patterns in different threads, you must arrange your own 9067 locking, and ensure that pcre2_code_free() cannot be called by two 9068 threads at the same time. 9069 9070 If a pattern was processed by pcre2_jit_compile() before being serial- 9071 ized, the JIT data is discarded and so is no longer available after a 9072 save/restore cycle. You can, however, process a restored pattern with 9073 pcre2_jit_compile() if you wish. 9074 9075 9076AUTHOR 9077 9078 Philip Hazel 9079 University Computing Service 9080 Cambridge, England. 9081 9082 9083REVISION 9084 9085 Last updated: 24 May 2016 9086 Copyright (c) 1997-2016 University of Cambridge. 9087------------------------------------------------------------------------------ 9088 9089 9090PCRE2STACK(3) Library Functions Manual PCRE2STACK(3) 9091 9092 9093 9094NAME 9095 PCRE2 - Perl-compatible regular expressions (revised API) 9096 9097PCRE2 DISCUSSION OF STACK USAGE 9098 9099 When you call pcre2_match(), it makes use of an internal function 9100 called match(). This calls itself recursively at branch points in the 9101 pattern, in order to remember the state of the match so that it can 9102 back up and try a different alternative after a failure. As matching 9103 proceeds deeper and deeper into the tree of possibilities, the recur- 9104 sion depth increases. The match() function is also called in other cir- 9105 cumstances, for example, whenever a parenthesized sub-pattern is 9106 entered, and in certain cases of repetition. 9107 9108 Not all calls of match() increase the recursion depth; for an item such 9109 as a* it may be called several times at the same level, after matching 9110 different numbers of a's. Furthermore, in a number of cases where the 9111 result of the recursive call would immediately be passed back as the 9112 result of the current call (a "tail recursion"), the function is just 9113 restarted instead. 9114 9115 Each time the internal match() function is called recursively, it uses 9116 memory from the process stack. For certain kinds of pattern and data, 9117 very large amounts of stack may be needed, despite the recognition of 9118 "tail recursion". Note that if PCRE2 is compiled with the -fsani- 9119 tize=address option of the GCC compiler, the stack requirements are 9120 greatly increased. 9121 9122 The above comments apply when pcre2_match() is run in its normal inter- 9123 pretive manner. If the compiled pattern was processed by pcre2_jit_com- 9124 pile(), and just-in-time compiling was successful, and the options 9125 passed to pcre2_match() were not incompatible, the matching process 9126 uses the JIT-compiled code instead of the match() function. In this 9127 case, the memory requirements are handled entirely differently. See the 9128 pcre2jit documentation for details. 9129 9130 The pcre2_dfa_match() function operates in a different way to 9131 pcre2_match(), and uses recursion only when there is a regular expres- 9132 sion recursion or subroutine call in the pattern. This includes the 9133 processing of assertion and "once-only" subpatterns, which are handled 9134 like subroutine calls. Normally, these are never very deep, and the 9135 limit on the complexity of pcre2_dfa_match() is controlled by the 9136 amount of workspace it is given. However, it is possible to write pat- 9137 terns with runaway infinite recursions; such patterns will cause 9138 pcre2_dfa_match() to run out of stack. At present, there is no protec- 9139 tion against this. 9140 9141 The comments that follow do NOT apply to pcre2_dfa_match(); they are 9142 relevant only for pcre2_match() without the JIT optimization. 9143 9144 Reducing pcre2_match()'s stack usage 9145 9146 You can often reduce the amount of recursion, and therefore the amount 9147 of stack used, by modifying the pattern that is being matched. Con- 9148 sider, for example, this pattern: 9149 9150 ([^<]|<(?!inet))+ 9151 9152 It matches from wherever it starts until it encounters "<inet" or the 9153 end of the data, and is the kind of pattern that might be used when 9154 processing an XML file. Each iteration of the outer parentheses matches 9155 either one character that is not "<" or a "<" that is not followed by 9156 "inet". However, each time a parenthesis is processed, a recursion 9157 occurs, so this formulation uses a stack frame for each matched charac- 9158 ter. For a long string, a lot of stack is required. Consider now this 9159 rewritten pattern, which matches exactly the same strings: 9160 9161 ([^<]++|<(?!inet))+ 9162 9163 This uses very much less stack, because runs of characters that do not 9164 contain "<" are "swallowed" in one item inside the parentheses. Recur- 9165 sion happens only when a "<" character that is not followed by "inet" 9166 is encountered (and we assume this is relatively rare). A possessive 9167 quantifier is used to stop any backtracking into the runs of non-"<" 9168 characters, but that is not related to stack usage. 9169 9170 This example shows that one way of avoiding stack problems when match- 9171 ing long subject strings is to write repeated parenthesized subpatterns 9172 to match more than one character whenever possible. 9173 9174 Compiling PCRE2 to use heap instead of stack for pcre2_match() 9175 9176 In environments where stack memory is constrained, you might want to 9177 compile PCRE2 to use heap memory instead of stack for remembering back- 9178 up points when pcre2_match() is running. This makes it run more slowly, 9179 however. Details of how to do this are given in the pcre2build documen- 9180 tation. When built in this way, instead of using the stack, PCRE2 gets 9181 memory for remembering backup points from the heap. By default, the 9182 memory is obtained by calling the system malloc() function, but you can 9183 arrange to supply your own memory management function. For details, see 9184 the section entitled "The match context" in the pcre2api documentation. 9185 Since the block sizes are always the same, it may be possible to imple- 9186 ment customized a memory handler that is more efficient than the stan- 9187 dard function. The memory blocks obtained for this purpose are retained 9188 and re-used if possible while pcre2_match() is running. They are all 9189 freed just before it exits. 9190 9191 Limiting pcre2_match()'s stack usage 9192 9193 You can set limits on the number of times the internal match() function 9194 is called, both in total and recursively. If a limit is exceeded, 9195 pcre2_match() returns an error code. Setting suitable limits should 9196 prevent it from running out of stack. The default values of the limits 9197 are very large, and unlikely ever to operate. They can be changed when 9198 PCRE2 is built, and they can also be set when pcre2_match() is called. 9199 For details of these interfaces, see the pcre2build documentation and 9200 the section entitled "The match context" in the pcre2api documentation. 9201 9202 As a very rough rule of thumb, you should reckon on about 500 bytes per 9203 recursion. Thus, if you want to limit your stack usage to 8Mb, you 9204 should set the limit at 16000 recursions. A 64Mb stack, on the other 9205 hand, can support around 128000 recursions. 9206 9207 The pcre2test test program has a modifier called "find_limits" which, 9208 if applied to a subject line, causes it to find the smallest limits 9209 that allow a a pattern to match. This is done by calling pcre2_match() 9210 repeatedly with different limits. 9211 9212 Changing stack size in Unix-like systems 9213 9214 In Unix-like environments, there is not often a problem with the stack 9215 unless very long strings are involved, though the default limit on 9216 stack size varies from system to system. Values from 8Mb to 64Mb are 9217 common. You can find your default limit by running the command: 9218 9219 ulimit -s 9220 9221 Unfortunately, the effect of running out of stack is often SIGSEGV, 9222 though sometimes a more explicit error message is given. You can nor- 9223 mally increase the limit on stack size by code such as this: 9224 9225 struct rlimit rlim; 9226 getrlimit(RLIMIT_STACK, &rlim); 9227 rlim.rlim_cur = 100*1024*1024; 9228 setrlimit(RLIMIT_STACK, &rlim); 9229 9230 This reads the current limits (soft and hard) using getrlimit(), then 9231 attempts to increase the soft limit to 100Mb using setrlimit(). You 9232 must do this before calling pcre2_match(). 9233 9234 Changing stack size in Mac OS X 9235 9236 Using setrlimit(), as described above, should also work on Mac OS X. It 9237 is also possible to set a stack size when linking a program. There is a 9238 discussion about stack sizes in Mac OS X at this web site: 9239 http://developer.apple.com/qa/qa2005/qa1419.html. 9240 9241 9242AUTHOR 9243 9244 Philip Hazel 9245 University Computing Service 9246 Cambridge, England. 9247 9248 9249REVISION 9250 9251 Last updated: 21 November 2014 9252 Copyright (c) 1997-2014 University of Cambridge. 9253------------------------------------------------------------------------------ 9254 9255 9256PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3) 9257 9258 9259 9260NAME 9261 PCRE2 - Perl-compatible regular expressions (revised API) 9262 9263PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY 9264 9265 The full syntax and semantics of the regular expressions that are sup- 9266 ported by PCRE2 are described in the pcre2pattern documentation. This 9267 document contains a quick-reference summary of the syntax. 9268 9269 9270QUOTING 9271 9272 \x where x is non-alphanumeric is a literal x 9273 \Q...\E treat enclosed characters as literal 9274 9275 9276ESCAPED CHARACTERS 9277 9278 This table applies to ASCII and Unicode environments. 9279 9280 \a alarm, that is, the BEL character (hex 07) 9281 \cx "control-x", where x is any ASCII printing character 9282 \e escape (hex 1B) 9283 \f form feed (hex 0C) 9284 \n newline (hex 0A) 9285 \r carriage return (hex 0D) 9286 \t tab (hex 09) 9287 \0dd character with octal code 0dd 9288 \ddd character with octal code ddd, or backreference 9289 \o{ddd..} character with octal code ddd.. 9290 \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error) 9291 \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) 9292 \xhh character with hex code hh 9293 \x{hhh..} character with hex code hhh.. 9294 9295 Note that \0dd is always an octal code. The treatment of backslash fol- 9296 lowed by a non-zero digit is complicated; for details see the section 9297 "Non-printing characters" in the pcre2pattern documentation, where 9298 details of escape processing in EBCDIC environments are also given. 9299 9300 When \x is not followed by {, from zero to two hexadecimal digits are 9301 read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec- 9302 imal digits to be recognized as a hexadecimal escape; otherwise it 9303 matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol- 9304 lowed by four hexadecimal digits, it matches a literal "u". 9305 9306 9307CHARACTER TYPES 9308 9309 . any character except newline; 9310 in dotall mode, any character whatsoever 9311 \C one code unit, even in UTF mode (best avoided) 9312 \d a decimal digit 9313 \D a character that is not a decimal digit 9314 \h a horizontal white space character 9315 \H a character that is not a horizontal white space character 9316 \N a character that is not a newline 9317 \p{xx} a character with the xx property 9318 \P{xx} a character without the xx property 9319 \R a newline sequence 9320 \s a white space character 9321 \S a character that is not a white space character 9322 \v a vertical white space character 9323 \V a character that is not a vertical white space character 9324 \w a "word" character 9325 \W a "non-word" character 9326 \X a Unicode extended grapheme cluster 9327 9328 \C is dangerous because it may leave the current matching point in the 9329 middle of a UTF-8 or UTF-16 character. The application can lock out the 9330 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also 9331 possible to build PCRE2 with the use of \C permanently disabled. 9332 9333 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 9334 mode or in the 16-bit and 32-bit libraries. However, if locale-specific 9335 matching is happening, \s and \w may also match characters with code 9336 points in the range 128-255. If the PCRE2_UCP option is set, the behav- 9337 iour of these escape sequences is changed to use Unicode properties and 9338 they match many more characters. 9339 9340 9341GENERAL CATEGORY PROPERTIES FOR \p and \P 9342 9343 C Other 9344 Cc Control 9345 Cf Format 9346 Cn Unassigned 9347 Co Private use 9348 Cs Surrogate 9349 9350 L Letter 9351 Ll Lower case letter 9352 Lm Modifier letter 9353 Lo Other letter 9354 Lt Title case letter 9355 Lu Upper case letter 9356 L& Ll, Lu, or Lt 9357 9358 M Mark 9359 Mc Spacing mark 9360 Me Enclosing mark 9361 Mn Non-spacing mark 9362 9363 N Number 9364 Nd Decimal number 9365 Nl Letter number 9366 No Other number 9367 9368 P Punctuation 9369 Pc Connector punctuation 9370 Pd Dash punctuation 9371 Pe Close punctuation 9372 Pf Final punctuation 9373 Pi Initial punctuation 9374 Po Other punctuation 9375 Ps Open punctuation 9376 9377 S Symbol 9378 Sc Currency symbol 9379 Sk Modifier symbol 9380 Sm Mathematical symbol 9381 So Other symbol 9382 9383 Z Separator 9384 Zl Line separator 9385 Zp Paragraph separator 9386 Zs Space separator 9387 9388 9389PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P 9390 9391 Xan Alphanumeric: union of properties L and N 9392 Xps POSIX space: property Z or tab, NL, VT, FF, CR 9393 Xsp Perl space: property Z or tab, NL, VT, FF, CR 9394 Xuc Univerally-named character: one that can be 9395 represented by a Universal Character Name 9396 Xwd Perl word: property Xan or underscore 9397 9398 Perl and POSIX space are now the same. Perl added VT to its space char- 9399 acter set at release 5.18. 9400 9401 9402SCRIPT NAMES FOR \p AND \P 9403 9404 Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese, 9405 Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, 9406 Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham, 9407 Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, 9408 Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Geor- 9409 gian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han, 9410 Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, 9411 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan- 9412 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, 9413 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha- 9414 jani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui, 9415 Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, 9416 Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki, 9417 Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, 9418 Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, 9419 Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, 9420 Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala, 9421 Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, 9422 Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, 9423 Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi. 9424 9425 9426CHARACTER CLASSES 9427 9428 [...] positive character class 9429 [^...] negative character class 9430 [x-y] range (can be used for hex characters) 9431 [[:xxx:]] positive POSIX named set 9432 [[:^xxx:]] negative POSIX named set 9433 9434 alnum alphanumeric 9435 alpha alphabetic 9436 ascii 0-127 9437 blank space or tab 9438 cntrl control character 9439 digit decimal digit 9440 graph printing, excluding space 9441 lower lower case letter 9442 print printing, including space 9443 punct printing, excluding alphanumeric 9444 space white space 9445 upper upper case letter 9446 word same as \w 9447 xdigit hexadecimal digit 9448 9449 In PCRE2, POSIX character set names recognize only ASCII characters by 9450 default, but some of them use Unicode properties if PCRE2_UCP is set. 9451 You can use \Q...\E inside a character class. 9452 9453 9454QUANTIFIERS 9455 9456 ? 0 or 1, greedy 9457 ?+ 0 or 1, possessive 9458 ?? 0 or 1, lazy 9459 * 0 or more, greedy 9460 *+ 0 or more, possessive 9461 *? 0 or more, lazy 9462 + 1 or more, greedy 9463 ++ 1 or more, possessive 9464 +? 1 or more, lazy 9465 {n} exactly n 9466 {n,m} at least n, no more than m, greedy 9467 {n,m}+ at least n, no more than m, possessive 9468 {n,m}? at least n, no more than m, lazy 9469 {n,} n or more, greedy 9470 {n,}+ n or more, possessive 9471 {n,}? n or more, lazy 9472 9473 9474ANCHORS AND SIMPLE ASSERTIONS 9475 9476 \b word boundary 9477 \B not a word boundary 9478 ^ start of subject 9479 also after an internal newline in multiline mode 9480 (after any newline if PCRE2_ALT_CIRCUMFLEX is set) 9481 \A start of subject 9482 $ end of subject 9483 also before newline at end of subject 9484 also before internal newline in multiline mode 9485 \Z end of subject 9486 also before newline at end of subject 9487 \z end of subject 9488 \G first matching position in subject 9489 9490 9491MATCH POINT RESET 9492 9493 \K reset start of match 9494 9495 \K is honoured in positive assertions, but ignored in negative ones. 9496 9497 9498ALTERNATION 9499 9500 expr|expr|expr... 9501 9502 9503CAPTURING 9504 9505 (...) capturing group 9506 (?<name>...) named capturing group (Perl) 9507 (?'name'...) named capturing group (Perl) 9508 (?P<name>...) named capturing group (Python) 9509 (?:...) non-capturing group 9510 (?|...) non-capturing group; reset group numbers for 9511 capturing groups in each alternative 9512 9513 9514ATOMIC GROUPS 9515 9516 (?>...) atomic, non-capturing group 9517 9518 9519COMMENT 9520 9521 (?#....) comment (not nestable) 9522 9523 9524OPTION SETTING 9525 9526 (?i) caseless 9527 (?J) allow duplicate names 9528 (?m) multiline 9529 (?s) single line (dotall) 9530 (?U) default ungreedy (lazy) 9531 (?x) extended (ignore white space) 9532 (?-...) unset option(s) 9533 9534 The following are recognized only at the very start of a pattern or 9535 after one of the newline or \R options with similar syntax. More than 9536 one of them may appear. 9537 9538 (*LIMIT_MATCH=d) set the match limit to d (decimal number) 9539 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) 9540 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching 9541 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching 9542 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) 9543 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) 9544 (*NO_JIT) disable JIT optimization 9545 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) 9546 (*UTF) set appropriate UTF mode for the library in use 9547 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) 9548 9549 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of 9550 the limits set by the caller of pcre2_match(), not increase them. The 9551 application can lock out the use of (*UTF) and (*UCP) by setting the 9552 PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile 9553 time. 9554 9555 9556NEWLINE CONVENTION 9557 9558 These are recognized only at the very start of the pattern or after 9559 option settings with a similar syntax. 9560 9561 (*CR) carriage return only 9562 (*LF) linefeed only 9563 (*CRLF) carriage return followed by linefeed 9564 (*ANYCRLF) all three of the above 9565 (*ANY) any Unicode newline sequence 9566 9567 9568WHAT \R MATCHES 9569 9570 These are recognized only at the very start of the pattern or after 9571 option setting with a similar syntax. 9572 9573 (*BSR_ANYCRLF) CR, LF, or CRLF 9574 (*BSR_UNICODE) any Unicode newline sequence 9575 9576 9577LOOKAHEAD AND LOOKBEHIND ASSERTIONS 9578 9579 (?=...) positive look ahead 9580 (?!...) negative look ahead 9581 (?<=...) positive look behind 9582 (?<!...) negative look behind 9583 9584 Each top-level branch of a look behind must be of a fixed length. 9585 9586 9587BACKREFERENCES 9588 9589 \n reference by number (can be ambiguous) 9590 \gn reference by number 9591 \g{n} reference by number 9592 \g{-n} relative reference by number 9593 \k<name> reference by name (Perl) 9594 \k'name' reference by name (Perl) 9595 \g{name} reference by name (Perl) 9596 \k{name} reference by name (.NET) 9597 (?P=name) reference by name (Python) 9598 9599 9600SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) 9601 9602 (?R) recurse whole pattern 9603 (?n) call subpattern by absolute number 9604 (?+n) call subpattern by relative number 9605 (?-n) call subpattern by relative number 9606 (?&name) call subpattern by name (Perl) 9607 (?P>name) call subpattern by name (Python) 9608 \g<name> call subpattern by name (Oniguruma) 9609 \g'name' call subpattern by name (Oniguruma) 9610 \g<n> call subpattern by absolute number (Oniguruma) 9611 \g'n' call subpattern by absolute number (Oniguruma) 9612 \g<+n> call subpattern by relative number (PCRE2 extension) 9613 \g'+n' call subpattern by relative number (PCRE2 extension) 9614 \g<-n> call subpattern by relative number (PCRE2 extension) 9615 \g'-n' call subpattern by relative number (PCRE2 extension) 9616 9617 9618CONDITIONAL PATTERNS 9619 9620 (?(condition)yes-pattern) 9621 (?(condition)yes-pattern|no-pattern) 9622 9623 (?(n) absolute reference condition 9624 (?(+n) relative reference condition 9625 (?(-n) relative reference condition 9626 (?(<name>) named reference condition (Perl) 9627 (?('name') named reference condition (Perl) 9628 (?(name) named reference condition (PCRE2) 9629 (?(R) overall recursion condition 9630 (?(Rn) specific group recursion condition 9631 (?(R&name) specific recursion condition 9632 (?(DEFINE) define subpattern for reference 9633 (?(VERSION[>]=n.m) test PCRE2 version 9634 (?(assert) assertion condition 9635 9636 9637BACKTRACKING CONTROL 9638 9639 The following act immediately they are reached: 9640 9641 (*ACCEPT) force successful match 9642 (*FAIL) force backtrack; synonym (*F) 9643 (*MARK:NAME) set name to be passed back; synonym (*:NAME) 9644 9645 The following act only when a subsequent match failure causes a back- 9646 track to reach them. They all force a match failure, but they differ in 9647 what happens afterwards. Those that advance the start-of-match point do 9648 so only if the pattern is not anchored. 9649 9650 (*COMMIT) overall failure, no advance of starting point 9651 (*PRUNE) advance to next starting character 9652 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) 9653 (*SKIP) advance to current matching position 9654 (*SKIP:NAME) advance to position corresponding to an earlier 9655 (*MARK:NAME); if not found, the (*SKIP) is ignored 9656 (*THEN) local failure, backtrack to next alternation 9657 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) 9658 9659 9660CALLOUTS 9661 9662 (?C) callout (assumed number 0) 9663 (?Cn) callout with numerical data n 9664 (?C"text") callout with string data 9665 9666 The allowed string delimiters are ` ' " ^ % # $ (which are the same for 9667 the start and the end), and the starting delimiter { matched with the 9668 ending delimiter }. To encode the ending delimiter within the string, 9669 double it. 9670 9671 9672SEE ALSO 9673 9674 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), 9675 pcre2(3). 9676 9677 9678AUTHOR 9679 9680 Philip Hazel 9681 University Computing Service 9682 Cambridge, England. 9683 9684 9685REVISION 9686 9687 Last updated: 16 October 2015 9688 Copyright (c) 1997-2015 University of Cambridge. 9689------------------------------------------------------------------------------ 9690 9691 9692PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3) 9693 9694 9695 9696NAME 9697 PCRE - Perl-compatible regular expressions (revised API) 9698 9699UNICODE AND UTF SUPPORT 9700 9701 When PCRE2 is built with Unicode support (which is the default), it has 9702 knowledge of Unicode character properties and can process text strings 9703 in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). 9704 However, by default, PCRE2 assumes that one code unit is one character. 9705 To process a pattern as a UTF string, where a character may require 9706 more than one code unit, you must call pcre2_compile() with the 9707 PCRE2_UTF option flag, or the pattern must start with the sequence 9708 (*UTF). When either of these is the case, both the pattern and any sub- 9709 ject strings that are matched against it are treated as UTF strings 9710 instead of strings of individual one-code-unit characters. 9711 9712 If you do not need Unicode support you can build PCRE2 without it, in 9713 which case the library will be smaller. 9714 9715 9716UNICODE PROPERTY SUPPORT 9717 9718 When PCRE2 is built with Unicode support, the escape sequences \p{..}, 9719 \P{..}, and \X can be used. The Unicode properties that can be tested 9720 are limited to the general category properties such as Lu for an upper 9721 case letter or Nd for a decimal number, the Unicode script names such 9722 as Arabic or Han, and the derived properties Any and L&. Full lists are 9723 given in the pcre2pattern and pcre2syntax documentation. Only the short 9724 names for properties are supported. For example, \p{L} matches a let- 9725 ter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in 9726 Perl, many properties may optionally be prefixed by "Is", for compati- 9727 bility with Perl 5.6. PCRE does not support this. 9728 9729 9730WIDE CHARACTERS AND UTF MODES 9731 9732 Codepoints less than 256 can be specified in patterns by either braced 9733 or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). 9734 Larger values have to use braced sequences. Unbraced octal code points 9735 up to \777 are also recognized; larger ones can be coded using \o{...}. 9736 9737 In UTF modes, repeat quantifiers apply to complete UTF characters, not 9738 to individual code units. 9739 9740 In UTF modes, the dot metacharacter matches one UTF character instead 9741 of a single code unit. 9742 9743 The escape sequence \C can be used to match a single code unit in a UTF 9744 mode, but its use can lead to some strange effects because it breaks up 9745 multi-unit characters (see the description of \C in the pcre2pattern 9746 documentation). 9747 9748 The use of \C is not supported by the alternative matching function 9749 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac- 9750 ter may consist of more than one code unit. The use of \C in these 9751 modes provokes a match-time error. Also, the JIT optimization does not 9752 support \C in these modes. If JIT optimization is requested for a UTF-8 9753 or UTF-16 pattern that contains \C, it will not succeed, and so when 9754 pcre2_match() is called, the matching will be carried out by the normal 9755 interpretive function. 9756 9757 The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test 9758 characters of any code value, but, by default, the characters that 9759 PCRE2 recognizes as digits, spaces, or word characters remain the same 9760 set as in non-UTF mode, all with code points less than 256. This 9761 remains true even when PCRE2 is built to include Unicode support, 9762 because to do otherwise would slow down matching in many common cases. 9763 Note that this also applies to \b and \B, because they are defined in 9764 terms of \w and \W. If you want to test for a wider sense of, say, 9765 "digit", you can use explicit Unicode property tests such as \p{Nd}. 9766 Alternatively, if you set the PCRE2_UCP option, the way that the char- 9767 acter escapes work is changed so that Unicode properties are used to 9768 determine which characters match. There are more details in the section 9769 on generic character types in the pcre2pattern documentation. 9770 9771 Similarly, characters that match the POSIX named character classes are 9772 all low-valued characters, unless the PCRE2_UCP option is set. 9773 9774 However, the special horizontal and vertical white space matching 9775 escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- 9776 acters, whether or not PCRE2_UCP is set. 9777 9778 Case-insensitive matching in UTF mode makes use of Unicode properties. 9779 A few Unicode characters such as Greek sigma have more than two code- 9780 points that are case-equivalent, and these are treated as such. 9781 9782 9783VALIDITY OF UTF STRINGS 9784 9785 When the PCRE2_UTF option is set, the strings passed as patterns and 9786 subjects are (by default) checked for validity on entry to the relevant 9787 functions. If an invalid UTF string is passed, an negative error code 9788 is returned. The code unit offset to the offending character can be 9789 extracted from the match data block by calling pcre2_get_startchar(), 9790 which is used for this purpose after a UTF error. 9791 9792 UTF-16 and UTF-32 strings can indicate their endianness by special code 9793 knows as a byte-order mark (BOM). The PCRE2 functions do not handle 9794 this, expecting strings to be in host byte order. 9795 9796 A UTF string is checked before any other processing takes place. In the 9797 case of pcre2_match() and pcre2_dfa_match() calls with a non-zero 9798 starting offset, the check is applied only to that part of the subject 9799 that could be inspected during matching, and there is a check that the 9800 starting offset points to the first code unit of a character or to the 9801 end of the subject. If there are no lookbehind assertions in the pat- 9802 tern, the check starts at the starting offset. Otherwise, it starts at 9803 the length of the longest lookbehind before the starting offset, or at 9804 the start of the subject if there are not that many characters before 9805 the starting offset. Note that the sequences \b and \B are one-charac- 9806 ter lookbehinds. 9807 9808 In addition to checking the format of the string, there is a check to 9809 ensure that all code points lie in the range U+0 to U+10FFFF, excluding 9810 the surrogate area. The so-called "non-character" code points are not 9811 excluded because Unicode corrigendum #9 makes it clear that they should 9812 not be. 9813 9814 Characters in the "Surrogate Area" of Unicode are reserved for use by 9815 UTF-16, where they are used in pairs to encode code points with values 9816 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs 9817 are available independently in the UTF-8 and UTF-32 encodings. (In 9818 other words, the whole surrogate thing is a fudge for UTF-16 which 9819 unfortunately messes up UTF-8 and UTF-32.) 9820 9821 In some situations, you may already know that your strings are valid, 9822 and therefore want to skip these checks in order to improve perfor- 9823 mance, for example in the case of a long subject string that is being 9824 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com- 9825 pile time or at match time, PCRE2 assumes that the pattern or subject 9826 it is given (respectively) contains only valid UTF code unit sequences. 9827 9828 Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check 9829 for the pattern; it does not also apply to subject strings. If you want 9830 to disable the check for a subject string you must pass this option to 9831 pcre2_match() or pcre2_dfa_match(). 9832 9833 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the 9834 result is undefined and your program may crash or loop indefinitely. 9835 9836 Errors in UTF-8 strings 9837 9838 The following negative error codes are given for invalid UTF-8 strings: 9839 9840 PCRE2_ERROR_UTF8_ERR1 9841 PCRE2_ERROR_UTF8_ERR2 9842 PCRE2_ERROR_UTF8_ERR3 9843 PCRE2_ERROR_UTF8_ERR4 9844 PCRE2_ERROR_UTF8_ERR5 9845 9846 The string ends with a truncated UTF-8 character; the code specifies 9847 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 9848 characters to be no longer than 4 bytes, the encoding scheme (origi- 9849 nally defined by RFC 2279) allows for up to 6 bytes, and this is 9850 checked first; hence the possibility of 4 or 5 missing bytes. 9851 9852 PCRE2_ERROR_UTF8_ERR6 9853 PCRE2_ERROR_UTF8_ERR7 9854 PCRE2_ERROR_UTF8_ERR8 9855 PCRE2_ERROR_UTF8_ERR9 9856 PCRE2_ERROR_UTF8_ERR10 9857 9858 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of 9859 the character do not have the binary value 0b10 (that is, either the 9860 most significant bit is 0, or the next bit is 1). 9861 9862 PCRE2_ERROR_UTF8_ERR11 9863 PCRE2_ERROR_UTF8_ERR12 9864 9865 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes 9866 long; these code points are excluded by RFC 3629. 9867 9868 PCRE2_ERROR_UTF8_ERR13 9869 9870 A 4-byte character has a value greater than 0x10fff; these code points 9871 are excluded by RFC 3629. 9872 9873 PCRE2_ERROR_UTF8_ERR14 9874 9875 A 3-byte character has a value in the range 0xd800 to 0xdfff; this 9876 range of code points are reserved by RFC 3629 for use with UTF-16, and 9877 so are excluded from UTF-8. 9878 9879 PCRE2_ERROR_UTF8_ERR15 9880 PCRE2_ERROR_UTF8_ERR16 9881 PCRE2_ERROR_UTF8_ERR17 9882 PCRE2_ERROR_UTF8_ERR18 9883 PCRE2_ERROR_UTF8_ERR19 9884 9885 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes 9886 for a value that can be represented by fewer bytes, which is invalid. 9887 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor- 9888 rect coding uses just one byte. 9889 9890 PCRE2_ERROR_UTF8_ERR20 9891 9892 The two most significant bits of the first byte of a character have the 9893 binary value 0b10 (that is, the most significant bit is 1 and the sec- 9894 ond is 0). Such a byte can only validly occur as the second or subse- 9895 quent byte of a multi-byte character. 9896 9897 PCRE2_ERROR_UTF8_ERR21 9898 9899 The first byte of a character has the value 0xfe or 0xff. These values 9900 can never occur in a valid UTF-8 string. 9901 9902 Errors in UTF-16 strings 9903 9904 The following negative error codes are given for invalid UTF-16 9905 strings: 9906 9907 PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string 9908 PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate 9909 PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate 9910 9911 9912 Errors in UTF-32 strings 9913 9914 The following negative error codes are given for invalid UTF-32 9915 strings: 9916 9917 PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff) 9918 PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff 9919 9920 9921AUTHOR 9922 9923 Philip Hazel 9924 University Computing Service 9925 Cambridge, England. 9926 9927 9928REVISION 9929 9930 Last updated: 03 July 2016 9931 Copyright (c) 1997-2016 University of Cambridge. 9932------------------------------------------------------------------------------ 9933 9934 9935