1<html> 2<head> 3<title>pcre2syntax specification</title> 4</head> 5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6<h1>pcre2syntax man page</h1> 7<p> 8Return to the <a href="index.html">PCRE2 index page</a>. 9</p> 10<p> 11This page is part of the PCRE2 HTML documentation. It was generated 12automatically from the original man page. If there is any nonsense in it, 13please consult the man page, in case the conversion went wrong. 14<br> 15<ul> 16<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a> 17<li><a name="TOC2" href="#SEC2">QUOTING</a> 18<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a> 19<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a> 20<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> 21<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> 22<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a> 23<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a> 24<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a> 25<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a> 26<li><a name="TOC11" href="#SEC11">REPORTED MATCH POINT SETTING</a> 27<li><a name="TOC12" href="#SEC12">ALTERNATION</a> 28<li><a name="TOC13" href="#SEC13">CAPTURING</a> 29<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a> 30<li><a name="TOC15" href="#SEC15">COMMENT</a> 31<li><a name="TOC16" href="#SEC16">OPTION SETTING</a> 32<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a> 33<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a> 34<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> 35<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a> 36<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> 37<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a> 38<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a> 39<li><a name="TOC24" href="#SEC24">CALLOUTS</a> 40<li><a name="TOC25" href="#SEC25">SEE ALSO</a> 41<li><a name="TOC26" href="#SEC26">AUTHOR</a> 42<li><a name="TOC27" href="#SEC27">REVISION</a> 43</ul> 44<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br> 45<P> 46The full syntax and semantics of the regular expressions that are supported by 47PCRE2 are described in the 48<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 49documentation. This document contains a quick-reference summary of the syntax. 50</P> 51<br><a name="SEC2" href="#TOC1">QUOTING</a><br> 52<P> 53<pre> 54 \x where x is non-alphanumeric is a literal x 55 \Q...\E treat enclosed characters as literal 56</PRE> 57</P> 58<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br> 59<P> 60This table applies to ASCII and Unicode environments. 61<pre> 62 \a alarm, that is, the BEL character (hex 07) 63 \cx "control-x", where x is any ASCII printing character 64 \e escape (hex 1B) 65 \f form feed (hex 0C) 66 \n newline (hex 0A) 67 \r carriage return (hex 0D) 68 \t tab (hex 09) 69 \0dd character with octal code 0dd 70 \ddd character with octal code ddd, or backreference 71 \o{ddd..} character with octal code ddd.. 72 \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error) 73 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only) 74 \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) 75 \xhh character with hex code hh 76 \x{hh..} character with hex code hh.. 77</pre> 78Note that \0dd is always an octal code. The treatment of backslash followed by 79a non-zero digit is complicated; for details see the section 80<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a> 81in the 82<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 83documentation, where details of escape processing in EBCDIC environments are 84also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not 85supported in EBCDIC environments. Note that \N not followed by an opening 86curly bracket has a different meaning (see below). 87</P> 88<P> 89When \x is not followed by {, from zero to two hexadecimal digits are read, 90but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to 91be recognized as a hexadecimal escape; otherwise it matches a literal "x". 92Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits, 93it matches a literal "u". 94</P> 95<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br> 96<P> 97<pre> 98 . any character except newline; 99 in dotall mode, any character whatsoever 100 \C one code unit, even in UTF mode (best avoided) 101 \d a decimal digit 102 \D a character that is not a decimal digit 103 \h a horizontal white space character 104 \H a character that is not a horizontal white space character 105 \N a character that is not a newline 106 \p{<i>xx</i>} a character with the <i>xx</i> property 107 \P{<i>xx</i>} a character without the <i>xx</i> property 108 \R a newline sequence 109 \s a white space character 110 \S a character that is not a white space character 111 \v a vertical white space character 112 \V a character that is not a vertical white space character 113 \w a "word" character 114 \W a "non-word" character 115 \X a Unicode extended grapheme cluster 116</pre> 117\C is dangerous because it may leave the current matching point in the middle 118of a UTF-8 or UTF-16 character. The application can lock out the use of \C by 119setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2 120with the use of \C permanently disabled. 121</P> 122<P> 123By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode 124or in the 16-bit and 32-bit libraries. However, if locale-specific matching is 125happening, \s and \w may also match characters with code points in the range 126128-255. If the PCRE2_UCP option is set, the behaviour of these escape 127sequences is changed to use Unicode properties and they match many more 128characters. 129</P> 130<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> 131<P> 132<pre> 133 C Other 134 Cc Control 135 Cf Format 136 Cn Unassigned 137 Co Private use 138 Cs Surrogate 139 140 L Letter 141 Ll Lower case letter 142 Lm Modifier letter 143 Lo Other letter 144 Lt Title case letter 145 Lu Upper case letter 146 L& Ll, Lu, or Lt 147 148 M Mark 149 Mc Spacing mark 150 Me Enclosing mark 151 Mn Non-spacing mark 152 153 N Number 154 Nd Decimal number 155 Nl Letter number 156 No Other number 157 158 P Punctuation 159 Pc Connector punctuation 160 Pd Dash punctuation 161 Pe Close punctuation 162 Pf Final punctuation 163 Pi Initial punctuation 164 Po Other punctuation 165 Ps Open punctuation 166 167 S Symbol 168 Sc Currency symbol 169 Sk Modifier symbol 170 Sm Mathematical symbol 171 So Other symbol 172 173 Z Separator 174 Zl Line separator 175 Zp Paragraph separator 176 Zs Space separator 177</PRE> 178</P> 179<br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br> 180<P> 181<pre> 182 Xan Alphanumeric: union of properties L and N 183 Xps POSIX space: property Z or tab, NL, VT, FF, CR 184 Xsp Perl space: property Z or tab, NL, VT, FF, CR 185 Xuc Univerally-named character: one that can be 186 represented by a Universal Character Name 187 Xwd Perl word: property Xan or underscore 188</pre> 189Perl and POSIX space are now the same. Perl added VT to its space character set 190at release 5.18. 191</P> 192<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br> 193<P> 194Adlam, 195Ahom, 196Anatolian_Hieroglyphs, 197Arabic, 198Armenian, 199Avestan, 200Balinese, 201Bamum, 202Bassa_Vah, 203Batak, 204Bengali, 205Bhaiksuki, 206Bopomofo, 207Brahmi, 208Braille, 209Buginese, 210Buhid, 211Canadian_Aboriginal, 212Carian, 213Caucasian_Albanian, 214Chakma, 215Cham, 216Cherokee, 217Common, 218Coptic, 219Cuneiform, 220Cypriot, 221Cyrillic, 222Deseret, 223Devanagari, 224Dogra, 225Duployan, 226Egyptian_Hieroglyphs, 227Elbasan, 228Ethiopic, 229Georgian, 230Glagolitic, 231Gothic, 232Grantha, 233Greek, 234Gujarati, 235Gunjala_Gondi, 236Gurmukhi, 237Han, 238Hangul, 239Hanifi_Rohingya, 240Hanunoo, 241Hatran, 242Hebrew, 243Hiragana, 244Imperial_Aramaic, 245Inherited, 246Inscriptional_Pahlavi, 247Inscriptional_Parthian, 248Javanese, 249Kaithi, 250Kannada, 251Katakana, 252Kayah_Li, 253Kharoshthi, 254Khmer, 255Khojki, 256Khudawadi, 257Lao, 258Latin, 259Lepcha, 260Limbu, 261Linear_A, 262Linear_B, 263Lisu, 264Lycian, 265Lydian, 266Mahajani, 267Makasar, 268Malayalam, 269Mandaic, 270Manichaean, 271Marchen, 272Masaram_Gondi, 273Medefaidrin, 274Meetei_Mayek, 275Mende_Kikakui, 276Meroitic_Cursive, 277Meroitic_Hieroglyphs, 278Miao, 279Modi, 280Mongolian, 281Mro, 282Multani, 283Myanmar, 284Nabataean, 285New_Tai_Lue, 286Newa, 287Nko, 288Nushu, 289Ogham, 290Ol_Chiki, 291Old_Hungarian, 292Old_Italic, 293Old_North_Arabian, 294Old_Permic, 295Old_Persian, 296Old_Sogdian, 297Old_South_Arabian, 298Old_Turkic, 299Oriya, 300Osage, 301Osmanya, 302Pahawh_Hmong, 303Palmyrene, 304Pau_Cin_Hau, 305Phags_Pa, 306Phoenician, 307Psalter_Pahlavi, 308Rejang, 309Runic, 310Samaritan, 311Saurashtra, 312Sharada, 313Shavian, 314Siddham, 315SignWriting, 316Sinhala, 317Sogdian, 318Sora_Sompeng, 319Soyombo, 320Sundanese, 321Syloti_Nagri, 322Syriac, 323Tagalog, 324Tagbanwa, 325Tai_Le, 326Tai_Tham, 327Tai_Viet, 328Takri, 329Tamil, 330Tangut, 331Telugu, 332Thaana, 333Thai, 334Tibetan, 335Tifinagh, 336Tirhuta, 337Ugaritic, 338Vai, 339Warang_Citi, 340Yi, 341Zanabazar_Square. 342</P> 343<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br> 344<P> 345<pre> 346 [...] positive character class 347 [^...] negative character class 348 [x-y] range (can be used for hex characters) 349 [[:xxx:]] positive POSIX named set 350 [[:^xxx:]] negative POSIX named set 351 352 alnum alphanumeric 353 alpha alphabetic 354 ascii 0-127 355 blank space or tab 356 cntrl control character 357 digit decimal digit 358 graph printing, excluding space 359 lower lower case letter 360 print printing, including space 361 punct printing, excluding alphanumeric 362 space white space 363 upper upper case letter 364 word same as \w 365 xdigit hexadecimal digit 366</pre> 367In PCRE2, POSIX character set names recognize only ASCII characters by default, 368but some of them use Unicode properties if PCRE2_UCP is set. You can use 369\Q...\E inside a character class. 370</P> 371<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br> 372<P> 373<pre> 374 ? 0 or 1, greedy 375 ?+ 0 or 1, possessive 376 ?? 0 or 1, lazy 377 * 0 or more, greedy 378 *+ 0 or more, possessive 379 *? 0 or more, lazy 380 + 1 or more, greedy 381 ++ 1 or more, possessive 382 +? 1 or more, lazy 383 {n} exactly n 384 {n,m} at least n, no more than m, greedy 385 {n,m}+ at least n, no more than m, possessive 386 {n,m}? at least n, no more than m, lazy 387 {n,} n or more, greedy 388 {n,}+ n or more, possessive 389 {n,}? n or more, lazy 390</PRE> 391</P> 392<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br> 393<P> 394<pre> 395 \b word boundary 396 \B not a word boundary 397 ^ start of subject 398 also after an internal newline in multiline mode 399 (after any newline if PCRE2_ALT_CIRCUMFLEX is set) 400 \A start of subject 401 $ end of subject 402 also before newline at end of subject 403 also before internal newline in multiline mode 404 \Z end of subject 405 also before newline at end of subject 406 \z end of subject 407 \G first matching position in subject 408</PRE> 409</P> 410<br><a name="SEC11" href="#TOC1">REPORTED MATCH POINT SETTING</a><br> 411<P> 412<pre> 413 \K set reported start of match 414</pre> 415\K is honoured in positive assertions, but ignored in negative ones. 416</P> 417<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br> 418<P> 419<pre> 420 expr|expr|expr... 421</PRE> 422</P> 423<br><a name="SEC13" href="#TOC1">CAPTURING</a><br> 424<P> 425<pre> 426 (...) capturing group 427 (?<name>...) named capturing group (Perl) 428 (?'name'...) named capturing group (Perl) 429 (?P<name>...) named capturing group (Python) 430 (?:...) non-capturing group 431 (?|...) non-capturing group; reset group numbers for 432 capturing groups in each alternative 433</PRE> 434</P> 435<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br> 436<P> 437<pre> 438 (?>...) atomic, non-capturing group 439</PRE> 440</P> 441<br><a name="SEC15" href="#TOC1">COMMENT</a><br> 442<P> 443<pre> 444 (?#....) comment (not nestable) 445</PRE> 446</P> 447<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br> 448<P> 449Changes of these options within a group are automatically cancelled at the end 450of the group. 451<pre> 452 (?i) caseless 453 (?J) allow duplicate names 454 (?m) multiline 455 (?n) no auto capture 456 (?s) single line (dotall) 457 (?U) default ungreedy (lazy) 458 (?x) extended: ignore white space except in classes 459 (?xx) as (?x) but also ignore space and tab in classes 460 (?-...) unset option(s) 461 (?^) unset imnsx options 462</pre> 463Unsetting x or xx unsets both. Several options may be set at once, and a 464mixture of setting and unsetting such as (?i-x) is allowed, but there may be 465only one hyphen. Setting (but no unsetting) is allowed after (?^ for example 466(?^in). An option setting may appear at the start of a non-capturing group, for 467example (?i:...). 468</P> 469<P> 470The following are recognized only at the very start of a pattern or after one 471of the newline or \R options with similar syntax. More than one of them may 472appear. For the first three, d is a decimal number. 473<pre> 474 (*LIMIT_DEPTH=d) set the backtracking limit to d 475 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes 476 (*LIMIT_MATCH=d) set the match limit to d 477 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching 478 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching 479 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) 480 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) 481 (*NO_JIT) disable JIT optimization 482 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) 483 (*UTF) set appropriate UTF mode for the library in use 484 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) 485</pre> 486Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of 487the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>, 488not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The 489application can lock out the use of (*UTF) and (*UCP) by setting the 490PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time. 491</P> 492<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br> 493<P> 494These are recognized only at the very start of the pattern or after option 495settings with a similar syntax. 496<pre> 497 (*CR) carriage return only 498 (*LF) linefeed only 499 (*CRLF) carriage return followed by linefeed 500 (*ANYCRLF) all three of the above 501 (*ANY) any Unicode newline sequence 502 (*NUL) the NUL character (binary zero) 503</PRE> 504</P> 505<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br> 506<P> 507These are recognized only at the very start of the pattern or after option 508setting with a similar syntax. 509<pre> 510 (*BSR_ANYCRLF) CR, LF, or CRLF 511 (*BSR_UNICODE) any Unicode newline sequence 512</PRE> 513</P> 514<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br> 515<P> 516<pre> 517 (?=...) positive look ahead 518 (?!...) negative look ahead 519 (?<=...) positive look behind 520 (?<!...) negative look behind 521</pre> 522Each top-level branch of a look behind must be of a fixed length. 523</P> 524<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br> 525<P> 526<pre> 527 \n reference by number (can be ambiguous) 528 \gn reference by number 529 \g{n} reference by number 530 \g+n relative reference by number (PCRE2 extension) 531 \g-n relative reference by number 532 \g{+n} relative reference by number (PCRE2 extension) 533 \g{-n} relative reference by number 534 \k<name> reference by name (Perl) 535 \k'name' reference by name (Perl) 536 \g{name} reference by name (Perl) 537 \k{name} reference by name (.NET) 538 (?P=name) reference by name (Python) 539</PRE> 540</P> 541<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br> 542<P> 543<pre> 544 (?R) recurse whole pattern 545 (?n) call subpattern by absolute number 546 (?+n) call subpattern by relative number 547 (?-n) call subpattern by relative number 548 (?&name) call subpattern by name (Perl) 549 (?P>name) call subpattern by name (Python) 550 \g<name> call subpattern by name (Oniguruma) 551 \g'name' call subpattern by name (Oniguruma) 552 \g<n> call subpattern by absolute number (Oniguruma) 553 \g'n' call subpattern by absolute number (Oniguruma) 554 \g<+n> call subpattern by relative number (PCRE2 extension) 555 \g'+n' call subpattern by relative number (PCRE2 extension) 556 \g<-n> call subpattern by relative number (PCRE2 extension) 557 \g'-n' call subpattern by relative number (PCRE2 extension) 558</PRE> 559</P> 560<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br> 561<P> 562<pre> 563 (?(condition)yes-pattern) 564 (?(condition)yes-pattern|no-pattern) 565 566 (?(n) absolute reference condition 567 (?(+n) relative reference condition 568 (?(-n) relative reference condition 569 (?(<name>) named reference condition (Perl) 570 (?('name') named reference condition (Perl) 571 (?(name) named reference condition (PCRE2, deprecated) 572 (?(R) overall recursion condition 573 (?(Rn) specific numbered group recursion condition 574 (?(R&name) specific named group recursion condition 575 (?(DEFINE) define subpattern for reference 576 (?(VERSION[>]=n.m) test PCRE2 version 577 (?(assert) assertion condition 578</pre> 579Note the ambiguity of (?(R) and (?(Rn) which might be named reference 580conditions or recursion tests. Such a condition is interpreted as a reference 581condition if the relevant named group exists. 582</P> 583<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br> 584<P> 585All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the 586name is mandatory, for the others it is optional. (*SKIP) changes its behaviour 587if :NAME is present. The others just set a name for passing back to the caller, 588but this is not a name that (*SKIP) can see. The following act immediately they 589are reached: 590<pre> 591 (*ACCEPT) force successful match 592 (*FAIL) force backtrack; synonym (*F) 593 (*MARK:NAME) set name to be passed back; synonym (*:NAME) 594</pre> 595The following act only when a subsequent match failure causes a backtrack to 596reach them. They all force a match failure, but they differ in what happens 597afterwards. Those that advance the start-of-match point do so only if the 598pattern is not anchored. 599<pre> 600 (*COMMIT) overall failure, no advance of starting point 601 (*PRUNE) advance to next starting character 602 (*SKIP) advance to current matching position 603 (*SKIP:NAME) advance to position corresponding to an earlier 604 (*MARK:NAME); if not found, the (*SKIP) is ignored 605 (*THEN) local failure, backtrack to next alternation 606</pre> 607The effect of one of these verbs in a group called as a subroutine is confined 608to the subroutine call. 609</P> 610<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br> 611<P> 612<pre> 613 (?C) callout (assumed number 0) 614 (?Cn) callout with numerical data n 615 (?C"text") callout with string data 616</pre> 617The allowed string delimiters are ` ' " ^ % # $ (which are the same for the 618start and the end), and the starting delimiter { matched with the ending 619delimiter }. To encode the ending delimiter within the string, double it. 620</P> 621<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br> 622<P> 623<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), 624<b>pcre2matching</b>(3), <b>pcre2</b>(3). 625</P> 626<br><a name="SEC26" href="#TOC1">AUTHOR</a><br> 627<P> 628Philip Hazel 629<br> 630University Computing Service 631<br> 632Cambridge, England. 633<br> 634</P> 635<br><a name="SEC27" href="#TOC1">REVISION</a><br> 636<P> 637Last updated: 02 September 2018 638<br> 639Copyright © 1997-2018 University of Cambridge. 640<br> 641<p> 642Return to the <a href="index.html">PCRE2 index page</a>. 643</p> 644