1<html>
2<head>
3<title>pcre2syntax specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcre2syntax man page</h1>
7<p>
8Return to the <a href="index.html">PCRE2 index page</a>.
9</p>
10<p>
11This page is part of the PCRE2 HTML documentation. It was generated
12automatically from the original man page. If there is any nonsense in it,
13please consult the man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
17<li><a name="TOC2" href="#SEC2">QUOTING</a>
18<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
19<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26<li><a name="TOC11" href="#SEC11">REPORTED MATCH POINT SETTING</a>
27<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28<li><a name="TOC13" href="#SEC13">CAPTURING</a>
29<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30<li><a name="TOC15" href="#SEC15">COMMENT</a>
31<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
33<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
34<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
35<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
36<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
37<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
38<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
39<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41<li><a name="TOC26" href="#SEC26">AUTHOR</a>
42<li><a name="TOC27" href="#SEC27">REVISION</a>
43</ul>
44<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45<P>
46The full syntax and semantics of the regular expressions that are supported by
47PCRE2 are described in the
48<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
49documentation. This document contains a quick-reference summary of the syntax.
50</P>
51<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52<P>
53<pre>
54  \x         where x is non-alphanumeric is a literal x
55  \Q...\E    treat enclosed characters as literal
56</PRE>
57</P>
58<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
59<P>
60This table applies to ASCII and Unicode environments.
61<pre>
62  \a         alarm, that is, the BEL character (hex 07)
63  \cx        "control-x", where x is any ASCII printing character
64  \e         escape (hex 1B)
65  \f         form feed (hex 0C)
66  \n         newline (hex 0A)
67  \r         carriage return (hex 0D)
68  \t         tab (hex 09)
69  \0dd       character with octal code 0dd
70  \ddd       character with octal code ddd, or backreference
71  \o{ddd..}  character with octal code ddd..
72  \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
73  \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
74  \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
75  \xhh       character with hex code hh
76  \x{hh..}   character with hex code hh..
77</pre>
78Note that \0dd is always an octal code. The treatment of backslash followed by
79a non-zero digit is complicated; for details see the section
80<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
81in the
82<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
83documentation, where details of escape processing in EBCDIC environments are
84also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
85supported in EBCDIC environments. Note that \N not followed by an opening
86curly bracket has a different meaning (see below).
87</P>
88<P>
89When \x is not followed by {, from zero to two hexadecimal digits are read,
90but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
91be recognized as a hexadecimal escape; otherwise it matches a literal "x".
92Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
93it matches a literal "u".
94</P>
95<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
96<P>
97<pre>
98  .          any character except newline;
99               in dotall mode, any character whatsoever
100  \C         one code unit, even in UTF mode (best avoided)
101  \d         a decimal digit
102  \D         a character that is not a decimal digit
103  \h         a horizontal white space character
104  \H         a character that is not a horizontal white space character
105  \N         a character that is not a newline
106  \p{<i>xx</i>}     a character with the <i>xx</i> property
107  \P{<i>xx</i>}     a character without the <i>xx</i> property
108  \R         a newline sequence
109  \s         a white space character
110  \S         a character that is not a white space character
111  \v         a vertical white space character
112  \V         a character that is not a vertical white space character
113  \w         a "word" character
114  \W         a "non-word" character
115  \X         a Unicode extended grapheme cluster
116</pre>
117\C is dangerous because it may leave the current matching point in the middle
118of a UTF-8 or UTF-16 character. The application can lock out the use of \C by
119setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
120with the use of \C permanently disabled.
121</P>
122<P>
123By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
124or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
125happening, \s and \w may also match characters with code points in the range
126128-255. If the PCRE2_UCP option is set, the behaviour of these escape
127sequences is changed to use Unicode properties and they match many more
128characters.
129</P>
130<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
131<P>
132<pre>
133  C          Other
134  Cc         Control
135  Cf         Format
136  Cn         Unassigned
137  Co         Private use
138  Cs         Surrogate
139
140  L          Letter
141  Ll         Lower case letter
142  Lm         Modifier letter
143  Lo         Other letter
144  Lt         Title case letter
145  Lu         Upper case letter
146  L&         Ll, Lu, or Lt
147
148  M          Mark
149  Mc         Spacing mark
150  Me         Enclosing mark
151  Mn         Non-spacing mark
152
153  N          Number
154  Nd         Decimal number
155  Nl         Letter number
156  No         Other number
157
158  P          Punctuation
159  Pc         Connector punctuation
160  Pd         Dash punctuation
161  Pe         Close punctuation
162  Pf         Final punctuation
163  Pi         Initial punctuation
164  Po         Other punctuation
165  Ps         Open punctuation
166
167  S          Symbol
168  Sc         Currency symbol
169  Sk         Modifier symbol
170  Sm         Mathematical symbol
171  So         Other symbol
172
173  Z          Separator
174  Zl         Line separator
175  Zp         Paragraph separator
176  Zs         Space separator
177</PRE>
178</P>
179<br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
180<P>
181<pre>
182  Xan        Alphanumeric: union of properties L and N
183  Xps        POSIX space: property Z or tab, NL, VT, FF, CR
184  Xsp        Perl space: property Z or tab, NL, VT, FF, CR
185  Xuc        Univerally-named character: one that can be
186               represented by a Universal Character Name
187  Xwd        Perl word: property Xan or underscore
188</pre>
189Perl and POSIX space are now the same. Perl added VT to its space character set
190at release 5.18.
191</P>
192<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
193<P>
194Adlam,
195Ahom,
196Anatolian_Hieroglyphs,
197Arabic,
198Armenian,
199Avestan,
200Balinese,
201Bamum,
202Bassa_Vah,
203Batak,
204Bengali,
205Bhaiksuki,
206Bopomofo,
207Brahmi,
208Braille,
209Buginese,
210Buhid,
211Canadian_Aboriginal,
212Carian,
213Caucasian_Albanian,
214Chakma,
215Cham,
216Cherokee,
217Common,
218Coptic,
219Cuneiform,
220Cypriot,
221Cyrillic,
222Deseret,
223Devanagari,
224Dogra,
225Duployan,
226Egyptian_Hieroglyphs,
227Elbasan,
228Ethiopic,
229Georgian,
230Glagolitic,
231Gothic,
232Grantha,
233Greek,
234Gujarati,
235Gunjala_Gondi,
236Gurmukhi,
237Han,
238Hangul,
239Hanifi_Rohingya,
240Hanunoo,
241Hatran,
242Hebrew,
243Hiragana,
244Imperial_Aramaic,
245Inherited,
246Inscriptional_Pahlavi,
247Inscriptional_Parthian,
248Javanese,
249Kaithi,
250Kannada,
251Katakana,
252Kayah_Li,
253Kharoshthi,
254Khmer,
255Khojki,
256Khudawadi,
257Lao,
258Latin,
259Lepcha,
260Limbu,
261Linear_A,
262Linear_B,
263Lisu,
264Lycian,
265Lydian,
266Mahajani,
267Makasar,
268Malayalam,
269Mandaic,
270Manichaean,
271Marchen,
272Masaram_Gondi,
273Medefaidrin,
274Meetei_Mayek,
275Mende_Kikakui,
276Meroitic_Cursive,
277Meroitic_Hieroglyphs,
278Miao,
279Modi,
280Mongolian,
281Mro,
282Multani,
283Myanmar,
284Nabataean,
285New_Tai_Lue,
286Newa,
287Nko,
288Nushu,
289Ogham,
290Ol_Chiki,
291Old_Hungarian,
292Old_Italic,
293Old_North_Arabian,
294Old_Permic,
295Old_Persian,
296Old_Sogdian,
297Old_South_Arabian,
298Old_Turkic,
299Oriya,
300Osage,
301Osmanya,
302Pahawh_Hmong,
303Palmyrene,
304Pau_Cin_Hau,
305Phags_Pa,
306Phoenician,
307Psalter_Pahlavi,
308Rejang,
309Runic,
310Samaritan,
311Saurashtra,
312Sharada,
313Shavian,
314Siddham,
315SignWriting,
316Sinhala,
317Sogdian,
318Sora_Sompeng,
319Soyombo,
320Sundanese,
321Syloti_Nagri,
322Syriac,
323Tagalog,
324Tagbanwa,
325Tai_Le,
326Tai_Tham,
327Tai_Viet,
328Takri,
329Tamil,
330Tangut,
331Telugu,
332Thaana,
333Thai,
334Tibetan,
335Tifinagh,
336Tirhuta,
337Ugaritic,
338Vai,
339Warang_Citi,
340Yi,
341Zanabazar_Square.
342</P>
343<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
344<P>
345<pre>
346  [...]       positive character class
347  [^...]      negative character class
348  [x-y]       range (can be used for hex characters)
349  [[:xxx:]]   positive POSIX named set
350  [[:^xxx:]]  negative POSIX named set
351
352  alnum       alphanumeric
353  alpha       alphabetic
354  ascii       0-127
355  blank       space or tab
356  cntrl       control character
357  digit       decimal digit
358  graph       printing, excluding space
359  lower       lower case letter
360  print       printing, including space
361  punct       printing, excluding alphanumeric
362  space       white space
363  upper       upper case letter
364  word        same as \w
365  xdigit      hexadecimal digit
366</pre>
367In PCRE2, POSIX character set names recognize only ASCII characters by default,
368but some of them use Unicode properties if PCRE2_UCP is set. You can use
369\Q...\E inside a character class.
370</P>
371<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
372<P>
373<pre>
374  ?           0 or 1, greedy
375  ?+          0 or 1, possessive
376  ??          0 or 1, lazy
377  *           0 or more, greedy
378  *+          0 or more, possessive
379  *?          0 or more, lazy
380  +           1 or more, greedy
381  ++          1 or more, possessive
382  +?          1 or more, lazy
383  {n}         exactly n
384  {n,m}       at least n, no more than m, greedy
385  {n,m}+      at least n, no more than m, possessive
386  {n,m}?      at least n, no more than m, lazy
387  {n,}        n or more, greedy
388  {n,}+       n or more, possessive
389  {n,}?       n or more, lazy
390</PRE>
391</P>
392<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
393<P>
394<pre>
395  \b          word boundary
396  \B          not a word boundary
397  ^           start of subject
398                also after an internal newline in multiline mode
399                (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
400  \A          start of subject
401  $           end of subject
402                also before newline at end of subject
403                also before internal newline in multiline mode
404  \Z          end of subject
405                also before newline at end of subject
406  \z          end of subject
407  \G          first matching position in subject
408</PRE>
409</P>
410<br><a name="SEC11" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
411<P>
412<pre>
413  \K          set reported start of match
414</pre>
415\K is honoured in positive assertions, but ignored in negative ones.
416</P>
417<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
418<P>
419<pre>
420  expr|expr|expr...
421</PRE>
422</P>
423<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
424<P>
425<pre>
426  (...)           capturing group
427  (?&#60;name&#62;...)    named capturing group (Perl)
428  (?'name'...)    named capturing group (Perl)
429  (?P&#60;name&#62;...)   named capturing group (Python)
430  (?:...)         non-capturing group
431  (?|...)         non-capturing group; reset group numbers for
432                   capturing groups in each alternative
433</PRE>
434</P>
435<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
436<P>
437<pre>
438  (?&#62;...)         atomic, non-capturing group
439</PRE>
440</P>
441<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
442<P>
443<pre>
444  (?#....)        comment (not nestable)
445</PRE>
446</P>
447<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
448<P>
449Changes of these options within a group are automatically cancelled at the end
450of the group.
451<pre>
452  (?i)            caseless
453  (?J)            allow duplicate names
454  (?m)            multiline
455  (?n)            no auto capture
456  (?s)            single line (dotall)
457  (?U)            default ungreedy (lazy)
458  (?x)            extended: ignore white space except in classes
459  (?xx)           as (?x) but also ignore space and tab in classes
460  (?-...)         unset option(s)
461  (?^)            unset imnsx options
462</pre>
463Unsetting x or xx unsets both. Several options may be set at once, and a
464mixture of setting and unsetting such as (?i-x) is allowed, but there may be
465only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
466(?^in). An option setting may appear at the start of a non-capturing group, for
467example (?i:...).
468</P>
469<P>
470The following are recognized only at the very start of a pattern or after one
471of the newline or \R options with similar syntax. More than one of them may
472appear. For the first three, d is a decimal number.
473<pre>
474  (*LIMIT_DEPTH=d) set the backtracking limit to d
475  (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
476  (*LIMIT_MATCH=d) set the match limit to d
477  (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
478  (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
479  (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
480  (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
481  (*NO_JIT)       disable JIT optimization
482  (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
483  (*UTF)          set appropriate UTF mode for the library in use
484  (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
485</pre>
486Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
487the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>,
488not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
489application can lock out the use of (*UTF) and (*UCP) by setting the
490PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
491</P>
492<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
493<P>
494These are recognized only at the very start of the pattern or after option
495settings with a similar syntax.
496<pre>
497  (*CR)           carriage return only
498  (*LF)           linefeed only
499  (*CRLF)         carriage return followed by linefeed
500  (*ANYCRLF)      all three of the above
501  (*ANY)          any Unicode newline sequence
502  (*NUL)          the NUL character (binary zero)
503</PRE>
504</P>
505<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
506<P>
507These are recognized only at the very start of the pattern or after option
508setting with a similar syntax.
509<pre>
510  (*BSR_ANYCRLF)  CR, LF, or CRLF
511  (*BSR_UNICODE)  any Unicode newline sequence
512</PRE>
513</P>
514<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
515<P>
516<pre>
517  (?=...)         positive look ahead
518  (?!...)         negative look ahead
519  (?&#60;=...)        positive look behind
520  (?&#60;!...)        negative look behind
521</pre>
522Each top-level branch of a look behind must be of a fixed length.
523</P>
524<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
525<P>
526<pre>
527  \n              reference by number (can be ambiguous)
528  \gn             reference by number
529  \g{n}           reference by number
530  \g+n            relative reference by number (PCRE2 extension)
531  \g-n            relative reference by number
532  \g{+n}          relative reference by number (PCRE2 extension)
533  \g{-n}          relative reference by number
534  \k&#60;name&#62;        reference by name (Perl)
535  \k'name'        reference by name (Perl)
536  \g{name}        reference by name (Perl)
537  \k{name}        reference by name (.NET)
538  (?P=name)       reference by name (Python)
539</PRE>
540</P>
541<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
542<P>
543<pre>
544  (?R)            recurse whole pattern
545  (?n)            call subpattern by absolute number
546  (?+n)           call subpattern by relative number
547  (?-n)           call subpattern by relative number
548  (?&name)        call subpattern by name (Perl)
549  (?P&#62;name)       call subpattern by name (Python)
550  \g&#60;name&#62;        call subpattern by name (Oniguruma)
551  \g'name'        call subpattern by name (Oniguruma)
552  \g&#60;n&#62;           call subpattern by absolute number (Oniguruma)
553  \g'n'           call subpattern by absolute number (Oniguruma)
554  \g&#60;+n&#62;          call subpattern by relative number (PCRE2 extension)
555  \g'+n'          call subpattern by relative number (PCRE2 extension)
556  \g&#60;-n&#62;          call subpattern by relative number (PCRE2 extension)
557  \g'-n'          call subpattern by relative number (PCRE2 extension)
558</PRE>
559</P>
560<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
561<P>
562<pre>
563  (?(condition)yes-pattern)
564  (?(condition)yes-pattern|no-pattern)
565
566  (?(n)               absolute reference condition
567  (?(+n)              relative reference condition
568  (?(-n)              relative reference condition
569  (?(&#60;name&#62;)          named reference condition (Perl)
570  (?('name')          named reference condition (Perl)
571  (?(name)            named reference condition (PCRE2, deprecated)
572  (?(R)               overall recursion condition
573  (?(Rn)              specific numbered group recursion condition
574  (?(R&name)          specific named group recursion condition
575  (?(DEFINE)          define subpattern for reference
576  (?(VERSION[&#62;]=n.m)  test PCRE2 version
577  (?(assert)          assertion condition
578</pre>
579Note the ambiguity of (?(R) and (?(Rn) which might be named reference
580conditions or recursion tests. Such a condition is interpreted as a reference
581condition if the relevant named group exists.
582</P>
583<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
584<P>
585All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
586name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
587if :NAME is present. The others just set a name for passing back to the caller,
588but this is not a name that (*SKIP) can see. The following act immediately they
589are reached:
590<pre>
591  (*ACCEPT)       force successful match
592  (*FAIL)         force backtrack; synonym (*F)
593  (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
594</pre>
595The following act only when a subsequent match failure causes a backtrack to
596reach them. They all force a match failure, but they differ in what happens
597afterwards. Those that advance the start-of-match point do so only if the
598pattern is not anchored.
599<pre>
600  (*COMMIT)       overall failure, no advance of starting point
601  (*PRUNE)        advance to next starting character
602  (*SKIP)         advance to current matching position
603  (*SKIP:NAME)    advance to position corresponding to an earlier
604                  (*MARK:NAME); if not found, the (*SKIP) is ignored
605  (*THEN)         local failure, backtrack to next alternation
606</pre>
607The effect of one of these verbs in a group called as a subroutine is confined
608to the subroutine call.
609</P>
610<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
611<P>
612<pre>
613  (?C)            callout (assumed number 0)
614  (?Cn)           callout with numerical data n
615  (?C"text")      callout with string data
616</pre>
617The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
618start and the end), and the starting delimiter { matched with the ending
619delimiter }. To encode the ending delimiter within the string, double it.
620</P>
621<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
622<P>
623<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
624<b>pcre2matching</b>(3), <b>pcre2</b>(3).
625</P>
626<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
627<P>
628Philip Hazel
629<br>
630University Computing Service
631<br>
632Cambridge, England.
633<br>
634</P>
635<br><a name="SEC27" href="#TOC1">REVISION</a><br>
636<P>
637Last updated: 02 September 2018
638<br>
639Copyright &copy; 1997-2018 University of Cambridge.
640<br>
641<p>
642Return to the <a href="index.html">PCRE2 index page</a>.
643</p>
644