Lines Matching +full:- +full:- +full:without +full:- +full:perl
3 PCRE2 - Perl-compatible regular expressions (revised API)
8 are described in detail below. There is a quick-reference syntax summary in the
12 page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
14 conflict with the Perl syntax) in order to provide some compatibility with
17 Perl's regular expressions are described in its own documentation, and regular
26 algorithm that is not Perl-compatible. Some of the features discussed below are
36 .SH "SPECIAL START-OF-PATTERN ITEMS"
40 by special items at the start of a pattern. These are not Perl-compatible, but
50 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
51 single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
52 specified for the 32-bit library, in which case it constrains the character
65 restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
95 .SS "Disabling auto-possessification"
108 .SS "Disabling start-up optimizations"
125 apply to patterns whose top-level branches all start with .* (match any number
187 character, the two-character sequence CRLF, any of the three preceding, any
223 matches. By default, this is any Unicode newline sequence, for Perl
298 - indicates character range
318 precede a non-alphanumeric with backslash to specify that it stands for itself.
332 can do so by putting them between \eQ and \eE. This is different from Perl in
334 in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish
339 Pattern PCRE2 matches Perl matches
358 .SS "Non-printing characters"
361 A second use of backslash provides a way of encoding non-printing characters
363 non-printing characters in a pattern, but when a pattern is being prepared by
369 \ecx "control-x", where x is any printable ASCII character
384 is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
394 compile-time error occurs.
398 escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
399 only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
400 ^, _, or ?. Any other character provokes a compile-time error. The sequence
402 characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
411 because 127 is not a control character in EBCDIC, Perl makes it generate the
413 them the APC character has the value 255 (hex FF), but in the one Perl calls
414 POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
425 addition to Perl; it provides way of specifying character code points as octal
435 and Perl has changed over time, causing PCRE2 also to change.
505 8-bit non-UTF mode no greater than 0xff
506 16-bit non-UTF mode no greater than 0xffff
507 32-bit non-UTF mode no greater than 0xffffffff
511 so-called "surrogate" code points). The check for these can be disabled by the
513 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
514 and UTF-32 modes, because these values are not representable in UTF-16.
533 In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
560 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
567 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
592 \eW any "non-word" character
604 "Non-printing characters"
606 above for details. Perl also uses \eN{name} to specify characters by Unicode
618 vary if locale-specific matching is taking place. For example, in some locales
619 the "non-breaking space" character (\exA0) is recognized as white space, and in
624 low-valued character tables, and may vary if locale-specific matching is taking
634 page). For example, in a French locale such as "fr_FR" in Unix-like systems,
641 for characters in the range 128-255 when locale-specific matching is happening.
663 U+00A0 Non-break space
670 U+2004 Three-per-em space
671 U+2005 Four-per-em space
672 U+2006 Six-per-em space
677 U+202F Narrow no-break space
691 In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
700 Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
710 This particular group matches either the two-character sequence CR followed by
713 line, U+0085). Because this is an atomic group, the two-character sequence is
732 Note that these special settings, which are not Perl-compatible, are recognized
750 8-bit non-UTF-8 mode, these sequences are of course limited to testing
752 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
757 \eP{\fIxx\fP} a character without the \fIxx\fP property
768 Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
932 a two-letter abbreviation. For compatibility with Perl, negation can be
963 Mn Non-spacing mark
1001 page). Perl does not support the Cs property.
1003 The long synonyms for property names that Perl supports (such as \ep{Letter})
1013 the behaviour of current versions of Perl.
1036 Instead it introduced various emoji-specific properties. PCRE2 uses only the
1051 4. Do not end before extending characters or spacing marks or the "zero-width
1074 and \es to use Unicode properties. PCRE2 uses these non-standard, non-Perl
1080 Xsp Any Perl space character
1081 Xwd Any Perl "word" character
1086 Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
1087 compatibility, but Perl changed. Xwd matches the same characters as Xan, plus
1090 There is another non-standard property, Xuc, which matches any character that
1135 Perl documents that the use of \eK within assertions is "not well defined". In
1155 without consuming any characters from the subject string. The use of
1180 done, it also affects \eb and \eB. Neither PCRE2 nor Perl has a separate "start
1191 argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to
1200 \fIstartoffset\fP is non-zero. By calling \fBpcre2_match()\fP multiple times
1201 with appropriate arguments, you can mimic Perl's /g option, and it is in this
1205 character of the matching process, is subtly different from Perl's, which
1206 defines it as true at the end of the previous match. In Perl, these can be
1218 The circumflex and dollar metacharacters are zero-width assertions. That is,
1219 they test for a particular condition being true without consuming any
1222 only the two-character sequence CRLF is recognized as a newline, isolated CR
1229 \fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1262 for compatibility with Perl. However, this can be changed by setting the
1269 when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
1277 below) recognizes the two-character sequence CRLF as a newline, this is
1298 character; when the two-character sequence CRLF is used, dot does not match CR
1305 PCRE2_DOTALL option is set, a dot matches any one character, without exception.
1306 If the two-character sequence CRLF is present in the subject string, it takes
1321 "Non-printing characters"
1323 above for details. Perl also uses \eN{name} to specify characters by Unicode
1331 whether or not a UTF mode is set. In the 8-bit library, one code unit is one
1332 byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
1333 32-bit unit. Unlike a dot, \eC always matches line-ending characters. The
1334 feature is provided in Perl in order to match individual bytes in UTF-8 mode,
1338 with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
1353 in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
1356 The former gives a match-time error; the latter fails to optimize and so the
1359 In the 32-bit library, however, \eC is always supported (when not explicitly
1360 locked out) because it always matches a single code unit, whether or not UTF-32
1364 it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1366 could be used with a UTF-8 string (ignore white space and line breaks):
1368 (?| (?=[\ex00-\ex7f])(\eC) |
1369 (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1370 (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1371 (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1379 below). The assertions at the start of each branch check the next UTF-8
1419 when matching character classes, whatever line-ending sequence is in use, and
1440 character class. For example, [d-m] matches any letter between d and m,
1444 or immediately after a range. For example, [b-d-z] matches letters in the range
1447 Perl treats a hyphen as a literal if it appears before or after a POSIX class
1449 However, unless the hyphen is the last character in the class, Perl outputs a
1454 range. A pattern such as [W-]46] is interpreted as a class of two characters
1455 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1456 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
1457 the end of range, so [W-\e]46] is interpreted as a class containing a range
1463 example [\e000-\e037]. Ranges can include any characters that are valid for the
1464 current mode. In any UTF mode, the so-called "surrogate" characters (those
1467 this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the
1472 Perl, EBCDIC code points within the range that are not letters are omitted. For
1473 example, [h-k] matches only four characters, even though the codes for h and k
1475 specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
1479 matches the letters in either case. For example, [W-c] is equivalent to
1480 [][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1481 tables for a French locale are in use, [\exc8-\excb] matches accented E
1494 introducing a POSIX class name, or for a special compatibility feature - see
1496 escaping other non-alphanumeric characters does no harm.
1502 Perl supports the POSIX notation for character classes. This uses names
1513 ascii character codes 0 - 127
1527 and space (32). If locale-specific matching is taking place, the list of space
1531 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1532 5.8. Another Perl extension is negation, which is indicated by a ^ character
1537 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
1543 range 128-255 when locale-specific matching is happening. However, if the
1568 U+2066 - U+2069 Various "isolate"s
1596 not compatible with Perl. It is provided to help migrations from other
1603 above), and in a Perl-style pattern the preceding or following character
1604 normally shows which is wanted, without the need for the assertions that are
1634 and ")". These options are Perl-compatible, and are described in detail in the
1649 example (?-im). The two "extended" options are not independent; unsetting either
1652 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
1659 options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
1660 the circumflex to cause some options to be re-instated, but a hyphen may not
1663 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
1664 the same way as the Perl-compatible options by using the characters J and U
1687 a non-capturing subpattern (see the next section), the option letters may
1695 \fBNote:\fP There are other PCRE2-specific options that can be set by the
1721 matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1740 There are often times when a grouping subpattern is required without a
1752 a non-capturing subpattern, the option letters may appear between the "?" and
1768 Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
1770 (?| and is itself a non-capturing subpattern. For example, consider this
1782 any branch. The following example is taken from the Perl documentation. The
1785 # before ---------------branch-reset----------- after
1801 A relative reference such as (?-1) is no different: it is just a convenient way
1809 for a subpattern's having matched refers to a non-unique number, the test is
1823 to Perl until release 5.10. Python had the feature earlier, and PCRE1
1825 Perl and the Python syntax.
1828 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. Names
1830 with a non-digit. References to capturing parentheses from other parts of the
1848 if the names were not present. In both PCRE2 and Perl, capturing subpatterns
1851 name-to-number translation table from a compiled pattern, as well as
1856 Perl allows identically numbered subpatterns to have different names. Consider
1861 Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
1866 compile-time error. However, there is still scope for confusion. Consider this
1884 as a 3-letter abbreviation or as the full name, and in both cases you want to
1900 If you make a backreference to a non-unique named subpattern from elsewhere in
1909 If you make a subroutine call to a non-unique named subpattern, the one that
1973 which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1992 For convenience, the three most common quantifiers have single-character
2004 Earlier versions of Perl and PCRE1 used to give an error at compile time for
2010 possible (up to the maximum number of permitted times), without causing the
2040 If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
2050 to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
2101 re-evaluated to see if a different number of repeats allows the rest of the
2114 that once a subpattern has matched, it is not to be re-evaluated in this way.
2157 The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
2160 package, and PCRE1 copied it from there. It ultimately found its way into Perl
2176 matches an unlimited number of substrings that either consist of non-digits, or
2185 than a single character at the end, because both PCRE2 and Perl have an
2193 sequences of non-digits cannot be broken, and failure happens quickly.
2216 "Non-printing characters"
2234 An unsigned number specifies an absolute reference without the ambiguity that
2238 (abc(def)ghi)\eg{-1}
2240 The sequence \eg{-1} is a reference to the most recently started capturing
2242 Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
2247 of forward reference can be useful it patterns that repeat. Perl does not
2271 subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or
2272 \ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2359 For example, a sequence such as (.)\eg{-1} can be used to check that two
2382 For compatibility with Perl, most assertion subpatterns may be repeated; though
2398 without the assertion, the order depending on the greediness of the quantifier.
2444 have a fixed length. However, if there are several top-level alternatives, they
2455 extension compared with Perl, which requires all branches to match the same
2460 is not permitted, because its single top-level branch can match two different
2461 lengths, but it is acceptable to PCRE2 if rewritten to use two top-level
2471 can be used instead of a lookbehind assertion to get round the fixed-length
2479 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
2490 as the subpattern matches a fixed-length string. However,
2498 Perl does not support backreferences in lookbehinds. PCRE2 does support them,
2509 specify efficient matching of fixed-length strings at the end of subject
2578 (?(condition)yes-pattern)
2579 (?(condition)yes-pattern|no-pattern)
2581 If the condition is satisfied, the yes-pattern is used; otherwise the
2582 no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
2584 subpattern, a compile-time error occurs. Each of the two alternatives may
2594 recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
2612 can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside
2615 zero in any of these forms is not used; it provokes a compile-time error.)
2617 Consider the following pattern, which contains non-significant white space to
2628 the condition is true, and so the yes-pattern is executed and a closing
2629 parenthesis is required. Otherwise, since no-pattern is not present, the
2631 non-parentheses, optionally enclosed in parentheses.
2636 ...other stuff... ( \e( )? [^()]+ (?(-1) \e) ) ...
2644 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
2646 this facility before Perl, the syntax (?(name)...) is also recognized. Note,
2662 "Recursion" in this sense refers to any subroutine-like call from one part of
2722 (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
2729 pattern uses references to the named group to match the four dot-separated
2756 this pattern, again containing non-significant white space, and with the two
2759 (?(?=[^a-z]*[a-z])
2760 \ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
2763 sequence of non-letters followed by a letter. In other words, it tests for the
2767 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2772 assertion, whether it succeeds or fails. (Compare non-conditional assertions,
2816 unlimited nested parentheses. Without the use of recursion, the best that can
2820 For some time, Perl has provided a facility that allows regular expressions to
2821 recurse (amongst other things). It does this by interpolating Perl code in the
2822 expression at run time, and the code can refer to the expression itself. A Perl
2828 The (?p{...}) item interpolates Perl code at run time, and in this case refers
2831 Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
2834 this kind of recursion was subsequently introduced into Perl at release 5.10.
2841 non-recursive subroutine
2852 substrings which can either be a sequence of non-parentheses, or a recursive
2855 to avoid backtracking into sequences of non-parentheses.
2867 pattern above you can write (?-2) to refer to the second most recently opened
2879 (?|(a)|(b)) (c) (?-2)
2882 is number 2. When the reference (?-2) is encountered, the second most recently
2893 non-recursive subroutine
2897 An alternative approach is to use named parentheses. The Perl syntax for this
2908 non-parentheses is important when applying the pattern to strings that do not
2941 different alternatives for the recursive and non-recursive cases. The (?R) item
2946 .SS "Differences in recursion processing between PCRE2 and Perl"
2949 Some former differences between PCRE2 and Perl no longer exist.
2951 Before release 10.30, recursion processing in PCRE2 differed from Perl in that
2953 once it had matched some of the subject string, it was never re-entered, even
2955 failure. (Historical note: PCRE implemented recursion before Perl did.)
2958 as atomic. That is, they can be re-entered to try unused alternatives if there
2960 Perl works. If you want a subroutine call to be atomic, you must explicitly
2972 typical palindromic phrases, the pattern has to ignore all non-word characters,
2979 avoid backtracking into sequences of non-word characters. Without this, PCRE2
2981 Perl takes so long that you think it has gone into a loop.
2983 Another way in which PCRE2 and Perl used to differ in their recursion
2984 processing is in the handling of captured values. Formerly in Perl, when a
2994 "b" and so the whole match succeeds. This match used to fail in Perl, but in
3011 (...(relative)...)...(?-1)...
3031 Processing options such as case-independence are fixed when a subpattern is
3035 (abc)(?i:(?-1))
3057 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
3068 (abc)(?i:\eg<-1>)
3070 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
3077 Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
3082 PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
3095 scripts within patterns in a similar way to Perl.
3104 one side-effect is that sometimes callouts are skipped. If you need all
3154 There are a number of special "Backtracking Control Verbs" (to use Perl's
3159 By default, for compatibility with Perl, a name is any sequence of characters
3163 is no longer Perl-compatible.
3174 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3178 The maximum length of a name is 255 in the 8-bit library and 65535 in the
3179 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
3215 the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
3229 Experiments with Perl suggest that it too has similar optimizations, and like
3257 abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
3259 (??{}). Those are, of course, Perl features that are not present in PCRE2. The
3283 When a match succeeds, the name of the last-encountered (*MARK:NAME) on the
3318 name is recorded and passed back if it is the last-encountered. This does not
3383 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3398 applied starting at "x", and so the (*COMMIT) causes the match to fail without
3421 This verb, when given without a name, is like (*PRUNE), except that if the
3447 assertions, because they are never re-entered by backtracking. Compare the
3477 pattern-based if-then-else block:
3483 second alternative and tries COND2, without backtracking into COND1. If that
3549 not always the same as Perl's. It means that if two or more backtracking verbs
3564 PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
3569 If the subject is "abac", Perl matches unless its optimizations are disabled,
3584 without any further processing; captured strings and a (*MARK) name (if set)
3586 assertion to fail without any further processing; captured substrings and any
3609 the assertion to be true, without considering any further alternative branches.
3619 succeed without any further processing. Matching then continues after the
3620 subroutine call. Perl documents this behaviour. Perl's treatment of the other
3659 Copyright (c) 1997-2018 University of Cambridge.