• Home
  • History
  • Annotate
  • Raw
  • Download

Lines Matching full:are

7 The syntax and semantics of the regular expressions that are supported by PCRE2
17 Perl's regular expressions are described in its own documentation, and regular
18 expressions in general are covered in a number of books, some of which have
23 This document discusses the patterns that are supported by PCRE2 when its main
26 algorithm that is not Perl-compatible. Some of the features discussed below are
28 the alternative function, and how it differs from the normal function, are
40 by special items at the start of a pattern. These are not Perl-compatible, but
41 are provided to make these options accessible to pattern writers who are not
151 These facilities are provided to catch runaway matches that are provoked by
172 \fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
219 The newline convention affects where the circumflex and dollar assertions are
249 the sections below, character code values are ASCII or Unicode; in an EBCDIC
250 environment these characters may have different code values, and there are no
264 caseless matching is specified (the PCRE2_CASELESS option), letters are matched
268 and repetitions in the pattern. These are encoded in the pattern by the use of
269 \fImetacharacters\fP, which do not stand for themselves but instead are
272 There are two different sets of metacharacters: those that are recognized
273 anywhere in the pattern except within square brackets, and those that are
294 a character class the only metacharacters are:
322 backslash. All other characters (in particular, those whose code points are
323 greater than 127) are treated as literals.
327 outside a character class and the next newline, inclusive, are ignored. An
333 that $ and @ are handled as literals in \eQ...\eE sequences in PCRE2, whereas
366 these escapes are as follows:
399 only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
412 APC character. Unfortunately, there are several variants of EBCDIC. In most of
417 After \e0 up to two further octal digits are read. If there are fewer than two
418 digits, just those that are present are used. Thus the sequence \e0\ex\e015
439 if there are at least that many previous capturing left parentheses in the
451 Otherwise, up to three octal digits are read to form a character code.
460 \e40 is the same, provided there are fewer than 40
477 Note that octal values of 100 or greater that are specified using this syntax
479 digits are ever read.
482 digits are read (letters can be in upper or lower case). Any number of
495 the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
502 Characters that are specified using octal or hexadecimal numbers are
510 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
514 and UTF-32 modes, because these values are not representable in UTF-16.
525 \eB, \eR, and \eX are not special inside a character class. Like other
533 In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
545 can be coded as \eg{name}. Backreferences are discussed
562 syntax for referencing a subpattern as a "subroutine". Details are discussed
567 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
616 The default \es characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
617 space (32), which are defined as white space in the "C" locale. This list may
635 or "french" in Windows, some character codes greater than 127 are used for
636 accented letters, and these are then matched by \ew. The use of locales with
639 By default, characters whose code points are greater than 127 never match \ed,
644 is set, the behaviour is changed so that Unicode properties are used to
654 \eB because they are defined in terms of \ew and \eW. Matching these sequences
659 points, whether or not PCRE2_UCP is set. The horizontal space characters are:
681 The vertical space characters are:
705 This is an example of an "atomic group", details of which are given
716 In other modes, two additional characters whose code points are greater than 255
732 Note that these special settings, which are not Perl-compatible, are recognized
749 sequences that match characters with specific properties are available. In
750 8-bit non-UTF-8 mode, these sequences are of course limited to testing
751 characters whose code points are less than 256, but they do work in this mode.
753 may be encountered. These are all treated as being in the Common script and
754 with an unassigned type. The extra escape sequences are:
760 The property names represented by \fIxx\fP above are limited to the Unicode
768 Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
772 Sets of Unicode characters are defined as belonging to certain scripts. A
779 Those that are not part of an identified script are lumped together as
938 of negation, the curly brackets in the escape sequence are optional; these two
944 The following general category property codes are supported:
995 U+DFFF. Such characters are not valid in Unicode strings and so
1033 define the boundaries of extended grapheme clusters. The rules are defined in
1059 Extend and ZWJ characters are allowed between the characters.
1062 regional indicator (RI) characters if there are an odd number of RI characters
1076 explicitly. These properties are:
1092 languages. These are the characters $, @, ` (grave accent), and all characters
1094 surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
1095 excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH
1161 The backslashed assertions are:
1187 start and end of the subject string, whatever options are set. Thus, they are
1188 independent of multiline mode. These three assertions are not affected by the
1218 The circumflex and dollar metacharacters are zero-width assertions. That is,
1220 characters from the subject string. These two metacharacters are concerned with
1223 and LF characters are treated as ordinary data characters, and are not
1238 alternatives are involved, but it should be the first thing in each alternative
1242 "anchored" pattern. (There are also other constructs that can cause a pattern
1249 character of the pattern if a number of alternatives are involved, but it
1257 The meanings of the circumflex and dollar metacharacters are changed if the
1267 patterns that are anchored in single line mode because all branches start with
1268 ^ are not anchored in multiline mode, and a match for circumflex is possible
1278 preferred, even if the single characters CR and LF are also recognized as
1300 (including isolated CRs and LFs). When any Unicode line endings are being
1381 character's individual bytes are then captured by the appropriate number of
1407 are in the class by enumerating those that are not. A class that starts with a
1418 Characters that might indicate line breaks are never treated in any special way
1434 class; it matches the backspace character. The sequences \eB, \eR, and \eX are
1463 example [\e000-\e037]. Ranges can include any characters that are valid for the
1468 surrogates, are always permitted.
1470 There is a special case in EBCDIC environments for ranges whose end points are
1472 Perl, EBCDIC code points within the range that are not letters are omitted. For
1481 tables for a French locale are in use, [\exc8-\excb] matches accented E
1491 The only metacharacters that are recognized in character classes are backslash,
1526 The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1538 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1539 supported, and an error is given if they are encountered.
1544 PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
1545 changed so that Unicode character properties are used. This is achieved by
1559 classes are handled specially in UCP mode:
1572 This matches the same characters as [:graph:] plus space characters that are
1580 The other POSIX classes are unchanged, and match only characters with code
1594 Only these exact character sequences are recognized. A sequence such as
1604 normally shows which is wanted, without the need for the assertions that are
1611 Vertical bar characters are used to separate alternative patterns. For example,
1619 that succeeds is used. If the alternatives are within a subpattern
1634 and ")". These options are Perl-compatible, and are described in detail in the
1638 documentation. The option letters are:
1649 example (?-im). The two "extended" options are not independent; unsetting either
1665 respectively. However, these are not unset by (?^).
1686 As a convenient shorthand, if any option settings are required at the start of
1695 \fBNote:\fP There are other PCRE2-specific options that can be set by the
1698 set or what has been defaulted. Details are given in the section entitled
1703 above. There are also the (*UTF) and (*UCP) leading sequences that can be used
1704 to set UTF and Unicode property modes; they are equivalent to setting the
1714 Subpatterns are delimited by parentheses (round brackets), which can be nested.
1730 Opening parentheses are counted from left to right (starting from 1) to obtain
1736 the captured substrings are "red king", "red", and "king", and are numbered 1,
1740 There are often times when a grouping subpattern is required without a
1748 the captured substrings are "white queen" and "queen", and are numbered 1 and
1751 As a convenient shorthand, if any option settings are required at the start of
1758 match exactly the same set of strings. Because alternative branches are tried
1759 from left to right, and options are not reset until the end of the subpattern
1775 Because the two alternatives are inside a (?| group, both sets of capturing
1776 parentheses are numbered one. Thus, when the pattern matches, you can look
1779 alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1847 Named capturing parentheses are allocated numbers as well as names, exactly as
1849 are primarily identified by numbers; any names are just aliases for these
1857 this pattern, where there are two capturing subpatterns, both numbered 1:
1893 There are five capturing substrings, but only one is ever set after a match.
1901 the pattern, the subpatterns to which the name refers are checked in the order
1921 recursion, all subpatterns with the same name are tested. If the condition is
1957 no upper limit; if the second number and the comma are both omitted, the
1979 subpatterns that are referenced as
1989 below). Items other than subpatterns that have a {0} quantifier are omitted
2005 such patterns. However, because there are cases where this can be useful, such
2006 patterns are now accepted, but if any repetition of the subpattern does in fact
2009 By default, the quantifiers are "greedy", that is, they match as much as
2041 the quantifiers are not greedy by default, but individual ones can be made
2060 However, there are some cases where the optimization cannot be used. When .*
2061 is inside capturing parentheses that are the subject of a backreference
2086 "tweedledee". However, if there are nested capturing subpatterns, the
2131 Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
2133 everything it can. So, while both \ed+ and \ed+? are prepared to adjust the
2151 Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
2152 option is ignored. They are a convenient notation for the simpler forms of
2206 always taken as a backreference, and causes an error only if there are not
2208 parentheses that are referenced need not be to the left of the reference for
2222 no such problem when named parentheses are used. A backreference to any
2227 signed or unsigned number, optionally enclosed in braces. These examples are
2243 can be helpful in long patterns, and also in patterns that are created by
2270 There are several different ways of writing backreferences to named
2272 \ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2296 following a backslash are taken as part of a potential backreference number.
2341 coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
2347 More complicated assertions are coded as subpatterns. There are two kinds:
2355 Assertion subpatterns are not capturing subpatterns. If an assertion contains
2356 capturing subpatterns within it, these are counted for the purposes of
2360 adjacent characters are the same.
2363 captured are discarded (as happens with any pattern branch that fails to
2365 this means that no captured substrings are ever retained after a successful
2370 branch are retained, and matching continues with the next pattern item after
2377 (see below), captured substrings are retained, because matching continues with
2389 However, it may contain internal capturing parenthesized groups that are called
2424 (?!foo) is always true when the next three characters are "bar". A
2443 a lookbehind assertion are restricted such that all the strings it matches must
2444 have a fixed length. However, if there are several top-level alternatives, they
2476 match. If there are insufficient characters before the current position, the
2482 \eX and \eR escapes, which can match different numbers of code units, are never
2489 calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
2499 but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
2523 covers the entire string, from right to left, so we are no better off. However,
2542 matches "foo" preceded by three digits that are not "999". Notice that each of
2544 string. First there is a check that the previous three characters are all
2545 digits, and then there is a check that the same three characters are not "999".
2547 of which are digits and the last three of which are not "999". For example, it
2553 that the first three are digits, and then the second assertion checks that the
2554 preceding three characters are not "999".
2566 characters that are not "999".
2576 already been matched. The two possible forms of conditional subpattern are:
2583 string (it always matches). If there are more than two alternatives in the
2587 the condition. This pattern fragment is an example where the alternatives are
2593 There are five kinds of condition: references to subpatterns, references to
2625 matches one or more characters that are not parentheses. The third part is a
2702 At "top level", all these recursion test conditions are false.
2740 they are dealing with by using this condition to match a string such as
2767 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2773 when captures are retained only for positive assertions that succeed.)
2780 There are two ways of including comments in patterns that are processed by
2787 closing parenthesis. Nested parentheses are not permitted. If the
2791 characters are interpreted as newlines is controlled by an option passed to the
2881 The first two capturing groups (a) and (b) are both numbered 1, and group (c)
2885 reference (?1) was used. In other words, relative references are just a
2890 reference is not inside the parentheses that are referenced. They are always
2914 the match runs for a very long time indeed because there are so many different
2918 At the end of a match, the values of capturing parentheses are those from
2935 arbitrary nesting. Only digits are allowed in nested brackets (that is, when
2936 recursing), whereas any characters are permitted at the outer level.
2957 Starting with release 10.30, recursive subroutine calls are no longer treated
2969 palindrome when there are an odd number of characters, or nothing when there
3028 occur. However, any capturing parentheses that are set during the subroutine
3031 Processing options such as case-independence are fixed when a subpattern is
3070 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
3087 entry point is set to NULL, callouts are disabled.
3090 function is to be called. There are two kinds of callout: those with a
3104 one side-effect is that sometimes callouts are skipped. If you need all
3107 programming interface to the callout function, are given in the
3124 callouts are automatically installed before each item in the pattern. They are
3154 There are a number of special "Backtracking Control Verbs" (to use Perl's
3167 only backslash items that are permitted are \eQ, \eE, and sequences such as
3174 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3183 Since these verbs are specifically related to backtracking, most of them can be
3210 PCRE2 contains some optimizations that are used to speed up matching by running
3236 The following verbs act as soon as they are encountered.
3259 (??{}). Those are, of course, Perl features that are not present in PCRE2. The
3294 assertions and atomic groups. (There are differences in those cases when
3333 If you are interested in (*MARK) values after failed matches, you should
3345 The following verbs do nothing when they are encountered. Matching continues
3383 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3410 possessive quantifier, but there are some uses of (*PRUNE) that cannot be
3446 means that it does not see (*MARK) settings that are inside atomic groups or
3447 assertions, because they are never re-entered by backtracking. Compare the
3470 names that are set by (*COMMIT:NAME), (*PRUNE:NAME) or (*THEN:NAME).
3484 succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
3496 enclosing alternative. Consider this pattern, where A, B, etc. are complex
3501 If A and B are matched, but there is a failure in C, matching does not
3510 because there are no more alternatives to try. In this case, matching does now
3542 etc. are complex pattern fragments:
3569 If the subject is "abac", Perl matches unless its optimizations are disabled,
3587 (*MARK) name are discarded.
3590 a positive assertion and false for a negative one; captured substrings are
3595 because lookaround assertions are atomic. A backtrack that occurs after an
3604 The other backtracking verbs are not treated specially if they appear in a