pcreapi.3 - OpenGrok cross reference for /external/pcre/dist/doc/pcreapi.3

Lines Matching full:the
129 two additional libraries. They can be built as well as, or instead of, the
130 8-bit library. To avoid too much complication, this document describes the
131 8-bit versions of the functions, with only occasional references to the 16-bit
134 The 16-bit and 32-bit functions operate in the same way as their 8-bit
139 by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the
140 16-bit and 32-bit option names define the same bit values.
143 16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data
144 units and UTF-32 when using the 32-bit library, unless specified otherwise.
145 More details of the specific differences for the 16-bit and 32-bit libraries
146 are given in the
161 also some wrapper functions (for the 8-bit library only) that correspond to the
162 POSIX regular expression API, but they do not give access to all the
163 functionality. They are described in the
168 wrapper (again for the 8-bit library only) is also distributed with PCRE. It is
169 documented in the
175 The native API C function prototypes are defined in the header file
176 \fBpcre.h\fP, and on Unix-like systems the (8-bit) library itself is called
177 \fBlibpcre\fP. It can normally be accessed by adding \fB-lpcre\fP to the
178 command for linking an application that uses PCRE. The header file defines the
179 macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release numbers
180 for the library. Applications can use these to include support for different
185 including \fBpcre.h\fP or \fBpcrecpp.h\fP, because otherwise the
191 in a Perl-compatible manner. A sample program that demonstrates the simplest
192 way of using them is provided in the file called \fIpcredemo.c\fP in the PCRE
193 source distribution. A listing of this program is given in the
197 documentation, and the
204 in appropriate hardware environments. It greatly speeds up the matching
207 relevant. More complicated programs might need to make use of the functions
209 \fBpcre_assign_jit_stack()\fP in order to control the JIT code's memory usage.
212 gives improved performance. The JIT-specific functions are discussed in the
219 Perl-compatible, is also provided. This uses a different algorithm for the
220 matching. The alternative algorithm finds all possible matches (at a given
221 point in the subject), and scans the subject just once (unless there are
223 substrings. A description of the two matching algorithms and their advantages
224 and disadvantages is given in the
230 In addition to the main compiling and matching functions, there are convenience
243 provided, to free the memory used for extracted strings.
246 in the current locale for passing to \fBpcre_compile()\fP, \fBpcre_exec()\fP,
252 compiled pattern. The function \fBpcre_version()\fP returns a pointer to a
253 string containing the version of PCRE and its date of release.
256 containing a compiled pattern. This is provided for the benefit of
260 the entry points of the standard \fBmalloc()\fP and \fBfree()\fP functions,
261 respectively. PCRE calls the memory management functions via these variables,
262 so a calling program can replace them if it wishes to intercept the calls. This
267 only when PCRE is compiled to use the heap for remembering data, instead of
268 recursive function calls, when running the \fBpcre_exec()\fP function. See the
273 building PCRE, for use in environments that have limited stacks. Because of the
277 first freed), and always for memory blocks of the same size. There is a
278 discussion about PCRE's stack usage in the
285 by the caller to a "callout" function, which PCRE will then call at specified
286 points during a matching operation. Details are given in the
293 set by the caller to a function that is called by PCRE whenever it starts
295 uses recursive function calls, which use up the system stack. This function is
297 error if the stack runs out. The function should return zero if all is well, or
307 character, the two-character sequence CRLF, any of the three preceding, or any
308 Unicode newline sequence. The Unicode newline sequences are the three just
309 mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed,
313 Each of the first three conventions is used by at least one operating system as
315 The default default is LF, which is the Unix standard. When PCRE is run, the
319 At compile time, the newline convention can be specified by the \fIoptions\fP
320 argument of \fBpcre_compile()\fP, or it can be specified by special text at the
321 start of the pattern itself; this overrides any other settings. See the
325 page for details of the special character sequences.
327 In the PCRE documentation the word "newline" is used to mean "the character or
328 pair of characters that indicate a line break". The choice of newline
329 convention affects the handling of the dot, circumflex, and dollar
330 metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
331 recognized line ending sequence, the match position advancement for a
332 non-anchored pattern. There is more detail about this in the
339 The choice of newline convention does not affect the interpretation of
347 The PCRE functions can be used in multi-threading applications, with the
348 proviso that the memory management functions pointed to by \fBpcre_malloc\fP,
349 \fBpcre_free\fP, \fBpcre_stack_malloc\fP, and \fBpcre_stack_free\fP, and the
356 If the just-in-time optimization feature is being used, it needs separate
357 memory stack areas for each thread. See the
368 time, possibly by a different program, and even on a host other than the one on
369 which it was compiled. Details are given in the
373 documentation, which includes a description of the
385 discover which optional features have been compiled into the PCRE library. The
392 information is required; the second argument is a pointer to a variable into
393 which the information is placed. The returned value is zero on success, or the
394 negative error code PCRE_ERROR_BADOPTION if the value in the first argument is
395 not recognized. The following information is available:
400 otherwise it is set to zero. This value should normally be given to the 8-bit
401 version of this function, \fBpcre_config()\fP. If it is given to the 16-bit
402 or 32-bit version of this function, the result is PCRE_ERROR_BADOPTION.
407 otherwise it is set to zero. This value should normally be given to the 16-bit
408 version of this function, \fBpcre16_config()\fP. If it is given to the 8-bit
409 or 32-bit version of this function, the result is PCRE_ERROR_BADOPTION.
414 otherwise it is set to zero. This value should normally be given to the 32-bit
415 version of this function, \fBpcre32_config()\fP. If it is given to the 8-bit
416 or 16-bit version of this function, the result is PCRE_ERROR_BADOPTION.
431 support is available, the string contains the name of the architecture for
432 which the JIT compiler is configured, for example "x86 32bit (little endian +
433 unaligned)". If JIT support is not available, the result is NULL.
437 The output is an integer whose value specifies the default character sequence
438 that is recognized as meaning "newline". The values that are supported in
440 ANYCRLF, and -1 for ANY. In EBCDIC environments, CR, ANYCRLF, and ANY yield the
441 same values. However, the value for LF is normally 21, though some EBCDIC
442 environments use 37. The corresponding values for CRLF are 3349 and 3365. The
443 default should normally correspond to the standard sequence for your operating
448 The output is an integer whose value indicates what character sequences the \eR
451 or CRLF. The default can be overridden when a pattern is compiled or matched.
455 The output is an integer that contains the number of bytes used for internal
456 linkage in compiled regular expressions. For the 8-bit library, the value can
457 be 2, 3, or 4. For the 16-bit library, the value is either 2 or 4 and is still
458 a number of bytes. For the 32-bit library, the value is either 2 or 4 and is
459 still a number of bytes. The default value of 2 is sufficient for all but the
460 most massive patterns, since it allows the compiled pattern to be up to 64K in
461 size. Larger values allow larger regular expressions to be compiled, at the
466 The output is an integer that contains the threshold above which the POSIX
476 The output is a long integer that gives the maximum depth of nesting of
477 parentheses (of any kind) in a pattern. This limit is imposed to cap the amount
479 built; the default is 250. This limit does not take into account the stack that
480 may already be used by the calling application. For finer control over
486 The output is a long integer that gives the default limit for the number of
492 The output is a long integer that gives the default limit for the depth of
493 recursion when calling the internal matching function in a \fBpcre_exec()\fP
499 \fBpcre_exec()\fP is implemented by recursive function calls that use the stack
500 to remember their state. This is the usual way that PCRE is compiled. The
501 output is zero if PCRE was compiled to use blocks of data on the heap instead
503 \fBpcre_stack_free\fP are called to manage memory blocks on the heap, thus
504 avoiding the use of the stack.
521 Either of the functions \fBpcre_compile()\fP or \fBpcre_compile2()\fP can be
522 called to compile a pattern into an internal form. The only difference between
525 too much repetition, we refer just to \fBpcre_compile()\fP below, but the
528 The pattern is a C string terminated by a binary zero, and is passed in the
530 via \fBpcre_malloc\fP is returned. This contains the compiled code and related
531 data. The \fBpcre\fP type is defined for the returned block; this is a typedef
532 for a structure whose contents are not externally defined. It is up to the
533 caller to free the memory (via \fBpcre_free\fP) when it is no longer required.
535 Although the compiled code of a PCRE regex is relocatable, that is, it does not
536 depend on memory location, the complete \fBpcre\fP data block is not
537 fully relocatable, because it may contain a copy of the \fItableptr\fP
540 The \fIoptions\fP argument contains various bit settings that affect the
541 compilation. It should be zero if no options are required. The available
544 within the pattern (see the detailed description in the
549 the pattern, the contents of the \fIoptions\fP argument specifies their
550 settings at the start of compilation and execution. The PCRE_ANCHORED,
552 PCRE_NO_START_OPTIMIZE options can be set at the time of matching as well as at
557 NULL, and sets the variable pointed to by \fIerrptr\fP to point to a textual
558 error message. This is a static string that is part of the library. You must
559 not try to free it. Normally, the offset from the start of the pattern to the
560 data unit that was being processed when the error was discovered is placed in
563 the offset is that of the first data unit of the failing character.
565 Some errors are not detected until the whole pattern has been scanned; in these
566 cases, the offset passed back is the length of the pattern. Note that the
568 point into the middle of a UTF-8 or UTF-16 character.
570 If \fBpcre_compile2()\fP is used instead of \fBpcre_compile()\fP, and the
572 returned via this argument in the event of an error. This is in addition to the
575 If the final argument, \fItableptr\fP, is NULL, PCRE uses a default set of
576 character tables that are built when PCRE is compiled, using the default C
577 locale. Otherwise, \fItableptr\fP must be an address that is the result of a
578 call to \fBpcre_maketables()\fP. This value is stored with the compiled
579 pattern, and used again by \fBpcre_exec()\fP and \fBpcre_dfa_exec()\fP when the
580 pattern is matched. For more discussion, see the section on locale support
589     "^A.*Z",          /* the pattern */
595 The following names for option bits are defined in the \fBpcre.h\fP header
600 If this bit is set, the pattern is forced to be "anchored", that is, it is
601 constrained to match only at the first matching point in the string that is
602 being searched (the "subject string"). This effect can also be achieved by
603 appropriate constructs in the pattern itself, which is the only way to do it in
609 all with number 255, before each pattern item. For discussion of the callout
610 facility, see the
619 These options (which are mutually exclusive) control what the \eR escape
620 sequence matches. The choice is either to match only CR, LF, or CRLF, or to
621 match any Unicode newline sequence. The default is specified when PCRE is
622 built. It can be overridden from within the pattern, or by setting an option
627 If this bit is set, letters in the pattern match both upper and lower case
629 pattern by a (?i) option setting. In UTF-8 mode, PCRE always understands the
631 matching is always possible. For characters with higher values, the concept of
639 If this bit is set, a dollar metacharacter in the pattern matches only at the
640 end of the subject string. Without this option, a dollar also matches
641 immediately before a newline at the end of the string (but not before any other
642 newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
648 If this bit is set, a dot metacharacter in the pattern matches a character of
651 a dot does not match when the current position is at a newline. This option is
654 characters, independent of the setting of this option.
660 only one instance of the named subpattern can ever be matched. There are more
661 details of named subpatterns below; see also the
669 If this bit is set, most white space characters in the pattern are totally
677 White space did not used to include the VT character (code 11), because Perl
682 class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is
686 Which characters are interpreted as newlines is controlled by the options
687 passed to \fBpcre_compile()\fP or by a special sequence at the start of the
688 pattern, as described in the section entitled
693 in the \fBpcrepattern\fP documentation. Note that the end of this type of
694 comment is a literal newline sequence in the pattern; escape sequences that
700 within the sequence (?( that introduces a conditional subpattern.
710 give an error for this, by running it with the -w option.) There are at present
717 the first newline in the subject string, though the matched text may continue
718 over the newline.
723 compatible with JavaScript rather than Perl. The changes are as follows:
727 character). Thus, the pattern AB]CD becomes illegal when this option is set.
730 string (by default this causes the current matching alternative to fail). A
732 an "a" in the subject), whereas it fails by default, for Perl compatibility.
738 hexadecimal digits, in which case the hexadecimal number defines the code point
740 case the following character).
743 hexadecimal digits, in which case the hexadecimal number defines the code point
750 By default, for the purposes of matching "start of line" and "end of line",
751 PCRE treats the subject string as consisting of a single line of characters,
752 even if it actually contains newlines. The "start of line" metacharacter (^)
753 matches only at the start of the string, and the "end of line" metacharacter
754 ($) matches only at the end of the string, or before a terminating newline
756 PCRE_DOTALL is set, the "any character" metacharacter (.) does not match at a
757 newline. This behaviour (for ^, $, and dot) is the same as Perl.
759 When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs
760 match immediately following or immediately before internal newlines in the
761 subject string, respectively, as well as at the very start and end. This is
768 This option locks out interpretation of the pattern as UTF-8 (or UTF-16 or
769 UTF-32 in the 16-bit and 32-bit libraries). In particular, it prevents the
770 creator of the pattern from switching to UTF interpretation by starting the
772 from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
781 These options override the default newline definition that was chosen when PCRE
782 was built. Setting the first or the second specifies that a newline is
784 PCRE_NEWLINE_CRLF specifies that a newline is indicated by the two-character
785 CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies that any of the three
789 In an ASCII/Unicode environment, the Unicode newline sequences are the three
790 just mentioned, plus the single characters VT (vertical tab, U+000B), FF (form
792 (paragraph separator, U+2029). For the 8-bit library, the last two are
795 When PCRE is compiled to run in an EBCDIC (mainframe) environment, the code for
796 CR is 0x0d, the same as ASCII. However, the character code for LF is normally
799 less than 256. For more details, see the
805 The newline setting in the options word uses three bits that are treated
807 plus the five values above). This means that if you set more than one newline
808 option, the combination may or may not be sensible. For example,
815 indicates a comment that lasts until after the next line break sequence. In
819 The newline option that is set at compile time becomes the default that is used
824 If this option is set, it disables the use of numbered capturing parentheses in
827 they acquire numbers in the usual way). There is no equivalent of this option
836 this option if you want the matching functions to do a full unoptimized search
837 and run all the callouts, but it is mainly provided for testing purposes.
843 it is remembered with the compiled pattern and assumed at matching time. This
844 is necessary if you want to use JIT execution, because the JIT compiler needs
845 to know whether or not this option is set. For details see the discussion of
854 This option changes the way PCRE processes \eB, \eb, \eD, \ed, \eS, \es, \eW,
855 \ew, and some of the POSIX character classes. By default, only ASCII characters
857 classify characters. More details are given in the section on
862 in the
866 page. If you set PCRE_UCP, matching one of the items it affects takes much
867 longer. The option is available only if PCRE has been compiled with Unicode
872 This option inverts the "greediness" of the quantifiers so that they are not
874 with Perl. It can also be set by a (?U) option setting within the pattern.
878 This option causes PCRE to regard both the pattern and the subject as strings
880 only when PCRE is built to include UTF support. If not, the use of this option
881 provokes an error. Details of how this option changes the behaviour of PCRE are
882 given in the
890 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
891 automatically checked. There is a discussion about the
896 in the
902 this check for performance reasons, you can set the PCRE_NO_UTF8_CHECK option.
903 When it is set, the effect of passing an invalid UTF-8 string as a pattern is
906 the validity checking of subject strings only. If the same string is being
907 matched many times, the option can be safely set for the second and subsequent
914 The following table lists the error codes than may be returned by
915 \fBpcre_compile2()\fP, along with the error messages that may be returned by
963   43  two named subpatterns have the same name
987   65  different names for subpatterns of the same number are
1002   78  setting UTF is disabled by the application
1012 be used if the limits were changed when PCRE was built.
1025 more time analyzing it in order to speed up the time taken for matching. The
1027 argument. If studying the pattern produces additional information that will
1029 \fBpcre_extra\fP block, in which the \fIstudy_data\fP field points to the
1030 results of the study.
1034 also contains other fields that can be set by the caller before the block is
1040 in the section on matching a pattern.
1042 If studying the pattern does not produce any useful information,
1043 \fBpcre_study()\fP returns NULL by default. In that circumstance, if the
1044 calling program wants to pass any of the other fields to \fBpcre_exec()\fP or
1046 if \fBpcre_study()\fP is called with the PCRE_STUDY_EXTRA_NEEDED option, it
1058 If any of these are set, and the just-in-time compiler is available, the
1060 the \fBpcre_exec()\fP interpretive matching function. If the just-in-time
1061 compiler is not available, these options are ignored. All undefined bits in the
1065 patterns to be analyzed, and for one-off matches and simple patterns the
1067 Not all patterns can be optimized by the JIT compiler. For those that cannot be
1068 handled, matching automatically falls back to the \fBpcre_exec()\fP
1069 interpreter. For more details, see the
1076 studying succeeds (even if no data is returned), the variable it points to is
1078 static string that is part of the library. You must not try to free it. You
1079 should test the error pointer for NULL after calling \fBpcre_study()\fP, to be
1082 When you are finished with a pattern, you can free the memory used for the
1083 study data by calling \fBpcre_free_study()\fP. This function was added to the
1084 API for release 8.20. For earlier versions, the memory could be freed with
1085 \fBpcre_free()\fP, just like the pattern itself. This will still work in cases
1086 where JIT optimization is not used, but it is advisable to change to the new
1106 Studying a pattern does two things: first, a lower bound for the length of
1107 subject string that is needed to match the pattern is computed. This does not
1109 guarantee that no shorter strings match. The value is used to avoid wasting
1110 time by trying to match strings that are shorter than the lower bound. You can
1111 find out the value in a calling program via the \fBpcre_fullinfo()\fP function.
1115 created. This speeds up finding a position in the subject at which to start
1116 matching. (In 16-bit mode, the bitmap is used for 16-bit values less than 256.
1117 In 32-bit mode, the bitmap is used for 32-bit values less than 256.)
1120 \fBpcre_dfa_exec()\fP, and the information is also used by the JIT compiler.
1121 The optimizations can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
1128 execution to work with PCRE_NO_START_OPTIMIZE, the option must be set at
1144 code point. When running in UTF-8 mode, or in the 16- or 32-bit libraries, this
1148 \ep and \eP, or, alternatively, the PCRE_UCP option can be set when a pattern
1150 instead of the built-in tables.
1154 use locales, but not try to mix the two.
1156 PCRE contains an internal set of tables that are used when the final argument
1158 Normally, the internal tables recognize only ASCII characters. However, when
1159 PCRE is built, it is possible to cause the internal tables to be rebuilt in the
1160 default "C" locale of the local system, which may cause them to be different.
1162 The internal tables can always be overridden by tables supplied by the
1164 the default. As more and more applications change to using Unicode, the need
1167 External tables are built by calling the \fBpcre_maketables()\fP function,
1168 which has no arguments, in the relevant locale. The result can then be passed
1170 tables that are appropriate for the French locale (where accented characters
1171 with values greater than 128 are treated as letters), the following code could
1179 are using Windows, the name for the French locale is "french".
1181 When \fBpcre_maketables()\fP runs, the tables are built in memory that is
1182 obtained via \fBpcre_malloc\fP. It is the caller's responsibility to ensure
1183 that the memory containing the tables remains available for as long as it is
1186 The pointer that is passed to \fBpcre_compile()\fP is saved with the compiled
1187 pattern, and the same tables are used via this pointer by \fBpcre_study()\fP
1189 pattern, compilation, studying and matching all happen in the same locale, but
1192 It is possible to pass a table pointer or NULL (indicating the use of the
1193 internal tables) to \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP (see the
1194 discussion below in the section on matching a pattern). This facility is
1197 used at compile time, it must be provided again when the reloaded pattern is
1199 locale from the one in which it was compiled is likely to lead to anomalous
1213 pattern. It replaces the \fBpcre_info()\fP function, which was removed from the
1216 The first argument for \fBpcre_fullinfo()\fP is a pointer to the compiled
1217 pattern. The second argument is the result of \fBpcre_study()\fP, or NULL if
1218 the pattern was not studied. The third argument specifies which piece of
1219 information is required, and the fourth argument is a pointer to a variable
1220 to receive the data. The yield of the function is zero for success, or one of
1223   PCRE_ERROR_NULL           the argument \fIcode\fP was NULL
1224                             the argument \fIwhere\fP was NULL
1225   PCRE_ERROR_BADMAGIC       the "magic number" was not found
1226   PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
1228   PCRE_ERROR_BADOPTION      the value of \fIwhat\fP was invalid
1229   PCRE_ERROR_UNSET          the requested field is not set
1231 The "magic number" is placed at the start of each compiled pattern as an simple
1232 check against passing an arbitrary memory pointer. The endianness error can
1234 a typical call of \fBpcre_fullinfo()\fP, to obtain the length of the compiled
1243     &length);         /* where to put the data */
1245 The possible values for the third argument are defined in \fBpcre.h\fP, and are
1250 Return the number of the highest back reference in the pattern. The fourth
1256 Return the number of capturing subpatterns in the pattern. The fourth argument
1261 Return a pointer to the internal default character tables within PCRE. The
1263 information call is provided for internal use by the \fBpcre_study()\fP
1269 Return information about the first data unit of any matched string, for a
1270 non-anchored pattern. The name of this option refers to the 8-bit library,
1271 where data units are bytes. The fourth argument should point to an \fBint\fP
1273 when the 32-bit library is in non-UTF-32 mode, the full 32-bit range of
1277 If there is a fixed first value, for example, the letter "c" from a pattern
1278 such as (cat|cow|coyote), its value is returned. In the 8-bit library, the
1279 value is always less than 256. In the 16-bit library the value can be up to
1280 0xffff. In the 32-bit library the value can be up to 0x10ffff.
1284 (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
1287 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
1288 (if it were set, the pattern would be anchored),
1290 -1 is returned, indicating that the pattern matches only at the start of a
1291 subject string or after any newline within the string. Otherwise -2 is
1296 Return the value of the first data unit (non-UTF character) of any matched
1297 string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS returns 1;
1298 otherwise return 0. The fourth argument should point to an \fBuint_t\fP
1301 In the 8-bit library, the value is always less than 256. In the 16-bit library
1302 the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value
1307 Return information about the first data unit of any matched string, for a
1308 non-anchored pattern. The fourth argument should point to an \fBint\fP
1311 If there is a fixed first value, for example, the letter "c" from a pattern
1312 such as (cat|cow|coyote), 1 is returned, and the character value can be
1316 (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
1319 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
1320 (if it were set, the pattern would be anchored),
1322 2 is returned, indicating that the pattern matches only at the start of a
1323 subject string or after any newline within the string. Otherwise 0 is
1328 If the pattern was studied, and this resulted in the construction of a 256-bit
1329 table indicating a fixed set of values for the first data unit in any matching
1330 string, a pointer to the table is returned. Otherwise NULL is returned. The
1335 Return 1 if the pattern contains any explicit matches for CR or LF characters,
1336 otherwise 0. The fourth argument should point to an \fBint\fP variable. An
1341 Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
1342 0. The fourth argument should point to an \fBint\fP variable. (?J) and
1343 (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1347 Return 1 if the pattern was studied with one of the JIT options, and
1348 just-in-time compiling was successful. The fourth argument should point to an
1350 in this version of PCRE, or that the pattern was not studied with a JIT option,
1351 or that the JIT compiler could not handle this particular pattern. See the
1359 If the pattern was successfully studied with a JIT option, return the size of
1360 the JIT compiled code, otherwise return zero. The fourth argument should point
1365 Return the value of the rightmost literal data unit that must exist in any
1366 matched string, other than at its start, if such a value has been recorded. The
1369 only if it follows something of variable length. For example, for the pattern
1370 /^a\ed+z\ed+/ the returned value is "z", but for /^a\edz\ed/ the returned value
1373 Since for the 32-bit library using the non-UTF-32 mode, this function is unable
1374 to return the full 32-bit range of characters, this value is deprecated;
1375 instead the PCRE_INFO_REQUIREDCHARFLAGS and PCRE_INFO_REQUIREDCHAR values should
1380 Return 1 if the pattern can match an empty string, otherwise 0. The fourth
1385 If the pattern set a match limit by including an item of the form
1386 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth argument
1387 should point to an unsigned 32-bit integer. If no such value has been set, the
1388 call to \fBpcre_fullinfo()\fP returns the error PCRE_ERROR_UNSET.
1392 Return the number of characters (NB not data units) in the longest lookbehind
1393 assertion in the pattern. This information is useful when doing multi-segment
1394 matching using the partial matching facilities. Note that the simple assertions
1396 one-character lookbehind, though it does not actually inspect the previous
1397 character. This is to ensure that at least one character from the old segment
1399 lookbehinds in the pattern, \eA might match incorrectly at the start of a new
1404 If the pattern was studied and a minimum length for matching subject strings
1405 was computed, its value is returned. Otherwise the returned value is -1. The
1406 value is a number of characters, which in UTF mode may be different from the
1407 number of data units. The fourth argument should point to an \fBint\fP
1408 variable. A non-negative value is a lower bound to the length of any matching
1416 PCRE supports the use of named as well as numbered capturing parentheses. The
1417 names are just an additional way of identifying the parentheses, which still
1420 substrings by name. It is also possible to extract the data directly, by first
1421 converting the name to a number in order to access the correct pointers in the
1422 output vector (described with \fBpcre_exec()\fP below). To do the conversion,
1423 you need to use the name-to-number map, which is described by these three
1427 the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size of each
1428 entry; both of these return an \fBint\fP value. The entry size depends on the
1429 length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first
1430 entry of the table. This is a pointer to \fBchar\fP in the 8-bit library, where
1431 the first two bytes of each entry are the number of the capturing parenthesis,
1432 most significant byte first. In the 16-bit library, the pointer points to
1433 16-bit data units, the first of which contains the parenthesis number. In the
1434 32-bit library, the pointer points to 32-bit data units, the first of which
1435 contains the parenthesis number. The rest of the entry is the corresponding
1439 with the same number, as described in the
1444 in the
1448 page, the groups may be given the same name, but there is only one entry in the
1449 table. Different names for groups of the same number are not permitted.
1451 but only if PCRE_DUPNAMES is set. They appear in the table in the order in
1452 which they were found in the pattern. In the absence of (?| this is the order
1453 of increasing number; when (?| is used this is not necessarily the case because
1456 As a simple example of the name/number table, consider the following pattern
1457 after compilation by the 8-bit library (assume PCRE_EXTENDED is set, so white
1464 There are four named subpatterns, so the table has four entries, and each entry
1465 in the table is eight bytes long. The table is as follows, with non-printing
1473 When writing code to extract data from named subpatterns using the
1474 name-to-number map, remember that the length of the entries is likely to be
1479 Return 1 if the pattern can be used for partial matching with
1480 \fBpcre_exec()\fP, otherwise 0. The fourth argument should point to an
1481 \fBint\fP variable. From release 8.00, this always returns 1, because the
1482 restrictions that previously applied to partial matching have been lifted. The
1490 Return a copy of the options with which the pattern was compiled. The fourth
1492 are those specified in the call to \fBpcre_compile()\fP, modified by any
1493 top-level option settings at the start of the pattern itself. In other words,
1494 they are the options that will be in force when matching starts. For example,
1495 if the pattern /(?im)abc(?-i)d/ is compiled with the PCRE_EXTENDED option, the
1499 alternatives begin with one of the following:
1506           references to the subpattern in which .* appears
1508 For such patterns, the PCRE_ANCHORED bit is set in the options returned by
1513 If the pattern set a recursion limit by including an item of the form
1514 (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
1516 set, the call to \fBpcre_fullinfo()\fP returns the error PCRE_ERROR_UNSET.
1520 Return the size of the compiled pattern in bytes (for all three libraries). The
1522 include the size of the \fBpcre\fP structure that is returned by
1523 \fBpcre_compile()\fP. The value that is passed as the argument to
1525 place the compiled data is the value returned by this option plus the size of
1527 does not alter the value returned by this option.
1531 Return the size in bytes (for all three libraries) of the data block pointed to
1532 by the \fIstudy_data\fP field in a \fBpcre_extra\fP block. If \fBpcre_extra\fP
1533 is NULL, or there is no study data, zero is returned. The fourth argument
1534 should point to a \fBsize_t\fP variable. The \fIstudy_data\fP field is set by
1535 \fBpcre_study()\fP to record information that will speed up matching (see the
1541 above). The format of the \fIstudy_data\fP block is private, but its length
1542 is made available via this option so that it can be saved and restored (see the
1551 matched string, other than at its start. The fourth argument should  point to
1553 1, the character value itself can be retrieved using PCRE_INFO_REQUIREDCHAR.
1556 something of variable length. For example, for the pattern /^a\ed+z\ed+/ the
1558 /^a\edz\ed/ the returned value is 0.
1562 Return the value of the rightmost literal data unit that must exist in any
1563 matched string, other than at its start, if such a value has been recorded. The
1573 The \fBpcre_refcount()\fP function is used to maintain a reference count in the
1574 data block that contains a compiled pattern. It is provided for the benefit of
1576 of the application may be using the same compiled pattern, but you want to free
1579 When a pattern is compiled, the reference count field is initialized to zero.
1580 It is changed only by calling this function, whose action is to add the
1581 \fIadjust\fP value (which may be positive or negative) to it. The yield of the
1582 function is the new value. However, the value of the count is constrained to
1583 lie between 0 and 65535, inclusive. If the new value is outside these limits,
1584 it is forced to the appropriate limit value.
1586 Except when it is zero, the reference count is not correctly preserved if a
1591 .SH "MATCHING A PATTERN: THE TRADITIONAL FUNCTION"
1601 compiled pattern, which is passed in the \fIcode\fP argument. If the
1602 pattern was studied, the result of the study should be passed in the
1603 \fIextra\fP argument. You can call \fBpcre_exec()\fP with the same \fIcode\fP
1605 different subject strings with the same pattern.
1607 This function is the main matching facility of the library, and it operates in
1614 in the section about the \fBpcre_dfa_exec()\fP function.
1616 In most applications, the pattern will have been compiled (and optionally
1617 studied) in the same process that calls \fBpcre_exec()\fP. However, it is
1620 about this, see the
1632     NULL,           /* we didn't study the pattern */
1633     "some string",  /* the subject string */
1634     11,             /* the length of the subject string */
1635     0,              /* start at offset 0 in the subject */
1645 If the \fIextra\fP argument is not NULL, it must point to a \fBpcre_extra\fP
1646 data block. The \fBpcre_study()\fP function returns such a block (when it
1648 additional information in it. The \fBpcre_extra\fP block contains the following
1660 In the 16-bit version of this structure, the \fImark\fP field has type
1663 In the 32-bit version of this structure, the \fImark\fP field has type
1666 The \fIflags\fP field is used to specify which of the other fields are set. The
1677 Other flag bits should be set to zero. The \fIstudy_data\fP field and sometimes
1678 the \fIexecutable_jit\fP field are set in the \fBpcre_extra\fP block that is
1679 returned by \fBpcre_study()\fP, together with the appropriate flag bits. You
1680 should not set these yourself, but you may add to the block by setting other
1685 but which have a very large number of possibilities in their search trees. The
1689 calls repeatedly (sometimes recursively). The limit set by \fImatch_limit\fP is
1690 imposed on the number of times this function is called during a match, which
1691 has the effect of limiting the amount of backtracking that can take place. For
1692 patterns that are not anchored, the count restarts from zero for each position
1693 in the subject string.
1696 with a JIT option, the way that the matching is executed is entirely different.
1697 However, there is still the possibility of runaway matching that goes on for a
1698 very long time, and so the \fImatch_limit\fP value is also used in this case
1699 (but in a different way) to limit how long the matching can continue.
1701 The default value for the limit can be set when PCRE is built; the default
1702 default is 10 million, which handles all but the most extreme cases. You can
1703 override the default by suppling \fBpcre_exec()\fP with a \fBpcre_extra\fP
1705 the \fIflags\fP field. If the limit is exceeded, \fBpcre_exec()\fP returns
1708 A value for the match limit may also be supplied by an item at the start of a
1709 pattern of the form
1714 less than the limit set by the caller of \fBpcre_exec()\fP or, if no such limit
1715 is set, less than the default.
1718 instead of limiting the total number of times that \fBmatch()\fP is called, it
1719 limits the depth of recursion. The recursion depth is a smaller number than the
1723 Limiting the recursion depth limits the amount of machine stack that can be
1724 used, or, when PCRE has been compiled to use memory on the heap instead of the
1725 stack, the amount of heap memory that can be used. This limit is not relevant,
1729 built; the default default is the same value as the default for
1730 \fImatch_limit\fP. You can override the default by suppling \fBpcre_exec()\fP
1732 PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the \fIflags\fP field. If the limit
1735 A value for the recursion limit may also be supplied by an item at the start of
1736 a pattern of the form
1741 less than the limit set by the caller of \fBpcre_exec()\fP or, if no such limit
1742 is set, less than the default.
1744 The \fIcallout_data\fP field is used in conjunction with the "callout" feature,
1745 and is described in the
1753 then reloaded, because the tables that were used to compile a pattern are not
1754 saved with it. See the
1762 \fBWarning:\fP The tables that \fBpcre_exec()\fP uses must be the same as those
1763 that were used when the pattern was compiled. If this is not the case, the
1765 compiled and matched in the same process, this field should never be set. In
1766 this (the most common) case, the correct table pointer is automatically passed
1767 with the compiled pattern from \fBpcre_compile()\fP to \fBpcre_exec()\fP.
1769 If PCRE_EXTRA_MARK is set in the \fIflags\fP field, the \fImark\fP field must
1770 be set to point to a suitable variable. If the pattern contains any
1771 backtracking control verbs such as (*MARK:NAME), and the execution ends up with
1772 a name to pass back, a pointer to the name string (zero terminated) is placed
1773 in the variable pointed to by the \fImark\fP field. The names are within the
1775 freeing the memory of a compiled pattern. If there is no name to pass back, the
1776 variable pointed to by the \fImark\fP field is set to NULL. For details of the
1777 backtracking control verbs, see the section entitled
1782 in the
1793 The unused bits of the \fIoptions\fP argument for \fBpcre_exec()\fP must be
1794 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_\fIxxx\fP,
1799 If the pattern was successfully studied with one of the just-in-time (JIT)
1800 compile options, the only supported options for JIT execution are
1803 unsupported option is used, JIT execution is disabled and the normal
1808 The PCRE_ANCHORED option limits \fBpcre_exec()\fP to matching at the first
1816 These options (which are mutually exclusive) control what the \eR escape
1817 sequence matches. The choice is either to match only CR, LF, or CRLF, or to
1818 match any Unicode newline sequence. These options override the choice that was
1819 made or defaulted when the pattern was compiled.
1827 These options override the newline definition that was chosen or defaulted when
1828 the pattern was compiled. For details, see the description of
1829 \fBpcre_compile()\fP above. During matching, the newline choice affects the
1830 behaviour of the dot, circumflex, and dollar metacharacters. It may also alter
1831 the way the match position is advanced after a match failure for an unanchored
1835 match attempt for an unanchored pattern fails when the current position is at a
1836 CRLF sequence, and the pattern contains no explicit matches for CR or LF
1837 characters, the match position is advanced by two characters instead of one, in
1838 other words, to after the CRLF.
1840 The above rule is a compromise that makes the most common cases work as
1841 expected. For example, if the pattern is .+A (and the PCRE_DOTALL option is not
1842 set), it does not match the string "\er\enA" because, after failing at the
1843 start, it skips both the CR and the LF before retrying. However, the pattern
1845 reference, and so advances only by one character after the first failure.
1848 characters, or one of the \er or \en escape sequences. Implicit matches such as
1849 [^X] do not count, nor does \es (which includes CR and LF in the characters
1852 Notwithstanding the above, anomalous effects may still occur when CRLF is a
1853 valid newline sequence and explicit \er or \en escapes appear in the pattern.
1857 This option specifies that first character of the subject string is not the
1858 beginning of a line, so the circumflex metacharacter should not match before
1860 never to match. This option affects only the behaviour of the circumflex
1865 This option specifies that the end of the subject string is not the end of a
1866 line, so the dollar metacharacter should not match it nor (except in multiline
1868 compile time) causes dollar never to match. This option affects only the
1869 behaviour of the dollar metacharacter. It does not affect \eZ or \ez.
1874 there are alternatives in the pattern, they are tried. If all the alternatives
1875 match the empty string, the entire match fails. For example, if the pattern
1880 string at the start of the subject. With PCRE_NOTEMPTY set, this match is not
1881 valid, so PCRE searches further into the string for occurrences of "a" or "b".
1886 the start of the subject is permitted. If the pattern is anchored, such a match
1887 can occur only if the pattern contains \eK.
1890 does make a special case of a pattern match of the empty string within its
1891 \fBsplit()\fP function, and when using the /g modifier. It is possible to
1892 emulate Perl's behaviour after matching a null string by first trying the match
1893 again at the same offset with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then
1894 if that fails, by advancing the starting offset (see below) and trying an
1900 sample program. In the most general case, you have to check to see if the
1901 newline convention recognizes CRLF as a newline, and if so, and the current
1902 character is CR followed by LF, advance the starting offset by two characters
1907 There are a number of optimizations that \fBpcre_exec()\fP uses at the start of
1908 a match, in order to speed up the process. For example, if it is known that an
1909 unanchored match must start with a specific character, it searches the subject
1911 actually running the main matching function. This means that a special item
1912 such as (*COMMIT) at the start of a pattern is not considered until after a
1913 suitable starting point for the match has been found. Also, when callouts or
1915 skipped if the pattern is never actually used. The start-up optimizations are
1916 in effect a pre-scan of the subject that takes place before the pattern is run.
1918 The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, possibly
1919 causing performance to suffer, but ensuring that in cases where the result is
1920 "no match", the callouts do occur, and that items such as (*COMMIT) and (*MARK)
1921 are considered at every possible starting position in the subject string. If
1923 time. The use of PCRE_NO_START_OPTIMIZE at matching time (that is, passing it
1927 Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching operation.
1928 Consider the pattern
1932 When this is compiled, PCRE records the fact that a match must start with the
1933 character "A". Suppose the subject string is "DEFABC". The start-up
1934 optimization scans along the subject, finds "A" and runs the first match
1935 attempt from there. The (*COMMIT) item means that the pattern must match the
1936 current starting position, which in this case, it does. However, if the same
1937 match is run with PCRE_NO_START_OPTIMIZE set, the initial scan along the
1938 subject string does not happen. The first match attempt is run starting from
1940 the overall result is "no match". If the pattern is studied, more start-up
1941 optimizations may be used. For example, a minimum length for the subject may be
1942 recorded. Consider the pattern
1946 The minimum length for a match is one character. If the subject is "ABC", there
1948 If the pattern is studied, the final attempt does not take place, because PCRE
1949 knows that the subject is too short, and so the (*MARK) is never encountered.
1950 In this case, studying the pattern does not affect the overall match result,
1951 which is still "no match", but it does affect the auxiliary information that is
1956 When PCRE_UTF8 is set at compile time, the validity of the subject as a UTF-8
1958 The entire string is checked before any other processing takes place. The value
1959 of \fIstartoffset\fP is also checked to ensure that it points to the start of a
1960 UTF-8 character. There is a discussion about the
1965 in the
1969 page. If an invalid sequence of bytes is found, \fBpcre_exec()\fP returns the
1970 error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
1971 truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In both
1972 cases, information about the precise nature of the error may also be returned
1973 (see the descriptions of these errors in the section entitled \fIError return
1979 If \fIstartoffset\fP contains a value that does not point to the start of a
1980 UTF-8 character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is
1984 checks for performance reasons, you can set the PCRE_NO_UTF8_CHECK option when
1985 calling \fBpcre_exec()\fP. You might want to do this for the second and
1987 all the matches in a single subject string. However, you should be sure that
1988 the value of \fIstartoffset\fP points to the start of a character (or the end
1989 of the subject). When PCRE_NO_UTF8_CHECK is set, the effect of passing an
1996 These options turn on the partial matching feature. For backwards
1998 occurs if the end of the subject string is reached successfully, but there are
1999 not enough subject characters to complete the match. If this happens when
2003 PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match,
2012 In both cases, the portion of the string that was inspected when the partial
2013 match was found is set as the first matching string. There is a more detailed
2014 discussion of partial and multi-segment matching, with examples, in the
2021 .SS "The string to be matched by \fBpcre_exec()\fP"
2026 \fIstartoffset\fP. The units for \fIlength\fP and \fIstartoffset\fP are bytes
2027 for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit
2028 data items for the 32-bit library.
2030 If \fIstartoffset\fP is negative or greater than the length of the subject,
2031 \fBpcre_exec()\fP returns PCRE_ERROR_BADOFFSET. When the starting offset is
2032 zero, the search for a match starts at the beginning of the subject, and this
2033 is by far the most common case. In UTF-8 or UTF-16 mode, the offset must point
2034 to the start of a character, or the end of the subject (in UTF-32 mode, one
2035 data unit equals one character, so all offsets are valid). Unlike the pattern
2036 string, the subject may contain binary zeroes.
2038 A non-zero starting offset is useful when searching for another match in the
2041 setting PCRE_NOTBOL in the case of a pattern that begins with any kind of
2042 lookbehind. For example, consider the pattern
2046 which finds occurrences of "iss" in the middle of words. (\eB matches only if
2047 the current position in the subject is not a word boundary.) When applied to
2048 the string "Mississipi" the first call to \fBpcre_exec()\fP finds the first
2049 occurrence. If \fBpcre_exec()\fP is called again with just the remainder of the
2050 subject, namely "issipi", it does not match, because \eB is always false at the
2051 start of the subject, which is deemed to be a word boundary. However, if
2052 \fBpcre_exec()\fP is passed the entire string again, but with \fIstartoffset\fP
2053 set to 4, it finds the second occurrence of "iss" because it is able to look
2054 behind the starting point to discover that it is preceded by a letter.
2056 Finding all the matches in a subject is tricky when the pattern can match an
2057 empty string. It is possible to emulate Perl's /g behaviour by first trying the
2058 match again at the same offset, with the PCRE_NOTEMPTY_ATSTART and
2059 PCRE_ANCHORED options, and then if that fails, advancing the starting offset
2061 do this in the
2065 sample program. In the most general case, you have to check to see if the
2066 newline convention recognizes CRLF as a newline, and if so, and the current
2067 character is CR followed by LF, advance the starting offset by two characters
2070 If a non-zero starting offset is passed when the pattern is anchored, one
2071 attempt to match at the given offset is made. This can only succeed if the
2072 pattern does not require the match to be at the start of the subject.
2078 In general, a pattern matches a certain portion of the subject, and in
2079 addition, further substrings from the subject may be picked out by parts of the
2080 pattern. Following the usage in Jeffrey Friedl's book, this is called
2081 "capturing" in what follows, and the phrase "capturing subpattern" is used for
2085 Captured substrings are returned to the caller via a vector of integers whose
2086 address is passed in \fIovector\fP. The number of elements in the vector is
2088 argument is NOT the size of \fIovector\fP in bytes.
2090 The first two-thirds of the vector is used to pass back captured substrings,
2091 each substring using a pair of integers. The remaining third of the vector is
2093 and is not available for passing back information. The number passed in
2098 in pairs of integers, starting at the beginning of \fIovector\fP, and
2099 continuing up to two-thirds of its length at the most. The first element of
2100 each pair is set to the offset of the first character in a substring, and the
2101 second is set to the offset of the first character after the end of a
2103 are byte offsets in the 8-bit library, 16-bit data item offsets in the 16-bit
2104 library, and 32-bit data item offsets in the 32-bit library. \fBNote\fP: they
2107 The first pair of integers, \fIovector[0]\fP and \fIovector[1]\fP, identify the
2108 portion of the subject string matched by the entire pattern. The next pair is
2109 used for the first capturing subpattern, and so on. The value returned by
2110 \fBpcre_exec()\fP is one more than the highest numbered pair that has been set.
2111 For example, if two substrings have been captured, the returned value is 3. If
2112 there are no capturing subpatterns, the return value from a successful match is
2113 1, indicating that just the first pair of offsets has been set.
2115 If a capturing subpattern is matched repeatedly, it is the last portion of the
2118 If the vector is too small to hold all the captured substring offsets, it is
2119 used as far as possible (up to two-thirds of its length), and the function
2120 returns a value of zero. If neither the actual string matched nor any captured
2122 passed as NULL and \fIovecsize\fP as zero. However, if the pattern contains
2123 back references and the \fIovector\fP is not big enough to remember the related
2128 in fact the vector is exactly the right size for the final match. For example,
2129 consider the pattern
2134 with subject string "abd", \fBpcre_exec()\fP will try to set the second
2136 "c" and backing up to try the second alternative. The zero return, however,
2137 does correctly indicate that the maximum number of slots (namely 2) have been
2138 filled. In similar cases where there is temporary overflow, but the final
2139 number of used slots is actually less than the maximum, a non-zero value is
2143 subpatterns there are in a compiled pattern. The smallest size for
2145 the offsets of the substring matched by the whole pattern, is (\fIn\fP+1)*3.
2149 the string "abc" is matched against the pattern (a|(z))(bc) the return from the
2151 happens, both values in the offset pairs corresponding to unused subpatterns
2154 Offset values that correspond to unused subpatterns at the end of the
2155 expression are also set to -1. For example, if the string "abc" is matched
2156 against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The
2157 return from the function is 2, because the highest used capturing subpattern
2158 number is 1, and the offsets for for the second and third capturing subpatterns
2159 (assuming the vector is large enough, of course) are set to -1.
2161 \fBNote\fP: Elements in the first two-thirds of \fIovector\fP that do not
2162 correspond to capturing parentheses in the pattern are never changed. That is,
2164 \fIovector[0]\fP to \fIovector[2n+1]\fP are set by \fBpcre_exec()\fP. The other
2165 elements (in the first two-thirds) retain whatever values they previously had.
2167 Some convenience functions are provided for extracting the captured substrings
2175 If \fBpcre_exec()\fP fails, it returns a negative number. The following are
2176 defined in the header file:
2180 The subject string did not match the pattern.
2189 An unrecognized bit was set in the \fIoptions\fP argument.
2193 PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch
2195 compiled in an environment of one endianness is run in an environment with the
2196 other endianness. This is the error that PCRE gives when the magic number is
2201 While running the pattern match, an unknown item was encountered in the
2203 of the compiled pattern.
2207 If a pattern contains back references, but the \fIovector\fP that is passed to
2208 \fBpcre_exec()\fP is not big enough to remember the referenced substrings, PCRE
2209 gets a block of memory at the start of matching to use for this purpose. If the
2210 call via \fBpcre_malloc()\fP fails, this error is given. The memory is
2211 automatically freed at the end of matching.
2219 This error is used by the \fBpcre_copy_substring()\fP,
2225 The backtracking limit, as specified by the \fImatch_limit\fP field in a
2226 \fBpcre_extra\fP structure (or defaulted) was reached. See the description
2232 use by callout functions that want to yield a distinctive error code. See the
2241 and the PCRE_NO_UTF8_CHECK option was not set. If the size of the output vector
2242 (\fIovecsize\fP) is at least 2, the byte offset to the start of the the invalid
2243 UTF-8 character is placed in the first element, and a reason code is placed in
2244 the second element. The reason codes are listed in the
2249 For backward compatibility, if PCRE_PARTIAL_HARD is set and the problem is a
2250 truncated UTF-8 character at the end of the subject (reason codes 1 to 5),
2256 be valid (the PCRE_NO_UTF8_CHECK option was not set), but the value of
2257 \fIstartoffset\fP did not point to the beginning of a UTF-8 character or the
2258 end of the subject.
2262 The subject string did not match, but it did match partially. See the
2270 This code is no longer in use. It was formerly returned when the PCRE_PARTIAL
2278 in PCRE or by overwriting of the compiled pattern.
2282 This error is given if the value of the \fIovecsize\fP argument is negative.
2286 The internal recursion limit, as specified by the \fImatch_limit_recursion\fP
2287 field in a \fBpcre_extra\fP structure (or defaulted) was reached. See the
2296 The value of \fIstartoffset\fP was negative or greater than the length of the
2297 subject, that is, the value in \fIlength\fP.
2301 This error is returned instead of PCRE_ERROR_BADUTF8 when the subject string
2302 ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD option is set.
2303 Information about the failure is returned as for PCRE_ERROR_BADUTF8. It is in
2305 PCRE_PARTIAL_HARD precedes the implementation of returned information; it is
2311 the pattern. Specifically, it means that either the whole pattern or a
2312 subpattern has been called recursively for the second time at the same position
2313 in the subject string. Some simple patterns that might do this are detected and
2321 JIT compile option is being matched, but the memory available for the
2322 just-in-time processing stack is not large enough. See the
2330 This error is given if a pattern that was compiled by the 8-bit library is
2336 host with different endianness. The utility function
2338 so that it runs on the new host.
2343 compile option is being matched, but the matching mode (partial or complete
2344 match) does not correspond to any JIT compilation mode. When the JIT fast path
2345 function is used, this error may be also given for invalid options. See the
2363 This section applies only to the 8-bit library. The corresponding information
2364 for the 16-bit and 32-bit libraries is given in the
2375 PCRE_ERROR_SHORTUTF8, and the size of the output vector (\fIovecsize\fP) is at
2376 least 2, the offset of the start of the invalid UTF-8 character is placed in
2378 in the second element (\fIovector[1]\fP). The reason codes are given names in
2387 The string ends with a truncated UTF-8 character; the code specifies how many
2389 no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279)
2390 allows for up to 6 bytes, and this is checked first; hence the possibility of
2399 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the
2400 character do not have the binary value 0b10 (that is, either the most
2401 significant bit is 0, or the next bit is 1).
2406 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long;
2416 A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of
2428 the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just
2433 The two most significant bits of the first byte of a character have the binary
2434 value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a
2435 byte can only validly occur as the second or subsequent byte of a multi-byte
2440 The first byte of a character has the value 0xfe or 0xff. These values can
2445 This error code was formerly used when the presence of a so-called
2467 Captured substrings can be accessed directly by using the offsets returned by
2468 \fBpcre_exec()\fP in \fIovector\fP. For convenience, the functions
2472 by number. The next section describes functions for extracting named
2476 further zero added on the end, but the result is not, of course, a C string.
2477 However, you can process such a string by referring to the length that is
2479 Unfortunately, the interface to \fBpcre_get_substring_list()\fP is not adequate
2480 for handling strings containing binary zeros, because the end of the final
2483 The first three arguments are the same for all three of these functions:
2484 \fIsubject\fP is the subject string that has just been successfully matched,
2485 \fIovector\fP is a pointer to the vector of integer offsets that was passed to
2486 \fBpcre_exec()\fP, and \fIstringcount\fP is the number of substrings that were
2487 captured by the match, including the substring that matched the entire regular
2488 expression. This is the value returned by \fBpcre_exec()\fP if it is greater
2490 space in \fIovector\fP, the value passed as \fIstringcount\fP should be the
2491 number of elements in the vector divided by three.
2495 value of zero extracts the substring that matched the entire pattern, whereas
2496 higher values extract the captured substrings. For \fBpcre_copy_substring()\fP,
2500 \fIstringptr\fP. The yield of the function is the length of the string, not
2501 including the terminating zero, or one of these error codes:
2505 The buffer was too small for \fBpcre_copy_substring()\fP, or the attempt to get
2514 memory that is obtained via \fBpcre_malloc\fP. The address of the memory block
2515 is returned via \fIlistptr\fP, which is also the start of the list of string
2516 pointers. The end of the list is marked by a NULL pointer. The yield of the
2517 function is zero if all went well, or the error code
2521 if the attempt to get the memory block failed.
2524 happen when capturing subpattern number \fIn+1\fP matches some part of the
2527 inspecting the appropriate offset in \fIovector\fP, which is negative for unset
2531 \fBpcre_free_substring_list()\fP can be used to free the memory returned by
2537 \fBpcre_free\fP directly; it is for these cases that the functions are
2564 the number of the subpattern called "xxx" is 2. If the name is known to be
2565 unique (PCRE_DUPNAMES was not set), you can find the number from the name by
2566 calling \fBpcre_get_stringnumber()\fP. The first argument is the compiled
2567 pattern, and the second is the name. The yield of the function is the
2571 Given the number, you can extract the substring directly, or use one of the
2572 functions described in the previous section. For convenience, there are also
2573 two functions that do the whole job.
2575 Most of the arguments of \fBpcre_copy_named_substring()\fP and
2576 \fBpcre_get_named_substring()\fP are the same as those for the similarly named
2577 functions that extract by number. As these are described in the previous
2581 is an extra argument, given at the start, which is a pointer to the compiled
2582 pattern. This is needed in order to gain access to the name-to-number
2588 the behaviour may not be what you want (see the next section).
2590 \fBWarning:\fP If the pattern uses the (?| feature to set up multiple
2591 subpatterns with the same number, as described in the
2596 in the
2600 page, you cannot use names to distinguish the different subpatterns, because
2601 names are not included in the compiled code. The matching process uses only
2602 numbers. For this reason, the use of different names for subpatterns of the
2614 When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns
2616 subpatterns with the same number, created by using the (?| feature. Indeed, if
2617 such subpatterns are named, they are required to use the same names.)
2620 one of the named subpatterns participates. An example is shown in the
2627 \fBpcre_get_named_substring()\fP return the first substring corresponding to
2629 returned; no data is returned. The \fBpcre_get_stringnumber()\fP function
2630 returns one of the numbers that are associated with the name, but it is not
2634 you must use the \fBpcre_get_stringtable_entries()\fP function. The first
2635 argument is the compiled pattern, and the second is the name. The third and
2636 fourth are pointers to variables which are updated by the function. After it
2637 has run, they point to the first and last entries in the name-to-number table
2638 for the given name. The function itself returns the length of each entry, or
2639 PCRE_ERROR_NOSUBSTRING (-7) if there are none. The format of the table is
2640 described above in the section entitled \fIInformation about a pattern\fP
2645 Given all the relevant entries for the name, you can extract each of their
2646 numbers, and hence the captured data, if any.
2653 when it finds the first match, starting at a given point in the subject. If you
2654 want to find all possible matches, or the longest possible match, consider
2655 using the alternative matching function (see below) instead. If you cannot use
2657 can kludge it up by making use of the callout facility, which is described in
2664 What you have to do is to insert a callout right at the end of the pattern.
2665 When your callout function is called, extract and save the current matched
2676 find it helpful to have an estimate of the amount of stack that is used by
2677 \fBpcre_exec()\fP, to help them set recursion limits, as described in the
2681 documentation. The estimate that is output by \fBpcretest\fP when called with
2687 arguments, it returns instead a negative number whose absolute value is the
2689 clear that no match has happened.) The value is approximate because in some
2691 additional variables on the stack.
2693 If PCRE has been compiled to use the heap instead of the stack for recursion,
2694 the value returned is the size of each block that is obtained from the heap.
2698 .SH "MATCHING A PATTERN: THE ALTERNATIVE FUNCTION"
2709 a compiled pattern, using a matching algorithm that scans the subject string
2710 just once, and does not backtrack. This has different characteristics to the
2711 normal algorithm, and is not compatible with Perl. Some of the features of PCRE
2713 matching can be useful. For a discussion of the two matching algorithms, and a
2714 list of features that \fBpcre_dfa_exec()\fP does not support, see the
2720 The arguments for the \fBpcre_dfa_exec()\fP function are the same as for
2721 \fBpcre_exec()\fP, plus two extras. The \fIovector\fP argument is used in a
2722 different way, and this is described below. The other common arguments are used
2723 in the same way as for \fBpcre_exec()\fP, so their description is not repeated
2726 The two additional arguments provide workspace for the function. The workspace
2728 multiple paths through the pattern tree. More workspace will be needed for
2738     NULL,           /* we didn't study the pattern */
2739     "some string",  /* the subject string */
2740     11,             /* the length of the subject string */
2741     0,              /* start at offset 0 in the subject */
2751 The unused bits of the \fIoptions\fP argument for \fBpcre_dfa_exec()\fP must be
2752 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_\fIxxx\fP,
2756 All but the last four of these are exactly the same as for \fBpcre_exec()\fP,
2762 These have the same general effect as they do for \fBpcre_exec()\fP, but the
2764 \fBpcre_dfa_exec()\fP, it returns PCRE_ERROR_PARTIAL if the end of the subject
2767 been found. When PCRE_PARTIAL_SOFT is set, the return code PCRE_ERROR_NOMATCH
2768 is converted into PCRE_ERROR_PARTIAL if the end of the subject is reached,
2770 possibility. The portion of the string that was inspected when the longest
2771 partial match was found is set as the first matching string in both cases.
2773 examples, in the
2781 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to stop as
2782 soon as it has found one match. Because of the way the alternative algorithm
2783 works, this is necessarily the shortest possible match at the first possible
2784 matching point in the subject string.
2789 again, with additional subject characters, and have it continue with the same
2790 match. The PCRE_DFA_RESTART option requests this action; when it is set, the
2791 \fIworkspace\fP and \fIwscount\fP options must reference the same vector as
2792 before because data about the match so far is left in them after a partial
2793 match. There is more discussion of this facility in the
2804 substring in the subject. Note, however, that all the matches from one run of
2805 the function start at the same point in the subject. The shorter matches are
2806 all initial substrings of the longer matches. For example, if the pattern
2810 is matched against the string
2820 On success, the yield of the function is a number greater than zero, which is
2821 the number of matched substrings. The substrings themselves are returned in
2822 \fIovector\fP. Each string uses two elements; the first is the offset to the
2823 start, and the second is the offset to the end. In fact, all the strings have
2825 but it was decided to retain some compatibility with the way \fBpcre_exec()\fP
2826 returns data, even though the meaning of the strings is different.)
2828 The strings are returned in reverse order of length; that is, the longest
2830 \fIovector\fP, the yield of the function is zero, and the vector is filled with
2835 repeats at the end of a pattern (as well as internally). For example, the
2837 even considering the possibility of backtracking into the repeated digits. For
2840 ("a\ed+?") or set the PCRE_NO_AUTO_POSSESS option when compiling.
2847 Many of the errors are the same as for \fBpcre_exec()\fP, and these are
2853 There are in addition the following errors that are specific to
2858 This return is given if \fBpcre_dfa_exec()\fP encounters an item in the pattern
2859 that it does not support, for instance, the use of \eC or a back reference.
2864 uses a back reference for the condition, or a test for recursion in a specific
2870 block that contains a setting of the \fImatch_limit\fP or
2876 This return is given if \fBpcre_dfa_exec()\fP runs out of space in the
2881 When a recursive subpattern is processed, the matching function calls itself
2883 error is given if the output vector is not large enough. This should be
2888 When \fBpcre_dfa_exec()\fP is called with the \fBPCRE_DFA_RESTART\fP option,
2889 some plausibility checks are made on the contents of the workspace, which
2890 should contain data about the previous partial match. If any of these checks