1README file for PCRE2 (Perl-compatible regular expression library)
2------------------------------------------------------------------
3
4PCRE2 is a re-working of the original PCRE1 library to provide an entirely new
5API. Since its initial release in 2015, there has been further development of
6the code and it now differs from PCRE1 in more than just the API. There are new
7features and the internals have been improved. The latest release of PCRE2 is
8available in three alternative formats from:
9
10https://ftp.pcre.org/pub/pcre/pcre2-10.xx.tar.gz
11https://ftp.pcre.org/pub/pcre/pcre2-10.xx.tar.bz2
12https://ftp.pcre.org/pub/pcre/pcre2-10.xx.tar.zip
13
14There is a mailing list for discussion about the development of PCRE (both the
15original and new APIs) at pcre-dev@exim.org. You can access the archives and
16subscribe or manage your subscription here:
17
18   https://lists.exim.org/mailman/listinfo/pcre-dev
19
20Please read the NEWS file if you are upgrading from a previous release. The
21contents of this README file are:
22
23  The PCRE2 APIs
24  Documentation for PCRE2
25  Contributions by users of PCRE2
26  Building PCRE2 on non-Unix-like systems
27  Building PCRE2 without using autotools
28  Building PCRE2 using autotools
29  Retrieving configuration information
30  Shared libraries
31  Cross-compiling using autotools
32  Making new tarballs
33  Testing PCRE2
34  Character tables
35  File manifest
36
37
38The PCRE2 APIs
39--------------
40
41PCRE2 is written in C, and it has its own API. There are three sets of
42functions, one for the 8-bit library, which processes strings of bytes, one for
43the 16-bit library, which processes strings of 16-bit values, and one for the
4432-bit library, which processes strings of 32-bit values. Unlike PCRE1, there
45are no C++ wrappers.
46
47The distribution does contain a set of C wrapper functions for the 8-bit
48library that are based on the POSIX regular expression API (see the pcre2posix
49man page). These are built into a library called libpcre2-posix. Note that this
50just provides a POSIX calling interface to PCRE2; the regular expressions
51themselves still follow Perl syntax and semantics. The POSIX API is restricted,
52and does not give full access to all of PCRE2's facilities.
53
54The header file for the POSIX-style functions is called pcre2posix.h. The
55official POSIX name is regex.h, but I did not want to risk possible problems
56with existing files of that name by distributing it that way. To use PCRE2 with
57an existing program that uses the POSIX API, pcre2posix.h will have to be
58renamed or pointed at by a link (or the program modified, of course). See the
59pcre2posix documentation for more details.
60
61
62Documentation for PCRE2
63-----------------------
64
65If you install PCRE2 in the normal way on a Unix-like system, you will end up
66with a set of man pages whose names all start with "pcre2". The one that is
67just called "pcre2" lists all the others. In addition to these man pages, the
68PCRE2 documentation is supplied in two other forms:
69
70  1. There are files called doc/pcre2.txt, doc/pcre2grep.txt, and
71     doc/pcre2test.txt in the source distribution. The first of these is a
72     concatenation of the text forms of all the section 3 man pages except the
73     listing of pcre2demo.c and those that summarize individual functions. The
74     other two are the text forms of the section 1 man pages for the pcre2grep
75     and pcre2test commands. These text forms are provided for ease of scanning
76     with text editors or similar tools. They are installed in
77     <prefix>/share/doc/pcre2, where <prefix> is the installation prefix
78     (defaulting to /usr/local).
79
80  2. A set of files containing all the documentation in HTML form, hyperlinked
81     in various ways, and rooted in a file called index.html, is distributed in
82     doc/html and installed in <prefix>/share/doc/pcre2/html.
83
84
85Building PCRE2 on non-Unix-like systems
86---------------------------------------
87
88For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
89your system supports the use of "configure" and "make" you may be able to build
90PCRE2 using autotools in the same way as for many Unix-like systems.
91
92PCRE2 can also be configured using CMake, which can be run in various ways
93(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
94NON-AUTOTOOLS-BUILD has information about CMake.
95
96PCRE2 has been compiled on many different operating systems. It should be
97straightforward to build PCRE2 on any system that has a Standard C compiler and
98library, because it uses only Standard C functions.
99
100
101Building PCRE2 without using autotools
102--------------------------------------
103
104The use of autotools (in particular, libtool) is problematic in some
105environments, even some that are Unix or Unix-like. See the NON-AUTOTOOLS-BUILD
106file for ways of building PCRE2 without using autotools.
107
108
109Building PCRE2 using autotools
110------------------------------
111
112The following instructions assume the use of the widely used "configure; make;
113make install" (autotools) process.
114
115To build PCRE2 on system that supports autotools, first run the "configure"
116command from the PCRE2 distribution directory, with your current directory set
117to the directory where you want the files to be created. This command is a
118standard GNU "autoconf" configuration script, for which generic instructions
119are supplied in the file INSTALL.
120
121Most commonly, people build PCRE2 within its own distribution directory, and in
122this case, on many systems, just running "./configure" is sufficient. However,
123the usual methods of changing standard defaults are available. For example:
124
125CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
126
127This command specifies that the C compiler should be run with the flags '-O2
128-Wall' instead of the default, and that "make install" should install PCRE2
129under /opt/local instead of the default /usr/local.
130
131If you want to build in a different directory, just run "configure" with that
132directory as current. For example, suppose you have unpacked the PCRE2 source
133into /source/pcre2/pcre2-xxx, but you want to build it in
134/build/pcre2/pcre2-xxx:
135
136cd /build/pcre2/pcre2-xxx
137/source/pcre2/pcre2-xxx/configure
138
139PCRE2 is written in C and is normally compiled as a C library. However, it is
140possible to build it as a C++ library, though the provided building apparatus
141does not have any features to support this.
142
143There are some optional features that can be included or omitted from the PCRE2
144library. They are also documented in the pcre2build man page.
145
146. By default, both shared and static libraries are built. You can change this
147  by adding one of these options to the "configure" command:
148
149  --disable-shared
150  --disable-static
151
152  (See also "Shared libraries on Unix-like systems" below.)
153
154. By default, only the 8-bit library is built. If you add --enable-pcre2-16 to
155  the "configure" command, the 16-bit library is also built. If you add
156  --enable-pcre2-32 to the "configure" command, the 32-bit library is also
157  built. If you want only the 16-bit or 32-bit library, use --disable-pcre2-8
158  to disable building the 8-bit library.
159
160. If you want to include support for just-in-time (JIT) compiling, which can
161  give large performance improvements on certain platforms, add --enable-jit to
162  the "configure" command. This support is available only for certain hardware
163  architectures. If you try to enable it on an unsupported architecture, there
164  will be a compile time error. If in doubt, use --enable-jit=auto, which
165  enables JIT only if the current hardware is supported.
166
167. If you are enabling JIT under SELinux environment you may also want to add
168  --enable-jit-sealloc, which enables the use of an executable memory allocator
169  that is compatible with SELinux. Warning: this allocator is experimental!
170  It does not support fork() operation and may crash when no disk space is
171  available. This option has no effect if JIT is disabled.
172
173. If you do not want to make use of the default support for UTF-8 Unicode
174  character strings in the 8-bit library, UTF-16 Unicode character strings in
175  the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
176  library, you can add --disable-unicode to the "configure" command. This
177  reduces the size of the libraries. It is not possible to configure one
178  library with Unicode support, and another without, in the same configuration.
179  It is also not possible to use --enable-ebcdic (see below) with Unicode
180  support, so if this option is set, you must also use --disable-unicode.
181
182  When Unicode support is available, the use of a UTF encoding still has to be
183  enabled by setting the PCRE2_UTF option at run time or starting a pattern
184  with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
185  either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
186
187  As well as supporting UTF strings, Unicode support includes support for the
188  \P, \p, and \X sequences that recognize Unicode character properties.
189  However, only the basic two-letter properties such as Lu are supported.
190  Escape sequences such as \d and \w in patterns do not by default make use of
191  Unicode properties, but can be made to do so by setting the PCRE2_UCP option
192  or starting a pattern with (*UCP).
193
194. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
195  of the preceding, or any of the Unicode newline sequences, or the NUL (zero)
196  character as indicating the end of a line. Whatever you specify at build time
197  is the default; the caller of PCRE2 can change the selection at run time. The
198  default newline indicator is a single LF character (the Unix standard). You
199  can specify the default newline indicator by adding --enable-newline-is-cr,
200  --enable-newline-is-lf, --enable-newline-is-crlf,
201  --enable-newline-is-anycrlf, --enable-newline-is-any, or
202  --enable-newline-is-nul to the "configure" command, respectively.
203
204. By default, the sequence \R in a pattern matches any Unicode line ending
205  sequence. This is independent of the option specifying what PCRE2 considers
206  to be the end of a line (see above). However, the caller of PCRE2 can
207  restrict \R to match only CR, LF, or CRLF. You can make this the default by
208  adding --enable-bsr-anycrlf to the "configure" command (bsr = "backslash R").
209
210. In a pattern, the escape sequence \C matches a single code unit, even in a
211  UTF mode. This can be dangerous because it breaks up multi-code-unit
212  characters. You can build PCRE2 with the use of \C permanently locked out by
213  adding --enable-never-backslash-C (note the upper case C) to the "configure"
214  command. When \C is allowed by the library, individual applications can lock
215  it out by calling pcre2_compile() with the PCRE2_NEVER_BACKSLASH_C option.
216
217. PCRE2 has a counter that limits the depth of nesting of parentheses in a
218  pattern. This limits the amount of system stack that a pattern uses when it
219  is compiled. The default is 250, but you can change it by setting, for
220  example,
221
222  --with-parens-nest-limit=500
223
224. PCRE2 has a counter that can be set to limit the amount of computing resource
225  it uses when matching a pattern. If the limit is exceeded during a match, the
226  match fails. The default is ten million. You can change the default by
227  setting, for example,
228
229  --with-match-limit=500000
230
231  on the "configure" command. This is just the default; individual calls to
232  pcre2_match() or pcre2_dfa_match() can supply their own value. There is more
233  discussion in the pcre2api man page (search for pcre2_set_match_limit).
234
235. There is a separate counter that limits the depth of nested backtracking
236  (pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
237  matching process, which indirectly limits the amount of heap memory that is
238  used, and in the case of pcre2_dfa_match() the amount of stack as well. This
239  counter also has a default of ten million, which is essentially "unlimited".
240  You can change the default by setting, for example,
241
242  --with-match-limit-depth=5000
243
244  There is more discussion in the pcre2api man page (search for
245  pcre2_set_depth_limit).
246
247. You can also set an explicit limit on the amount of heap memory used by
248  the pcre2_match() and pcre2_dfa_match() interpreters:
249
250  --with-heap-limit=500
251
252  The units are kibibytes (units of 1024 bytes). This limit does not apply when
253  the JIT optimization (which has its own memory control features) is used.
254  There is more discussion on the pcre2api man page (search for
255  pcre2_set_heap_limit).
256
257. In the 8-bit library, the default maximum compiled pattern size is around
258  64 kibibytes. You can increase this by adding --with-link-size=3 to the
259  "configure" command. PCRE2 then uses three bytes instead of two for offsets
260  to different parts of the compiled pattern. In the 16-bit library,
261  --with-link-size=3 is the same as --with-link-size=4, which (in both
262  libraries) uses four-byte offsets. Increasing the internal link size reduces
263  performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
264  link size setting is ignored, as 4-byte offsets are always used.
265
266. For speed, PCRE2 uses four tables for manipulating and identifying characters
267  whose code point values are less than 256. By default, it uses a set of
268  tables for ASCII encoding that is part of the distribution. If you specify
269
270  --enable-rebuild-chartables
271
272  a program called pcre2_dftables is compiled and run in the default C locale
273  when you obey "make". It builds a source file called pcre2_chartables.c. If
274  you do not specify this option, pcre2_chartables.c is created as a copy of
275  pcre2_chartables.c.dist. See "Character tables" below for further
276  information.
277
278. It is possible to compile PCRE2 for use on systems that use EBCDIC as their
279  character code (as opposed to ASCII/Unicode) by specifying
280
281  --enable-ebcdic --disable-unicode
282
283  This automatically implies --enable-rebuild-chartables (see above). However,
284  when PCRE2 is built this way, it always operates in EBCDIC. It cannot support
285  both EBCDIC and UTF-8/16/32. There is a second option, --enable-ebcdic-nl25,
286  which specifies that the code value for the EBCDIC NL character is 0x25
287  instead of the default 0x15.
288
289. If you specify --enable-debug, additional debugging code is included in the
290  build. This option is intended for use by the PCRE2 maintainers.
291
292. In environments where valgrind is installed, if you specify
293
294  --enable-valgrind
295
296  PCRE2 will use valgrind annotations to mark certain memory regions as
297  unaddressable. This allows it to detect invalid memory accesses, and is
298  mostly useful for debugging PCRE2 itself.
299
300. In environments where the gcc compiler is used and lcov is installed, if you
301  specify
302
303  --enable-coverage
304
305  the build process implements a code coverage report for the test suite. The
306  report is generated by running "make coverage". If ccache is installed on
307  your system, it must be disabled when building PCRE2 for coverage reporting.
308  You can do this by setting the environment variable CCACHE_DISABLE=1 before
309  running "make" to build PCRE2. There is more information about coverage
310  reporting in the "pcre2build" documentation.
311
312. When JIT support is enabled, pcre2grep automatically makes use of it, unless
313  you add --disable-pcre2grep-jit to the "configure" command.
314
315. There is support for calling external programs during matching in the
316  pcre2grep command, using PCRE2's callout facility with string arguments. This
317  support can be disabled by adding --disable-pcre2grep-callout to the
318  "configure" command. There are two kinds of callout: one that generates
319  output from inbuilt code, and another that calls an external program. The
320  latter has special support for Windows and VMS; otherwise it assumes the
321  existence of the fork() function. This facility can be disabled by adding
322  --disable-pcre2grep-callout-fork to the "configure" command.
323
324. The pcre2grep program currently supports only 8-bit data files, and so
325  requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
326  libz and/or libbz2, in order to read .gz and .bz2 files (respectively), by
327  specifying one or both of
328
329  --enable-pcre2grep-libz
330  --enable-pcre2grep-libbz2
331
332  Of course, the relevant libraries must be installed on your system.
333
334. The default starting size (in bytes) of the internal buffer used by pcre2grep
335  can be set by, for example:
336
337  --with-pcre2grep-bufsize=51200
338
339  The value must be a plain integer. The default is 20480. The amount of memory
340  used by pcre2grep is actually three times this number, to allow for "before"
341  and "after" lines. If very long lines are encountered, the buffer is
342  automatically enlarged, up to a fixed maximum size.
343
344. The default maximum size of pcre2grep's internal buffer can be set by, for
345  example:
346
347  --with-pcre2grep-max-bufsize=2097152
348
349  The default is either 1048576 or the value of --with-pcre2grep-bufsize,
350  whichever is the larger.
351
352. It is possible to compile pcre2test so that it links with the libreadline
353  or libedit libraries, by specifying, respectively,
354
355  --enable-pcre2test-libreadline or --enable-pcre2test-libedit
356
357  If this is done, when pcre2test's input is from a terminal, it reads it using
358  the readline() function. This provides line-editing and history facilities.
359  Note that libreadline is GPL-licenced, so if you distribute a binary of
360  pcre2test linked in this way, there may be licensing issues. These can be
361  avoided by linking with libedit (which has a BSD licence) instead.
362
363  Enabling libreadline causes the -lreadline option to be added to the
364  pcre2test build. In many operating environments with a sytem-installed
365  readline library this is sufficient. However, in some environments (e.g. if
366  an unmodified distribution version of readline is in use), it may be
367  necessary to specify something like LIBS="-lncurses" as well. This is
368  because, to quote the readline INSTALL, "Readline uses the termcap functions,
369  but does not link with the termcap or curses library itself, allowing
370  applications which link with readline the to choose an appropriate library."
371  If you get error messages about missing functions tgetstr, tgetent, tputs,
372  tgetflag, or tgoto, this is the problem, and linking with the ncurses library
373  should fix it.
374
375. The C99 standard defines formatting modifiers z and t for size_t and
376  ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers in
377  environments other than Microsoft Visual Studio when __STDC_VERSION__ is
378  defined and has a value greater than or equal to 199901L (indicating C99).
379  However, there is at least one environment that claims to be C99 but does not
380  support these modifiers. If --disable-percent-zt is specified, no use is made
381  of the z or t modifiers. Instead or %td or %zu, %lu is used, with a cast for
382  size_t values.
383
384. There is a special option called --enable-fuzz-support for use by people who
385  want to run fuzzing tests on PCRE2. At present this applies only to the 8-bit
386  library. If set, it causes an extra library called libpcre2-fuzzsupport.a to
387  be built, but not installed. This contains a single function called
388  LLVMFuzzerTestOneInput() whose arguments are a pointer to a string and the
389  length of the string. When called, this function tries to compile the string
390  as a pattern, and if that succeeds, to match it. This is done both with no
391  options and with some random options bits that are generated from the string.
392  Setting --enable-fuzz-support also causes a binary called pcre2fuzzcheck to
393  be created. This is normally run under valgrind or used when PCRE2 is
394  compiled with address sanitizing enabled. It calls the fuzzing function and
395  outputs information about it is doing. The input strings are specified by
396  arguments: if an argument starts with "=" the rest of it is a literal input
397  string. Otherwise, it is assumed to be a file name, and the contents of the
398  file are the test string.
399
400. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
401  which caused pcre2_match() to use individual blocks on the heap for
402  backtracking instead of recursive function calls (which use the stack). This
403  is now obsolete since pcre2_match() was refactored always to use the heap (in
404  a much more efficient way than before). This option is retained for backwards
405  compatibility, but has no effect other than to output a warning.
406
407The "configure" script builds the following files for the basic C library:
408
409. Makefile             the makefile that builds the library
410. src/config.h         build-time configuration options for the library
411. src/pcre2.h          the public PCRE2 header file
412. pcre2-config          script that shows the building settings such as CFLAGS
413                         that were set for "configure"
414. libpcre2-8.pc        )
415. libpcre2-16.pc       ) data for the pkg-config command
416. libpcre2-32.pc       )
417. libpcre2-posix.pc    )
418. libtool              script that builds shared and/or static libraries
419
420Versions of config.h and pcre2.h are distributed in the src directory of PCRE2
421tarballs under the names config.h.generic and pcre2.h.generic. These are
422provided for those who have to build PCRE2 without using "configure" or CMake.
423If you use "configure" or CMake, the .generic versions are not used.
424
425The "configure" script also creates config.status, which is an executable
426script that can be run to recreate the configuration, and config.log, which
427contains compiler output from tests that "configure" runs.
428
429Once "configure" has run, you can run "make". This builds whichever of the
430libraries libpcre2-8, libpcre2-16 and libpcre2-32 are configured, and a test
431program called pcre2test. If you enabled JIT support with --enable-jit, another
432test program called pcre2_jit_test is built as well. If the 8-bit library is
433built, libpcre2-posix and the pcre2grep command are also built. Running
434"make" with the -j option may speed up compilation on multiprocessor systems.
435
436The command "make check" runs all the appropriate tests. Details of the PCRE2
437tests are given below in a separate section of this document. The -j option of
438"make" can also be used when running the tests.
439
440You can use "make install" to install PCRE2 into live directories on your
441system. The following are installed (file names are all relative to the
442<prefix> that is set when "configure" is run):
443
444  Commands (bin):
445    pcre2test
446    pcre2grep (if 8-bit support is enabled)
447    pcre2-config
448
449  Libraries (lib):
450    libpcre2-8      (if 8-bit support is enabled)
451    libpcre2-16     (if 16-bit support is enabled)
452    libpcre2-32     (if 32-bit support is enabled)
453    libpcre2-posix  (if 8-bit support is enabled)
454
455  Configuration information (lib/pkgconfig):
456    libpcre2-8.pc
457    libpcre2-16.pc
458    libpcre2-32.pc
459    libpcre2-posix.pc
460
461  Header files (include):
462    pcre2.h
463    pcre2posix.h
464
465  Man pages (share/man/man{1,3}):
466    pcre2grep.1
467    pcre2test.1
468    pcre2-config.1
469    pcre2.3
470    pcre2*.3 (lots more pages, all starting "pcre2")
471
472  HTML documentation (share/doc/pcre2/html):
473    index.html
474    *.html (lots more pages, hyperlinked from index.html)
475
476  Text file documentation (share/doc/pcre2):
477    AUTHORS
478    COPYING
479    ChangeLog
480    LICENCE
481    NEWS
482    README
483    pcre2.txt         (a concatenation of the man(3) pages)
484    pcre2test.txt     the pcre2test man page
485    pcre2grep.txt     the pcre2grep man page
486    pcre2-config.txt  the pcre2-config man page
487
488If you want to remove PCRE2 from your system, you can run "make uninstall".
489This removes all the files that "make install" installed. However, it does not
490remove any directories, because these are often shared with other programs.
491
492
493Retrieving configuration information
494------------------------------------
495
496Running "make install" installs the command pcre2-config, which can be used to
497recall information about the PCRE2 configuration and installation. For example:
498
499  pcre2-config --version
500
501prints the version number, and
502
503  pcre2-config --libs8
504
505outputs information about where the 8-bit library is installed. This command
506can be included in makefiles for programs that use PCRE2, saving the programmer
507from having to remember too many details. Run pcre2-config with no arguments to
508obtain a list of possible arguments.
509
510The pkg-config command is another system for saving and retrieving information
511about installed libraries. Instead of separate commands for each library, a
512single command is used. For example:
513
514  pkg-config --libs libpcre2-16
515
516The data is held in *.pc files that are installed in a directory called
517<prefix>/lib/pkgconfig.
518
519
520Shared libraries
521----------------
522
523The default distribution builds PCRE2 as shared libraries and static libraries,
524as long as the operating system supports shared libraries. Shared library
525support relies on the "libtool" script which is built as part of the
526"configure" process.
527
528The libtool script is used to compile and link both shared and static
529libraries. They are placed in a subdirectory called .libs when they are newly
530built. The programs pcre2test and pcre2grep are built to use these uninstalled
531libraries (by means of wrapper scripts in the case of shared libraries). When
532you use "make install" to install shared libraries, pcre2grep and pcre2test are
533automatically re-built to use the newly installed shared libraries before being
534installed themselves. However, the versions left in the build directory still
535use the uninstalled libraries.
536
537To build PCRE2 using static libraries only you must use --disable-shared when
538configuring it. For example:
539
540./configure --prefix=/usr/gnu --disable-shared
541
542Then run "make" in the usual way. Similarly, you can use --disable-static to
543build only shared libraries.
544
545
546Cross-compiling using autotools
547-------------------------------
548
549You can specify CC and CFLAGS in the normal way to the "configure" command, in
550order to cross-compile PCRE2 for some other host. However, you should NOT
551specify --enable-rebuild-chartables, because if you do, the pcre2_dftables.c
552source file is compiled and run on the local host, in order to generate the
553inbuilt character tables (the pcre2_chartables.c file). This will probably not
554work, because pcre2_dftables.c needs to be compiled with the local compiler,
555not the cross compiler.
556
557When --enable-rebuild-chartables is not specified, pcre2_chartables.c is
558created by making a copy of pcre2_chartables.c.dist, which is a default set of
559tables that assumes ASCII code. Cross-compiling with the default tables should
560not be a problem.
561
562If you need to modify the character tables when cross-compiling, you should
563move pcre2_chartables.c.dist out of the way, then compile pcre2_dftables.c by
564hand and run it on the local host to make a new version of
565pcre2_chartables.c.dist. See the pcre2build section "Creating character tables
566at build time" for more details.
567
568
569Making new tarballs
570-------------------
571
572The command "make dist" creates three PCRE2 tarballs, in tar.gz, tar.bz2, and
573zip formats. The command "make distcheck" does the same, but then does a trial
574build of the new distribution to ensure that it works.
575
576If you have modified any of the man page sources in the doc directory, you
577should first run the PrepareRelease script before making a distribution. This
578script creates the .txt and HTML forms of the documentation from the man pages.
579
580
581Testing PCRE2
582-------------
583
584To test the basic PCRE2 library on a Unix-like system, run the RunTest script.
585There is another script called RunGrepTest that tests the pcre2grep command.
586When JIT support is enabled, a third test program called pcre2_jit_test is
587built. Both the scripts and all the program tests are run if you obey "make
588check". For other environments, see the instructions in NON-AUTOTOOLS-BUILD.
589
590The RunTest script runs the pcre2test test program (which is documented in its
591own man page) on each of the relevant testinput files in the testdata
592directory, and compares the output with the contents of the corresponding
593testoutput files. RunTest uses a file called testtry to hold the main output
594from pcre2test. Other files whose names begin with "test" are used as working
595files in some tests.
596
597Some tests are relevant only when certain build-time options were selected. For
598example, the tests for UTF-8/16/32 features are run only when Unicode support
599is available. RunTest outputs a comment when it skips a test.
600
601Many (but not all) of the tests that are not skipped are run twice if JIT
602support is available. On the second run, JIT compilation is forced. This
603testing can be suppressed by putting "nojit" on the RunTest command line.
604
605The entire set of tests is run once for each of the 8-bit, 16-bit and 32-bit
606libraries that are enabled. If you want to run just one set of tests, call
607RunTest with either the -8, -16 or -32 option.
608
609If valgrind is installed, you can run the tests under it by putting "valgrind"
610on the RunTest command line. To run pcre2test on just one or more specific test
611files, give their numbers as arguments to RunTest, for example:
612
613  RunTest 2 7 11
614
615You can also specify ranges of tests such as 3-6 or 3- (meaning 3 to the
616end), or a number preceded by ~ to exclude a test. For example:
617
618  Runtest 3-15 ~10
619
620This runs tests 3 to 15, excluding test 10, and just ~13 runs all the tests
621except test 13. Whatever order the arguments are in, the tests are always run
622in numerical order.
623
624You can also call RunTest with the single argument "list" to cause it to output
625a list of tests.
626
627The test sequence starts with "test 0", which is a special test that has no
628input file, and whose output is not checked. This is because it will be
629different on different hardware and with different configurations. The test
630exists in order to exercise some of pcre2test's code that would not otherwise
631be run.
632
633Tests 1 and 2 can always be run, as they expect only plain text strings (not
634UTF) and make no use of Unicode properties. The first test file can be fed
635directly into the perltest.sh script to check that Perl gives the same results.
636The only difference you should see is in the first few lines, where the Perl
637version is given instead of the PCRE2 version. The second set of tests check
638auxiliary functions, error detection, and run-time flags that are specific to
639PCRE2. It also uses the debugging flags to check some of the internals of
640pcre2_compile().
641
642If you build PCRE2 with a locale setting that is not the standard C locale, the
643character tables may be different (see next paragraph). In some cases, this may
644cause failures in the second set of tests. For example, in a locale where the
645isprint() function yields TRUE for characters in the range 128-255, the use of
646[:isascii:] inside a character class defines a different set of characters, and
647this shows up in this test as a difference in the compiled code, which is being
648listed for checking. For example, where the comparison test output contains
649[\x00-\x7f] the test might contain [\x00-\xff], and similarly in some other
650cases. This is not a bug in PCRE2.
651
652Test 3 checks pcre2_maketables(), the facility for building a set of character
653tables for a specific locale and using them instead of the default tables. The
654script uses the "locale" command to check for the availability of the "fr_FR",
655"french", or "fr" locale, and uses the first one that it finds. If the "locale"
656command fails, or if its output doesn't include "fr_FR", "french", or "fr" in
657the list of available locales, the third test cannot be run, and a comment is
658output to say why. If running this test produces an error like this:
659
660  ** Failed to set locale "fr_FR"
661
662it means that the given locale is not available on your system, despite being
663listed by "locale". This does not mean that PCRE2 is broken. There are three
664alternative output files for the third test, because three different versions
665of the French locale have been encountered. The test passes if its output
666matches any one of them.
667
668Tests 4 and 5 check UTF and Unicode property support, test 4 being compatible
669with the perltest.sh script, and test 5 checking PCRE2-specific things.
670
671Tests 6 and 7 check the pcre2_dfa_match() alternative matching function, in
672non-UTF mode and UTF-mode with Unicode property support, respectively.
673
674Test 8 checks some internal offsets and code size features, but it is run only
675when Unicode support is enabled. The output is different in 8-bit, 16-bit, and
67632-bit modes and for different link sizes, so there are different output files
677for each mode and link size.
678
679Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
68016-bit and 32-bit modes. These are tests that generate different output in
6818-bit mode. Each pair are for general cases and Unicode support, respectively.
682
683Test 13 checks the handling of non-UTF characters greater than 255 by
684pcre2_dfa_match() in 16-bit and 32-bit modes.
685
686Test 14 contains some special UTF and UCP tests that give different output for
687different code unit widths.
688
689Test 15 contains a number of tests that must not be run with JIT. They check,
690among other non-JIT things, the match-limiting features of the intepretive
691matcher.
692
693Test 16 is run only when JIT support is not available. It checks that an
694attempt to use JIT has the expected behaviour.
695
696Test 17 is run only when JIT support is available. It checks JIT complete and
697partial modes, match-limiting under JIT, and other JIT-specific features.
698
699Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
700the 8-bit library, without and with Unicode support, respectively.
701
702Test 20 checks the serialization functions by writing a set of compiled
703patterns to a file, and then reloading and checking them.
704
705Tests 21 and 22 test \C support when the use of \C is not locked out, without
706and with UTF support, respectively. Test 23 tests \C when it is locked out.
707
708Tests 24 and 25 test the experimental pattern conversion functions, without and
709with UTF support, respectively.
710
711
712Character tables
713----------------
714
715For speed, PCRE2 uses four tables for manipulating and identifying characters
716whose code point values are less than 256. By default, a set of tables that is
717built into the library is used. The pcre2_maketables() function can be called
718by an application to create a new set of tables in the current locale. This are
719passed to PCRE2 by calling pcre2_set_character_tables() to put a pointer into a
720compile context.
721
722The source file called pcre2_chartables.c contains the default set of tables.
723By default, this is created as a copy of pcre2_chartables.c.dist, which
724contains tables for ASCII coding. However, if --enable-rebuild-chartables is
725specified for ./configure, a new version of pcre2_chartables.c is built by the
726program pcre2_dftables (compiled from pcre2_dftables.c), which uses the ANSI C
727character handling functions such as isalnum(), isalpha(), isupper(),
728islower(), etc. to build the table sources. This means that the default C
729locale that is set for your system will control the contents of these default
730tables. You can change the default tables by editing pcre2_chartables.c and
731then re-building PCRE2. If you do this, you should take care to ensure that the
732file does not get automatically re-generated. The best way to do this is to
733move pcre2_chartables.c.dist out of the way and replace it with your customized
734tables.
735
736When the pcre2_dftables program is run as a result of specifying
737--enable-rebuild-chartables, it uses the default C locale that is set on your
738system. It does not pay attention to the LC_xxx environment variables. In other
739words, it uses the system's default locale rather than whatever the compiling
740user happens to have set. If you really do want to build a source set of
741character tables in a locale that is specified by the LC_xxx variables, you can
742run the pcre2_dftables program by hand with the -L option. For example:
743
744  ./pcre2_dftables -L pcre2_chartables.c.special
745
746The second argument names the file where the source code for the tables is
747written. The first two 256-byte tables provide lower casing and case flipping
748functions, respectively. The next table consists of a number of 32-byte bit
749maps which identify certain character classes such as digits, "word"
750characters, white space, etc. These are used when building 32-byte bit maps
751that represent character classes for code points less than 256. The final
752256-byte table has bits indicating various character types, as follows:
753
754    1   white space character
755    2   letter
756    4   lower case letter
757    8   decimal digit
758   16   alphanumeric or '_'
759
760You can also specify -b (with or without -L) when running pcre2_dftables. This
761causes the tables to be written in binary instead of as source code. A set of
762binary tables can be loaded into memory by an application and passed to
763pcre2_compile() in the same way as tables created dynamically by calling
764pcre2_maketables(). The tables are just a string of bytes, independent of
765hardware characteristics such as endianness. This means they can be bundled
766with an application that runs in different environments, to ensure consistent
767behaviour.
768
769See also the pcre2build section "Creating character tables at build time".
770
771
772File manifest
773-------------
774
775The distribution should contain the files listed below.
776
777(A) Source files for the PCRE2 library functions and their headers are found in
778    the src directory:
779
780  src/pcre2_dftables.c     auxiliary program for building pcre2_chartables.c
781                           when --enable-rebuild-chartables is specified
782
783  src/pcre2_chartables.c.dist  a default set of character tables that assume
784                           ASCII coding; unless --enable-rebuild-chartables is
785                           specified, used by copying to pcre2_chartables.c
786
787  src/pcre2posix.c         )
788  src/pcre2_auto_possess.c )
789  src/pcre2_compile.c      )
790  src/pcre2_config.c       )
791  src/pcre2_context.c      )
792  src/pcre2_convert.c      )
793  src/pcre2_dfa_match.c    )
794  src/pcre2_error.c        )
795  src/pcre2_extuni.c       )
796  src/pcre2_find_bracket.c )
797  src/pcre2_jit_compile.c  )
798  src/pcre2_jit_match.c    ) sources for the functions in the library,
799  src/pcre2_jit_misc.c     )   and some internal functions that they use
800  src/pcre2_maketables.c   )
801  src/pcre2_match.c        )
802  src/pcre2_match_data.c   )
803  src/pcre2_newline.c      )
804  src/pcre2_ord2utf.c      )
805  src/pcre2_pattern_info.c )
806  src/pcre2_script_run.c   )
807  src/pcre2_serialize.c    )
808  src/pcre2_string_utils.c )
809  src/pcre2_study.c        )
810  src/pcre2_substitute.c   )
811  src/pcre2_substring.c    )
812  src/pcre2_tables.c       )
813  src/pcre2_ucd.c          )
814  src/pcre2_valid_utf.c    )
815  src/pcre2_xclass.c       )
816
817  src/pcre2_printint.c     debugging function that is used by pcre2test,
818  src/pcre2_fuzzsupport.c  function for (optional) fuzzing support
819
820  src/config.h.in          template for config.h, when built by "configure"
821  src/pcre2.h.in           template for pcre2.h when built by "configure"
822  src/pcre2posix.h         header for the external POSIX wrapper API
823  src/pcre2_internal.h     header for internal use
824  src/pcre2_intmodedep.h   a mode-specific internal header
825  src/pcre2_ucp.h          header for Unicode property handling
826
827  sljit/*                  source files for the JIT compiler
828
829(B) Source files for programs that use PCRE2:
830
831  src/pcre2demo.c          simple demonstration of coding calls to PCRE2
832  src/pcre2grep.c          source of a grep utility that uses PCRE2
833  src/pcre2test.c          comprehensive test program
834  src/pcre2_jit_test.c     JIT test program
835
836(C) Auxiliary files:
837
838  132html                  script to turn "man" pages into HTML
839  AUTHORS                  information about the author of PCRE2
840  ChangeLog                log of changes to the code
841  CleanTxt                 script to clean nroff output for txt man pages
842  Detrail                  script to remove trailing spaces
843  HACKING                  some notes about the internals of PCRE2
844  INSTALL                  generic installation instructions
845  LICENCE                  conditions for the use of PCRE2
846  COPYING                  the same, using GNU's standard name
847  Makefile.in              ) template for Unix Makefile, which is built by
848                           )   "configure"
849  Makefile.am              ) the automake input that was used to create
850                           )   Makefile.in
851  NEWS                     important changes in this release
852  NON-AUTOTOOLS-BUILD      notes on building PCRE2 without using autotools
853  PrepareRelease           script to make preparations for "make dist"
854  README                   this file
855  RunTest                  a Unix shell script for running tests
856  RunGrepTest              a Unix shell script for pcre2grep tests
857  aclocal.m4               m4 macros (generated by "aclocal")
858  config.guess             ) files used by libtool,
859  config.sub               )   used only when building a shared library
860  configure                a configuring shell script (built by autoconf)
861  configure.ac             ) the autoconf input that was used to build
862                           )   "configure" and config.h
863  depcomp                  ) script to find program dependencies, generated by
864                           )   automake
865  doc/*.3                  man page sources for PCRE2
866  doc/*.1                  man page sources for pcre2grep and pcre2test
867  doc/index.html.src       the base HTML page
868  doc/html/*               HTML documentation
869  doc/pcre2.txt            plain text version of the man pages
870  doc/pcre2test.txt        plain text documentation of test program
871  install-sh               a shell script for installing files
872  libpcre2-8.pc.in         template for libpcre2-8.pc for pkg-config
873  libpcre2-16.pc.in        template for libpcre2-16.pc for pkg-config
874  libpcre2-32.pc.in        template for libpcre2-32.pc for pkg-config
875  libpcre2-posix.pc.in     template for libpcre2-posix.pc for pkg-config
876  ltmain.sh                file used to build a libtool script
877  missing                  ) common stub for a few missing GNU programs while
878                           )   installing, generated by automake
879  mkinstalldirs            script for making install directories
880  perltest.sh              Script for running a Perl test program
881  pcre2-config.in          source of script which retains PCRE2 information
882  testdata/testinput*      test data for main library tests
883  testdata/testoutput*     expected test results
884  testdata/grep*           input and output for pcre2grep tests
885  testdata/*               other supporting test files
886
887(D) Auxiliary files for cmake support
888
889  cmake/COPYING-CMAKE-SCRIPTS
890  cmake/FindPackageHandleStandardArgs.cmake
891  cmake/FindEditline.cmake
892  cmake/FindReadline.cmake
893  CMakeLists.txt
894  config-cmake.h.in
895
896(E) Auxiliary files for building PCRE2 "by hand"
897
898  src/pcre2.h.generic     ) a version of the public PCRE2 header file
899                          )   for use in non-"configure" environments
900  src/config.h.generic    ) a version of config.h for use in non-"configure"
901                          )   environments
902
903Philip Hazel
904Email local part: Philip.Hazel
905Email domain: gmail.com
906Last updated: 04 December 2020
907