1<!--===- docs/ParserCombinators.md
2
3   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4   See https://llvm.org/LICENSE.txt for license information.
5   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6
7-->
8
9# Parser Combinators
10
11```eval_rst
12.. contents::
13   :local:
14```
15
16This document is a primer on Parser Combinators and their use in Flang.
17
18## Concept
19The Fortran language recognizer here can be classified as an LL recursive
20descent parser.  It is composed from a *parser combinator* library that
21defines a few fundamental parsers and a few ways to compose them into more
22powerful parsers.
23
24For our purposes here, a *parser* is any object that attempts to recognize
25an instance of some syntax from an input stream.  It may succeed or fail.
26On success, it may return some semantic value to its caller.
27
28In C++ terms, a parser is any instance of a class that
291. has a `constexpr` default constructor,
301. defines a type named `resultType`, and
311. provides a function (`const` member or `static`) that accepts a reference to a
32`ParseState` as its argument and returns a `std::optional<resultType>` as a
33result, with the presence or absence of a value in the `std::optional<>`
34signifying success or failure, respectively.
35```
36std::optional<resultType> Parse(ParseState &) const;
37```
38The `resultType` of a parser is typically the class type of some particular
39node type in the parse tree.
40
41`ParseState` is a class that encapsulates a position in the source stream,
42collects messages, and holds a few state flags that determine tokenization
43(e.g., are we in a character literal?).  Instances of `ParseState` are
44independent and complete -- they are cheap to duplicate whenever necessary to
45implement backtracking.
46
47The `constexpr` default constructor of a parser is important.  The functions
48(below) that operate on instances of parsers are themselves all `constexpr`.
49This use of compile-time expressions allows the entirety of a recursive
50descent parser for a language to be constructed at compilation time through
51the use of templates.
52
53### Fundamental Predefined Parsers
54These objects and functions are (or return) the fundamental parsers:
55
56* `ok` is a trivial parser that always succeeds without advancing.
57* `pure(x)` returns a trivial parser that always succeeds without advancing,
58  returning some value `x`.
59* `pure<T>()` is `pure(T{})` but does not require that T be copy-constructible.
60* `fail<T>(msg)` denotes a trivial parser that always fails, emitting the
61  given message as a side effect.  The template parameter is the type of
62  the value that the parser never returns.
63* `nextCh` consumes the next character and returns its location,
64  and fails at EOF.
65* `"xyz"_ch` succeeds if the next character consumed matches any of those
66  in the string and returns its location.  Be advised that the source
67  will have been normalized to lower case (miniscule) letters outside
68  character and Hollerith literals and edit descriptors before parsing.
69
70### Combinators
71These functions and operators combine existing parsers to generate new parsers.
72They are `constexpr`, so they should be viewed as type-safe macros.
73
74* `!p` succeeds if p fails, and fails if p succeeds.
75* `p >> q` fails if p does, otherwise running q and returning its value when
76  it succeeds.
77* `p / q` fails if p does, otherwise running q and returning p's value
78  if q succeeds.
79* `p || q` succeeds if p does, otherwise running q.  The two parsers must
80  have the same type, and the value returned by the first succeeding parser
81  is the value of the combination.
82* `first(p1, p2, ...)` returns the value of the first parser that succeeds.
83  All of the parsers in the list must return the same type.
84  It is essentially the same as `p1 || p2 || ...` but has a slightly
85  faster implementation and may be easier to format in your code.
86* `lookAhead(p)` succeeds if p does, but doesn't modify any state.
87* `attempt(p)` succeeds if p does, safely preserving state on failure.
88* `many(p)` recognizes a greedy sequence of zero or more nonempty successes
89  of p, and returns `std::list<>` of their values.  It always succeeds.
90* `some(p)` recognized a greedy sequence of one or more successes of p.
91  It fails if p immediately fails.
92* `skipMany(p)` is the same as `many(p)`, but it discards the results.
93* `maybe(p)` tries to match p, returning an `std::optional<T>` value.
94  It always succeeds.
95* `defaulted(p)` matches p, and when p fails it returns a
96  default-constructed instance of p's resultType.  It always succeeds.
97* `nonemptySeparated(p, q)` repeatedly matches "p q p q p q ... p",
98  returning a `std::list<>` of only the values of the p's.  It fails if
99  p immediately fails.
100* `extension(p)` parses p if strict standard compliance is disabled,
101   or with a warning if nonstandard usage warnings are enabled.
102* `deprecated(p)` parses p if strict standard compliance is disabled,
103  with a warning if deprecated usage warnings are enabled.
104* `inContext(msg, p)` runs p within an error message context; any
105  message that `p` generates will be tagged with `msg` as its
106  context.  Contexts may nest.
107* `withMessage(msg, p)` succeeds if `p` does, and if it does not,
108  it discards the messages from `p` and fails with the specified message.
109* `recovery(p, q)` is equivalent to `p || q`, except that error messages
110  generated from the first parser are retained, and a flag is set in
111  the ParseState to remember that error recovery was necessary.
112* `localRecovery(msg, p, q)` is equivalent to
113  `recovery(withMessage(msg, p), q >> pure<A>())` where `A` is the
114  result type of 'p'.
115  It is useful for targeted error recovery situations within statements.
116
117Note that
118```
119a >> b >> c / d / e
120```
121matches a sequence of five parsers, but returns only the result that was
122obtained by matching `c`.
123
124### Applicatives
125The following *applicative* combinators combine parsers and modify or
126collect the values that they return.
127
128* `construct<T>(p1, p2, ...)` matches zero or more parsers in succession,
129  collecting their results and then passing them with move semantics to a
130  constructor for the type T if they all succeed.
131  If there is a single parser as the argument and it returns no usable
132  value but only success or failure (_e.g.,_ `"IF"_tok`), the default
133  nullary constructor of the type `T` is called.
134* `sourced(p)` matches p, and fills in its `source` data member with the
135  locations of the cooked character stream that it consumed
136* `applyFunction(f, p1, p2, ...)` matches one or more parsers in succession,
137  collecting their results and passing them as rvalue reference arguments to
138  some function, returning its result.
139* `applyLambda([](&&x){}, p1, p2, ...)` is the same thing, but for lambdas
140  and other function objects.
141* `applyMem(mf, p1, p2, ...)` is the same thing, but invokes a member
142  function of the result of the first parser for updates in place.
143
144### Token Parsers
145Last, we have these basic parsers on which the actual grammar of the Fortran
146is built.  All of the following parsers consume characters acquired from
147`nextCh`.
148
149* `space` always succeeds after consuming any spaces
150* `spaceCheck` always succeeds after consuming any spaces, and can emit
151  a warning if there was no space in free form code before a character
152  that could continue a name or keyword
153* `digit` matches one cooked decimal digit (0-9)
154* `letter` matches one cooked letter (A-Z)
155* `"..."_tok` match the content of the string, skipping spaces before and
156  after.  Internal spaces are optional matches.  The `_tok` suffix is
157  optional when the parser appears before the combinator `>>` or after
158  the combinator `/`.
159* `"..."_sptok` is a string match in which the spaces are required in
160   free form source.
161* `"..."_id` is a string match for a complete identifier (not a prefix of
162   a longer identifier or keyword).
163* `parenthesized(p)` is shorthand for `"(" >> p / ")"`.
164* `bracketed(p)` is shorthand for `"[" >> p / "]"`.
165* `nonEmptyList(p)` matches a comma-separated list of one or more
166  instances of p.
167* `nonEmptyList(errorMessage, p)` is equivalent to
168  `withMessage(errorMessage, nonemptyList(p))`, which allows one to supply
169  a meaningful error message in the event of an empty list.
170* `optionalList(p)` is the same thing, but can be empty, and always succeeds.
171
172### Debugging Parser
173Last, a string literal `"..."_debug` denotes a parser that emits the string to
174`llvm::errs` and succeeds.  It is useful for tracing while debugging a parser but should
175obviously not be committed for production code.
176