1<!--===- docs/Parsing.md 2 3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. 4 See https://llvm.org/LICENSE.txt for license information. 5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception 6 7--> 8 9# The F18 Parser 10 11```eval_rst 12.. contents:: 13 :local: 14``` 15 16This program source code implements a parser for the Fortran programming 17language. 18 19The draft ISO standard for Fortran 2018 dated July 2017 was used as the 20primary definition of the language. The parser also accepts many features 21from previous versions of the standard that are no longer part of the Fortran 222018 language. 23 24It also accepts many features that have never been part of any version 25of the standard Fortran language but have been supported by previous 26implementations and are known or suspected to remain in use. As a 27general principle, we want to recognize and implement any such feature 28so long as it does not conflict with requirements of the current standard 29for Fortran. 30 31The parser is implemented in standard ISO C++ and requires the 2017 32edition of the language and library. The parser constitutes a reentrant 33library with no mutable or constructed static data. Best modern C++ 34programming practices are observed to ensure that the ownership of 35dynamic memory is clear, that value rather than object semantics are 36defined for the data structures, that most functions are free from 37invisible side effects, and that the strictest available type checking 38is enforced by the C++ compiler when the Fortran parser is built. 39Class inheritance is rare and dynamic polymorphism is avoided in favor 40of modern discriminated unions. To the furthest reasonable extent, the 41parser has been implemented in a declarative fashion that corresponds 42closely to the text of the Fortran language standard. 43 44The several major modules of the Fortran parser are composed into a 45top-level Parsing class, by means of which one may drive the parsing of a 46source file and receive its parse tree and error messages. The interfaces 47of the Parsing class correspond to the two major passes of the parser, 48which are described below. 49 50## Prescanning and Preprocessing 51 52The first pass is performed by an instance of the Prescanner class, 53with help from an instance of Preprocessor. 54 55The prescanner generates the "cooked character stream", implemented 56by a CookedSource class instance, in which: 57* line ends have been normalized to single ASCII LF characters (UNIX newlines) 58* all `INCLUDE` files have been expanded 59* all continued Fortran source lines have been unified 60* all comments and insignificant spaces have been removed 61* fixed form right margins have been clipped 62* extra blank card columns have been inserted into character literals 63 and Hollerith constants 64* preprocessing directives have been implemented 65* preprocessing macro invocations have been expanded 66* legacy `D` lines in fixed form source have been omitted or included 67* except for the payload in character literals, Hollerith constants, 68 and character and Hollerith edit descriptors, all letters have been 69 normalized to lower case 70* all original non-ASCII characters in Hollerith constants have been 71 decoded and re-encoded into UTF-8 72 73Lines in the cooked character stream can be of arbitrary length. 74 75The purpose of the cooked character stream is to enable the implementation 76of a parser whose sole concern is the recognition of the Fortran language 77from productions that closely correspond to the grammar that is presented 78in the Fortran standard, without having to deal with the complexity of 79all of the source-level concerns in the preceding list. 80 81The implementation of the preprocessor interacts with the prescanner by 82means of _token sequences_. These are partitionings of input lines into 83contiguous virtual blocks of characters, and are the only place in this 84Fortran compiler in which we have reified a tokenization of the program 85source; the parser proper does not have a tokenizer. The prescanner 86builds these token sequences out of source lines and supplies them 87to the preprocessor, which interprets directives and expands macro 88invocations. The token sequences returned by the preprocessor are then 89marshaled to constitute the cooked character stream that is the output of 90the prescanner. 91 92The preprocessor and prescanner can both instantiate new temporary 93instances of the Prescanner class to locate, open, and process any 94include files. 95 96The tight interaction and mutual design of the prescanner and preprocessor 97enable a principled implementation of preprocessing for the Fortran 98language that implements a reasonable facsimile of the C language 99preprocessor that is fully aware of Fortran's source forms, line 100continuation mechanisms, case insensitivity, token syntax, &c. 101 102The preprocessor always runs. There's no good reason for it not to. 103 104The content of the cooked character stream is available and useful 105for debugging, being as it is a simple value forwarded from the first major 106pass of the compiler to the second. 107 108## Source Provenance 109 110The prescanner constructs a chronicle of every file that is read by the 111parser, viz. the original source file and all others that it directly 112or indirectly includes. One copy of the content of each of these files 113is mapped or read into the address space of the parser. Memory mapping 114is used initially, but files with DOS line breaks or a missing terminal 115newline are immediately normalized in a buffer when necessary. 116 117The virtual input stream, which marshals every appearance of every file 118and every expansion of every macro invocation, is not materialized as 119an actual stream of bytes. There is, however, a mapping from each byte 120position in this virtual input stream back to whence it came (maintained 121by an instance of the AllSources class). Offsets into this virtual input 122stream constitute values of the Provenance class. Provenance values, 123and contiguous ranges thereof, are used to describe and delimit source 124positions for messaging. 125 126Further, every byte in the cooked character stream supplied by the 127prescanner to the parser can be inexpensively mapped to its provenance. 128Simple `const char *` pointers to characters in the cooked character 129stream, or to contiguous ranges thereof, are used as source position 130indicators within the parser and in the parse tree. 131 132## Messages 133 134Message texts, and snprintf-like formatting strings for constructing 135messages, are instantiated in the various components of the parser with 136C++ user defined character literals tagged with `_err_en_US` and `_en_US` 137(signifying fatality and language, with the default being the dialect of 138English used in the United States) so that they may be easily identified 139for localization. As described above, messages are associated with 140source code positions by means of provenance values. 141 142## The Parse Tree 143 144Each of the ca. 450 numbered requirement productions in the standard 145Fortran language grammar, as well as the productions implied by legacy 146extensions and preserved obsolescent features, maps to a distinct class 147in the parse tree so as to maximize the efficacy of static type checking 148by the C++ compiler. 149 150A transcription of the Fortran grammar appears with production requirement 151numbers in the commentary before these class definitions, so that one 152may easily refer to the standard (or to the parse tree definitions while 153reading that document). 154 155Three paradigms collectively implement most of the parse tree classes: 156* *wrappers*, in which a single data member `v` has been encapsulated 157 in a new type 158* *tuples* (or product types), in which several values of arbitrary 159 types have been encapsulated in a single data member `t` whose type 160 is an instance of `std::tuple<>` 161* *discriminated unions* (or sum types), in which one value whose type is 162 a dynamic selection from a set of distinct types is saved in a data 163 member `u` whose type is an instance of `std::variant<>` 164 165The use of these patterns is a design convenience, and exceptions to them 166are not uncommon wherever it made better sense to write custom definitions. 167 168Parse tree entities should be viewed as values, not objects; their 169addresses should not be abused for purposes of identification. They are 170assembled with C++ move semantics during parse tree construction. 171Their default and copy constructors are deliberately deleted in their 172declarations. 173 174The std::list<> data type is used in the parse tree to reliably store pointers 175to other relevant entries in the tree. Since the tree lists are moved and 176spliced at certain points std::list<> provides the necessary guarantee of the 177stability of pointers into these lists. 178 179There is a general purpose library by means of which parse trees may 180be traversed. 181 182## Parsing 183 184This compiler attempts to recognize the entire cooked character stream 185(see above) as a Fortran program. It records the reductions made during 186a successful recognition as a parse tree value. The recognized grammar 187is that of a whole source file, not just of its possible statements, 188so the parser has no global state that tracks the subprogram hierarchy 189or the structure of their nested block constructs. The parser performs 190no semantic analysis along the way, deferring all of that work to the 191next pass of the compiler. 192 193The resulting parse tree therefore necessarily contains ambiguous parses 194that cannot be resolved without recourse to a symbol table. Most notably, 195leading assignments to array elements can be misrecognized as statement 196function definitions, and array element references can be misrecognized 197as function calls. The semantic analysis phase of the compiler performs 198local rewrites of the parse tree once it can be disambiguated by symbols 199and types. 200 201Formally speaking, this parser is based on recursive descent with 202localized backtracking (specifically, it will not backtrack into a 203successful reduction to try its other alternatives). It is not generated 204as a table or code from a specification of the Fortran grammar; rather, it 205_is_ the grammar, as declaratively respecified in C++ constant expressions 206using a small collection of basic token recognition objects and a library 207of "parser combinator" template functions that compose them to form more 208complicated recognizers and their correspondences to the construction 209of parse tree values. 210 211## Unparsing 212 213Parse trees can be converted back into free form Fortran source code. 214This formatter is not really a classical "pretty printer", but is 215more of a data structure dump whose output is suitable for compilation 216by another compiler. It is also used for testing the parser, since a 217reparse of an unparsed parse tree should be an identity function apart from 218source provenance. 219