1<!--===- docs/DoConcurrent.md 2 3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. 4 See https://llvm.org/LICENSE.txt for license information. 5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception 6 7--> 8 9# `DO CONCURRENT` isn't necessarily concurrent 10 11```eval_rst 12.. contents:: 13 :local: 14``` 15 16A variant form of Fortran's primary looping construct was 17added to the Fortran 2008 language standard with the apparent 18intent of enabling more effective automatic parallel execution of code 19written in the standard language without the use of 20non-standard directives. 21Spelled `DO CONCURRENT`, the construct takes a rectilinear iteration 22space specification like `FORALL` and allows us to write 23a multidimensional loop nest construct with a single `DO CONCURRENT` 24statement and a single terminating `END DO` statement. 25 26Within the body of a `DO CONCURRENT` loop the program must respect 27a long list of restrictions on its use of Fortran language features. 28Actions that obviously can't be executed in parallel or that 29don't allow all iterations to execute are prohibited. 30These include: 31* Control flow statements that would prevent the loop nest from 32 executing all its iterations: `RETURN`, `EXIT`, and any 33 `GOTO` or `CYCLE` that leaves the construct. 34* Image control statements: `STOP`, `SYNC`, `LOCK`/`UNLOCK`, `EVENT`, 35 and `ALLOCATE`/`DEALLOCATE` of a coarray. 36* Calling a procedure that is not `PURE`. 37* Deallocation of any polymorphic entity, as that could cause 38 an impure FINAL subroutine to be called. 39* Messing with the IEEE floating-point control and status flags. 40* Accepting some restrictions on data flow between iterations 41 (i.e., none) and on liveness of modified objects after the loop. 42 (The details are spelled out later.) 43 44In return for accepting these restrictions, a `DO CONCURRENT` might 45compile into code that exploits the parallel features of the target 46machine to run the iterations of the `DO CONCURRENT` construct. 47One needn't necessarily require OpenACC or OpenMP directives. 48 49But it turns out that these rules, though *necessary* for safe parallel 50execution, are not *sufficient*. 51One may write conforming `DO CONCURRENT` constructs that cannot 52be safely parallelized by a compiler; worse, one may write conforming 53`DO CONCURRENT` constructs whose parallelizability a compiler cannot 54determine even in principle -- forcing a conforming compiler to 55assume the worst and generate sequential code. 56 57## Localization 58 59The Fortran language standard does not actually define `DO CONCURRENT` as a 60concurrent construct, or even as a construct that imposes sufficient 61requirements on the programmer to allow for parallel execution. 62`DO CONCURRENT` is instead defined as executing the iterations 63of the loop in some arbitrary order (see subclause 11.1.7.4.3 paragraph 3). 64 65A `DO CONCURRENT` construct cannot modify an object in one iteration 66and expect to be able to read it in another, or read it in one before it gets 67modified by another -- there's no way to synchronize inter-iteration 68communication with critical sections or atomics. 69 70But a conforming `DO CONCURRENT` construct *can* modify an object in 71multiple iterations of the loop so long as its only reads from that 72object *after* having modified it earler in the *same* iteration. 73(See 11.1.7.5 paragraph 4 for the details.) 74 75For example: 76 77``` 78 DO CONCURRENT (J=1:N) 79 TMP = A(J) + B(J) 80 C(J) = TMP 81 END DO 82 ! And TMP is undefined afterwards 83``` 84 85The scalar variable `TMP` is used in this loop in a way that conforms 86to the standard, as every use of `TMP` follows a definition that appears 87earlier in the same iteration. 88 89The idea, of course, is that a parallelizing compiler isn't required to 90use the same word of memory to hold the value of `TMP`; 91for parallel execution, `TMP` can be _localized_. 92This means that the loop can be internally rewritten as if it had been 93``` 94 DO CONCURRENT (J=1:N) 95 BLOCK 96 REAL :: TMP 97 TMP = A(J) + B(J) 98 C(J) = TMP 99 END BLOCK 100 END DO 101``` 102and thus any risk of data flow between the iterations is removed. 103 104## The identification problem 105 106The automatic localization rules of `DO CONCURRENT` that allow 107usage like `TMP` above are not limited to simple local scalar 108variables. 109They also apply to arbitrary variables, and thus may apply 110in cases that a compiler cannot determine exactly due to 111the presence of indexing, indirection, and interprocedural data flow. 112 113Let's see why this turns out to be a problem. 114 115Examples: 116``` 117 DO CONCURRENT (J=1:N) 118 T(IX(J)) = A(J) + B(J) 119 C(J) = T(IY(J)) 120 END DO 121``` 122This loop conforms to the standard language if, 123whenever `IX(J)` equals `IY(J')` for any distinct pair of iterations 124`J` and `J'`, 125then the load must be reading a value stored earlier in the 126same iteration -- so `IX(J')==IY(J')`, and hence `IX(J)==IX(J')` too, 127in this example. 128Otherwise, a load in one iteration might depend on a store 129in another. 130 131When all values of `IX(J)` are distinct, and the program conforms 132to the restrictions of `DO CONCURRENT`, a compiler can parallelize 133the construct easily without applying localization to `T(...)`. 134And when some values of `IX(J)` are duplicates, a compiler can parallelize 135the loop by forwarding the stored value to the load in those 136iterations. 137But at compilation time, there's _no way to distinguish_ these 138cases in general, and a conservative implementation has to assume 139the worst and run the loop's iterations serially. 140(Or compare `IX(J)` with `IY(J)` at runtime and forward the 141stored value conditionally, which adds overhead and becomes 142quickly impractical in loops with multiple loads and stores.) 143 144In 145``` 146 TYPE :: T 147 REAL, POINTER :: P 148 END TYPE 149 TYPE(T) :: T1(N), T2(N) 150 DO CONCURRENT (J=1:N) 151 T1(J)%P = A(J) + B(J) 152 C(J) = T2(J)%P 153 END DO 154``` 155we have the same kind of ambiguity from the compiler's perspective. 156Are the targets of the pointers used for the stores all distinct 157from the targets of the pointers used for the loads? 158The programmer may know that they are so, but a compiler 159cannot; and there is no syntax by which one can stipulate 160that they are so. 161 162## The global variable localization problem 163 164Here's another case: 165``` 166 MODULE M 167 REAL :: T 168 END MODULE 169 ... 170 USE M 171 INTERFACE 172 PURE REAL FUNCTION F(X) 173 REAL, INTENT(IN) :: X 174 END FUNCTION 175 END INTERFACE 176 DO CONCURRENT (J=1:N) 177 T = A(J) + B(J) 178 D(J) = F(A(J)) + T 179 END DO 180``` 181The variable `T` is obviously meant to be localized. 182However, a compiler can't be sure that the pure function `F` 183doesn't read from `T`; if it does, there wouldn't be a 184practical way to convey the localized copy to it. 185 186In summary, standard Fortran defines `DO CONCURRENT` as a serial 187construct with a sheaf of constraints that we assume are intended 188to enable straightforward parallelization without 189all of the complexity of defining threading models or shared memory semantics, 190with the addition of an automatic localization rule that provides 191convenient temporaries objects without requiring the use of nested 192`BLOCK` or `ASSOCIATE` constructs. 193But the language allows ambiguous cases in which a compiler can neither 1941. prove that automatic localization *is* required for a given 195 object in every iteration, nor 1961. prove that automatic localization *isn't* required in any iteration. 197 198## Locality specifiers 199 200The Fortran 2018 standard added "locality specifiers" to the 201`DO CONCURRENT` statement. 202These allow one to define some variable names as being `LOCAL` or 203`SHARED`, overriding the automatic localization rule so that it 204applies only in the remaining cases of "unspecified" locality. 205 206`LOCAL` variables are those that can be defined by more than one 207iteration but are referenced only after having been defined 208earlier in the same iteration. 209`SHARED` variables are those that, if defined in 210any iteration, are not defined or referenced in any other iteration. 211 212(There is also a `LOCAL_INIT` specifier that is not relevant to the 213problem at hand, and a `DEFAULT(NONE)` specifier that requires a 214locality specifier be present for every variable mentioned in the 215`DO CONCURRENT` construct.) 216 217These locality specifiers can help resolve some otherwise ambiguous 218cases of localization, but they're not a complete solution to the problems 219described above. 220 221First, the specifiers allow explicit localization of objects 222(like the scalar `T` in `MODULE M` above) that are not local variables 223of the subprogram. 224`DO CONCURRENT` still allows a pure procedure called from the loop 225to reference `T`, and so explicit localization just confirms the 226worst-case assumptions about interprocedural data flow 227within an iteration that a compiler must make anyway. 228 229Second, the specifiers allow arbitary variables to be localized, 230not just scalars. 231One may localize a million-element array of derived type 232with allocatable components to be created in each iteration, 233for example. 234(It is not clear whether localized objects are finalized; 235probably not.) 236 237Third, as Fortran uses context to distinguish references to 238pointers from (de)references to their targets, it's not clear 239whether `LOCAL(PTR)` localizes a pointer, its target, or both. 240 241Fourth, the specifiers can be applied only to variable _names_, 242not to any designator with subscripts or component references. 243One may have defined a derived type to hold a representation 244of a sparse matrix, using `ALLOCATABLE` components to store its 245packed data and indexing structures, but a program cannot localize 246some parts of it and share the rest. 247(Perhaps one may wrap `ASSOCIATE` constructs around the 248`DO CONCURRENT` construct; 249the interaction between locality specifiers and construct entities is 250not clearly defined in the language.) 251 252In the example above that defines `T(IX(J))` and reads from `T(IY(J))`, 253the locality specifiers can't be used to share those elements of `T()` 254that are modified at most once and localize the cases where 255`IX(J)` is a duplicate and `IY(J)==IX(J)`. 256 257Last, when a loop both defines and references many shared objects, 258including potential references to globally accessible object 259in called procedures, one may need to name all of them in a `SHARED` 260specifier. 261 262## What to do now 263 264These problems have been presented to the J3 Fortran language 265standard committee. 266Their responses in 267recent [e-mail discussions](https://mailman.j3-fortran.org/pipermail/j3/2020-July/thread.html) 268did not include an intent to address them in future standards or corrigenda. 269The most effective-looking response -- which was essentially "just use 270`DEFAULT(SHARED)` to disable all automatic localization" -- is not an 271viable option, since the language does not include such a specifier! 272 273Programmers writing `DO CONCURRENT` loops that are safely parallelizable 274need an effective means to convey to compilers that those compilers 275do not have to assume only the weaker stipulations required by 276today's `DO CONCURRENT` without having to write verbose and 277error-prone locality specifiers (when those would suffice). 278Specifically, an easy means is required that stipulates that localization 279should apply at most only to the obvious cases of local non-pointer 280non-allocatable scalars. 281 282In the LLVM Fortran compiler project (a/k/a "flang", "f18") we considered 283several solutions to this problem. 2841. Add syntax (e.g., `DO PARALLEL` or `DO CONCURRENT() DEFAULT(PARALLEL)`) 285 by which one can inform the compiler that it should localize only 286 the obvious cases of simple local scalars. 287 Such syntax seems unlikely to ever be standardized, so its usage 288 would be nonportable. 2891. Add a command-line option &/or a source directive to stipulate 290 the stronger guarantees. Obvious non-parallelizable usage in the construct 291 would elicit a stern warning. The `DO CONCURRENT` loops in the source 292 would continue to be portable to other compilers. 2931. Assume that these stronger conditions hold by default, and add a command-line 294 option &/or a source directive to "opt out" back to the weaker 295 requirements of the standard language 296 in the event that the program contains one of those inherently 297 non-parallelizable `DO CONCURRENT` loops that perhaps should never have 298 been possible to write in a conforming program in the first place. 299 Actual parallel `DO CONCURRENT` constructs would produce parallel 300 code for users who would otherwise be surprised to learn about these 301 problems in the language. 302 But this option could lead to non-standard behavior for codes that depend, 303 accidentally or not, on non-parallelizable implicit localization. 3041. Accept the standard as it exists, do the best job of automatic 305 parallelization that can be done, and refer dissatisfied users to J3. 306 This would be avoiding the problem. 307 308None of these options is without a fairly obvious disadvantage. 309The best option seems to be the one that assumes that users who write 310`DO CONCURRENT` constructs are doing so with the intent to write parallel code. 311 312## Other precedents 313 314As of August 2020, we observe that the GNU Fortran compiler (10.1) does not 315yet implement the Fortran 2018 locality clauses, but will parallelize some 316`DO CONCURRENT` constructs without ambiguous data dependences when the automatic 317parallelization option is enabled. 318 319The Intel Fortran compiler supports the new locality clauses and will parallelize 320some `DO CONCURRENT` constructs when automatic parallelization option is enabled. 321When OpenMP is enabled, ifort reports that all `DO CONCURRENT` constructs are 322parallelized, but they seem to execute in a serial fashion when data flow 323hazards are present. 324