1================== 2Vectorization Plan 3================== 4 5.. contents:: 6 :local: 7 8Abstract 9======== 10The vectorization transformation can be rather complicated, involving several 11potential alternatives, especially for outer-loops [1]_ but also possibly for 12innermost loops. These alternatives may have significant performance impact, 13both positive and negative. A cost model is therefore employed to identify the 14best alternative, including the alternative of avoiding any transformation 15altogether. 16 17The Vectorization Plan is an explicit model for describing vectorization 18candidates. It serves for both optimizing candidates including estimating their 19cost reliably, and for performing their final translation into IR. This 20facilitates dealing with multiple vectorization candidates. 21 22High-level Design 23================= 24 25Vectorization Workflow 26---------------------- 27VPlan-based vectorization involves three major steps, taking a "scenario-based 28approach" to vectorization planning: 29 301. Legal Step: check if a loop can be legally vectorized; encode constraints and 31 artifacts if so. 322. Plan Step: 33 34 a. Build initial VPlans following the constraints and decisions taken by 35 Legal Step 1, and compute their cost. 36 b. Apply optimizations to the VPlans, possibly forking additional VPlans. 37 Prune sub-optimal VPlans having relatively high cost. 383. Execute Step: materialize the best VPlan. Note that this is the only step 39 that modifies the IR. 40 41Design Guidelines 42----------------- 43In what follows, the term "input IR" refers to code that is fed into the 44vectorizer whereas the term "output IR" refers to code that is generated by the 45vectorizer. The output IR contains code that has been vectorized or "widened" 46according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed 47according to an Unroll Factor (UF). 48The design of VPlan follows several high-level guidelines: 49 501. Analysis-like: building and manipulating VPlans must not modify the input IR. 51 In particular, if the best option is not to vectorize at all, the 52 vectorization process terminates before reaching Step 3, and compilation 53 should proceed as if VPlans had not been built. 54 552. Align Cost & Execute: each VPlan must support both estimating the cost and 56 generating the output IR code, such that the cost estimation evaluates the 57 to-be-generated code reliably. 58 593. Support vectorizing additional constructs: 60 61 a. Outer-loop vectorization. In particular, VPlan must be able to model the 62 control-flow of the output IR which may include multiple basic-blocks and 63 nested loops. 64 b. SLP vectorization. 65 c. Combinations of the above, including nested vectorization: vectorizing 66 both an inner loop and an outer-loop at the same time (each with its own 67 VF and UF), mixed vectorization: vectorizing a loop with SLP patterns 68 inside [4]_, (re)vectorizing input IR containing vector code. 69 d. Function vectorization [2]_. 70 714. Support multiple candidates efficiently. In particular, similar candidates 72 related to a range of possible VF's and UF's must be represented efficiently. 73 Potential versioning needs to be supported efficiently. 74 755. Support vectorizing idioms, such as interleaved groups of strided loads or 76 stores. This is achieved by modeling a sequence of output instructions using 77 a "Recipe", which is responsible for computing its cost and generating its 78 code. 79 806. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization 81 such regions may need to be, for example, predicated and linearized, or 82 replicated VF*UF times to handle scalarized and predicated instructions. 83 Innerloops are also modelled as SESE regions. 84 857. Support instruction-level analysis and transformation, as part of Planning 86 Step 2.b: During vectorization instructions may need to be traversed, moved, 87 replaced by other instructions or be created. For example, vector idiom 88 detection and formation involves searching for and optimizing instruction 89 patterns. 90 91Definitions 92=========== 93The low-level design of VPlan comprises of the following classes. 94 95:LoopVectorizationPlanner: 96 A LoopVectorizationPlanner is designed to handle the vectorization of a loop 97 or a loop nest. It can construct, optimize and discard one or more VPlans, 98 each VPlan modelling a distinct way to vectorize the loop or the loop nest. 99 Once the best VPlan is determined, including the best VF and UF, this VPlan 100 drives the generation of output IR. 101 102:VPlan: 103 A model of a vectorized candidate for a given input IR loop or loop nest. This 104 candidate is represented using a Hierarchical CFG. VPlan supports estimating 105 the cost and driving the generation of the output IR code it represents. 106 107:Hierarchical CFG: 108 A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The 109 Hierarchical CFG data structure is similar to the Tile Tree [5]_, where 110 cross-Tile edges are lifted to connect Tiles instead of the original 111 basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms 112 Region and Block are used rather than Tile [5]_ to avoid confusion with loop 113 tiling. 114 115:VPBlockBase: 116 The building block of the Hierarchical CFG. A pure-virtual base-class of 117 VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical 118 control-flow relations with other VPBlocks. Note that in contrast to the IR 119 BasicBlock, a VPBlockBase models its control-flow successors and predecessors 120 directly, rather than through a Terminator branch or through predecessor 121 branches that "use" the VPBlockBase. 122 123:VPBasicBlock: 124 VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the 125 Hierarchical CFG. It represents a sequence of output IR instructions that will 126 appear consecutively in an output IR basic-block. The instructions of this 127 basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a 128 sequence of zero or more VPRecipes that model the cost and generation of the 129 output IR instructions. 130 131:VPRegionBlock: 132 VPRegionBlock is a subclass of VPBlockBase. It models a collection of 133 VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR 134 CFG. A VPRegionBlock may indicate that its contents are to be replicated a 135 constant number of times when output IR is generated, effectively representing 136 a loop with constant trip-count that will be completely unrolled. This is used 137 to support scalarized and predicated instructions with a single model for 138 multiple candidate VF's and UF's. 139 140:VPRecipeBase: 141 A pure-virtual base class modeling a sequence of one or more output IR 142 instructions, possibly based on one or more input IR instructions. These 143 input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe 144 may specify how its ingredients are to be transformed to produce the output IR 145 instructions; e.g., cloned once, replicated multiple times or widened 146 according to selected VF. 147 148:VPValue: 149 The base of VPlan's def-use relations class hierarchy. When instantiated, it 150 models a constant or a live-in Value in VPlan. It has users, which are of type 151 VPUser, but no operands. 152 153:VPUser: 154 A VPUser represents an entity that uses a number of VPValues as operands. 155 VPUser is similar in some aspects to LLVM's User class. 156 157:VPDef: 158 A VPDef represents an entity that defines zero, one or multiple VPValues. 159 It is used to model the fact that recipes in VPlan can define multiple 160 VPValues. 161 162:VPInstruction: 163 A VPInstruction is both a VPRecipe and a VPUser. It models a single 164 VPlan-level instruction to be generated if the VPlan is executed, including 165 its opcode and possibly additional characteristics. It is the basis for 166 writing instruction-level analyses and optimizations in VPlan as creating, 167 replacing or moving VPInstructions record both def-use and scheduling 168 decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic 169 operations that enrich the Vectorizer's semantics. 170 171:VPTransformState: 172 Stores information used for generating output IR, passed from 173 LoopVectorizationPlanner to its selected VPlan for execution, and used to pass 174 additional information down to VPBlocks and VPRecipes. 175 176The Planning Process and VPlan Roadmap 177====================================== 178 179Transforming the Loop Vectorizer to use VPlan follows a staged approach. First, 180VPlan is used to record the final vectorization decisions, and to execute them: 181the Hierarchical CFG models the planned control-flow, and Recipes capture 182decisions taken inside basic-blocks. Next, VPlan will be used also as the basis 183for taking these decisions, effectively turning them into a series of 184VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process 185itself including cost-based analyses for making these decisions, to fully 186support compositional and iterative decision making. 187 188Some decisions are local to an instruction in the loop, such as whether to widen 189it into a vector instruction or replicate it, keeping the generated instructions 190in place. Other decisions, however, involve moving instructions, replacing them 191with other instructions, and/or introducing new instructions. For example, a 192cast may sink past a later instruction and be widened to handle first-order 193recurrence; an interleave group of strided gathers or scatters may effectively 194move to one place where they are replaced with shuffles and a common wide vector 195load or store; new instructions may be introduced to compute masks, shuffle the 196elements of vectors, and pack scalar values into vectors or vice-versa. 197 198In order for VPlan to support making instruction-level decisions and analyses, 199it needs to model the relevant instructions along with their def/use relations. 200This too follows a staged approach: first, the new instructions that compute 201masks are modeled as VPInstructions, along with their induced def/use subgraph. 202This effectively models masks in VPlan, facilitating VPlan-based predication. 203Next, the logic embedded within each Recipe for generating its instructions at 204VPlan execution time, will instead take part in the planning process by modeling 205them as VPInstructions. Finally, only logic that applies to instructions as a 206group will remain in Recipes, such as interleave groups and potentially other 207idiom groups having synergistic cost. 208 209Related LLVM components 210----------------------- 2111. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP 212 tree, where TSLP [3]_ adds Plan Step 2.b. 213 2142. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by 215 Polly [7]_. 216 2173. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of 218 the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_. 219 220References 221---------- 222.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit 223 Nuzman and Ayal Zaks, PACT 2008. 224 225.. [2] "Proposal for function vectorization and loop vectorization with function 226 calls", Xinmin Tian, [`cfe-dev 227 <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_]., 228 March 2, 2016. 229 See also `review <https://reviews.llvm.org/D22792>`_. 230 231.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios 232 Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015. 233 234.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization 235 overhead", Hao Zhou and Jingling Xue, CGO 2016. 236 237.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and 238 Brian Koblenz, PLDI 1991 239 240.. [6] "Structural analysis: A new approach to flow analysis in optimizing 241 compilers", M. Sharir, Journal of Computer Languages, Jan. 1980 242 243.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma 244 thesis, 2011. 245 246.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks, 247 European LLVM Developers' Meeting 2017. 248 249.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop 250 Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016. 251