1==================
2Vectorization Plan
3==================
4
5.. contents::
6   :local:
7
8Abstract
9========
10The vectorization transformation can be rather complicated, involving several
11potential alternatives, especially for outer-loops [1]_ but also possibly for
12innermost loops. These alternatives may have significant performance impact,
13both positive and negative. A cost model is therefore employed to identify the
14best alternative, including the alternative of avoiding any transformation
15altogether.
16
17The Vectorization Plan is an explicit model for describing vectorization
18candidates. It serves for both optimizing candidates including estimating their
19cost reliably, and for performing their final translation into IR. This
20facilitates dealing with multiple vectorization candidates.
21
22High-level Design
23=================
24
25Vectorization Workflow
26----------------------
27VPlan-based vectorization involves three major steps, taking a "scenario-based
28approach" to vectorization planning:
29
301. Legal Step: check if a loop can be legally vectorized; encode constraints and
31   artifacts if so.
322. Plan Step:
33
34   a. Build initial VPlans following the constraints and decisions taken by
35      Legal Step 1, and compute their cost.
36   b. Apply optimizations to the VPlans, possibly forking additional VPlans.
37      Prune sub-optimal VPlans having relatively high cost.
383. Execute Step: materialize the best VPlan. Note that this is the only step
39   that modifies the IR.
40
41Design Guidelines
42-----------------
43In what follows, the term "input IR" refers to code that is fed into the
44vectorizer whereas the term "output IR" refers to code that is generated by the
45vectorizer. The output IR contains code that has been vectorized or "widened"
46according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
47according to an Unroll Factor (UF).
48The design of VPlan follows several high-level guidelines:
49
501. Analysis-like: building and manipulating VPlans must not modify the input IR.
51   In particular, if the best option is not to vectorize at all, the
52   vectorization process terminates before reaching Step 3, and compilation
53   should proceed as if VPlans had not been built.
54
552. Align Cost & Execute: each VPlan must support both estimating the cost and
56   generating the output IR code, such that the cost estimation evaluates the
57   to-be-generated code reliably.
58
593. Support vectorizing additional constructs:
60
61   a. Outer-loop vectorization. In particular, VPlan must be able to model the
62      control-flow of the output IR which may include multiple basic-blocks and
63      nested loops.
64   b. SLP vectorization.
65   c. Combinations of the above, including nested vectorization: vectorizing
66      both an inner loop and an outer-loop at the same time (each with its own
67      VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
68      inside [4]_, (re)vectorizing input IR containing vector code.
69   d. Function vectorization [2]_.
70
714. Support multiple candidates efficiently. In particular, similar candidates
72   related to a range of possible VF's and UF's must be represented efficiently.
73   Potential versioning needs to be supported efficiently.
74
755. Support vectorizing idioms, such as interleaved groups of strided loads or
76   stores. This is achieved by modeling a sequence of output instructions using
77   a "Recipe", which is responsible for computing its cost and generating its
78   code.
79
806. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
81   such regions may need to be, for example, predicated and linearized, or
82   replicated VF*UF times to handle scalarized and predicated instructions.
83   Innerloops are also modelled as SESE regions.
84
857. Support instruction-level analysis and transformation, as part of Planning
86   Step 2.b: During vectorization instructions may need to be traversed, moved,
87   replaced by other instructions or be created. For example, vector idiom
88   detection and formation involves searching for and optimizing instruction
89   patterns.
90
91Definitions
92===========
93The low-level design of VPlan comprises of the following classes.
94
95:LoopVectorizationPlanner:
96  A LoopVectorizationPlanner is designed to handle the vectorization of a loop
97  or a loop nest. It can construct, optimize and discard one or more VPlans,
98  each VPlan modelling a distinct way to vectorize the loop or the loop nest.
99  Once the best VPlan is determined, including the best VF and UF, this VPlan
100  drives the generation of output IR.
101
102:VPlan:
103  A model of a vectorized candidate for a given input IR loop or loop nest. This
104  candidate is represented using a Hierarchical CFG. VPlan supports estimating
105  the cost and driving the generation of the output IR code it represents.
106
107:Hierarchical CFG:
108  A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
109  Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
110  cross-Tile edges are lifted to connect Tiles instead of the original
111  basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
112  Region and Block are used rather than Tile [5]_ to avoid confusion with loop
113  tiling.
114
115:VPBlockBase:
116  The building block of the Hierarchical CFG. A pure-virtual base-class of
117  VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
118  control-flow relations with other VPBlocks. Note that in contrast to the IR
119  BasicBlock, a VPBlockBase models its control-flow successors and predecessors
120  directly, rather than through a Terminator branch or through predecessor
121  branches that "use" the VPBlockBase.
122
123:VPBasicBlock:
124  VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
125  Hierarchical CFG. It represents a sequence of output IR instructions that will
126  appear consecutively in an output IR basic-block. The instructions of this
127  basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
128  sequence of zero or more VPRecipes that model the cost and generation of the
129  output IR instructions.
130
131:VPRegionBlock:
132  VPRegionBlock is a subclass of VPBlockBase. It models a collection of
133  VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
134  CFG. A VPRegionBlock may indicate that its contents are to be replicated a
135  constant number of times when output IR is generated, effectively representing
136  a loop with constant trip-count that will be completely unrolled. This is used
137  to support scalarized and predicated instructions with a single model for
138  multiple candidate VF's and UF's.
139
140:VPRecipeBase:
141  A pure-virtual base class modeling a sequence of one or more output IR
142  instructions, possibly based on one or more input IR instructions. These
143  input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
144  may specify how its ingredients are to be transformed to produce the output IR
145  instructions; e.g., cloned once, replicated multiple times or widened
146  according to selected VF.
147
148:VPValue:
149  The base of VPlan's def-use relations class hierarchy. When instantiated, it
150  models a constant or a live-in Value in VPlan. It has users, which are of type
151  VPUser, but no operands.
152
153:VPUser:
154  A VPValue representing a general vertex in the def-use graph of VPlan. It has
155  operands which are of type VPValue. When instantiated, it represents a
156  live-out Instruction that exists outside VPlan. VPUser is similar in some
157  aspects to LLVM's User class.
158
159:VPInstruction:
160  A VPInstruction is both a VPRecipe and a VPUser. It models a single
161  VPlan-level instruction to be generated if the VPlan is executed, including
162  its opcode and possibly additional characteristics. It is the basis for
163  writing instruction-level analyses and optimizations in VPlan as creating,
164  replacing or moving VPInstructions record both def-use and scheduling
165  decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic
166  operations that enrich the Vectorizer's semantics.
167
168:VPTransformState:
169  Stores information used for generating output IR, passed from
170  LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
171  additional information down to VPBlocks and VPRecipes.
172
173The Planning Process and VPlan Roadmap
174======================================
175
176Transforming the Loop Vectorizer to use VPlan follows a staged approach. First,
177VPlan is used to record the final vectorization decisions, and to execute them:
178the Hierarchical CFG models the planned control-flow, and Recipes capture
179decisions taken inside basic-blocks. Next, VPlan will be used also as the basis
180for taking these decisions, effectively turning them into a series of
181VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process
182itself including cost-based analyses for making these decisions, to fully
183support compositional and iterative decision making.
184
185Some decisions are local to an instruction in the loop, such as whether to widen
186it into a vector instruction or replicate it, keeping the generated instructions
187in place. Other decisions, however, involve moving instructions, replacing them
188with other instructions, and/or introducing new instructions. For example, a
189cast may sink past a later instruction and be widened to handle first-order
190recurrence; an interleave group of strided gathers or scatters may effectively
191move to one place where they are replaced with shuffles and a common wide vector
192load or store; new instructions may be introduced to compute masks, shuffle the
193elements of vectors, and pack scalar values into vectors or vice-versa.
194
195In order for VPlan to support making instruction-level decisions and analyses,
196it needs to model the relevant instructions along with their def/use relations.
197This too follows a staged approach: first, the new instructions that compute
198masks are modeled as VPInstructions, along with their induced def/use subgraph.
199This effectively models masks in VPlan, facilitating VPlan-based predication.
200Next, the logic embedded within each Recipe for generating its instructions at
201VPlan execution time, will instead take part in the planning process by modeling
202them as VPInstructions. Finally, only logic that applies to instructions as a
203group will remain in Recipes, such as interleave groups and potentially other
204idiom groups having synergistic cost.
205
206Related LLVM components
207-----------------------
2081. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
209   tree, where TSLP [3]_ adds Plan Step 2.b.
210
2112. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
212   Polly [7]_.
213
2143. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of
215   the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_.
216
217References
218----------
219.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
220    Nuzman and Ayal Zaks, PACT 2008.
221
222.. [2] "Proposal for function vectorization and loop vectorization with function
223    calls", Xinmin Tian, [`cfe-dev
224    <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
225    March 2, 2016.
226    See also `review <https://reviews.llvm.org/D22792>`_.
227
228.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
229    Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
230
231.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
232    overhead", Hao Zhou and Jingling Xue, CGO 2016.
233
234.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
235    Brian Koblenz, PLDI 1991
236
237.. [6] "Structural analysis: A new approach to flow analysis in optimizing
238    compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
239
240.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
241    thesis, 2011.
242
243.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
244    European LLVM Developers' Meeting 2017.
245
246.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop
247    Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016.
248