1# Chapter 5: Partial Lowering to Lower-Level Dialects for Optimization
2
3[TOC]
4
5At this point, we are eager to generate actual code and see our Toy language
6take life. We will use LLVM to generate code, but just showing the LLVM builder
7interface here wouldn't be very exciting. Instead, we will show how to perform
8progressive lowering through a mix of dialects coexisting in the same function.
9
10To make it more interesting, in this chapter we will consider that we want to
11reuse existing optimizations implemented in a dialect optimizing affine
12transformations: `Affine`. This dialect is tailored to the computation-heavy
13part of the program and is limited: it doesn't support representing our
14`toy.print` builtin, for instance, neither should it! Instead, we can target
15`Affine` for the computation heavy part of Toy, and in the
16[next chapter](Ch-6.md) directly target the `LLVM IR` dialect for lowering
17`print`. As part of this lowering, we will be lowering from the
18[TensorType](../../LangRef.md#tensor-type) that `Toy` operates on to the
19[MemRefType](../../LangRef.md#memref-type) that is indexed via an affine
20loop-nest. Tensors represent an abstract value-typed sequence of data, meaning
21that they don't live in any memory. MemRefs, on the other hand, represent lower
22level buffer access, as they are concrete references to a region of memory.
23
24# Dialect Conversions
25
26MLIR has many different dialects, so it is important to have a unified framework
27for [converting](../../../getting_started/Glossary.md#conversion) between them. This is where the
28`DialectConversion` framework comes into play. This framework allows for
29transforming a set of *illegal* operations to a set of *legal* ones. To use this
30framework, we need to provide two things (and an optional third):
31
32*   A [Conversion Target](../../DialectConversion.md#conversion-target)
33
34    -   This is the formal specification of what operations or dialects are
35        legal for the conversion. Operations that aren't legal will require
36        rewrite patterns to perform
37        [legalization](../../../getting_started/Glossary.md#legalization).
38
39*   A set of
40    [Rewrite Patterns](../../DialectConversion.md#rewrite-pattern-specification)
41
42    -   This is the set of [patterns](../QuickstartRewrites.md) used to
43        convert *illegal* operations into a set of zero or more *legal* ones.
44
45*   Optionally, a [Type Converter](../../DialectConversion.md#type-conversion).
46
47    -   If provided, this is used to convert the types of block arguments. We
48        won't be needing this for our conversion.
49
50## Conversion Target
51
52For our purposes, we want to convert the compute-intensive `Toy` operations into
53a combination of operations from the `Affine` `Standard` dialects for further
54optimization. To start off the lowering, we first define our conversion target:
55
56```c++
57void ToyToAffineLoweringPass::runOnFunction() {
58  // The first thing to define is the conversion target. This will define the
59  // final target for this lowering.
60  mlir::ConversionTarget target(getContext());
61
62  // We define the specific operations, or dialects, that are legal targets for
63  // this lowering. In our case, we are lowering to a combination of the
64  // `Affine` and `Standard` dialects.
65  target.addLegalDialect<mlir::AffineDialect, mlir::StandardOpsDialect>();
66
67  // We also define the Toy dialect as Illegal so that the conversion will fail
68  // if any of these operations are *not* converted. Given that we actually want
69  // a partial lowering, we explicitly mark the Toy operations that don't want
70  // to lower, `toy.print`, as *legal*.
71  target.addIllegalDialect<ToyDialect>();
72  target.addLegalOp<PrintOp>();
73  ...
74}
75```
76
77Above, we first set the toy dialect to illegal, and then the print operation
78as legal. We could have done this the other way around.
79Individual operations always take precedence over the (more generic) dialect
80definitions, so the order doesn't matter. See `ConversionTarget::getOpInfo`
81for the details.
82
83## Conversion Patterns
84
85After the conversion target has been defined, we can define how to convert the
86*illegal* operations into *legal* ones. Similarly to the canonicalization
87framework introduced in [chapter 3](Ch-3.md), the
88[`DialectConversion` framework](../../DialectConversion.md) also uses
89[RewritePatterns](../QuickstartRewrites.md) to perform the conversion logic.
90These patterns may be the `RewritePatterns` seen before or a new type of pattern
91specific to the conversion framework `ConversionPattern`. `ConversionPatterns`
92are different from traditional `RewritePatterns` in that they accept an
93additional `operands` parameter containing operands that have been
94remapped/replaced. This is used when dealing with type conversions, as the
95pattern will want to operate on values of the new type but match against the
96old. For our lowering, this invariant will be useful as it translates from the
97[TensorType](../../LangRef.md#tensor-type) currently being operated on to the
98[MemRefType](../../LangRef.md#memref-type). Let's look at a snippet of lowering
99the `toy.transpose` operation:
100
101```c++
102/// Lower the `toy.transpose` operation to an affine loop nest.
103struct TransposeOpLowering : public mlir::ConversionPattern {
104  TransposeOpLowering(mlir::MLIRContext *ctx)
105      : mlir::ConversionPattern(TransposeOp::getOperationName(), 1, ctx) {}
106
107  /// Match and rewrite the given `toy.transpose` operation, with the given
108  /// operands that have been remapped from `tensor<...>` to `memref<...>`.
109  mlir::LogicalResult
110  matchAndRewrite(mlir::Operation *op, ArrayRef<mlir::Value> operands,
111                  mlir::ConversionPatternRewriter &rewriter) const final {
112    auto loc = op->getLoc();
113
114    // Call to a helper function that will lower the current operation to a set
115    // of affine loops. We provide a functor that operates on the remapped
116    // operands, as well as the loop induction variables for the inner most
117    // loop body.
118    lowerOpToLoops(
119        op, operands, rewriter,
120        [loc](mlir::PatternRewriter &rewriter,
121              ArrayRef<mlir::Value> memRefOperands,
122              ArrayRef<mlir::Value> loopIvs) {
123          // Generate an adaptor for the remapped operands of the TransposeOp.
124          // This allows for using the nice named accessors that are generated
125          // by the ODS. This adaptor is automatically provided by the ODS
126          // framework.
127          TransposeOpAdaptor transposeAdaptor(memRefOperands);
128          mlir::Value input = transposeAdaptor.input();
129
130          // Transpose the elements by generating a load from the reverse
131          // indices.
132          SmallVector<mlir::Value, 2> reverseIvs(llvm::reverse(loopIvs));
133          return rewriter.create<mlir::AffineLoadOp>(loc, input, reverseIvs);
134        });
135    return success();
136  }
137};
138```
139
140Now we can prepare the list of patterns to use during the lowering process:
141
142```c++
143void ToyToAffineLoweringPass::runOnFunction() {
144  ...
145
146  // Now that the conversion target has been defined, we just need to provide
147  // the set of patterns that will lower the Toy operations.
148  mlir::OwningRewritePatternList patterns;
149  patterns.insert<..., TransposeOpLowering>(&getContext());
150
151  ...
152```
153
154## Partial Lowering
155
156Once the patterns have been defined, we can perform the actual lowering. The
157`DialectConversion` framework provides several different modes of lowering, but,
158for our purposes, we will perform a partial lowering, as we will not convert
159`toy.print` at this time.
160
161```c++
162void ToyToAffineLoweringPass::runOnFunction() {
163  ...
164
165  // With the target and rewrite patterns defined, we can now attempt the
166  // conversion. The conversion will signal failure if any of our *illegal*
167  // operations were not converted successfully.
168  auto function = getFunction();
169  if (mlir::failed(mlir::applyPartialConversion(function, target, patterns)))
170    signalPassFailure();
171}
172```
173
174### Design Considerations With Partial Lowering
175
176Before diving into the result of our lowering, this is a good time to discuss
177potential design considerations when it comes to partial lowering. In our
178lowering, we transform from a value-type, TensorType, to an allocated
179(buffer-like) type, MemRefType. However, given that we do not lower the
180`toy.print` operation, we need to temporarily bridge these two worlds. There are
181many ways to go about this, each with their own tradeoffs:
182
183*   Generate `load` operations from the buffer
184
185    One option is to generate `load` operations from the buffer type to materialize
186    an instance of the value type. This allows for the definition of the `toy.print`
187    operation to remain unchanged. The downside to this approach is that the
188    optimizations on the `affine` dialect are limited, because the `load` will
189    actually involve a full copy that is only visible *after* our optimizations have
190    been performed.
191
192*   Generate a new version of `toy.print` that operates on the lowered type
193
194    Another option would be to have another, lowered, variant of `toy.print` that
195    operates on the lowered type. The benefit of this option is that there is no
196    hidden, unnecessary copy to the optimizer. The downside is that another
197    operation definition is needed that may duplicate many aspects of the first.
198    Defining a base class in [ODS](../../OpDefinitions.md) may simplify this, but
199    you still need to treat these operations separately.
200
201*   Update `toy.print` to allow for operating on the lowered type
202
203    A third option is to update the current definition of `toy.print` to allow for
204    operating the on the lowered type. The benefit of this approach is that it is
205    simple, does not introduce an additional hidden copy, and does not require
206    another operation definition. The downside to this option is that it requires
207    mixing abstraction levels in the `Toy` dialect.
208
209For the sake of simplicity, we will use the third option for this lowering. This
210involves updating the type constraints on the PrintOp in the operation
211definition file:
212
213```tablegen
214def PrintOp : Toy_Op<"print"> {
215  ...
216
217  // The print operation takes an input tensor to print.
218  // We also allow a F64MemRef to enable interop during partial lowering.
219  let arguments = (ins AnyTypeOf<[F64Tensor, F64MemRef]>:$input);
220}
221```
222
223## Complete Toy Example
224
225Let's take a concrete example:
226
227```mlir
228func @main() {
229  %0 = toy.constant dense<[[1.000000e+00, 2.000000e+00, 3.000000e+00], [4.000000e+00, 5.000000e+00, 6.000000e+00]]> : tensor<2x3xf64>
230  %2 = toy.transpose(%0 : tensor<2x3xf64>) to tensor<3x2xf64>
231  %3 = toy.mul %2, %2 : tensor<3x2xf64>
232  toy.print %3 : tensor<3x2xf64>
233  toy.return
234}
235```
236
237With affine lowering added to our pipeline, we can now generate:
238
239```mlir
240func @main() {
241  %cst = constant 1.000000e+00 : f64
242  %cst_0 = constant 2.000000e+00 : f64
243  %cst_1 = constant 3.000000e+00 : f64
244  %cst_2 = constant 4.000000e+00 : f64
245  %cst_3 = constant 5.000000e+00 : f64
246  %cst_4 = constant 6.000000e+00 : f64
247
248  // Allocating buffers for the inputs and outputs.
249  %0 = alloc() : memref<3x2xf64>
250  %1 = alloc() : memref<3x2xf64>
251  %2 = alloc() : memref<2x3xf64>
252
253  // Initialize the input buffer with the constant values.
254  affine.store %cst, %2[0, 0] : memref<2x3xf64>
255  affine.store %cst_0, %2[0, 1] : memref<2x3xf64>
256  affine.store %cst_1, %2[0, 2] : memref<2x3xf64>
257  affine.store %cst_2, %2[1, 0] : memref<2x3xf64>
258  affine.store %cst_3, %2[1, 1] : memref<2x3xf64>
259  affine.store %cst_4, %2[1, 2] : memref<2x3xf64>
260
261  // Load the transpose value from the input buffer and store it into the
262  // next input buffer.
263  affine.for %arg0 = 0 to 3 {
264    affine.for %arg1 = 0 to 2 {
265      %3 = affine.load %2[%arg1, %arg0] : memref<2x3xf64>
266      affine.store %3, %1[%arg0, %arg1] : memref<3x2xf64>
267    }
268  }
269
270  // Multiply and store into the output buffer.
271  affine.for %arg0 = 0 to 3 {
272    affine.for %arg1 = 0 to 2 {
273      %3 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>
274      %4 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>
275      %5 = mulf %3, %4 : f64
276      affine.store %5, %0[%arg0, %arg1] : memref<3x2xf64>
277    }
278  }
279
280  // Print the value held by the buffer.
281  toy.print %0 : memref<3x2xf64>
282  dealloc %2 : memref<2x3xf64>
283  dealloc %1 : memref<3x2xf64>
284  dealloc %0 : memref<3x2xf64>
285  return
286}
287```
288
289## Taking Advantage of Affine Optimization
290
291Our naive lowering is correct, but it leaves a lot to be desired with regards to
292efficiency. For example, the lowering of `toy.mul` has generated some redundant
293loads. Let's look at how adding a few existing optimizations to the pipeline can
294help clean this up. Adding the `LoopFusion` and `MemRefDataFlowOpt` passes to
295the pipeline gives the following result:
296
297```mlir
298func @main() {
299  %cst = constant 1.000000e+00 : f64
300  %cst_0 = constant 2.000000e+00 : f64
301  %cst_1 = constant 3.000000e+00 : f64
302  %cst_2 = constant 4.000000e+00 : f64
303  %cst_3 = constant 5.000000e+00 : f64
304  %cst_4 = constant 6.000000e+00 : f64
305
306  // Allocating buffers for the inputs and outputs.
307  %0 = alloc() : memref<3x2xf64>
308  %1 = alloc() : memref<2x3xf64>
309
310  // Initialize the input buffer with the constant values.
311  affine.store %cst, %1[0, 0] : memref<2x3xf64>
312  affine.store %cst_0, %1[0, 1] : memref<2x3xf64>
313  affine.store %cst_1, %1[0, 2] : memref<2x3xf64>
314  affine.store %cst_2, %1[1, 0] : memref<2x3xf64>
315  affine.store %cst_3, %1[1, 1] : memref<2x3xf64>
316  affine.store %cst_4, %1[1, 2] : memref<2x3xf64>
317
318  affine.for %arg0 = 0 to 3 {
319    affine.for %arg1 = 0 to 2 {
320      // Load the transpose value from the input buffer.
321      %2 = affine.load %1[%arg1, %arg0] : memref<2x3xf64>
322
323      // Multiply and store into the output buffer.
324      %3 = mulf %2, %2 : f64
325      affine.store %3, %0[%arg0, %arg1] : memref<3x2xf64>
326    }
327  }
328
329  // Print the value held by the buffer.
330  toy.print %0 : memref<3x2xf64>
331  dealloc %1 : memref<2x3xf64>
332  dealloc %0 : memref<3x2xf64>
333  return
334}
335```
336
337Here, we can see that a redundant allocation was removed, the two loop nests
338were fused, and some unnecessary `load`s were removed. You can build `toyc-ch5`
339and try yourself: `toyc-ch5 test/Examples/Toy/Ch5/affine-lowering.mlir
340-emit=mlir-affine`. We can also check our optimizations by adding `-opt`.
341
342In this chapter we explored some aspects of partial lowering, with the intent to
343optimize. In the [next chapter](Ch-6.md) we will continue the discussion about
344dialect conversion by targeting LLVM for code generation.
345