1//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
2
3TODO:
4* lmw/stmw pass a la arm load store optimizer for prolog/epilog
5
6===-------------------------------------------------------------------------===
7
8This code:
9
10unsigned add32carry(unsigned sum, unsigned x) {
11 unsigned z = sum + x;
12 if (sum + x < x)
13     z++;
14 return z;
15}
16
17Should compile to something like:
18
19	addc r3,r3,r4
20	addze r3,r3
21
22instead we get:
23
24	add r3, r4, r3
25	cmplw cr7, r3, r4
26	mfcr r4 ; 1
27	rlwinm r4, r4, 29, 31, 31
28	add r3, r3, r4
29
30Ick.
31
32===-------------------------------------------------------------------------===
33
34We compile the hottest inner loop of viterbi to:
35
36        li r6, 0
37        b LBB1_84       ;bb432.i
38LBB1_83:        ;bb420.i
39        lbzx r8, r5, r7
40        addi r6, r7, 1
41        stbx r8, r4, r7
42LBB1_84:        ;bb432.i
43        mr r7, r6
44        cmplwi cr0, r7, 143
45        bne cr0, LBB1_83        ;bb420.i
46
47The CBE manages to produce:
48
49	li r0, 143
50	mtctr r0
51loop:
52	lbzx r2, r2, r11
53	stbx r0, r2, r9
54	addi r2, r2, 1
55	bdz later
56	b loop
57
58This could be much better (bdnz instead of bdz) but it still beats us.  If we
59produced this with bdnz, the loop would be a single dispatch group.
60
61===-------------------------------------------------------------------------===
62
63Lump the constant pool for each function into ONE pic object, and reference
64pieces of it as offsets from the start.  For functions like this (contrived
65to have lots of constants obviously):
66
67double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
68
69We generate:
70
71_X:
72        lis r2, ha16(.CPI_X_0)
73        lfd f0, lo16(.CPI_X_0)(r2)
74        lis r2, ha16(.CPI_X_1)
75        lfd f2, lo16(.CPI_X_1)(r2)
76        fmadd f0, f1, f0, f2
77        lis r2, ha16(.CPI_X_2)
78        lfd f1, lo16(.CPI_X_2)(r2)
79        lis r2, ha16(.CPI_X_3)
80        lfd f2, lo16(.CPI_X_3)(r2)
81        fmadd f1, f0, f1, f2
82        blr
83
84It would be better to materialize .CPI_X into a register, then use immediates
85off of the register to avoid the lis's.  This is even more important in PIC
86mode.
87
88Note that this (and the static variable version) is discussed here for GCC:
89http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
90
91Here's another example (the sgn function):
92double testf(double a) {
93       return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
94}
95
96it produces a BB like this:
97LBB1_1: ; cond_true
98        lis r2, ha16(LCPI1_0)
99        lfs f0, lo16(LCPI1_0)(r2)
100        lis r2, ha16(LCPI1_1)
101        lis r3, ha16(LCPI1_2)
102        lfs f2, lo16(LCPI1_2)(r3)
103        lfs f3, lo16(LCPI1_1)(r2)
104        fsub f0, f0, f1
105        fsel f1, f0, f2, f3
106        blr
107
108===-------------------------------------------------------------------------===
109
110PIC Code Gen IPO optimization:
111
112Squish small scalar globals together into a single global struct, allowing the
113address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
114of the GOT on targets with one).
115
116Note that this is discussed here for GCC:
117http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
118
119===-------------------------------------------------------------------------===
120
121Darwin Stub removal:
122
123We still generate calls to foo$stub, and stubs, on Darwin.  This is not
124necessary when building with the Leopard (10.5) or later linker, as stubs are
125generated by ld when necessary.  Parameterizing this based on the deployment
126target (-mmacosx-version-min) is probably enough.  x86-32 does this right, see
127its logic.
128
129Note: Since Darwin support has been removed, this item is no longer valid for
130Darwin specfically.
131
132===-------------------------------------------------------------------------===
133
134Darwin Stub LICM optimization:
135
136Loops like this:
137
138  for (...)  bar();
139
140Have to go through an indirect stub if bar is external or linkonce.  It would
141be better to compile it as:
142
143     fp = &bar;
144     for (...)  fp();
145
146which only computes the address of bar once (instead of each time through the
147stub).  This is Darwin specific and would have to be done in the code generator.
148Probably not a win on x86.
149
150===-------------------------------------------------------------------------===
151
152Simple IPO for argument passing, change:
153  void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
154
155the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
156of arguments get assigned to r3 through r10. That is, if you have a function
157foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
158argument bytes for r4 and r5. The trick then would be to shuffle the argument
159order for functions we can internalize so that the maximum number of
160integers/pointers get passed in regs before you see any of the fp arguments.
161
162Instead of implementing this, it would actually probably be easier to just
163implement a PPC fastcc, where we could do whatever we wanted to the CC,
164including having this work sanely.
165
166===-------------------------------------------------------------------------===
167
168Fix Darwin FP-In-Integer Registers ABI
169
170Darwin passes doubles in structures in integer registers, which is very very
171bad.  Add something like a BITCAST to LLVM, then do an i-p transformation that
172percolates these things out of functions.
173
174Check out how horrible this is:
175http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
176
177This is an extension of "interprocedural CC unmunging" that can't be done with
178just fastcc.
179
180===-------------------------------------------------------------------------===
181
182Fold add and sub with constant into non-extern, non-weak addresses so this:
183
184static int a;
185void bar(int b) { a = b; }
186void foo(unsigned char *c) {
187  *c = a;
188}
189
190So that
191
192_foo:
193        lis r2, ha16(_a)
194        la r2, lo16(_a)(r2)
195        lbz r2, 3(r2)
196        stb r2, 0(r3)
197        blr
198
199Becomes
200
201_foo:
202        lis r2, ha16(_a+3)
203        lbz r2, lo16(_a+3)(r2)
204        stb r2, 0(r3)
205        blr
206
207===-------------------------------------------------------------------------===
208
209We should compile these two functions to the same thing:
210
211#include <stdlib.h>
212void f(int a, int b, int *P) {
213  *P = (a-b)>=0?(a-b):(b-a);
214}
215void g(int a, int b, int *P) {
216  *P = abs(a-b);
217}
218
219Further, they should compile to something better than:
220
221_g:
222        subf r2, r4, r3
223        subfic r3, r2, 0
224        cmpwi cr0, r2, -1
225        bgt cr0, LBB2_2 ; entry
226LBB2_1: ; entry
227        mr r2, r3
228LBB2_2: ; entry
229        stw r2, 0(r5)
230        blr
231
232GCC produces:
233
234_g:
235        subf r4,r4,r3
236        srawi r2,r4,31
237        xor r0,r2,r4
238        subf r0,r2,r0
239        stw r0,0(r5)
240        blr
241
242... which is much nicer.
243
244This theoretically may help improve twolf slightly (used in dimbox.c:142?).
245
246===-------------------------------------------------------------------------===
247
248PR5945: This:
249define i32 @clamp0g(i32 %a) {
250entry:
251        %cmp = icmp slt i32 %a, 0
252        %sel = select i1 %cmp, i32 0, i32 %a
253        ret i32 %sel
254}
255
256Is compile to this with the PowerPC (32-bit) backend:
257
258_clamp0g:
259        cmpwi cr0, r3, 0
260        li r2, 0
261        blt cr0, LBB1_2
262; %bb.1:                                                    ; %entry
263        mr r2, r3
264LBB1_2:                                                     ; %entry
265        mr r3, r2
266        blr
267
268This could be reduced to the much simpler:
269
270_clamp0g:
271        srawi r2, r3, 31
272        andc r3, r3, r2
273        blr
274
275===-------------------------------------------------------------------------===
276
277int foo(int N, int ***W, int **TK, int X) {
278  int t, i;
279
280  for (t = 0; t < N; ++t)
281    for (i = 0; i < 4; ++i)
282      W[t / X][i][t % X] = TK[i][t];
283
284  return 5;
285}
286
287We generate relatively atrocious code for this loop compared to gcc.
288
289We could also strength reduce the rem and the div:
290http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
291
292===-------------------------------------------------------------------------===
293
294We generate ugly code for this:
295
296void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
297  unsigned code = 0;
298  if(dx < -dw) code |= 1;
299  if(dx > dw)  code |= 2;
300  if(dy < -dw) code |= 4;
301  if(dy > dw)  code |= 8;
302  if(dz < -dw) code |= 16;
303  if(dz > dw)  code |= 32;
304  *ret = code;
305}
306
307===-------------------------------------------------------------------------===
308
309%struct.B = type { i8, [3 x i8] }
310
311define void @bar(%struct.B* %b) {
312entry:
313        %tmp = bitcast %struct.B* %b to i32*              ; <uint*> [#uses=1]
314        %tmp = load i32* %tmp          ; <uint> [#uses=1]
315        %tmp3 = bitcast %struct.B* %b to i32*             ; <uint*> [#uses=1]
316        %tmp4 = load i32* %tmp3                ; <uint> [#uses=1]
317        %tmp8 = bitcast %struct.B* %b to i32*             ; <uint*> [#uses=2]
318        %tmp9 = load i32* %tmp8                ; <uint> [#uses=1]
319        %tmp4.mask17 = shl i32 %tmp4, i8 1          ; <uint> [#uses=1]
320        %tmp1415 = and i32 %tmp4.mask17, 2147483648            ; <uint> [#uses=1]
321        %tmp.masked = and i32 %tmp, 2147483648         ; <uint> [#uses=1]
322        %tmp11 = or i32 %tmp1415, %tmp.masked          ; <uint> [#uses=1]
323        %tmp12 = and i32 %tmp9, 2147483647             ; <uint> [#uses=1]
324        %tmp13 = or i32 %tmp12, %tmp11         ; <uint> [#uses=1]
325        store i32 %tmp13, i32* %tmp8
326        ret void
327}
328
329We emit:
330
331_foo:
332        lwz r2, 0(r3)
333        slwi r4, r2, 1
334        or r4, r4, r2
335        rlwimi r2, r4, 0, 0, 0
336        stw r2, 0(r3)
337        blr
338
339We could collapse a bunch of those ORs and ANDs and generate the following
340equivalent code:
341
342_foo:
343        lwz r2, 0(r3)
344        rlwinm r4, r2, 1, 0, 0
345        or r2, r2, r4
346        stw r2, 0(r3)
347        blr
348
349===-------------------------------------------------------------------------===
350
351Consider a function like this:
352
353float foo(float X) { return X + 1234.4123f; }
354
355The FP constant ends up in the constant pool, so we need to get the LR register.
356 This ends up producing code like this:
357
358_foo:
359.LBB_foo_0:     ; entry
360        mflr r11
361***     stw r11, 8(r1)
362        bl "L00000$pb"
363"L00000$pb":
364        mflr r2
365        addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
366        lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
367        fadds f1, f1, f0
368***     lwz r11, 8(r1)
369        mtlr r11
370        blr
371
372This is functional, but there is no reason to spill the LR register all the way
373to the stack (the two marked instrs): spilling it to a GPR is quite enough.
374
375Implementing this will require some codegen improvements.  Nate writes:
376
377"So basically what we need to support the "no stack frame save and restore" is a
378generalization of the LR optimization to "callee-save regs".
379
380Currently, we have LR marked as a callee-save reg.  The register allocator sees
381that it's callee save, and spills it directly to the stack.
382
383Ideally, something like this would happen:
384
385LR would be in a separate register class from the GPRs. The class of LR would be
386marked "unspillable".  When the register allocator came across an unspillable
387reg, it would ask "what is the best class to copy this into that I *can* spill"
388If it gets a class back, which it will in this case (the gprs), it grabs a free
389register of that class.  If it is then later necessary to spill that reg, so be
390it.
391
392===-------------------------------------------------------------------------===
393
394We compile this:
395int test(_Bool X) {
396  return X ? 524288 : 0;
397}
398
399to:
400_test:
401        cmplwi cr0, r3, 0
402        lis r2, 8
403        li r3, 0
404        beq cr0, LBB1_2 ;entry
405LBB1_1: ;entry
406        mr r3, r2
407LBB1_2: ;entry
408        blr
409
410instead of:
411_test:
412        addic r2,r3,-1
413        subfe r0,r2,r3
414        slwi r3,r0,19
415        blr
416
417This sort of thing occurs a lot due to globalopt.
418
419===-------------------------------------------------------------------------===
420
421We compile:
422
423define i32 @bar(i32 %x) nounwind readnone ssp {
424entry:
425  %0 = icmp eq i32 %x, 0                          ; <i1> [#uses=1]
426  %neg = sext i1 %0 to i32              ; <i32> [#uses=1]
427  ret i32 %neg
428}
429
430to:
431
432_bar:
433	cntlzw r2, r3
434	slwi r2, r2, 26
435	srawi r3, r2, 31
436	blr
437
438it would be better to produce:
439
440_bar:
441        addic r3,r3,-1
442        subfe r3,r3,r3
443        blr
444
445===-------------------------------------------------------------------------===
446
447We generate horrible ppc code for this:
448
449#define N  2000000
450double   a[N],c[N];
451void simpleloop() {
452   int j;
453   for (j=0; j<N; j++)
454     c[j] = a[j];
455}
456
457LBB1_1: ;bb
458        lfdx f0, r3, r4
459        addi r5, r5, 1                 ;; Extra IV for the exit value compare.
460        stfdx f0, r2, r4
461        addi r4, r4, 8
462
463        xoris r6, r5, 30               ;; This is due to a large immediate.
464        cmplwi cr0, r6, 33920
465        bne cr0, LBB1_1
466
467//===---------------------------------------------------------------------===//
468
469This:
470        #include <algorithm>
471        inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
472        { return std::make_pair(a + b, a + b < a); }
473        bool no_overflow(unsigned a, unsigned b)
474        { return !full_add(a, b).second; }
475
476Should compile to:
477
478__Z11no_overflowjj:
479        add r4,r3,r4
480        subfc r3,r3,r4
481        li r3,0
482        adde r3,r3,r3
483        blr
484
485(or better) not:
486
487__Z11no_overflowjj:
488        add r2, r4, r3
489        cmplw cr7, r2, r3
490        mfcr r2
491        rlwinm r2, r2, 29, 31, 31
492        xori r3, r2, 1
493        blr
494
495//===---------------------------------------------------------------------===//
496
497We compile some FP comparisons into an mfcr with two rlwinms and an or.  For
498example:
499#include <math.h>
500int test(double x, double y) { return islessequal(x, y);}
501int test2(double x, double y) {  return islessgreater(x, y);}
502int test3(double x, double y) {  return !islessequal(x, y);}
503
504Compiles into (all three are similar, but the bits differ):
505
506_test:
507	fcmpu cr7, f1, f2
508	mfcr r2
509	rlwinm r3, r2, 29, 31, 31
510	rlwinm r2, r2, 31, 31, 31
511	or r3, r2, r3
512	blr
513
514GCC compiles this into:
515
516 _test:
517	fcmpu cr7,f1,f2
518	cror 30,28,30
519	mfcr r3
520	rlwinm r3,r3,31,1
521	blr
522
523which is more efficient and can use mfocr.  See PR642 for some more context.
524
525//===---------------------------------------------------------------------===//
526
527void foo(float *data, float d) {
528   long i;
529   for (i = 0; i < 8000; i++)
530      data[i] = d;
531}
532void foo2(float *data, float d) {
533   long i;
534   data--;
535   for (i = 0; i < 8000; i++) {
536      data[1] = d;
537      data++;
538   }
539}
540
541These compile to:
542
543_foo:
544	li r2, 0
545LBB1_1:	; bb
546	addi r4, r2, 4
547	stfsx f1, r3, r2
548	cmplwi cr0, r4, 32000
549	mr r2, r4
550	bne cr0, LBB1_1	; bb
551	blr
552_foo2:
553	li r2, 0
554LBB2_1:	; bb
555	addi r4, r2, 4
556	stfsx f1, r3, r2
557	cmplwi cr0, r4, 32000
558	mr r2, r4
559	bne cr0, LBB2_1	; bb
560	blr
561
562The 'mr' could be eliminated to folding the add into the cmp better.
563
564//===---------------------------------------------------------------------===//
565Codegen for the following (low-probability) case deteriorated considerably
566when the correctness fixes for unordered comparisons went in (PR 642, 58871).
567It should be possible to recover the code quality described in the comments.
568
569; RUN: llvm-as < %s | llc -march=ppc32  | grep or | count 3
570; This should produce one 'or' or 'cror' instruction per function.
571
572; RUN: llvm-as < %s | llc -march=ppc32  | grep mfcr | count 3
573; PR2964
574
575define i32 @test(double %x, double %y) nounwind  {
576entry:
577	%tmp3 = fcmp ole double %x, %y		; <i1> [#uses=1]
578	%tmp345 = zext i1 %tmp3 to i32		; <i32> [#uses=1]
579	ret i32 %tmp345
580}
581
582define i32 @test2(double %x, double %y) nounwind  {
583entry:
584	%tmp3 = fcmp one double %x, %y		; <i1> [#uses=1]
585	%tmp345 = zext i1 %tmp3 to i32		; <i32> [#uses=1]
586	ret i32 %tmp345
587}
588
589define i32 @test3(double %x, double %y) nounwind  {
590entry:
591	%tmp3 = fcmp ugt double %x, %y		; <i1> [#uses=1]
592	%tmp34 = zext i1 %tmp3 to i32		; <i32> [#uses=1]
593	ret i32 %tmp34
594}
595
596//===---------------------------------------------------------------------===//
597for the following code:
598
599void foo (float *__restrict__ a, int *__restrict__ b, int n) {
600      a[n] = b[n]  * 2.321;
601}
602
603we load b[n] to GPR, then move it VSX register and convert it float. We should
604use vsx scalar integer load instructions to avoid direct moves
605
606//===----------------------------------------------------------------------===//
607; RUN: llvm-as < %s | llc -march=ppc32 | not grep fneg
608
609; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and
610; should not be generated except with -enable-finite-only-fp-math or the like).
611; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to
612; recognize a more elaborate tree than a simple SETxx.
613
614define double @test_FNEG_sel(double %A, double %B, double %C) {
615        %D = fsub double -0.000000e+00, %A               ; <double> [#uses=1]
616        %Cond = fcmp ugt double %D, -0.000000e+00               ; <i1> [#uses=1]
617        %E = select i1 %Cond, double %B, double %C              ; <double> [#uses=1]
618        ret double %E
619}
620
621//===----------------------------------------------------------------------===//
622The save/restore sequence for CR in prolog/epilog is terrible:
623- Each CR subreg is saved individually, rather than doing one save as a unit.
624- On Darwin, the save is done after the decrement of SP, which means the offset
625from SP of the save slot can be too big for a store instruction, which means we
626need an additional register (currently hacked in 96015+96020; the solution there
627is correct, but poor).
628- On SVR4 the same thing can happen, and I don't think saving before the SP
629decrement is safe on that target, as there is no red zone.  This is currently
630broken AFAIK, although it's not a target I can exercise.
631The following demonstrates the problem:
632extern void bar(char *p);
633void foo() {
634  char x[100000];
635  bar(x);
636  __asm__("" ::: "cr2");
637}
638
639//===-------------------------------------------------------------------------===
640Naming convention for instruction formats is very haphazard.
641We have agreed on a naming scheme as follows:
642
643<INST_form>{_<OP_type><OP_len>}+
644
645Where:
646INST_form is the instruction format (X-form, etc.)
647OP_type is the operand type - one of OPC (opcode), RD (register destination),
648                              RS (register source),
649                              RDp (destination register pair),
650                              RSp (source register pair), IM (immediate),
651                              XO (extended opcode)
652OP_len is the length of the operand in bits
653
654VSX register operands would be of length 6 (split across two fields),
655condition register fields of length 3.
656We would not need denote reserved fields in names of instruction formats.
657
658//===----------------------------------------------------------------------===//
659
660Instruction fusion was introduced in ISA 2.06 and more opportunities added in
661ISA 2.07.  LLVM needs to add infrastructure to recognize fusion opportunities
662and force instruction pairs to be scheduled together.
663
664-----------------------------------------------------------------------------
665
666More general handling of any_extend and zero_extend:
667
668See https://reviews.llvm.org/D24924#555306
669