1//===- README.txt - Notes for improving PowerPC-specific code gen ---------===// 2 3TODO: 4* lmw/stmw pass a la arm load store optimizer for prolog/epilog 5 6===-------------------------------------------------------------------------=== 7 8This code: 9 10unsigned add32carry(unsigned sum, unsigned x) { 11 unsigned z = sum + x; 12 if (sum + x < x) 13 z++; 14 return z; 15} 16 17Should compile to something like: 18 19 addc r3,r3,r4 20 addze r3,r3 21 22instead we get: 23 24 add r3, r4, r3 25 cmplw cr7, r3, r4 26 mfcr r4 ; 1 27 rlwinm r4, r4, 29, 31, 31 28 add r3, r3, r4 29 30Ick. 31 32===-------------------------------------------------------------------------=== 33 34We compile the hottest inner loop of viterbi to: 35 36 li r6, 0 37 b LBB1_84 ;bb432.i 38LBB1_83: ;bb420.i 39 lbzx r8, r5, r7 40 addi r6, r7, 1 41 stbx r8, r4, r7 42LBB1_84: ;bb432.i 43 mr r7, r6 44 cmplwi cr0, r7, 143 45 bne cr0, LBB1_83 ;bb420.i 46 47The CBE manages to produce: 48 49 li r0, 143 50 mtctr r0 51loop: 52 lbzx r2, r2, r11 53 stbx r0, r2, r9 54 addi r2, r2, 1 55 bdz later 56 b loop 57 58This could be much better (bdnz instead of bdz) but it still beats us. If we 59produced this with bdnz, the loop would be a single dispatch group. 60 61===-------------------------------------------------------------------------=== 62 63Lump the constant pool for each function into ONE pic object, and reference 64pieces of it as offsets from the start. For functions like this (contrived 65to have lots of constants obviously): 66 67double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; } 68 69We generate: 70 71_X: 72 lis r2, ha16(.CPI_X_0) 73 lfd f0, lo16(.CPI_X_0)(r2) 74 lis r2, ha16(.CPI_X_1) 75 lfd f2, lo16(.CPI_X_1)(r2) 76 fmadd f0, f1, f0, f2 77 lis r2, ha16(.CPI_X_2) 78 lfd f1, lo16(.CPI_X_2)(r2) 79 lis r2, ha16(.CPI_X_3) 80 lfd f2, lo16(.CPI_X_3)(r2) 81 fmadd f1, f0, f1, f2 82 blr 83 84It would be better to materialize .CPI_X into a register, then use immediates 85off of the register to avoid the lis's. This is even more important in PIC 86mode. 87 88Note that this (and the static variable version) is discussed here for GCC: 89http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html 90 91Here's another example (the sgn function): 92double testf(double a) { 93 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0); 94} 95 96it produces a BB like this: 97LBB1_1: ; cond_true 98 lis r2, ha16(LCPI1_0) 99 lfs f0, lo16(LCPI1_0)(r2) 100 lis r2, ha16(LCPI1_1) 101 lis r3, ha16(LCPI1_2) 102 lfs f2, lo16(LCPI1_2)(r3) 103 lfs f3, lo16(LCPI1_1)(r2) 104 fsub f0, f0, f1 105 fsel f1, f0, f2, f3 106 blr 107 108===-------------------------------------------------------------------------=== 109 110PIC Code Gen IPO optimization: 111 112Squish small scalar globals together into a single global struct, allowing the 113address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size 114of the GOT on targets with one). 115 116Note that this is discussed here for GCC: 117http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html 118 119===-------------------------------------------------------------------------=== 120 121Darwin Stub removal: 122 123We still generate calls to foo$stub, and stubs, on Darwin. This is not 124necessary when building with the Leopard (10.5) or later linker, as stubs are 125generated by ld when necessary. Parameterizing this based on the deployment 126target (-mmacosx-version-min) is probably enough. x86-32 does this right, see 127its logic. 128 129Note: Since Darwin support has been removed, this item is no longer valid for 130Darwin specfically. 131 132===-------------------------------------------------------------------------=== 133 134Darwin Stub LICM optimization: 135 136Loops like this: 137 138 for (...) bar(); 139 140Have to go through an indirect stub if bar is external or linkonce. It would 141be better to compile it as: 142 143 fp = &bar; 144 for (...) fp(); 145 146which only computes the address of bar once (instead of each time through the 147stub). This is Darwin specific and would have to be done in the code generator. 148Probably not a win on x86. 149 150===-------------------------------------------------------------------------=== 151 152Simple IPO for argument passing, change: 153 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y) 154 155the Darwin ABI specifies that any integer arguments in the first 32 bytes worth 156of arguments get assigned to r3 through r10. That is, if you have a function 157foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the 158argument bytes for r4 and r5. The trick then would be to shuffle the argument 159order for functions we can internalize so that the maximum number of 160integers/pointers get passed in regs before you see any of the fp arguments. 161 162Instead of implementing this, it would actually probably be easier to just 163implement a PPC fastcc, where we could do whatever we wanted to the CC, 164including having this work sanely. 165 166===-------------------------------------------------------------------------=== 167 168Fix Darwin FP-In-Integer Registers ABI 169 170Darwin passes doubles in structures in integer registers, which is very very 171bad. Add something like a BITCAST to LLVM, then do an i-p transformation that 172percolates these things out of functions. 173 174Check out how horrible this is: 175http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html 176 177This is an extension of "interprocedural CC unmunging" that can't be done with 178just fastcc. 179 180===-------------------------------------------------------------------------=== 181 182Fold add and sub with constant into non-extern, non-weak addresses so this: 183 184static int a; 185void bar(int b) { a = b; } 186void foo(unsigned char *c) { 187 *c = a; 188} 189 190So that 191 192_foo: 193 lis r2, ha16(_a) 194 la r2, lo16(_a)(r2) 195 lbz r2, 3(r2) 196 stb r2, 0(r3) 197 blr 198 199Becomes 200 201_foo: 202 lis r2, ha16(_a+3) 203 lbz r2, lo16(_a+3)(r2) 204 stb r2, 0(r3) 205 blr 206 207===-------------------------------------------------------------------------=== 208 209We should compile these two functions to the same thing: 210 211#include <stdlib.h> 212void f(int a, int b, int *P) { 213 *P = (a-b)>=0?(a-b):(b-a); 214} 215void g(int a, int b, int *P) { 216 *P = abs(a-b); 217} 218 219Further, they should compile to something better than: 220 221_g: 222 subf r2, r4, r3 223 subfic r3, r2, 0 224 cmpwi cr0, r2, -1 225 bgt cr0, LBB2_2 ; entry 226LBB2_1: ; entry 227 mr r2, r3 228LBB2_2: ; entry 229 stw r2, 0(r5) 230 blr 231 232GCC produces: 233 234_g: 235 subf r4,r4,r3 236 srawi r2,r4,31 237 xor r0,r2,r4 238 subf r0,r2,r0 239 stw r0,0(r5) 240 blr 241 242... which is much nicer. 243 244This theoretically may help improve twolf slightly (used in dimbox.c:142?). 245 246===-------------------------------------------------------------------------=== 247 248PR5945: This: 249define i32 @clamp0g(i32 %a) { 250entry: 251 %cmp = icmp slt i32 %a, 0 252 %sel = select i1 %cmp, i32 0, i32 %a 253 ret i32 %sel 254} 255 256Is compile to this with the PowerPC (32-bit) backend: 257 258_clamp0g: 259 cmpwi cr0, r3, 0 260 li r2, 0 261 blt cr0, LBB1_2 262; %bb.1: ; %entry 263 mr r2, r3 264LBB1_2: ; %entry 265 mr r3, r2 266 blr 267 268This could be reduced to the much simpler: 269 270_clamp0g: 271 srawi r2, r3, 31 272 andc r3, r3, r2 273 blr 274 275===-------------------------------------------------------------------------=== 276 277int foo(int N, int ***W, int **TK, int X) { 278 int t, i; 279 280 for (t = 0; t < N; ++t) 281 for (i = 0; i < 4; ++i) 282 W[t / X][i][t % X] = TK[i][t]; 283 284 return 5; 285} 286 287We generate relatively atrocious code for this loop compared to gcc. 288 289We could also strength reduce the rem and the div: 290http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf 291 292===-------------------------------------------------------------------------=== 293 294We generate ugly code for this: 295 296void func(unsigned int *ret, float dx, float dy, float dz, float dw) { 297 unsigned code = 0; 298 if(dx < -dw) code |= 1; 299 if(dx > dw) code |= 2; 300 if(dy < -dw) code |= 4; 301 if(dy > dw) code |= 8; 302 if(dz < -dw) code |= 16; 303 if(dz > dw) code |= 32; 304 *ret = code; 305} 306 307===-------------------------------------------------------------------------=== 308 309%struct.B = type { i8, [3 x i8] } 310 311define void @bar(%struct.B* %b) { 312entry: 313 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1] 314 %tmp = load i32* %tmp ; <uint> [#uses=1] 315 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1] 316 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1] 317 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2] 318 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1] 319 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1] 320 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1] 321 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1] 322 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1] 323 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1] 324 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1] 325 store i32 %tmp13, i32* %tmp8 326 ret void 327} 328 329We emit: 330 331_foo: 332 lwz r2, 0(r3) 333 slwi r4, r2, 1 334 or r4, r4, r2 335 rlwimi r2, r4, 0, 0, 0 336 stw r2, 0(r3) 337 blr 338 339We could collapse a bunch of those ORs and ANDs and generate the following 340equivalent code: 341 342_foo: 343 lwz r2, 0(r3) 344 rlwinm r4, r2, 1, 0, 0 345 or r2, r2, r4 346 stw r2, 0(r3) 347 blr 348 349===-------------------------------------------------------------------------=== 350 351Consider a function like this: 352 353float foo(float X) { return X + 1234.4123f; } 354 355The FP constant ends up in the constant pool, so we need to get the LR register. 356 This ends up producing code like this: 357 358_foo: 359.LBB_foo_0: ; entry 360 mflr r11 361*** stw r11, 8(r1) 362 bl "L00000$pb" 363"L00000$pb": 364 mflr r2 365 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb") 366 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2) 367 fadds f1, f1, f0 368*** lwz r11, 8(r1) 369 mtlr r11 370 blr 371 372This is functional, but there is no reason to spill the LR register all the way 373to the stack (the two marked instrs): spilling it to a GPR is quite enough. 374 375Implementing this will require some codegen improvements. Nate writes: 376 377"So basically what we need to support the "no stack frame save and restore" is a 378generalization of the LR optimization to "callee-save regs". 379 380Currently, we have LR marked as a callee-save reg. The register allocator sees 381that it's callee save, and spills it directly to the stack. 382 383Ideally, something like this would happen: 384 385LR would be in a separate register class from the GPRs. The class of LR would be 386marked "unspillable". When the register allocator came across an unspillable 387reg, it would ask "what is the best class to copy this into that I *can* spill" 388If it gets a class back, which it will in this case (the gprs), it grabs a free 389register of that class. If it is then later necessary to spill that reg, so be 390it. 391 392===-------------------------------------------------------------------------=== 393 394We compile this: 395int test(_Bool X) { 396 return X ? 524288 : 0; 397} 398 399to: 400_test: 401 cmplwi cr0, r3, 0 402 lis r2, 8 403 li r3, 0 404 beq cr0, LBB1_2 ;entry 405LBB1_1: ;entry 406 mr r3, r2 407LBB1_2: ;entry 408 blr 409 410instead of: 411_test: 412 addic r2,r3,-1 413 subfe r0,r2,r3 414 slwi r3,r0,19 415 blr 416 417This sort of thing occurs a lot due to globalopt. 418 419===-------------------------------------------------------------------------=== 420 421We compile: 422 423define i32 @bar(i32 %x) nounwind readnone ssp { 424entry: 425 %0 = icmp eq i32 %x, 0 ; <i1> [#uses=1] 426 %neg = sext i1 %0 to i32 ; <i32> [#uses=1] 427 ret i32 %neg 428} 429 430to: 431 432_bar: 433 cntlzw r2, r3 434 slwi r2, r2, 26 435 srawi r3, r2, 31 436 blr 437 438it would be better to produce: 439 440_bar: 441 addic r3,r3,-1 442 subfe r3,r3,r3 443 blr 444 445===-------------------------------------------------------------------------=== 446 447We generate horrible ppc code for this: 448 449#define N 2000000 450double a[N],c[N]; 451void simpleloop() { 452 int j; 453 for (j=0; j<N; j++) 454 c[j] = a[j]; 455} 456 457LBB1_1: ;bb 458 lfdx f0, r3, r4 459 addi r5, r5, 1 ;; Extra IV for the exit value compare. 460 stfdx f0, r2, r4 461 addi r4, r4, 8 462 463 xoris r6, r5, 30 ;; This is due to a large immediate. 464 cmplwi cr0, r6, 33920 465 bne cr0, LBB1_1 466 467//===---------------------------------------------------------------------===// 468 469This: 470 #include <algorithm> 471 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b) 472 { return std::make_pair(a + b, a + b < a); } 473 bool no_overflow(unsigned a, unsigned b) 474 { return !full_add(a, b).second; } 475 476Should compile to: 477 478__Z11no_overflowjj: 479 add r4,r3,r4 480 subfc r3,r3,r4 481 li r3,0 482 adde r3,r3,r3 483 blr 484 485(or better) not: 486 487__Z11no_overflowjj: 488 add r2, r4, r3 489 cmplw cr7, r2, r3 490 mfcr r2 491 rlwinm r2, r2, 29, 31, 31 492 xori r3, r2, 1 493 blr 494 495//===---------------------------------------------------------------------===// 496 497We compile some FP comparisons into an mfcr with two rlwinms and an or. For 498example: 499#include <math.h> 500int test(double x, double y) { return islessequal(x, y);} 501int test2(double x, double y) { return islessgreater(x, y);} 502int test3(double x, double y) { return !islessequal(x, y);} 503 504Compiles into (all three are similar, but the bits differ): 505 506_test: 507 fcmpu cr7, f1, f2 508 mfcr r2 509 rlwinm r3, r2, 29, 31, 31 510 rlwinm r2, r2, 31, 31, 31 511 or r3, r2, r3 512 blr 513 514GCC compiles this into: 515 516 _test: 517 fcmpu cr7,f1,f2 518 cror 30,28,30 519 mfcr r3 520 rlwinm r3,r3,31,1 521 blr 522 523which is more efficient and can use mfocr. See PR642 for some more context. 524 525//===---------------------------------------------------------------------===// 526 527void foo(float *data, float d) { 528 long i; 529 for (i = 0; i < 8000; i++) 530 data[i] = d; 531} 532void foo2(float *data, float d) { 533 long i; 534 data--; 535 for (i = 0; i < 8000; i++) { 536 data[1] = d; 537 data++; 538 } 539} 540 541These compile to: 542 543_foo: 544 li r2, 0 545LBB1_1: ; bb 546 addi r4, r2, 4 547 stfsx f1, r3, r2 548 cmplwi cr0, r4, 32000 549 mr r2, r4 550 bne cr0, LBB1_1 ; bb 551 blr 552_foo2: 553 li r2, 0 554LBB2_1: ; bb 555 addi r4, r2, 4 556 stfsx f1, r3, r2 557 cmplwi cr0, r4, 32000 558 mr r2, r4 559 bne cr0, LBB2_1 ; bb 560 blr 561 562The 'mr' could be eliminated to folding the add into the cmp better. 563 564//===---------------------------------------------------------------------===// 565Codegen for the following (low-probability) case deteriorated considerably 566when the correctness fixes for unordered comparisons went in (PR 642, 58871). 567It should be possible to recover the code quality described in the comments. 568 569; RUN: llvm-as < %s | llc -march=ppc32 | grep or | count 3 570; This should produce one 'or' or 'cror' instruction per function. 571 572; RUN: llvm-as < %s | llc -march=ppc32 | grep mfcr | count 3 573; PR2964 574 575define i32 @test(double %x, double %y) nounwind { 576entry: 577 %tmp3 = fcmp ole double %x, %y ; <i1> [#uses=1] 578 %tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 579 ret i32 %tmp345 580} 581 582define i32 @test2(double %x, double %y) nounwind { 583entry: 584 %tmp3 = fcmp one double %x, %y ; <i1> [#uses=1] 585 %tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 586 ret i32 %tmp345 587} 588 589define i32 @test3(double %x, double %y) nounwind { 590entry: 591 %tmp3 = fcmp ugt double %x, %y ; <i1> [#uses=1] 592 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 593 ret i32 %tmp34 594} 595 596//===---------------------------------------------------------------------===// 597for the following code: 598 599void foo (float *__restrict__ a, int *__restrict__ b, int n) { 600 a[n] = b[n] * 2.321; 601} 602 603we load b[n] to GPR, then move it VSX register and convert it float. We should 604use vsx scalar integer load instructions to avoid direct moves 605 606//===----------------------------------------------------------------------===// 607; RUN: llvm-as < %s | llc -march=ppc32 | not grep fneg 608 609; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and 610; should not be generated except with -enable-finite-only-fp-math or the like). 611; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to 612; recognize a more elaborate tree than a simple SETxx. 613 614define double @test_FNEG_sel(double %A, double %B, double %C) { 615 %D = fsub double -0.000000e+00, %A ; <double> [#uses=1] 616 %Cond = fcmp ugt double %D, -0.000000e+00 ; <i1> [#uses=1] 617 %E = select i1 %Cond, double %B, double %C ; <double> [#uses=1] 618 ret double %E 619} 620 621//===----------------------------------------------------------------------===// 622The save/restore sequence for CR in prolog/epilog is terrible: 623- Each CR subreg is saved individually, rather than doing one save as a unit. 624- On Darwin, the save is done after the decrement of SP, which means the offset 625from SP of the save slot can be too big for a store instruction, which means we 626need an additional register (currently hacked in 96015+96020; the solution there 627is correct, but poor). 628- On SVR4 the same thing can happen, and I don't think saving before the SP 629decrement is safe on that target, as there is no red zone. This is currently 630broken AFAIK, although it's not a target I can exercise. 631The following demonstrates the problem: 632extern void bar(char *p); 633void foo() { 634 char x[100000]; 635 bar(x); 636 __asm__("" ::: "cr2"); 637} 638 639//===-------------------------------------------------------------------------=== 640Naming convention for instruction formats is very haphazard. 641We have agreed on a naming scheme as follows: 642 643<INST_form>{_<OP_type><OP_len>}+ 644 645Where: 646INST_form is the instruction format (X-form, etc.) 647OP_type is the operand type - one of OPC (opcode), RD (register destination), 648 RS (register source), 649 RDp (destination register pair), 650 RSp (source register pair), IM (immediate), 651 XO (extended opcode) 652OP_len is the length of the operand in bits 653 654VSX register operands would be of length 6 (split across two fields), 655condition register fields of length 3. 656We would not need denote reserved fields in names of instruction formats. 657 658//===----------------------------------------------------------------------===// 659 660Instruction fusion was introduced in ISA 2.06 and more opportunities added in 661ISA 2.07. LLVM needs to add infrastructure to recognize fusion opportunities 662and force instruction pairs to be scheduled together. 663 664----------------------------------------------------------------------------- 665 666More general handling of any_extend and zero_extend: 667 668See https://reviews.llvm.org/D24924#555306 669