Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
AsmParser/ | 22-Nov-2023 | - | 2,036 | 1,706 | ||
Disassembler/ | 22-Nov-2023 | - | 452 | 360 | ||
InstPrinter/ | 22-Nov-2023 | - | 556 | 434 | ||
MCTargetDesc/ | 22-Nov-2023 | - | 2,426 | 1,783 | ||
TargetInfo/ | 22-Nov-2023 | - | 55 | 38 | ||
CMakeLists.txt | D | 22-Nov-2023 | 1.4 KiB | 50 | 46 | |
LLVMBuild.txt | D | 22-Nov-2023 | 1.1 KiB | 36 | 32 | |
PPC.h | D | 22-Nov-2023 | 3.3 KiB | 105 | 53 | |
PPC.td | D | 22-Nov-2023 | 23.5 KiB | 452 | 420 | |
PPCAsmPrinter.cpp | D | 22-Nov-2023 | 54.8 KiB | 1,429 | 1,054 | |
PPCBoolRetToInt.cpp | D | 22-Nov-2023 | 8.8 KiB | 257 | 168 | |
PPCBranchSelector.cpp | D | 22-Nov-2023 | 8.4 KiB | 241 | 148 | |
PPCCCState.cpp | D | 22-Nov-2023 | 1.1 KiB | 36 | 23 | |
PPCCCState.h | D | 22-Nov-2023 | 1.2 KiB | 43 | 23 | |
PPCCTRLoops.cpp | D | 22-Nov-2023 | 23.6 KiB | 729 | 552 | |
PPCCallingConv.h | D | 22-Nov-2023 | 1.1 KiB | 36 | 14 | |
PPCCallingConv.td | D | 22-Nov-2023 | 12.6 KiB | 287 | 230 | |
PPCEarlyReturn.cpp | D | 22-Nov-2023 | 7.2 KiB | 214 | 152 | |
PPCFastISel.cpp | D | 22-Nov-2023 | 81.3 KiB | 2,358 | 1,602 | |
PPCFrameLowering.cpp | D | 22-Nov-2023 | 69.1 KiB | 1,945 | 1,369 | |
PPCFrameLowering.h | D | 22-Nov-2023 | 6.4 KiB | 150 | 62 | |
PPCHazardRecognizers.cpp | D | 22-Nov-2023 | 14.1 KiB | 437 | 281 | |
PPCHazardRecognizers.h | D | 22-Nov-2023 | 3.8 KiB | 103 | 52 | |
PPCISelDAGToDAG.cpp | D | 22-Nov-2023 | 161.2 KiB | 4,430 | 3,205 | |
PPCISelLowering.cpp | D | 22-Nov-2023 | 473.6 KiB | 12,145 | 8,655 | |
PPCISelLowering.h | D | 22-Nov-2023 | 42.7 KiB | 964 | 453 | |
PPCInstr64Bit.td | D | 22-Nov-2023 | 60 KiB | 1,300 | 1,164 | |
PPCInstrAltivec.td | D | 22-Nov-2023 | 67.1 KiB | 1,400 | 1,226 | |
PPCInstrBuilder.h | D | 22-Nov-2023 | 1.5 KiB | 44 | 14 | |
PPCInstrFormats.td | D | 22-Nov-2023 | 49.3 KiB | 1,934 | 1,594 | |
PPCInstrHTM.td | D | 22-Nov-2023 | 5.1 KiB | 173 | 124 | |
PPCInstrInfo.cpp | D | 22-Nov-2023 | 70 KiB | 1,878 | 1,433 | |
PPCInstrInfo.h | D | 22-Nov-2023 | 11.4 KiB | 279 | 166 | |
PPCInstrInfo.td | D | 22-Nov-2023 | 188.6 KiB | 4,247 | 3,776 | |
PPCInstrQPX.td | D | 22-Nov-2023 | 57.4 KiB | 1,217 | 1,105 | |
PPCInstrSPE.td | D | 22-Nov-2023 | 26.5 KiB | 448 | 407 | |
PPCInstrVSX.td | D | 22-Nov-2023 | 103.7 KiB | 2,243 | 2,012 | |
PPCLoopPreIncPrep.cpp | D | 22-Nov-2023 | 14.9 KiB | 441 | 318 | |
PPCMCInstLower.cpp | D | 22-Nov-2023 | 6 KiB | 188 | 144 | |
PPCMIPeephole.cpp | D | 22-Nov-2023 | 7.6 KiB | 233 | 137 | |
PPCMachineFunctionInfo.cpp | D | 22-Nov-2023 | 1.8 KiB | 47 | 31 | |
PPCMachineFunctionInfo.h | D | 22-Nov-2023 | 7.7 KiB | 218 | 104 | |
PPCPerfectShuffle.h | D | 22-Nov-2023 | 397.5 KiB | 6,592 | 6,567 | |
PPCQPXLoadSplat.cpp | D | 22-Nov-2023 | 5.4 KiB | 167 | 104 | |
PPCRegisterInfo.cpp | D | 22-Nov-2023 | 39.5 KiB | 1,068 | 723 | |
PPCRegisterInfo.h | D | 22-Nov-2023 | 5.5 KiB | 146 | 100 | |
PPCRegisterInfo.td | D | 22-Nov-2023 | 13.1 KiB | 364 | 315 | |
PPCSchedule.td | D | 22-Nov-2023 | 4.9 KiB | 134 | 130 | |
PPCSchedule440.td | D | 22-Nov-2023 | 35 KiB | 609 | 594 | |
PPCScheduleA2.td | D | 22-Nov-2023 | 7.9 KiB | 173 | 162 | |
PPCScheduleE500mc.td | D | 22-Nov-2023 | 19.2 KiB | 322 | 314 | |
PPCScheduleE5500.td | D | 22-Nov-2023 | 23.7 KiB | 382 | 372 | |
PPCScheduleG3.td | D | 22-Nov-2023 | 4.4 KiB | 81 | 78 | |
PPCScheduleG4.td | D | 22-Nov-2023 | 5.3 KiB | 97 | 94 | |
PPCScheduleG4Plus.td | D | 22-Nov-2023 | 6.5 KiB | 113 | 110 | |
PPCScheduleG5.td | D | 22-Nov-2023 | 7.1 KiB | 131 | 123 | |
PPCScheduleP7.td | D | 22-Nov-2023 | 21.7 KiB | 398 | 381 | |
PPCScheduleP8.td | D | 22-Nov-2023 | 23.4 KiB | 407 | 390 | |
PPCSubtarget.cpp | D | 22-Nov-2023 | 7.9 KiB | 252 | 172 | |
PPCSubtarget.h | D | 22-Nov-2023 | 10.2 KiB | 318 | 215 | |
PPCTLSDynamicCall.cpp | D | 22-Nov-2023 | 5.7 KiB | 175 | 114 | |
PPCTOCRegDeps.cpp | D | 22-Nov-2023 | 5.2 KiB | 156 | 71 | |
PPCTargetMachine.cpp | D | 22-Nov-2023 | 15.8 KiB | 443 | 297 | |
PPCTargetMachine.h | D | 22-Nov-2023 | 2.7 KiB | 86 | 51 | |
PPCTargetObjectFile.cpp | D | 22-Nov-2023 | 2.5 KiB | 62 | 31 | |
PPCTargetObjectFile.h | D | 22-Nov-2023 | 1.2 KiB | 36 | 15 | |
PPCTargetStreamer.h | D | 22-Nov-2023 | 866 | 28 | 15 | |
PPCTargetTransformInfo.cpp | D | 22-Nov-2023 | 14.8 KiB | 434 | 269 | |
PPCTargetTransformInfo.h | D | 22-Nov-2023 | 3.7 KiB | 100 | 59 | |
PPCVSXCopy.cpp | D | 22-Nov-2023 | 6.3 KiB | 188 | 131 | |
PPCVSXFMAMutate.cpp | D | 22-Nov-2023 | 15.2 KiB | 395 | 224 | |
PPCVSXSwapRemoval.cpp | D | 22-Nov-2023 | 36 KiB | 1,035 | 648 | |
README.txt | D | 22-Nov-2023 | 18.1 KiB | 661 | 509 | |
README_ALTIVEC.txt | D | 22-Nov-2023 | 11.7 KiB | 344 | 262 | |
README_P9.txt | D | 22-Nov-2023 | 22.2 KiB | 606 | 479 | |
p9-instrs.txt | D | 22-Nov-2023 | 14.1 KiB | 443 | 317 |
README.txt
1//===- README.txt - Notes for improving PowerPC-specific code gen ---------===// 2 3TODO: 4* lmw/stmw pass a la arm load store optimizer for prolog/epilog 5 6===-------------------------------------------------------------------------=== 7 8This code: 9 10unsigned add32carry(unsigned sum, unsigned x) { 11 unsigned z = sum + x; 12 if (sum + x < x) 13 z++; 14 return z; 15} 16 17Should compile to something like: 18 19 addc r3,r3,r4 20 addze r3,r3 21 22instead we get: 23 24 add r3, r4, r3 25 cmplw cr7, r3, r4 26 mfcr r4 ; 1 27 rlwinm r4, r4, 29, 31, 31 28 add r3, r3, r4 29 30Ick. 31 32===-------------------------------------------------------------------------=== 33 34We compile the hottest inner loop of viterbi to: 35 36 li r6, 0 37 b LBB1_84 ;bb432.i 38LBB1_83: ;bb420.i 39 lbzx r8, r5, r7 40 addi r6, r7, 1 41 stbx r8, r4, r7 42LBB1_84: ;bb432.i 43 mr r7, r6 44 cmplwi cr0, r7, 143 45 bne cr0, LBB1_83 ;bb420.i 46 47The CBE manages to produce: 48 49 li r0, 143 50 mtctr r0 51loop: 52 lbzx r2, r2, r11 53 stbx r0, r2, r9 54 addi r2, r2, 1 55 bdz later 56 b loop 57 58This could be much better (bdnz instead of bdz) but it still beats us. If we 59produced this with bdnz, the loop would be a single dispatch group. 60 61===-------------------------------------------------------------------------=== 62 63Lump the constant pool for each function into ONE pic object, and reference 64pieces of it as offsets from the start. For functions like this (contrived 65to have lots of constants obviously): 66 67double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; } 68 69We generate: 70 71_X: 72 lis r2, ha16(.CPI_X_0) 73 lfd f0, lo16(.CPI_X_0)(r2) 74 lis r2, ha16(.CPI_X_1) 75 lfd f2, lo16(.CPI_X_1)(r2) 76 fmadd f0, f1, f0, f2 77 lis r2, ha16(.CPI_X_2) 78 lfd f1, lo16(.CPI_X_2)(r2) 79 lis r2, ha16(.CPI_X_3) 80 lfd f2, lo16(.CPI_X_3)(r2) 81 fmadd f1, f0, f1, f2 82 blr 83 84It would be better to materialize .CPI_X into a register, then use immediates 85off of the register to avoid the lis's. This is even more important in PIC 86mode. 87 88Note that this (and the static variable version) is discussed here for GCC: 89http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html 90 91Here's another example (the sgn function): 92double testf(double a) { 93 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0); 94} 95 96it produces a BB like this: 97LBB1_1: ; cond_true 98 lis r2, ha16(LCPI1_0) 99 lfs f0, lo16(LCPI1_0)(r2) 100 lis r2, ha16(LCPI1_1) 101 lis r3, ha16(LCPI1_2) 102 lfs f2, lo16(LCPI1_2)(r3) 103 lfs f3, lo16(LCPI1_1)(r2) 104 fsub f0, f0, f1 105 fsel f1, f0, f2, f3 106 blr 107 108===-------------------------------------------------------------------------=== 109 110PIC Code Gen IPO optimization: 111 112Squish small scalar globals together into a single global struct, allowing the 113address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size 114of the GOT on targets with one). 115 116Note that this is discussed here for GCC: 117http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html 118 119===-------------------------------------------------------------------------=== 120 121Darwin Stub removal: 122 123We still generate calls to foo$stub, and stubs, on Darwin. This is not 124necessary when building with the Leopard (10.5) or later linker, as stubs are 125generated by ld when necessary. Parameterizing this based on the deployment 126target (-mmacosx-version-min) is probably enough. x86-32 does this right, see 127its logic. 128 129===-------------------------------------------------------------------------=== 130 131Darwin Stub LICM optimization: 132 133Loops like this: 134 135 for (...) bar(); 136 137Have to go through an indirect stub if bar is external or linkonce. It would 138be better to compile it as: 139 140 fp = &bar; 141 for (...) fp(); 142 143which only computes the address of bar once (instead of each time through the 144stub). This is Darwin specific and would have to be done in the code generator. 145Probably not a win on x86. 146 147===-------------------------------------------------------------------------=== 148 149Simple IPO for argument passing, change: 150 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y) 151 152the Darwin ABI specifies that any integer arguments in the first 32 bytes worth 153of arguments get assigned to r3 through r10. That is, if you have a function 154foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the 155argument bytes for r4 and r5. The trick then would be to shuffle the argument 156order for functions we can internalize so that the maximum number of 157integers/pointers get passed in regs before you see any of the fp arguments. 158 159Instead of implementing this, it would actually probably be easier to just 160implement a PPC fastcc, where we could do whatever we wanted to the CC, 161including having this work sanely. 162 163===-------------------------------------------------------------------------=== 164 165Fix Darwin FP-In-Integer Registers ABI 166 167Darwin passes doubles in structures in integer registers, which is very very 168bad. Add something like a BITCAST to LLVM, then do an i-p transformation that 169percolates these things out of functions. 170 171Check out how horrible this is: 172http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html 173 174This is an extension of "interprocedural CC unmunging" that can't be done with 175just fastcc. 176 177===-------------------------------------------------------------------------=== 178 179Fold add and sub with constant into non-extern, non-weak addresses so this: 180 181static int a; 182void bar(int b) { a = b; } 183void foo(unsigned char *c) { 184 *c = a; 185} 186 187So that 188 189_foo: 190 lis r2, ha16(_a) 191 la r2, lo16(_a)(r2) 192 lbz r2, 3(r2) 193 stb r2, 0(r3) 194 blr 195 196Becomes 197 198_foo: 199 lis r2, ha16(_a+3) 200 lbz r2, lo16(_a+3)(r2) 201 stb r2, 0(r3) 202 blr 203 204===-------------------------------------------------------------------------=== 205 206We should compile these two functions to the same thing: 207 208#include <stdlib.h> 209void f(int a, int b, int *P) { 210 *P = (a-b)>=0?(a-b):(b-a); 211} 212void g(int a, int b, int *P) { 213 *P = abs(a-b); 214} 215 216Further, they should compile to something better than: 217 218_g: 219 subf r2, r4, r3 220 subfic r3, r2, 0 221 cmpwi cr0, r2, -1 222 bgt cr0, LBB2_2 ; entry 223LBB2_1: ; entry 224 mr r2, r3 225LBB2_2: ; entry 226 stw r2, 0(r5) 227 blr 228 229GCC produces: 230 231_g: 232 subf r4,r4,r3 233 srawi r2,r4,31 234 xor r0,r2,r4 235 subf r0,r2,r0 236 stw r0,0(r5) 237 blr 238 239... which is much nicer. 240 241This theoretically may help improve twolf slightly (used in dimbox.c:142?). 242 243===-------------------------------------------------------------------------=== 244 245PR5945: This: 246define i32 @clamp0g(i32 %a) { 247entry: 248 %cmp = icmp slt i32 %a, 0 249 %sel = select i1 %cmp, i32 0, i32 %a 250 ret i32 %sel 251} 252 253Is compile to this with the PowerPC (32-bit) backend: 254 255_clamp0g: 256 cmpwi cr0, r3, 0 257 li r2, 0 258 blt cr0, LBB1_2 259; BB#1: ; %entry 260 mr r2, r3 261LBB1_2: ; %entry 262 mr r3, r2 263 blr 264 265This could be reduced to the much simpler: 266 267_clamp0g: 268 srawi r2, r3, 31 269 andc r3, r3, r2 270 blr 271 272===-------------------------------------------------------------------------=== 273 274int foo(int N, int ***W, int **TK, int X) { 275 int t, i; 276 277 for (t = 0; t < N; ++t) 278 for (i = 0; i < 4; ++i) 279 W[t / X][i][t % X] = TK[i][t]; 280 281 return 5; 282} 283 284We generate relatively atrocious code for this loop compared to gcc. 285 286We could also strength reduce the rem and the div: 287http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf 288 289===-------------------------------------------------------------------------=== 290 291We generate ugly code for this: 292 293void func(unsigned int *ret, float dx, float dy, float dz, float dw) { 294 unsigned code = 0; 295 if(dx < -dw) code |= 1; 296 if(dx > dw) code |= 2; 297 if(dy < -dw) code |= 4; 298 if(dy > dw) code |= 8; 299 if(dz < -dw) code |= 16; 300 if(dz > dw) code |= 32; 301 *ret = code; 302} 303 304===-------------------------------------------------------------------------=== 305 306%struct.B = type { i8, [3 x i8] } 307 308define void @bar(%struct.B* %b) { 309entry: 310 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1] 311 %tmp = load i32* %tmp ; <uint> [#uses=1] 312 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1] 313 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1] 314 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2] 315 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1] 316 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1] 317 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1] 318 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1] 319 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1] 320 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1] 321 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1] 322 store i32 %tmp13, i32* %tmp8 323 ret void 324} 325 326We emit: 327 328_foo: 329 lwz r2, 0(r3) 330 slwi r4, r2, 1 331 or r4, r4, r2 332 rlwimi r2, r4, 0, 0, 0 333 stw r2, 0(r3) 334 blr 335 336We could collapse a bunch of those ORs and ANDs and generate the following 337equivalent code: 338 339_foo: 340 lwz r2, 0(r3) 341 rlwinm r4, r2, 1, 0, 0 342 or r2, r2, r4 343 stw r2, 0(r3) 344 blr 345 346===-------------------------------------------------------------------------=== 347 348Consider a function like this: 349 350float foo(float X) { return X + 1234.4123f; } 351 352The FP constant ends up in the constant pool, so we need to get the LR register. 353 This ends up producing code like this: 354 355_foo: 356.LBB_foo_0: ; entry 357 mflr r11 358*** stw r11, 8(r1) 359 bl "L00000$pb" 360"L00000$pb": 361 mflr r2 362 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb") 363 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2) 364 fadds f1, f1, f0 365*** lwz r11, 8(r1) 366 mtlr r11 367 blr 368 369This is functional, but there is no reason to spill the LR register all the way 370to the stack (the two marked instrs): spilling it to a GPR is quite enough. 371 372Implementing this will require some codegen improvements. Nate writes: 373 374"So basically what we need to support the "no stack frame save and restore" is a 375generalization of the LR optimization to "callee-save regs". 376 377Currently, we have LR marked as a callee-save reg. The register allocator sees 378that it's callee save, and spills it directly to the stack. 379 380Ideally, something like this would happen: 381 382LR would be in a separate register class from the GPRs. The class of LR would be 383marked "unspillable". When the register allocator came across an unspillable 384reg, it would ask "what is the best class to copy this into that I *can* spill" 385If it gets a class back, which it will in this case (the gprs), it grabs a free 386register of that class. If it is then later necessary to spill that reg, so be 387it. 388 389===-------------------------------------------------------------------------=== 390 391We compile this: 392int test(_Bool X) { 393 return X ? 524288 : 0; 394} 395 396to: 397_test: 398 cmplwi cr0, r3, 0 399 lis r2, 8 400 li r3, 0 401 beq cr0, LBB1_2 ;entry 402LBB1_1: ;entry 403 mr r3, r2 404LBB1_2: ;entry 405 blr 406 407instead of: 408_test: 409 addic r2,r3,-1 410 subfe r0,r2,r3 411 slwi r3,r0,19 412 blr 413 414This sort of thing occurs a lot due to globalopt. 415 416===-------------------------------------------------------------------------=== 417 418We compile: 419 420define i32 @bar(i32 %x) nounwind readnone ssp { 421entry: 422 %0 = icmp eq i32 %x, 0 ; <i1> [#uses=1] 423 %neg = sext i1 %0 to i32 ; <i32> [#uses=1] 424 ret i32 %neg 425} 426 427to: 428 429_bar: 430 cntlzw r2, r3 431 slwi r2, r2, 26 432 srawi r3, r2, 31 433 blr 434 435it would be better to produce: 436 437_bar: 438 addic r3,r3,-1 439 subfe r3,r3,r3 440 blr 441 442===-------------------------------------------------------------------------=== 443 444We generate horrible ppc code for this: 445 446#define N 2000000 447double a[N],c[N]; 448void simpleloop() { 449 int j; 450 for (j=0; j<N; j++) 451 c[j] = a[j]; 452} 453 454LBB1_1: ;bb 455 lfdx f0, r3, r4 456 addi r5, r5, 1 ;; Extra IV for the exit value compare. 457 stfdx f0, r2, r4 458 addi r4, r4, 8 459 460 xoris r6, r5, 30 ;; This is due to a large immediate. 461 cmplwi cr0, r6, 33920 462 bne cr0, LBB1_1 463 464//===---------------------------------------------------------------------===// 465 466This: 467 #include <algorithm> 468 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b) 469 { return std::make_pair(a + b, a + b < a); } 470 bool no_overflow(unsigned a, unsigned b) 471 { return !full_add(a, b).second; } 472 473Should compile to: 474 475__Z11no_overflowjj: 476 add r4,r3,r4 477 subfc r3,r3,r4 478 li r3,0 479 adde r3,r3,r3 480 blr 481 482(or better) not: 483 484__Z11no_overflowjj: 485 add r2, r4, r3 486 cmplw cr7, r2, r3 487 mfcr r2 488 rlwinm r2, r2, 29, 31, 31 489 xori r3, r2, 1 490 blr 491 492//===---------------------------------------------------------------------===// 493 494We compile some FP comparisons into an mfcr with two rlwinms and an or. For 495example: 496#include <math.h> 497int test(double x, double y) { return islessequal(x, y);} 498int test2(double x, double y) { return islessgreater(x, y);} 499int test3(double x, double y) { return !islessequal(x, y);} 500 501Compiles into (all three are similar, but the bits differ): 502 503_test: 504 fcmpu cr7, f1, f2 505 mfcr r2 506 rlwinm r3, r2, 29, 31, 31 507 rlwinm r2, r2, 31, 31, 31 508 or r3, r2, r3 509 blr 510 511GCC compiles this into: 512 513 _test: 514 fcmpu cr7,f1,f2 515 cror 30,28,30 516 mfcr r3 517 rlwinm r3,r3,31,1 518 blr 519 520which is more efficient and can use mfocr. See PR642 for some more context. 521 522//===---------------------------------------------------------------------===// 523 524void foo(float *data, float d) { 525 long i; 526 for (i = 0; i < 8000; i++) 527 data[i] = d; 528} 529void foo2(float *data, float d) { 530 long i; 531 data--; 532 for (i = 0; i < 8000; i++) { 533 data[1] = d; 534 data++; 535 } 536} 537 538These compile to: 539 540_foo: 541 li r2, 0 542LBB1_1: ; bb 543 addi r4, r2, 4 544 stfsx f1, r3, r2 545 cmplwi cr0, r4, 32000 546 mr r2, r4 547 bne cr0, LBB1_1 ; bb 548 blr 549_foo2: 550 li r2, 0 551LBB2_1: ; bb 552 addi r4, r2, 4 553 stfsx f1, r3, r2 554 cmplwi cr0, r4, 32000 555 mr r2, r4 556 bne cr0, LBB2_1 ; bb 557 blr 558 559The 'mr' could be eliminated to folding the add into the cmp better. 560 561//===---------------------------------------------------------------------===// 562Codegen for the following (low-probability) case deteriorated considerably 563when the correctness fixes for unordered comparisons went in (PR 642, 58871). 564It should be possible to recover the code quality described in the comments. 565 566; RUN: llvm-as < %s | llc -march=ppc32 | grep or | count 3 567; This should produce one 'or' or 'cror' instruction per function. 568 569; RUN: llvm-as < %s | llc -march=ppc32 | grep mfcr | count 3 570; PR2964 571 572define i32 @test(double %x, double %y) nounwind { 573entry: 574 %tmp3 = fcmp ole double %x, %y ; <i1> [#uses=1] 575 %tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 576 ret i32 %tmp345 577} 578 579define i32 @test2(double %x, double %y) nounwind { 580entry: 581 %tmp3 = fcmp one double %x, %y ; <i1> [#uses=1] 582 %tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 583 ret i32 %tmp345 584} 585 586define i32 @test3(double %x, double %y) nounwind { 587entry: 588 %tmp3 = fcmp ugt double %x, %y ; <i1> [#uses=1] 589 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 590 ret i32 %tmp34 591} 592 593//===---------------------------------------------------------------------===// 594for the following code: 595 596void foo (float *__restrict__ a, int *__restrict__ b, int n) { 597 a[n] = b[n] * 2.321; 598} 599 600we load b[n] to GPR, then move it VSX register and convert it float. We should 601use vsx scalar integer load instructions to avoid direct moves 602 603//===----------------------------------------------------------------------===// 604; RUN: llvm-as < %s | llc -march=ppc32 | not grep fneg 605 606; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and 607; should not be generated except with -enable-finite-only-fp-math or the like). 608; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to 609; recognize a more elaborate tree than a simple SETxx. 610 611define double @test_FNEG_sel(double %A, double %B, double %C) { 612 %D = fsub double -0.000000e+00, %A ; <double> [#uses=1] 613 %Cond = fcmp ugt double %D, -0.000000e+00 ; <i1> [#uses=1] 614 %E = select i1 %Cond, double %B, double %C ; <double> [#uses=1] 615 ret double %E 616} 617 618//===----------------------------------------------------------------------===// 619The save/restore sequence for CR in prolog/epilog is terrible: 620- Each CR subreg is saved individually, rather than doing one save as a unit. 621- On Darwin, the save is done after the decrement of SP, which means the offset 622from SP of the save slot can be too big for a store instruction, which means we 623need an additional register (currently hacked in 96015+96020; the solution there 624is correct, but poor). 625- On SVR4 the same thing can happen, and I don't think saving before the SP 626decrement is safe on that target, as there is no red zone. This is currently 627broken AFAIK, although it's not a target I can exercise. 628The following demonstrates the problem: 629extern void bar(char *p); 630void foo() { 631 char x[100000]; 632 bar(x); 633 __asm__("" ::: "cr2"); 634} 635 636//===-------------------------------------------------------------------------=== 637Naming convention for instruction formats is very haphazard. 638We have agreed on a naming scheme as follows: 639 640<INST_form>{_<OP_type><OP_len>}+ 641 642Where: 643INST_form is the instruction format (X-form, etc.) 644OP_type is the operand type - one of OPC (opcode), RD (register destination), 645 RS (register source), 646 RDp (destination register pair), 647 RSp (source register pair), IM (immediate), 648 XO (extended opcode) 649OP_len is the length of the operand in bits 650 651VSX register operands would be of length 6 (split across two fields), 652condition register fields of length 3. 653We would not need denote reserved fields in names of instruction formats. 654 655//===----------------------------------------------------------------------===// 656 657Instruction fusion was introduced in ISA 2.06 and more opportunities added in 658ISA 2.07. LLVM needs to add infrastructure to recognize fusion opportunities 659and force instruction pairs to be scheduled together. 660 661
README_ALTIVEC.txt
1//===- README_ALTIVEC.txt - Notes for improving Altivec code gen ----------===// 2 3Implement PPCInstrInfo::isLoadFromStackSlot/isStoreToStackSlot for vector 4registers, to generate better spill code. 5 6//===----------------------------------------------------------------------===// 7 8The first should be a single lvx from the constant pool, the second should be 9a xor/stvx: 10 11void foo(void) { 12 int x[8] __attribute__((aligned(128))) = { 1, 1, 1, 17, 1, 1, 1, 1 }; 13 bar (x); 14} 15 16#include <string.h> 17void foo(void) { 18 int x[8] __attribute__((aligned(128))); 19 memset (x, 0, sizeof (x)); 20 bar (x); 21} 22 23//===----------------------------------------------------------------------===// 24 25Altivec: Codegen'ing MUL with vector FMADD should add -0.0, not 0.0: 26http://gcc.gnu.org/bugzilla/show_bug.cgi?id=8763 27 28When -ffast-math is on, we can use 0.0. 29 30//===----------------------------------------------------------------------===// 31 32 Consider this: 33 v4f32 Vector; 34 v4f32 Vector2 = { Vector.X, Vector.X, Vector.X, Vector.X }; 35 36Since we know that "Vector" is 16-byte aligned and we know the element offset 37of ".X", we should change the load into a lve*x instruction, instead of doing 38a load/store/lve*x sequence. 39 40//===----------------------------------------------------------------------===// 41 42For functions that use altivec AND have calls, we are VRSAVE'ing all call 43clobbered regs. 44 45//===----------------------------------------------------------------------===// 46 47Implement passing vectors by value into calls and receiving them as arguments. 48 49//===----------------------------------------------------------------------===// 50 51GCC apparently tries to codegen { C1, C2, Variable, C3 } as a constant pool load 52of C1/C2/C3, then a load and vperm of Variable. 53 54//===----------------------------------------------------------------------===// 55 56We need a way to teach tblgen that some operands of an intrinsic are required to 57be constants. The verifier should enforce this constraint. 58 59//===----------------------------------------------------------------------===// 60 61We currently codegen SCALAR_TO_VECTOR as a store of the scalar to a 16-byte 62aligned stack slot, followed by a load/vperm. We should probably just store it 63to a scalar stack slot, then use lvsl/vperm to load it. If the value is already 64in memory this is a big win. 65 66//===----------------------------------------------------------------------===// 67 68extract_vector_elt of an arbitrary constant vector can be done with the 69following instructions: 70 71vTemp = vec_splat(v0,2); // 2 is the element the src is in. 72vec_ste(&destloc,0,vTemp); 73 74We can do an arbitrary non-constant value by using lvsr/perm/ste. 75 76//===----------------------------------------------------------------------===// 77 78If we want to tie instruction selection into the scheduler, we can do some 79constant formation with different instructions. For example, we can generate 80"vsplti -1" with "vcmpequw R,R" and 1,1,1,1 with "vsubcuw R,R", and 0,0,0,0 with 81"vsplti 0" or "vxor", each of which use different execution units, thus could 82help scheduling. 83 84This is probably only reasonable for a post-pass scheduler. 85 86//===----------------------------------------------------------------------===// 87 88For this function: 89 90void test(vector float *A, vector float *B) { 91 vector float C = (vector float)vec_cmpeq(*A, *B); 92 if (!vec_any_eq(*A, *B)) 93 *B = (vector float){0,0,0,0}; 94 *A = C; 95} 96 97we get the following basic block: 98 99 ... 100 lvx v2, 0, r4 101 lvx v3, 0, r3 102 vcmpeqfp v4, v3, v2 103 vcmpeqfp. v2, v3, v2 104 bne cr6, LBB1_2 ; cond_next 105 106The vcmpeqfp/vcmpeqfp. instructions currently cannot be merged when the 107vcmpeqfp. result is used by a branch. This can be improved. 108 109//===----------------------------------------------------------------------===// 110 111The code generated for this is truly aweful: 112 113vector float test(float a, float b) { 114 return (vector float){ 0.0, a, 0.0, 0.0}; 115} 116 117LCPI1_0: ; float 118 .space 4 119 .text 120 .globl _test 121 .align 4 122_test: 123 mfspr r2, 256 124 oris r3, r2, 4096 125 mtspr 256, r3 126 lis r3, ha16(LCPI1_0) 127 addi r4, r1, -32 128 stfs f1, -16(r1) 129 addi r5, r1, -16 130 lfs f0, lo16(LCPI1_0)(r3) 131 stfs f0, -32(r1) 132 lvx v2, 0, r4 133 lvx v3, 0, r5 134 vmrghw v3, v3, v2 135 vspltw v2, v2, 0 136 vmrghw v2, v2, v3 137 mtspr 256, r2 138 blr 139 140//===----------------------------------------------------------------------===// 141 142int foo(vector float *x, vector float *y) { 143 if (vec_all_eq(*x,*y)) return 3245; 144 else return 12; 145} 146 147A predicate compare being used in a select_cc should have the same peephole 148applied to it as a predicate compare used by a br_cc. There should be no 149mfcr here: 150 151_foo: 152 mfspr r2, 256 153 oris r5, r2, 12288 154 mtspr 256, r5 155 li r5, 12 156 li r6, 3245 157 lvx v2, 0, r4 158 lvx v3, 0, r3 159 vcmpeqfp. v2, v3, v2 160 mfcr r3, 2 161 rlwinm r3, r3, 25, 31, 31 162 cmpwi cr0, r3, 0 163 bne cr0, LBB1_2 ; entry 164LBB1_1: ; entry 165 mr r6, r5 166LBB1_2: ; entry 167 mr r3, r6 168 mtspr 256, r2 169 blr 170 171//===----------------------------------------------------------------------===// 172 173CodeGen/PowerPC/vec_constants.ll has an and operation that should be 174codegen'd to andc. The issue is that the 'all ones' build vector is 175SelectNodeTo'd a VSPLTISB instruction node before the and/xor is selected 176which prevents the vnot pattern from matching. 177 178 179//===----------------------------------------------------------------------===// 180 181An alternative to the store/store/load approach for illegal insert element 182lowering would be: 183 1841. store element to any ol' slot 1852. lvx the slot 1863. lvsl 0; splat index; vcmpeq to generate a select mask 1874. lvsl slot + x; vperm to rotate result into correct slot 1885. vsel result together. 189 190//===----------------------------------------------------------------------===// 191 192Should codegen branches on vec_any/vec_all to avoid mfcr. Two examples: 193 194#include <altivec.h> 195 int f(vector float a, vector float b) 196 { 197 int aa = 0; 198 if (vec_all_ge(a, b)) 199 aa |= 0x1; 200 if (vec_any_ge(a,b)) 201 aa |= 0x2; 202 return aa; 203} 204 205vector float f(vector float a, vector float b) { 206 if (vec_any_eq(a, b)) 207 return a; 208 else 209 return b; 210} 211 212//===----------------------------------------------------------------------===// 213 214We should do a little better with eliminating dead stores. 215The stores to the stack are dead since %a and %b are not needed 216 217; Function Attrs: nounwind 218define <16 x i8> @test_vpmsumb() #0 { 219 entry: 220 %a = alloca <16 x i8>, align 16 221 %b = alloca <16 x i8>, align 16 222 store <16 x i8> <i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16>, <16 x i8>* %a, align 16 223 store <16 x i8> <i8 113, i8 114, i8 115, i8 116, i8 117, i8 118, i8 119, i8 120, i8 121, i8 122, i8 123, i8 124, i8 125, i8 126, i8 127, i8 112>, <16 x i8>* %b, align 16 224 %0 = load <16 x i8>* %a, align 16 225 %1 = load <16 x i8>* %b, align 16 226 %2 = call <16 x i8> @llvm.ppc.altivec.crypto.vpmsumb(<16 x i8> %0, <16 x i8> %1) 227 ret <16 x i8> %2 228} 229 230 231; Function Attrs: nounwind readnone 232declare <16 x i8> @llvm.ppc.altivec.crypto.vpmsumb(<16 x i8>, <16 x i8>) #1 233 234 235Produces the following code with -mtriple=powerpc64-unknown-linux-gnu: 236# BB#0: # %entry 237 addis 3, 2, .LCPI0_0@toc@ha 238 addis 4, 2, .LCPI0_1@toc@ha 239 addi 3, 3, .LCPI0_0@toc@l 240 addi 4, 4, .LCPI0_1@toc@l 241 lxvw4x 0, 0, 3 242 addi 3, 1, -16 243 lxvw4x 35, 0, 4 244 stxvw4x 0, 0, 3 245 ori 2, 2, 0 246 lxvw4x 34, 0, 3 247 addi 3, 1, -32 248 stxvw4x 35, 0, 3 249 vpmsumb 2, 2, 3 250 blr 251 .long 0 252 .quad 0 253 254The two stxvw4x instructions are not needed. 255With -mtriple=powerpc64le-unknown-linux-gnu, the associated permutes 256are present too. 257 258//===----------------------------------------------------------------------===// 259 260The following example is found in test/CodeGen/PowerPC/vec_add_sub_doubleword.ll: 261 262define <2 x i64> @increment_by_val(<2 x i64> %x, i64 %val) nounwind { 263 %tmpvec = insertelement <2 x i64> <i64 0, i64 0>, i64 %val, i32 0 264 %tmpvec2 = insertelement <2 x i64> %tmpvec, i64 %val, i32 1 265 %result = add <2 x i64> %x, %tmpvec2 266 ret <2 x i64> %result 267 268This will generate the following instruction sequence: 269 std 5, -8(1) 270 std 5, -16(1) 271 addi 3, 1, -16 272 ori 2, 2, 0 273 lxvd2x 35, 0, 3 274 vaddudm 2, 2, 3 275 blr 276 277This will almost certainly cause a load-hit-store hazard. 278Since val is a value parameter, it should not need to be saved onto 279the stack, unless it's being done set up the vector register. Instead, 280it would be better to splat the value into a vector register, and then 281remove the (dead) stores to the stack. 282 283//===----------------------------------------------------------------------===// 284 285At the moment we always generate a lxsdx in preference to lfd, or stxsdx in 286preference to stfd. When we have a reg-immediate addressing mode, this is a 287poor choice, since we have to load the address into an index register. This 288should be fixed for P7/P8. 289 290//===----------------------------------------------------------------------===// 291 292Right now, ShuffleKind 0 is supported only on BE, and ShuffleKind 2 only on LE. 293However, we could actually support both kinds on either endianness, if we check 294for the appropriate shufflevector pattern for each case ... this would cause 295some additional shufflevectors to be recognized and implemented via the 296"swapped" form. 297 298//===----------------------------------------------------------------------===// 299 300There is a utility program called PerfectShuffle that generates a table of the 301shortest instruction sequence for implementing a shufflevector operation on 302PowerPC. However, this was designed for big-endian code generation. We could 303modify this program to create a little endian version of the table. The table 304is used in PPCISelLowering.cpp, PPCTargetLowering::LOWERVECTOR_SHUFFLE(). 305 306//===----------------------------------------------------------------------===// 307 308Opportunies to use instructions from PPCInstrVSX.td during code gen 309 - Conversion instructions (Sections 7.6.1.5 and 7.6.1.6 of ISA 2.07) 310 - Scalar comparisons (xscmpodp and xscmpudp) 311 - Min and max (xsmaxdp, xsmindp, xvmaxdp, xvmindp, xvmaxsp, xvminsp) 312 313Related to this: we currently do not generate the lxvw4x instruction for either 314v4f32 or v4i32, probably because adding a dag pattern to the recognizer requires 315a single target type. This should probably be addressed in the PPCISelDAGToDAG logic. 316 317//===----------------------------------------------------------------------===// 318 319Currently EXTRACT_VECTOR_ELT and INSERT_VECTOR_ELT are type-legal only 320for v2f64 with VSX available. We should create custom lowering 321support for the other vector types. Without this support, we generate 322sequences with load-hit-store hazards. 323 324v4f32 can be supported with VSX by shifting the correct element into 325big-endian lane 0, using xscvspdpn to produce a double-precision 326representation of the single-precision value in big-endian 327double-precision lane 0, and reinterpreting lane 0 as an FPR or 328vector-scalar register. 329 330v2i64 can be supported with VSX and P8Vector in the same manner as 331v2f64, followed by a direct move to a GPR. 332 333v4i32 can be supported with VSX and P8Vector by shifting the correct 334element into big-endian lane 1, using a direct move to a GPR, and 335sign-extending the 32-bit result to 64 bits. 336 337v8i16 can be supported with VSX and P8Vector by shifting the correct 338element into big-endian lane 3, using a direct move to a GPR, and 339sign-extending the 16-bit result to 64 bits. 340 341v16i8 can be supported with VSX and P8Vector by shifting the correct 342element into big-endian lane 7, using a direct move to a GPR, and 343sign-extending the 8-bit result to 64 bits. 344
README_P9.txt
1//===- README_P9.txt - Notes for improving Power9 code gen ----------------===// 2 3TODO: Instructions Need Implement Instrinstics or Map to LLVM IR 4 5Altivec: 6- Vector Compare Not Equal (Zero): 7 vcmpneb(.) vcmpneh(.) vcmpnew(.) 8 vcmpnezb(.) vcmpnezh(.) vcmpnezw(.) 9 . Same as other VCMP*, use VCMP/VCMPo form (support intrinsic) 10 11- Vector Extract Unsigned: vextractub vextractuh vextractuw vextractd 12 . Don't use llvm extractelement because they have different semantics 13 . Use instrinstics: 14 (set v2i64:$vD, (int_ppc_altivec_vextractub v16i8:$vA, imm:$UIMM)) 15 (set v2i64:$vD, (int_ppc_altivec_vextractuh v8i16:$vA, imm:$UIMM)) 16 (set v2i64:$vD, (int_ppc_altivec_vextractuw v4i32:$vA, imm:$UIMM)) 17 (set v2i64:$vD, (int_ppc_altivec_vextractd v2i64:$vA, imm:$UIMM)) 18 19- Vector Extract Unsigned Byte Left/Right-Indexed: 20 vextublx vextubrx vextuhlx vextuhrx vextuwlx vextuwrx 21 . Use instrinstics: 22 // Left-Indexed 23 (set i64:$rD, (int_ppc_altivec_vextublx i64:$rA, v16i8:$vB)) 24 (set i64:$rD, (int_ppc_altivec_vextuhlx i64:$rA, v8i16:$vB)) 25 (set i64:$rD, (int_ppc_altivec_vextuwlx i64:$rA, v4i32:$vB)) 26 27 // Right-Indexed 28 (set i64:$rD, (int_ppc_altivec_vextubrx i64:$rA, v16i8:$vB)) 29 (set i64:$rD, (int_ppc_altivec_vextuhrx i64:$rA, v8i16:$vB)) 30 (set i64:$rD, (int_ppc_altivec_vextuwrx i64:$rA, v4i32:$vB)) 31 32- Vector Insert Element Instructions: vinsertb vinsertd vinserth vinsertw 33 (set v16i8:$vD, (int_ppc_altivec_vinsertb v16i8:$vA, imm:$UIMM)) 34 (set v8i16:$vD, (int_ppc_altivec_vinsertd v8i16:$vA, imm:$UIMM)) 35 (set v4i32:$vD, (int_ppc_altivec_vinserth v4i32:$vA, imm:$UIMM)) 36 (set v2i64:$vD, (int_ppc_altivec_vinsertw v2i64:$vA, imm:$UIMM)) 37 38- Vector Count Leading/Trailing Zero LSB. Result is placed into GPR[rD]: 39 vclzlsbb vctzlsbb 40 . Use intrinsic: 41 (set i64:$rD, (int_ppc_altivec_vclzlsbb v16i8:$vB)) 42 (set i64:$rD, (int_ppc_altivec_vctzlsbb v16i8:$vB)) 43 44- Vector Count Trailing Zeros: vctzb vctzh vctzw vctzd 45 . Map to llvm cttz 46 (set v16i8:$vD, (cttz v16i8:$vB)) // vctzb 47 (set v8i16:$vD, (cttz v8i16:$vB)) // vctzh 48 (set v4i32:$vD, (cttz v4i32:$vB)) // vctzw 49 (set v2i64:$vD, (cttz v2i64:$vB)) // vctzd 50 51- Vector Extend Sign: vextsb2w vextsh2w vextsb2d vextsh2d vextsw2d 52 . vextsb2w: 53 (set v4i32:$vD, (sext v4i8:$vB)) 54 55 // PowerISA_V3.0: 56 do i = 0 to 3 57 VR[VRT].word[i] ← EXTS32(VR[VRB].word[i].byte[3]) 58 end 59 60 . vextsh2w: 61 (set v4i32:$vD, (sext v4i16:$vB)) 62 63 // PowerISA_V3.0: 64 do i = 0 to 3 65 VR[VRT].word[i] ← EXTS32(VR[VRB].word[i].hword[1]) 66 end 67 68 . vextsb2d 69 (set v2i64:$vD, (sext v2i8:$vB)) 70 71 // PowerISA_V3.0: 72 do i = 0 to 1 73 VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].byte[7]) 74 end 75 76 . vextsh2d 77 (set v2i64:$vD, (sext v2i16:$vB)) 78 79 // PowerISA_V3.0: 80 do i = 0 to 1 81 VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].hword[3]) 82 end 83 84 . vextsw2d 85 (set v2i64:$vD, (sext v2i32:$vB)) 86 87 // PowerISA_V3.0: 88 do i = 0 to 1 89 VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].word[1]) 90 end 91 92- Vector Integer Negate: vnegw vnegd 93 . Map to llvm ineg 94 (set v4i32:$rT, (ineg v4i32:$rA)) // vnegw 95 (set v2i64:$rT, (ineg v2i64:$rA)) // vnegd 96 97- Vector Parity Byte: vprtybw vprtybd vprtybq 98 . Use intrinsic: 99 (set v4i32:$rD, (int_ppc_altivec_vprtybw v4i32:$vB)) 100 (set v2i64:$rD, (int_ppc_altivec_vprtybd v2i64:$vB)) 101 (set v1i128:$rD, (int_ppc_altivec_vprtybq v1i128:$vB)) 102 103- Vector (Bit) Permute (Right-indexed): 104 . vbpermd: Same as "vbpermq", use VX1_Int_Ty2: 105 VX1_Int_Ty2<1484, "vbpermd", int_ppc_altivec_vbpermd, v2i64, v2i64>; 106 107 . vpermr: use VA1a_Int_Ty3 108 VA1a_Int_Ty3<59, "vpermr", int_ppc_altivec_vpermr, v16i8, v16i8, v16i8>; 109 110- Vector Rotate Left Mask/Mask-Insert: vrlwnm vrlwmi vrldnm vrldmi 111 . Use intrinsic: 112 VX1_Int_Ty<389, "vrlwnm", int_ppc_altivec_vrlwnm, v4i32>; 113 VX1_Int_Ty<133, "vrlwmi", int_ppc_altivec_vrlwmi, v4i32>; 114 VX1_Int_Ty<453, "vrldnm", int_ppc_altivec_vrldnm, v2i64>; 115 VX1_Int_Ty<197, "vrldmi", int_ppc_altivec_vrldmi, v2i64>; 116 117- Vector Shift Left/Right: vslv vsrv 118 . Use intrinsic, don't map to llvm shl and lshr, because they have different 119 semantics, e.g. vslv: 120 121 do i = 0 to 15 122 sh ← VR[VRB].byte[i].bit[5:7] 123 VR[VRT].byte[i] ← src.byte[i:i+1].bit[sh:sh+7] 124 end 125 126 VR[VRT].byte[i] is composed of 2 bytes from src.byte[i:i+1] 127 128 . VX1_Int_Ty<1860, "vslv", int_ppc_altivec_vslv, v16i8>; 129 VX1_Int_Ty<1796, "vsrv", int_ppc_altivec_vsrv, v16i8>; 130 131- Vector Multiply-by-10 (& Write Carry) Unsigned Quadword: 132 vmul10uq vmul10cuq 133 . Use intrinsic: 134 VX1_Int_Ty<513, "vmul10uq", int_ppc_altivec_vmul10uq, v1i128>; 135 VX1_Int_Ty< 1, "vmul10cuq", int_ppc_altivec_vmul10cuq, v1i128>; 136 137- Vector Multiply-by-10 Extended (& Write Carry) Unsigned Quadword: 138 vmul10euq vmul10ecuq 139 . Use intrinsic: 140 VX1_Int_Ty<577, "vmul10euq", int_ppc_altivec_vmul10euq, v1i128>; 141 VX1_Int_Ty< 65, "vmul10ecuq", int_ppc_altivec_vmul10ecuq, v1i128>; 142 143- Decimal Convert From/to National/Zoned/Signed-QWord: 144 bcdcfn. bcdcfz. bcdctn. bcdctz. bcdcfsq. bcdctsq. 145 . Use instrinstics: 146 (set v1i128:$vD, (int_ppc_altivec_bcdcfno v1i128:$vB, i1:$PS)) 147 (set v1i128:$vD, (int_ppc_altivec_bcdcfzo v1i128:$vB, i1:$PS)) 148 (set v1i128:$vD, (int_ppc_altivec_bcdctno v1i128:$vB)) 149 (set v1i128:$vD, (int_ppc_altivec_bcdctzo v1i128:$vB, i1:$PS)) 150 (set v1i128:$vD, (int_ppc_altivec_bcdcfsqo v1i128:$vB, i1:$PS)) 151 (set v1i128:$vD, (int_ppc_altivec_bcdctsqo v1i128:$vB)) 152 153- Decimal Copy-Sign/Set-Sign: bcdcpsgn. bcdsetsgn. 154 . Use instrinstics: 155 (set v1i128:$vD, (int_ppc_altivec_bcdcpsgno v1i128:$vA, v1i128:$vB)) 156 (set v1i128:$vD, (int_ppc_altivec_bcdsetsgno v1i128:$vB, i1:$PS)) 157 158- Decimal Shift/Unsigned-Shift/Shift-and-Round: bcds. bcdus. bcdsr. 159 . Use instrinstics: 160 (set v1i128:$vD, (int_ppc_altivec_bcdso v1i128:$vA, v1i128:$vB, i1:$PS)) 161 (set v1i128:$vD, (int_ppc_altivec_bcduso v1i128:$vA, v1i128:$vB)) 162 (set v1i128:$vD, (int_ppc_altivec_bcdsro v1i128:$vA, v1i128:$vB, i1:$PS)) 163 164 . Note! Their VA is accessed only 1 byte, i.e. VA.byte[7] 165 166- Decimal (Unsigned) Truncate: bcdtrunc. bcdutrunc. 167 . Use instrinstics: 168 (set v1i128:$vD, (int_ppc_altivec_bcdso v1i128:$vA, v1i128:$vB, i1:$PS)) 169 (set v1i128:$vD, (int_ppc_altivec_bcduso v1i128:$vA, v1i128:$vB)) 170 171 . Note! Their VA is accessed only 2 byte, i.e. VA.hword[3] (VA.bit[48:63]) 172 173VSX: 174- QP Copy Sign: xscpsgnqp 175 . Similar to xscpsgndp 176 . (set f128:$vT, (fcopysign f128:$vB, f128:$vA) 177 178- QP Absolute/Negative-Absolute/Negate: xsabsqp xsnabsqp xsnegqp 179 . Similar to xsabsdp/xsnabsdp/xsnegdp 180 . (set f128:$vT, (fabs f128:$vB)) // xsabsqp 181 (set f128:$vT, (fneg (fabs f128:$vB))) // xsnabsqp 182 (set f128:$vT, (fneg f128:$vB)) // xsnegqp 183 184- QP Add/Divide/Multiply/Subtract/Square-Root: 185 xsaddqp xsdivqp xsmulqp xssubqp xssqrtqp 186 . Similar to xsadddp 187 . isCommutable = 1 188 (set f128:$vT, (fadd f128:$vA, f128:$vB)) // xsaddqp 189 (set f128:$vT, (fmul f128:$vA, f128:$vB)) // xsmulqp 190 191 . isCommutable = 0 192 (set f128:$vT, (fdiv f128:$vA, f128:$vB)) // xsdivqp 193 (set f128:$vT, (fsub f128:$vA, f128:$vB)) // xssubqp 194 (set f128:$vT, (fsqrt f128:$vB))) // xssqrtqp 195 196- Round to Odd of QP Add/Divide/Multiply/Subtract/Square-Root: 197 xsaddqpo xsdivqpo xsmulqpo xssubqpo xssqrtqpo 198 . Similar to xsrsqrtedp?? 199 def XSRSQRTEDP : XX2Form<60, 74, 200 (outs vsfrc:$XT), (ins vsfrc:$XB), 201 "xsrsqrtedp $XT, $XB", IIC_VecFP, 202 [(set f64:$XT, (PPCfrsqrte f64:$XB))]>; 203 204 . Define DAG Node in PPCInstrInfo.td: 205 def PPCfaddrto: SDNode<"PPCISD::FADDRTO", SDTFPBinOp, []>; 206 def PPCfdivrto: SDNode<"PPCISD::FDIVRTO", SDTFPBinOp, []>; 207 def PPCfmulrto: SDNode<"PPCISD::FMULRTO", SDTFPBinOp, []>; 208 def PPCfsubrto: SDNode<"PPCISD::FSUBRTO", SDTFPBinOp, []>; 209 def PPCfsqrtrto: SDNode<"PPCISD::FSQRTRTO", SDTFPUnaryOp, []>; 210 211 DAG patterns of each instruction (PPCInstrVSX.td): 212 . isCommutable = 1 213 (set f128:$vT, (PPCfaddrto f128:$vA, f128:$vB)) // xsaddqpo 214 (set f128:$vT, (PPCfmulrto f128:$vA, f128:$vB)) // xsmulqpo 215 216 . isCommutable = 0 217 (set f128:$vT, (PPCfdivrto f128:$vA, f128:$vB)) // xsdivqpo 218 (set f128:$vT, (PPCfsubrto f128:$vA, f128:$vB)) // xssubqpo 219 (set f128:$vT, (PPCfsqrtrto f128:$vB)) // xssqrtqpo 220 221- QP (Negative) Multiply-{Add/Subtract}: xsmaddqp xsmsubqp xsnmaddqp xsnmsubqp 222 . Ref: xsmaddadp/xsmsubadp/xsnmaddadp/xsnmsubadp 223 224 . isCommutable = 1 225 // xsmaddqp 226 [(set f128:$vT, (fma f128:$vA, f128:$vB, f128:$vTi))]>, 227 RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, 228 AltVSXFMARel; 229 230 // xsmsubqp 231 [(set f128:$vT, (fma f128:$vA, f128:$vB, (fneg f128:$vTi)))]>, 232 RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, 233 AltVSXFMARel; 234 235 // xsnmaddqp 236 [(set f128:$vT, (fneg (fma f128:$vA, f128:$vB, f128:$vTi)))]>, 237 RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, 238 AltVSXFMARel; 239 240 // xsnmsubqp 241 [(set f128:$vT, (fneg (fma f128:$vA, f128:$vB, (fneg f128:$vTi))))]>, 242 RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, 243 AltVSXFMARel; 244 245- Round to Odd of QP (Negative) Multiply-{Add/Subtract}: 246 xsmaddqpo xsmsubqpo xsnmaddqpo xsnmsubqpo 247 . Similar to xsrsqrtedp?? 248 249 . Define DAG Node in PPCInstrInfo.td: 250 def PPCfmarto: SDNode<"PPCISD::FMARTO", SDTFPTernaryOp, []>; 251 252 It looks like we only need to define "PPCfmarto" for these instructions, 253 because according to PowerISA_V3.0, these instructions perform RTO on 254 fma's result: 255 xsmaddqp(o) 256 v ← bfp_MULTIPLY_ADD(src1, src3, src2) 257 rnd ← bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v) 258 result ← bfp_CONVERT_TO_BFP128(rnd) 259 260 xsmsubqp(o) 261 v ← bfp_MULTIPLY_ADD(src1, src3, bfp_NEGATE(src2)) 262 rnd ← bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v) 263 result ← bfp_CONVERT_TO_BFP128(rnd) 264 265 xsnmaddqp(o) 266 v ← bfp_MULTIPLY_ADD(src1,src3,src2) 267 rnd ← bfp_NEGATE(bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v)) 268 result ← bfp_CONVERT_TO_BFP128(rnd) 269 270 xsnmsubqp(o) 271 v ← bfp_MULTIPLY_ADD(src1, src3, bfp_NEGATE(src2)) 272 rnd ← bfp_NEGATE(bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v)) 273 result ← bfp_CONVERT_TO_BFP128(rnd) 274 275 DAG patterns of each instruction (PPCInstrVSX.td): 276 . isCommutable = 1 277 // xsmaddqpo 278 [(set f128:$vT, (PPCfmarto f128:$vA, f128:$vB, f128:$vTi))]>, 279 RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, 280 AltVSXFMARel; 281 282 // xsmsubqpo 283 [(set f128:$vT, (PPCfmarto f128:$vA, f128:$vB, (fneg f128:$vTi)))]>, 284 RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, 285 AltVSXFMARel; 286 287 // xsnmaddqpo 288 [(set f128:$vT, (fneg (PPCfmarto f128:$vA, f128:$vB, f128:$vTi)))]>, 289 RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, 290 AltVSXFMARel; 291 292 // xsnmsubqpo 293 [(set f128:$vT, (fneg (PPCfmarto f128:$vA, f128:$vB, (fneg f128:$vTi))))]>, 294 RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, 295 AltVSXFMARel; 296 297- QP Compare Ordered/Unordered: xscmpoqp xscmpuqp 298 . ref: XSCMPUDP 299 def XSCMPUDP : XX3Form_1<60, 35, 300 (outs crrc:$crD), (ins vsfrc:$XA, vsfrc:$XB), 301 "xscmpudp $crD, $XA, $XB", IIC_FPCompare, []>; 302 303 . No SDAG, intrinsic, builtin are required?? 304 Or llvm fcmp order/unorder compare?? 305 306- DP/QP Compare Exponents: xscmpexpdp xscmpexpqp 307 . No SDAG, intrinsic, builtin are required? 308 309- DP Compare ==, >=, >, !=: xscmpeqdp xscmpgedp xscmpgtdp xscmpnedp 310 . I checked existing instruction "XSCMPUDP". They are different in target 311 register. "XSCMPUDP" write to CR field, xscmp*dp write to VSX register 312 313 . Use instrinsic: 314 (set i128:$XT, (int_ppc_vsx_xscmpeqdp f64:$XA, f64:$XB)) 315 (set i128:$XT, (int_ppc_vsx_xscmpgedp f64:$XA, f64:$XB)) 316 (set i128:$XT, (int_ppc_vsx_xscmpgtdp f64:$XA, f64:$XB)) 317 (set i128:$XT, (int_ppc_vsx_xscmpnedp f64:$XA, f64:$XB)) 318 319- Vector Compare Not Equal: xvcmpnedp xvcmpnedp. xvcmpnesp xvcmpnesp. 320 . Similar to xvcmpeqdp: 321 defm XVCMPEQDP : XX3Form_Rcr<60, 99, 322 "xvcmpeqdp", "$XT, $XA, $XB", IIC_VecFPCompare, 323 int_ppc_vsx_xvcmpeqdp, v2i64, v2f64>; 324 325 . So we should use "XX3Form_Rcr" to implement instrinsic 326 327- Convert DP -> QP: xscvdpqp 328 . Similar to XSCVDPSP: 329 def XSCVDPSP : XX2Form<60, 265, 330 (outs vsfrc:$XT), (ins vsfrc:$XB), 331 "xscvdpsp $XT, $XB", IIC_VecFP, []>; 332 . So, No SDAG, intrinsic, builtin are required?? 333 334- Round & Convert QP -> DP (dword[1] is set to zero): xscvqpdp xscvqpdpo 335 . Similar to XSCVDPSP 336 . No SDAG, intrinsic, builtin are required?? 337 338- Truncate & Convert QP -> (Un)Signed (D)Word (dword[1] is set to zero): 339 xscvqpsdz xscvqpswz xscvqpudz xscvqpuwz 340 . According to PowerISA_V3.0, these are similar to "XSCVDPSXDS", "XSCVDPSXWS", 341 "XSCVDPUXDS", "XSCVDPUXWS" 342 343 . DAG patterns: 344 (set f128:$XT, (PPCfctidz f128:$XB)) // xscvqpsdz 345 (set f128:$XT, (PPCfctiwz f128:$XB)) // xscvqpswz 346 (set f128:$XT, (PPCfctiduz f128:$XB)) // xscvqpudz 347 (set f128:$XT, (PPCfctiwuz f128:$XB)) // xscvqpuwz 348 349- Convert (Un)Signed DWord -> QP: xscvsdqp xscvudqp 350 . Similar to XSCVSXDSP 351 . (set f128:$XT, (PPCfcfids f64:$XB)) // xscvsdqp 352 (set f128:$XT, (PPCfcfidus f64:$XB)) // xscvudqp 353 354- (Round &) Convert DP <-> HP: xscvdphp xscvhpdp 355 . Similar to XSCVDPSP 356 . No SDAG, intrinsic, builtin are required?? 357 358- Vector HP -> SP: xvcvhpsp xvcvsphp 359 . Similar to XVCVDPSP: 360 def XVCVDPSP : XX2Form<60, 393, 361 (outs vsrc:$XT), (ins vsrc:$XB), 362 "xvcvdpsp $XT, $XB", IIC_VecFP, []>; 363 . No SDAG, intrinsic, builtin are required?? 364 365- Round to Quad-Precision Integer: xsrqpi xsrqpix 366 . These are combination of "XSRDPI", "XSRDPIC", "XSRDPIM", .., because you 367 need to assign rounding mode in instruction 368 . Provide builtin? 369 (set f128:$vT, (int_ppc_vsx_xsrqpi f128:$vB)) 370 (set f128:$vT, (int_ppc_vsx_xsrqpix f128:$vB)) 371 372- Round Quad-Precision to Double-Extended Precision (fp80): xsrqpxp 373 . Provide builtin? 374 (set f128:$vT, (int_ppc_vsx_xsrqpxp f128:$vB)) 375 376Fixed Point Facility: 377 378- Exploit cmprb and cmpeqb (perhaps for something like 379 isalpha/isdigit/isupper/islower and isspace respectivelly). This can 380 perhaps be done through a builtin. 381 382- Provide testing for cnttz[dw] 383- Insert Exponent DP/QP: xsiexpdp xsiexpqp 384 . Use intrinsic? 385 . xsiexpdp: 386 // Note: rA and rB are the unsigned integer value. 387 (set f128:$XT, (int_ppc_vsx_xsiexpdp i64:$rA, i64:$rB)) 388 389 . xsiexpqp: 390 (set f128:$vT, (int_ppc_vsx_xsiexpqp f128:$vA, f64:$vB)) 391 392- Extract Exponent/Significand DP/QP: xsxexpdp xsxsigdp xsxexpqp xsxsigqp 393 . Use intrinsic? 394 . (set i64:$rT, (int_ppc_vsx_xsxexpdp f64$XB)) // xsxexpdp 395 (set i64:$rT, (int_ppc_vsx_xsxsigdp f64$XB)) // xsxsigdp 396 (set f128:$vT, (int_ppc_vsx_xsxexpqp f128$vB)) // xsxexpqp 397 (set f128:$vT, (int_ppc_vsx_xsxsigqp f128$vB)) // xsxsigqp 398 399- Vector Insert Word: xxinsertw 400 - Useful for inserting f32/i32 elements into vectors (the element to be 401 inserted needs to be prepared) 402 . Note: llvm has insertelem in "Vector Operations" 403 ; yields <n x <ty>> 404 <result> = insertelement <n x <ty>> <val>, <ty> <elt>, <ty2> <idx> 405 406 But how to map to it?? 407 [(set v1f128:$XT, (insertelement v1f128:$XTi, f128:$XB, i4:$UIMM))]>, 408 RegConstraint<"$XTi = $XT">, NoEncode<"$XTi">, 409 410 . Or use intrinsic? 411 (set v1f128:$XT, (int_ppc_vsx_xxinsertw v1f128:$XTi, f128:$XB, i4:$UIMM)) 412 413- Vector Extract Unsigned Word: xxextractuw 414 - Not useful for extraction of f32 from v4f32 (the current pattern is better - 415 shift->convert) 416 - It is useful for (uint_to_fp (vector_extract v4i32, N)) 417 - Unfortunately, it can't be used for (sint_to_fp (vector_extract v4i32, N)) 418 . Note: llvm has extractelement in "Vector Operations" 419 ; yields <ty> 420 <result> = extractelement <n x <ty>> <val>, <ty2> <idx> 421 422 How to map to it?? 423 [(set f128:$XT, (extractelement v1f128:$XB, i4:$UIMM))] 424 425 . Or use intrinsic? 426 (set f128:$XT, (int_ppc_vsx_xxextractuw v1f128:$XB, i4:$UIMM)) 427 428- Vector Insert Exponent DP/SP: xviexpdp xviexpsp 429 . Use intrinsic 430 (set v2f64:$XT, (int_ppc_vsx_xviexpdp v2f64:$XA, v2f64:$XB)) 431 (set v4f32:$XT, (int_ppc_vsx_xviexpsp v4f32:$XA, v4f32:$XB)) 432 433- Vector Extract Exponent/Significand DP/SP: xvxexpdp xvxexpsp xvxsigdp xvxsigsp 434 . Use intrinsic 435 (set v2f64:$XT, (int_ppc_vsx_xvxexpdp v2f64:$XB)) 436 (set v4f32:$XT, (int_ppc_vsx_xvxexpsp v4f32:$XB)) 437 (set v2f64:$XT, (int_ppc_vsx_xvxsigdp v2f64:$XB)) 438 (set v4f32:$XT, (int_ppc_vsx_xvxsigsp v4f32:$XB)) 439 440- Test Data Class SP/DP/QP: xststdcsp xststdcdp xststdcqp 441 . No SDAG, intrinsic, builtin are required? 442 Because it seems that we have no way to map BF field? 443 444 Instruction Form: [PO T XO B XO BX TX] 445 Asm: xststd* BF,XB,DCMX 446 447 BF is an index to CR register field. 448 449- Vector Test Data Class SP/DP: xvtstdcsp xvtstdcdp 450 . Use intrinsic 451 (set v4f32:$XT, (int_ppc_vsx_xvtstdcsp v4f32:$XB, i7:$DCMX)) 452 (set v2f64:$XT, (int_ppc_vsx_xvtstdcdp v2f64:$XB, i7:$DCMX)) 453 454- Maximum/Minimum Type-C/Type-J DP: xsmaxcdp xsmaxjdp xsmincdp xsminjdp 455 . PowerISA_V3.0: 456 "xsmaxcdp can be used to implement the C/C++/Java conditional operation 457 (x>y)?x:y for single-precision and double-precision arguments." 458 459 Note! c type and j type have different behavior when: 460 1. Either input is NaN 461 2. Both input are +-Infinity, +-Zero 462 463 . dtype map to llvm fmaxnum/fminnum 464 jtype use intrinsic 465 466 . xsmaxcdp xsmincdp 467 (set f64:$XT, (fmaxnum f64:$XA, f64:$XB)) 468 (set f64:$XT, (fminnum f64:$XA, f64:$XB)) 469 470 . xsmaxjdp xsminjdp 471 (set f64:$XT, (int_ppc_vsx_xsmaxjdp f64:$XA, f64:$XB)) 472 (set f64:$XT, (int_ppc_vsx_xsminjdp f64:$XA, f64:$XB)) 473 474- Vector Byte-Reverse H/W/D/Q Word: xxbrh xxbrw xxbrd xxbrq 475 . Use intrinsic 476 (set v8i16:$XT, (int_ppc_vsx_xxbrh v8i16:$XB)) 477 (set v4i32:$XT, (int_ppc_vsx_xxbrw v4i32:$XB)) 478 (set v2i64:$XT, (int_ppc_vsx_xxbrd v2i64:$XB)) 479 (set v1i128:$XT, (int_ppc_vsx_xxbrq v1i128:$XB)) 480 481- Vector Permute: xxperm xxpermr 482 . I have checked "PPCxxswapd" in PPCInstrVSX.td, but they are different 483 . Use intrinsic 484 (set v16i8:$XT, (int_ppc_vsx_xxperm v16i8:$XA, v16i8:$XB)) 485 (set v16i8:$XT, (int_ppc_vsx_xxpermr v16i8:$XA, v16i8:$XB)) 486 487- Vector Splat Immediate Byte: xxspltib 488 . Similar to XXSPLTW: 489 def XXSPLTW : XX2Form_2<60, 164, 490 (outs vsrc:$XT), (ins vsrc:$XB, u2imm:$UIM), 491 "xxspltw $XT, $XB, $UIM", IIC_VecPerm, []>; 492 493 . No SDAG, intrinsic, builtin are required? 494 495- Load/Store Vector: lxv stxv 496 . Has likely SDAG match: 497 (set v?:$XT, (load ix16addr:$src)) 498 (set v?:$XT, (store ix16addr:$dst)) 499 500 . Need define ix16addr in PPCInstrInfo.td 501 ix16addr: 16-byte aligned, see "def memrix16" in PPCInstrInfo.td 502 503- Load/Store Vector Indexed: lxvx stxvx 504 . Has likely SDAG match: 505 (set v?:$XT, (load xoaddr:$src)) 506 (set v?:$XT, (store xoaddr:$dst)) 507 508- Load/Store DWord: lxsd stxsd 509 . Similar to lxsdx/stxsdx: 510 def LXSDX : XX1Form<31, 588, 511 (outs vsfrc:$XT), (ins memrr:$src), 512 "lxsdx $XT, $src", IIC_LdStLFD, 513 [(set f64:$XT, (load xoaddr:$src))]>; 514 515 . (set f64:$XT, (load ixaddr:$src)) 516 (set f64:$XT, (store ixaddr:$dst)) 517 518- Load/Store SP, with conversion from/to DP: lxssp stxssp 519 . Similar to lxsspx/stxsspx: 520 def LXSSPX : XX1Form<31, 524, (outs vssrc:$XT), (ins memrr:$src), 521 "lxsspx $XT, $src", IIC_LdStLFD, 522 [(set f32:$XT, (load xoaddr:$src))]>; 523 524 . (set f32:$XT, (load ixaddr:$src)) 525 (set f32:$XT, (store ixaddr:$dst)) 526 527- Load as Integer Byte/Halfword & Zero Indexed: lxsibzx lxsihzx 528 . Similar to lxsiwzx: 529 def LXSIWZX : XX1Form<31, 12, (outs vsfrc:$XT), (ins memrr:$src), 530 "lxsiwzx $XT, $src", IIC_LdStLFD, 531 [(set f64:$XT, (PPClfiwzx xoaddr:$src))]>; 532 533 . (set f64:$XT, (PPClfiwzx xoaddr:$src)) 534 535- Store as Integer Byte/Halfword Indexed: stxsibx stxsihx 536 . Similar to stxsiwx: 537 def STXSIWX : XX1Form<31, 140, (outs), (ins vsfrc:$XT, memrr:$dst), 538 "stxsiwx $XT, $dst", IIC_LdStSTFD, 539 [(PPCstfiwx f64:$XT, xoaddr:$dst)]>; 540 541 . (PPCstfiwx f64:$XT, xoaddr:$dst) 542 543- Load Vector Halfword*8/Byte*16 Indexed: lxvh8x lxvb16x 544 . Similar to lxvd2x/lxvw4x: 545 def LXVD2X : XX1Form<31, 844, 546 (outs vsrc:$XT), (ins memrr:$src), 547 "lxvd2x $XT, $src", IIC_LdStLFD, 548 [(set v2f64:$XT, (int_ppc_vsx_lxvd2x xoaddr:$src))]>; 549 550 . (set v8i16:$XT, (int_ppc_vsx_lxvh8x xoaddr:$src)) 551 (set v16i8:$XT, (int_ppc_vsx_lxvb16x xoaddr:$src)) 552 553- Store Vector Halfword*8/Byte*16 Indexed: stxvh8x stxvb16x 554 . Similar to stxvd2x/stxvw4x: 555 def STXVD2X : XX1Form<31, 972, 556 (outs), (ins vsrc:$XT, memrr:$dst), 557 "stxvd2x $XT, $dst", IIC_LdStSTFD, 558 [(store v2f64:$XT, xoaddr:$dst)]>; 559 560 . (store v8i16:$XT, xoaddr:$dst) 561 (store v16i8:$XT, xoaddr:$dst) 562 563- Load/Store Vector (Left-justified) with Length: lxvl lxvll stxvl stxvll 564 . Likely needs an intrinsic 565 . (set v?:$XT, (int_ppc_vsx_lxvl xoaddr:$src)) 566 (set v?:$XT, (int_ppc_vsx_lxvll xoaddr:$src)) 567 568 . (int_ppc_vsx_stxvl xoaddr:$dst)) 569 (int_ppc_vsx_stxvll xoaddr:$dst)) 570 571- Load Vector Word & Splat Indexed: lxvwsx 572 . Likely needs an intrinsic 573 . (set v?:$XT, (int_ppc_vsx_lxvwsx xoaddr:$src)) 574 575Atomic operations (l[dw]at, st[dw]at): 576- Provide custom lowering for common atomic operations to use these 577 instructions with the correct Function Code 578- Ensure the operands are in the correct register (i.e. RT+1, RT+2) 579- Provide builtins since not all FC's necessarily have an existing LLVM 580 atomic operation 581 582Load Doubleword Monitored (ldmx): 583- Investigate whether there are any uses for this. It seems to be related to 584 Garbage Collection so it isn't likely to be all that useful for most 585 languages we deal with. 586 587Move to CR from XER Extended (mcrxrx): 588- Is there a use for this in LLVM? 589 590Fixed Point Facility: 591 592- Copy-Paste Facility: copy copy_first cp_abort paste paste. paste_last 593 . Use instrinstics: 594 (int_ppc_copy_first i32:$rA, i32:$rB) 595 (int_ppc_copy i32:$rA, i32:$rB) 596 597 (int_ppc_paste i32:$rA, i32:$rB) 598 (int_ppc_paste_last i32:$rA, i32:$rB) 599 600 (int_cp_abort) 601 602- Message Synchronize: msgsync 603- SLB*: slbieg slbsync 604- stop 605 . No instrinstics 606