1# XLA Architecture 2 3<div style="width:50%; margin:auto; margin-bottom:10px; margin-top:20px;"> 4<img style="width:50%" src="./images/xlalogo.png"> 5</div> 6 7 8## Why did we build XLA? 9 10We had several objectives for XLA to work with TensorFlow: 11 12* *Improve execution speed.* Compile subgraphs to reduce the execution time of 13 short-lived Ops to eliminate overhead from the TensorFlow runtime, fuse 14 pipelined operations to reduce memory overhead, and specialize to known 15 tensor shapes to allow for more aggressive constant propagation. 16 17* *Improve memory usage.* Analyze and schedule memory usage, in principle 18 eliminating many intermediate storage buffers. 19 20* *Reduce reliance on custom Ops.* Remove the need for many custom Ops by 21 improving the performance of automatically fused low-level Ops to match the 22 performance of custom Ops that were fused by hand. 23 24* *Reduce mobile footprint.* Eliminate the TensorFlow runtime by ahead-of-time 25 compiling the subgraph and emitting an object/header file pair that can be 26 linked directly into another application. The results can reduce the 27 footprint for mobile inference by several orders of magnitude. 28 29* *Improve portability.* Make it relatively easy to write a new backend for 30 novel hardware, at which point a large fraction of TensorFlow programs will 31 run unmodified on that hardware. This is in contrast with the approach of 32 specializing individual monolithic Ops for new hardware, which requires 33 TensorFlow programs to be rewritten to make use of those Ops. 34 35## How does XLA work? 36 37The input language to XLA is called "HLO IR", or just HLO (High Level 38Operations). The semantics of HLO are described on the 39[Operation Semantics](./operation_semantics.md) page. It is most convenient to 40think of HLO as a 41[compiler IR](https://en.wikipedia.org/wiki/Intermediate_representation). 42 43XLA takes graphs ("computations") defined in HLO and compiles them into machine 44instructions for various architectures. XLA is modular in the sense that it is 45easy to slot in an alternative backend to 46[target some novel HW architecture](./developing_new_backend.md). 47The CPU backend for x64 and ARM64 as well as the NVIDIA GPU backend are in the 48TensorFlow source tree. 49 50The following diagram shows the compilation process in XLA: 51 52<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> 53 <img src="./images/how-does-xla-work.png"> 54</div> 55 56XLA comes with several optimizations and analysis passes that are 57target-independent, such as 58[CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination), 59target-independent operation fusion, and buffer analysis for allocating runtime 60memory for the computation. 61 62After the target-independent step, XLA sends the HLO computation to a backend. 63The backend can perform further HLO-level optimizations, this time with target 64specific information and needs in mind. For example, the XLA GPU backend may 65perform operation fusion beneficial specifically for the GPU programming model 66and determine how to partition the computation into streams. At this stage, 67backends may also pattern-match certain operations or combinations thereof to 68optimized library calls. 69 70The next step is target-specific code generation. The CPU and GPU backends 71included with XLA use [LLVM](http://llvm.org) for low-level IR, optimization, 72and code-generation. These backends emit the LLVM IR necessary to represent the 73XLA HLO computation in an efficient manner, and then invoke LLVM to emit native 74code from this LLVM IR. 75 76The GPU backend currently supports NVIDIA GPUs via the LLVM NVPTX backend; the 77CPU backend supports multiple CPU ISAs. 78