1# XLA Architecture
2
3<div style="width:50%; margin:auto; margin-bottom:10px; margin-top:20px;">
4<img style="width:50%" src="./images/xlalogo.png">
5</div>
6
7
8## Why did we build XLA?
9
10We had several objectives for XLA to work with TensorFlow:
11
12*   *Improve execution speed.* Compile subgraphs to reduce the execution time of
13    short-lived Ops to eliminate overhead from the TensorFlow runtime, fuse
14    pipelined operations to reduce memory overhead, and specialize to known
15    tensor shapes to allow for more aggressive constant propagation.
16
17*   *Improve memory usage.* Analyze and schedule memory usage, in principle
18    eliminating many intermediate storage buffers.
19
20*   *Reduce reliance on custom Ops.* Remove the need for many custom Ops by
21    improving the performance of automatically fused low-level Ops to match the
22    performance of custom Ops that were fused by hand.
23
24*   *Reduce mobile footprint.* Eliminate the TensorFlow runtime by ahead-of-time
25    compiling the subgraph and emitting an object/header file pair that can be
26    linked directly into another application. The results can reduce the
27    footprint for mobile inference by several orders of magnitude.
28
29*   *Improve portability.* Make it relatively easy to write a new backend for
30    novel hardware, at which point a large fraction of TensorFlow programs will
31    run unmodified on that hardware. This is in contrast with the approach of
32    specializing individual monolithic Ops for new hardware, which requires
33    TensorFlow programs to be rewritten to make use of those Ops.
34
35## How does XLA work?
36
37The input language to XLA is called "HLO IR", or just HLO (High Level
38Operations). The semantics of HLO are described on the
39[Operation Semantics](./operation_semantics.md) page. It is most convenient to
40think of HLO as a
41[compiler IR](https://en.wikipedia.org/wiki/Intermediate_representation).
42
43XLA takes graphs ("computations") defined in HLO and compiles them into machine
44instructions for various architectures. XLA is modular in the sense that it is
45easy to slot in an alternative backend to
46[target some novel HW architecture](./developing_new_backend.md).
47The CPU backend for x64 and ARM64 as well as the NVIDIA GPU backend are in the
48TensorFlow source tree.
49
50The following diagram shows the compilation process in XLA:
51
52<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
53  <img src="./images/how-does-xla-work.png">
54</div>
55
56XLA comes with several optimizations and analysis passes that are
57target-independent, such as
58[CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination),
59target-independent operation fusion, and buffer analysis for allocating runtime
60memory for the computation.
61
62After the target-independent step, XLA sends the HLO computation to a backend.
63The backend can perform further HLO-level optimizations, this time with target
64specific information and needs in mind. For example, the XLA GPU backend may
65perform operation fusion beneficial specifically for the GPU programming model
66and determine how to partition the computation into streams. At this stage,
67backends may also pattern-match certain operations or combinations thereof to
68optimized library calls.
69
70The next step is target-specific code generation. The CPU and GPU backends
71included with XLA use [LLVM](http://llvm.org) for low-level IR, optimization,
72and code-generation. These backends emit the LLVM IR necessary to represent the
73XLA HLO computation in an efficient manner, and then invoke LLVM to emit native
74code from this LLVM IR.
75
76The GPU backend currently supports NVIDIA GPUs via the LLVM NVPTX backend; the
77CPU backend supports multiple CPU ISAs.
78