1Design of the Subzero fast code generator
2=========================================
3
4Introduction
5------------
6
7The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
8compiler technology based on `LLVM <http://llvm.org/>`_.  The developer uses the
9PNaCl toolchain to compile their application to architecture-neutral PNaCl
10bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
11possible.  The ``.pexe`` file is downloaded to the user's browser where the
12PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
13`sandboxed
14<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
15native code.  The translator uses architecture-specific optimizations as much as
16practical to generate good native code.
17
18The native code can be cached by the browser to avoid repeating translation on
19future page loads.  However, first-time user experience is hampered by long
20translation times.  The LLVM-based PNaCl translator is pretty slow, even when
21using ``-O0`` to minimize optimizations, so delays are especially noticeable on
22slow browser platforms such as ARM-based Chromebooks.
23
24Translator slowness can be mitigated or hidden in a number of ways.
25
26- Parallel translation.  However, slow machines where this matters most, e.g.
27  ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
28  and are likely to less memory available for multiple translation threads to
29  use.
30
31- Streaming translation, i.e. start translating as soon as the download starts.
32  This doesn't help much when translation speed is 10× slower than download
33  speed, or the ``.pexe`` file is already cached while the translated binary was
34  flushed from the cache.
35
36- Arrange the web page such that translation is done in parallel with
37  downloading large assets.
38
39- Arrange the web page to distract the user with `cat videos
40  <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
41  progress.
42
43Or, improve translator performance to something more reasonable.
44
45This document describes Subzero's attempt to improve translation speed by an
46order of magnitude while rivaling LLVM's code quality.  Subzero does this
47through minimal IR layering, lean data structures and passes, and a careful
48selection of fast optimization passes.  It has two optimization recipes: full
49optimizations (``O2``) and minimal optimizations (``Om1``).  The recipes are the
50following (described in more detail below):
51
52+----------------------------------------+-----------------------------+
53| O2 recipe                              | Om1 recipe                  |
54+========================================+=============================+
55| Parse .pexe file                       | Parse .pexe file            |
56+----------------------------------------+-----------------------------+
57| Loop nest analysis                     |                             |
58+----------------------------------------+-----------------------------+
59| Local common subexpression elimination |                             |
60+----------------------------------------+-----------------------------+
61| Address mode inference                 |                             |
62+----------------------------------------+-----------------------------+
63| Read-modify-write (RMW) transform      |                             |
64+----------------------------------------+-----------------------------+
65| Basic liveness analysis                |                             |
66+----------------------------------------+-----------------------------+
67| Load optimization                      |                             |
68+----------------------------------------+-----------------------------+
69|                                        | Phi lowering (simple)       |
70+----------------------------------------+-----------------------------+
71| Target lowering                        | Target lowering             |
72+----------------------------------------+-----------------------------+
73| Full liveness analysis                 |                             |
74+----------------------------------------+-----------------------------+
75| Register allocation                    | Minimal register allocation |
76+----------------------------------------+-----------------------------+
77| Phi lowering (advanced)                |                             |
78+----------------------------------------+-----------------------------+
79| Post-phi register allocation           |                             |
80+----------------------------------------+-----------------------------+
81| Branch optimization                    |                             |
82+----------------------------------------+-----------------------------+
83| Code emission                          | Code emission               |
84+----------------------------------------+-----------------------------+
85
86Goals
87=====
88
89Translation speed
90-----------------
91
92We'd like to be able to translate a ``.pexe`` file as fast as download speed.
93Any faster is in a sense wasted effort.  Download speed varies greatly, but
94we'll arbitrarily say 1 MB/sec.  We'll pick the ARM A15 CPU as the example of a
95slow machine.  We observe a 3× single-thread performance difference between A15
96and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
97``.pexe`` file could be compressed to 50% on the web server using gzip transport
98compression, so we set the translation speed goal to 6 MB/sec on the high-end
99Xeon workstation.
100
101Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
102⅒ the target rate.  The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
103
104In other words, Subzero's goal is to improve over LLVM's translation speed by
10510×.
106
107Code quality
108------------
109
110Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
111code quality.  The stretch goal is to approach LLVM ``-O2`` code quality.  On
112average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
113
114It's important to note that the quality of Subzero-generated code depends on
115target-neutral optimizations and simplifications being run beforehand in the
116developer environment.  The ``.pexe`` file reflects these optimizations.  For
117example, Subzero assumes that the basic blocks are ordered topologically where
118possible (which makes liveness analysis converge fastest), and Subzero does not
119do any function inlining because it should already have been done.
120
121Translator size
122---------------
123
124The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
125We think 1 MB is a more reasonable size -- especially for such a component that
126is distributed to a billion Chrome users.  Thus we target a 10× reduction in
127binary size.
128
129For development, Subzero can be built for all target architectures, and all
130debugging and diagnostic options enabled.  For a smaller translator, we restrict
131to a single target architecture, and define a ``MINIMAL`` build where
132unnecessary features are compiled out.
133
134Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
135include directories, which have little impact on translator size.  It also uses
136some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
137with little size impact.  In non-``MINIMAL`` builds, the translator size is much
138larger due to including code for parsing text-format bitcode files and forming
139LLVM IR.
140
141Memory footprint
142----------------
143
144The current LLVM-based translator suffers from an issue in which some
145function-specific data has to be retained in memory until all translation
146completes, and therefore the memory footprint grows without bound.  Large
147``.pexe`` files can lead to the translator process holding hundreds of MB of
148memory by the end.  The translator runs in a separate process, so this memory
149growth doesn't *directly* affect other processes, but it does dirty the physical
150memory and contributes to a perception of bloat and sometimes a reality of
151out-of-memory tab killing, especially noticeable on weaker systems.
152
153Subzero should maintain a stable memory footprint throughout translation.  It's
154not really practical to set a specific limit, because there is not really a
155practical limit on a single function's size, but the footprint should be
156"reasonable" and be proportional to the largest input function size, not the
157total ``.pexe`` file size.  Simply put, Subzero should not have memory leaks or
158inexorable memory growth.  (We use ASAN builds to test for leaks.)
159
160Multithreaded translation
161-------------------------
162
163It should be practical to translate different functions concurrently and see
164good scalability.  Some locking may be needed, such as accessing output buffers
165or constant pools, but that should be fairly minimal.  In contrast, LLVM was
166only designed for module-level parallelism, and as such, the PNaCl translator
167internally splits a ``.pexe`` file into several modules for concurrent
168translation.  All output needs to be deterministic regardless of the level of
169multithreading, i.e. functions and data should always be output in the same
170order.
171
172Target architectures
173--------------------
174
175Initial target architectures are x86-32, x86-64, ARM32, and MIPS32.  Future
176targets include ARM64 and MIPS64, though these targets lack NaCl support
177including a sandbox model or a validator.
178
179The first implementation is for x86-32, because it was expected to be
180particularly challenging, and thus more likely to draw out any design problems
181early:
182
183- There are a number of special cases, asymmetries, and warts in the x86
184  instruction set.
185
186- Complex addressing modes may be leveraged for better code quality.
187
188- 64-bit integer operations have to be lowered into longer sequences of 32-bit
189  operations.
190
191- Paucity of physical registers may reveal code quality issues early in the
192  design.
193
194Detailed design
195===============
196
197Intermediate representation - ICE
198---------------------------------
199
200Subzero's IR is called ICE.  It is designed to be reasonably similar to LLVM's
201IR, which is reflected in the ``.pexe`` file's bitcode structure.  It has a
202representation of global variables and initializers, and a set of functions.
203Each function contains a list of basic blocks, and each basic block constains a
204list of instructions.  Instructions that operate on stack and register variables
205do so using static single assignment (SSA) form.
206
207The ``.pexe`` file is translated one function at a time (or in parallel by
208multiple translation threads).  The recipe for optimization passes depends on
209the specific target and optimization level, and is described in detail below.
210Global variables (types and initializers) are simply and directly translated to
211object code, without any meaningful attempts at optimization.
212
213A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
214Its key contents include:
215
216- A list of ``CfgNode`` pointers, generally held in topological order.
217
218- A list of ``Variable`` pointers corresponding to local variables used in the
219  function plus compiler-generated temporaries.
220
221A basic block is represented by the ``Ice::CfgNode`` class.  Its key contents
222include:
223
224- A linear list of instructions, in the same style as LLVM.  The last
225  instruction of the list is always a terminator instruction: branch, switch,
226  return, unreachable.
227
228- A list of Phi instructions, also in the same style as LLVM.  They are held as
229  a linear list for convenience, though per Phi semantics, they are executed "in
230  parallel" without dependencies on each other.
231
232- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
233  another list for outgoing edges.
234
235- The node's unique, 0-based index into the CFG's node list.
236
237An instruction is represented by the ``Ice::Inst`` class.  Its key contents
238include:
239
240- A list of source operands.
241
242- Its destination variable, if the instruction produces a result in an
243  ``Ice::Variable``.
244
245- A bitvector indicating which variables' live ranges this instruction ends.
246  This is computed during liveness analysis.
247
248Instructions kinds are divided into high-level ICE instructions and low-level
249ICE instructions.  High-level instructions consist of the PNaCl/LLVM bitcode
250instruction kinds.  Each target architecture implementation extends the
251instruction space with its own set of low-level instructions.  Generally,
252low-level instructions correspond to individual machine instructions.  The
253high-level ICE instruction space includes a few additional instruction kinds
254that are not part of LLVM but are generally useful (e.g., an Assignment
255instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
256instructions for sandboxing).
257
258Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
259ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
260<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
261the following:
262
263- Alloca: allocate data on the stack
264
265- Arithmetic: binary operations of the form ``A = B op C``
266
267- Br: conditional or unconditional branch
268
269- Call: function call
270
271- Cast: unary type-conversion operations
272
273- ExtractElement: extract a scalar element from a vector-type value
274
275- Fcmp: floating-point comparison
276
277- Icmp: integer comparison
278
279- IntrinsicCall: call a known intrinsic
280
281- InsertElement: insert a scalar element into a vector-type value
282
283- Load: load a value from memory
284
285- Phi: implement the SSA phi node
286
287- Ret: return from the function
288
289- Select: essentially the C language operation of the form ``X = C ? Y : Z``
290
291- Store: store a value into memory
292
293- Switch: generalized branch to multiple possible locations
294
295- Unreachable: indicate that this portion of the code is unreachable
296
297The additional high-level ICE instructions are the following:
298
299- Assign: a simple ``A=B`` assignment.  This is useful for e.g. lowering Phi
300  instructions to non-SSA assignments, before lowering to machine code.
301
302- BundleLock, BundleUnlock.  These are markers used for sandboxing, but are
303  common across all targets and so they are elevated to the high-level
304  instruction set.
305
306- FakeDef, FakeUse, FakeKill.  These are tools used to preserve consistency in
307  liveness analysis, elevated to the high-level because they are used by all
308  targets.  They are described in more detail at the end of this section.
309
310- JumpTable: this represents the result of switch optimization analysis, where
311  some switch instructions may use jump tables instead of cascading
312  compare/branches.
313
314An operand is represented by the ``Ice::Operand`` class.  In high-level ICE, an
315operand is either an ``Ice::Constant`` or an ``Ice::Variable``.  Constants
316include scalar integer constants, scalar floating point constants, Undef (an
317unspecified constant of a particular scalar or vector type), and symbol
318constants (essentially addresses of globals).  Note that the PNaCl ABI does not
319include vector-type constants besides Undef, and as such, Subzero (so far) has
320no reason to represent vector-type constants internally.  A variable represents
321a value allocated on the stack (though not including alloca-derived storage).
322Among other things, a variable holds its unique, 0-based index into the CFG's
323variable list.
324
325Each target can extend the ``Constant`` and ``Variable`` classes for its own
326needs.  In addition, the ``Operand`` class may be extended, e.g. to define an
327x86 ``MemOperand`` that encodes a base register, an index register, an index
328register shift amount, and a constant offset.
329
330Register allocation and liveness analysis are restricted to Variable operands.
331Because of the importance of register allocation to code quality, and the
332translation-time cost of liveness analysis, Variable operands get some special
333treatment in ICE.  Most notably, a frequent pattern in Subzero is to iterate
334across all the Variables of an instruction.  An instruction holds a list of
335operands, but an operand may contain 0, 1, or more Variables.  As such, the
336``Operand`` class specially holds a list of Variables contained within, for
337quick access.
338
339A Subzero transformation pass may work by deleting an existing instruction and
340replacing it with zero or more new instructions.  Instead of actually deleting
341the existing instruction, we generally mark it as deleted and insert the new
342instructions right after the deleted instruction.  When printing the IR for
343debugging, this is a big help because it makes it much more clear how the
344non-deleted instructions came about.
345
346Subzero has a few special instructions to help with liveness analysis
347consistency.
348
349- The FakeDef instruction gives a fake definition of some variable.  For
350  example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
351  but an ICE instruction can represent only one destination variable.  This is
352  similar for multiply instructions, and for function calls that return a 64-bit
353  integer result in the ``%edx:%eax`` pair.  Also, using the ``xor %eax, %eax``
354  trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
355
356- The FakeUse instruction registers a use of a variable, typically to prevent an
357  earlier assignment to that variable from being dead-code eliminated.  For
358  example, lowering an operation like ``x=cc?y:z`` may be done using x86's
359  conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``.  Without a
360  FakeUse of ``x`` between the two instructions, the liveness analysis pass may
361  dead-code eliminate the first instruction.
362
363- The FakeKill instruction is added after a call instruction, and is a quick way
364  of indicating that caller-save registers are invalidated.
365
366Pexe parsing
367------------
368
369Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files.  It
370parses the ``.pexe`` file function by function, ultimately constructing an ICE
371CFG for each function.  After a function is parsed, its CFG is handed off to the
372translation phase.  The bitcode parser also parses global initializer data and
373hands it off to be translated to data sections in the object file.
374
375Subzero has another parsing strategy for testing/debugging.  LLVM libraries can
376be used to parse a module into LLVM IR (though very slowly relative to Subzero
377native parsing).  Then we iterate across the LLVM IR and construct high-level
378ICE, handing off each CFG to the translation phase.
379
380Overview of lowering
381--------------------
382
383In general, translation goes like this:
384
385- Parse the next function from the ``.pexe`` file and construct a CFG consisting
386  of high-level ICE.
387
388- Do analysis passes and transformation passes on the high-level ICE, as
389  desired.
390
391- Lower each high-level ICE instruction into a sequence of zero or more
392  low-level ICE instructions.  Each high-level instruction is generally lowered
393  independently, though the target lowering is allowed to look ahead in the
394  CfgNode's instruction list if desired.
395
396- Do more analysis and transformation passes on the low-level ICE, as desired.
397
398- Assemble the low-level CFG into an ELF object file (alternatively, a textual
399  assembly file that is later assembled by some external tool).
400
401- Repeat for all functions, and also produce object code for data such as global
402  initializers and internal constant pools.
403
404Currently there are two optimization levels: ``O2`` and ``Om1``.  For ``O2``,
405the intention is to apply all available optimizations to get the best code
406quality (though the initial code quality goal is measured against LLVM's ``O0``
407code quality).  For ``Om1``, the intention is to apply as few optimizations as
408possible and produce code as quickly as possible, accepting poor code quality.
409``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
410"sub-zero".
411
412High-level debuggability of generated code is so far not a design requirement.
413Subzero doesn't really do transformations that would obfuscate debugging; the
414main thing might be that register allocation (including stack slot coalescing
415for stack-allocated variables whose live ranges don't overlap) may render a
416variable's value unobtainable after its live range ends.  This would not be an
417issue for ``Om1`` since it doesn't register-allocate program-level variables,
418nor does it coalesce stack slots.  That said, fully supporting debuggability
419would require a few additions:
420
421- DWARF support would need to be added to Subzero's ELF file emitter.  Subzero
422  propagates global symbol names, local variable names, and function-internal
423  label names that are present in the ``.pexe`` file.  This would allow a
424  debugger to map addresses back to symbols in the ``.pexe`` file.
425
426- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
427  file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
428  metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
429  `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
430
431- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
432  the toolchain would need to be modified to preserve it.
433
434Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
435produces code with one third the code quality.  ``Om1`` is good for testing and
436debugging -- during translation, it tends to expose errors in the basic lowering
437that might otherwise have been hidden by the register allocator or other
438optimization passes.  It also helps determine whether a code correctness problem
439is a fundamental problem in the basic lowering, or an error in another
440optimization pass.
441
442The implementation of target lowering also controls the recipe of passes used
443for ``Om1`` and ``O2`` translation.  For example, address mode inference may
444only be relevant for x86.
445
446Lowering strategy
447-----------------
448
449The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
450each high-level instruction down to a sequence of low-level target-specific
451instructions, in a largely context-free setting.  That is, each high-level
452instruction conceptually has a simple template expansion into low-level
453instructions, and lowering can in theory be done in any order.  This may sound
454like a small effort, but quite a large number of templates may be needed because
455of the number of PNaCl types and instruction variants.  Furthermore, there may
456be optimized templates, e.g. to take advantage of operator commutativity (for
457example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``).  This is
458similar to other template-based approaches in fast code generation or
459interpretation, though some decisions are deferred until after some global
460analysis passes, mostly related to register allocation, stack slot assignment,
461and specific choice of instruction variant and addressing mode.
462
463The key idea for a lowering template is to produce valid low-level instructions
464that are guaranteed to meet address mode and other structural requirements of
465the instruction set.  For example, on x86, the source operand of an integer
466store instruction must be an immediate or a physical register; a shift
467instruction's shift amount must be an immediate or in register ``%cl``; a
468function's integer return value is in ``%eax``; most x86 instructions are
469two-operand, in contrast to corresponding three-operand high-level instructions;
470etc.
471
472Because target lowering runs before register allocation, there is no way to know
473whether a given ``Ice::Variable`` operand lives on the stack or in a physical
474register.  When the low-level instruction calls for a physical register operand,
475the target lowering can create an infinite-weight Variable.  This tells the
476register allocator to assign infinite weight when making decisions, effectively
477guaranteeing some physical register.  Variables can also be pre-colored to a
478specific physical register (``cl`` in the shift example above), which also gives
479infinite weight.
480
481To illustrate, consider a high-level arithmetic instruction on 32-bit integer
482operands::
483
484    A = B + C
485
486X86 target lowering might produce the following::
487
488    T.inf = B  // mov instruction
489    T.inf += C // add instruction
490    A = T.inf  // mov instruction
491
492Here, ``T.inf`` is an infinite-weight temporary.  As long as ``T.inf`` has a
493physical register, the three lowered instructions are all encodable regardless
494of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
495whether ``A`` is a physical register or in memory.
496
497In this example, ``A`` must be a Variable and one may be tempted to simplify the
498lowering sequence by setting ``A`` as infinite-weight and using::
499
500        A = B  // mov instruction
501        A += C // add instruction
502
503This has two problems.  First, if the original instruction was actually ``A =
504B + A``, the result would be incorrect.  Second, assigning ``A`` a physical
505register applies throughout ``A``'s entire live range.  This is probably not
506what is intended, and may ultimately lead to a failure to allocate a register
507for an infinite-weight variable.
508
509This style of lowering leads to many temporaries being generated, so in ``O2``
510mode, we rely on the register allocator to clean things up.  For example, in the
511example above, if ``B`` ends up getting a physical register and its live range
512ends at this instruction, the register allocator is likely to reuse that
513register for ``T.inf``.  This leads to ``T.inf=B`` being a redundant register
514copy, which is removed as an emission-time peephole optimization.
515
516O2 lowering
517-----------
518
519Currently, the ``O2`` lowering recipe is the following:
520
521- Loop nest analysis
522
523- Local common subexpression elimination
524
525- Address mode inference
526
527- Read-modify-write (RMW) transformation
528
529- Basic liveness analysis
530
531- Load optimization
532
533- Target lowering
534
535- Full liveness analysis
536
537- Register allocation
538
539- Phi instruction lowering (advanced)
540
541- Post-phi lowering register allocation
542
543- Branch optimization
544
545These passes are described in more detail below.
546
547Om1 lowering
548------------
549
550Currently, the ``Om1`` lowering recipe is the following:
551
552- Phi instruction lowering (simple)
553
554- Target lowering
555
556- Register allocation (infinite-weight and pre-colored only)
557
558Optimization passes
559-------------------
560
561Liveness analysis
562^^^^^^^^^^^^^^^^^
563
564Liveness analysis is a standard dataflow optimization, implemented as follows.
565For each node (basic block), its live-out set is computed as the union of the
566live-in sets of its successor nodes.  Then the node's instructions are processed
567in reverse order, updating the live set, until the beginning of the node is
568reached, and the node's live-in set is recorded.  If this iteration has changed
569the node's live-in set, the node's predecessors are marked for reprocessing.
570This continues until no more nodes need reprocessing.  If nodes are processed in
571reverse topological order, the number of iterations over the CFG is generally
572equal to the maximum loop nest depth.
573
574To implement this, each node records its live-in and live-out sets, initialized
575to the empty set.  Each instruction records which of its Variables' live ranges
576end in that instruction, initialized to the empty set.  A side effect of
577liveness analysis is dead instruction elimination.  Each instruction can be
578marked as tentatively dead, and after the algorithm converges, the tentatively
579dead instructions are permanently deleted.
580
581Optionally, after this liveness analysis completes, we can do live range
582construction, in which we calculate the live range of each variable in terms of
583instruction numbers.  A live range is represented as a union of segments, where
584the segment endpoints are instruction numbers.  Instruction numbers are required
585to be unique across the CFG, and monotonically increasing within a basic block.
586As a union of segments, live ranges can contain "gaps" and are therefore
587precise.  Because of SSA properties, a variable's live range can start at most
588once in a basic block, and can end at most once in a basic block.  Liveness
589analysis keeps track of which variable/instruction tuples begin live ranges and
590end live ranges, and combined with live-in and live-out sets, we can efficiently
591build up live ranges of all variables across all basic blocks.
592
593A lot of care is taken to try to make liveness analysis fast and efficient.
594Because of the lowering strategy, the number of variables is generally
595proportional to the number of instructions, leading to an O(N^2) complexity
596algorithm if implemented naively.  To improve things based on sparsity, we note
597that most variables are "local" and referenced in at most one basic block (in
598contrast to the "global" variables with multi-block usage), and therefore cannot
599be live across basic blocks.  Therefore, the live-in and live-out sets,
600typically represented as bit vectors, can be limited to the set of global
601variables, and the intra-block liveness bit vector can be compacted to hold the
602global variables plus the local variables for that block.
603
604Register allocation
605^^^^^^^^^^^^^^^^^^^
606
607Subzero implements a simple linear-scan register allocator, based on the
608allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
609Register Allocation in the Context of SSA Form and Register Constraints
610<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  This allocator has
611several nice features:
612
613- Live ranges are represented as unions of segments, as described above, rather
614  than a single start/end tuple.
615
616- It allows pre-coloring of variables with specific physical registers.
617
618- It applies equally well to pre-lowered Phi instructions.
619
620The paper suggests an approach of aggressively coalescing variables across Phi
621instructions (i.e., trying to force Phi source and destination variables to have
622the same register assignment), but we reject that in favor of the more natural
623preference mechanism described below.
624
625We enhance the algorithm in the paper with the capability of automatic inference
626of register preference, and with the capability of allowing overlapping live
627ranges to safely share the same register in certain circumstances.  If we are
628considering register allocation for variable ``A``, and ``A`` has a single
629defining instruction ``A=B+C``, then the preferred register for ``A``, if
630available, would be the register assigned to ``B`` or ``C``, if any, provided
631that ``B`` or ``C``'s live range does not overlap ``A``'s live range.  In this
632way we infer a good register preference for ``A``.
633
634We allow overlapping live ranges to get the same register in certain cases.
635Suppose a high-level instruction like::
636
637    A = unary_op(B)
638
639has been target-lowered like::
640
641    T.inf = B
642    A = unary_op(T.inf)
643
644Further, assume that ``B``'s live range continues beyond this instruction
645sequence, and that ``B`` has already been assigned some register.  Normally, we
646might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
647turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
648different registers.  But ``T.inf`` is just a read-only copy of ``B`` that is
649guaranteed to be in a register, so in theory these overlapping live ranges could
650safely have the same register.  Our implementation allows this overlap as long
651as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
652modified within ``T.inf``'s live range.
653
654Subzero's register allocator can be run in 3 configurations.
655
656- Normal mode.  All Variables are considered for register allocation.  It
657  requires full liveness analysis and live range construction as a prerequisite.
658  This is used by ``O2`` lowering.
659
660- Minimal mode.  Only infinite-weight or pre-colored Variables are considered.
661  All other Variables are stack-allocated.  It does not require liveness
662  analysis; instead, it quickly scans the instructions and records first
663  definitions and last uses of all relevant Variables, using that to construct a
664  single-segment live range.  Although this includes most of the Variables, the
665  live ranges are mostly simple, short, and rarely overlapping, which the
666  register allocator handles efficiently.  This is used by ``Om1`` lowering.
667
668- Post-phi lowering mode.  Advanced phi lowering is done after normal-mode
669  register allocation, and may result in new infinite-weight Variables that need
670  registers.  One would like to just run something like minimal mode to assign
671  registers to the new Variables while respecting existing register allocation
672  decisions.  However, it sometimes happens that there are no free registers.
673  In this case, some register needs to be forcibly spilled to the stack and
674  temporarily reassigned to the new Variable, and reloaded at the end of the new
675  Variable's live range.  The register must be one that has no explicit
676  references during the Variable's live range.  Since Subzero currently doesn't
677  track def/use chains (though it does record the CfgNode where a Variable is
678  defined), we just do a brute-force search across the CfgNode's instruction
679  list for the instruction numbers of interest.  This situation happens very
680  rarely, so there's little point for now in improving its performance.
681
682The basic linear-scan algorithm may, as it proceeds, rescind an early register
683allocation decision, leaving that Variable to be stack-allocated.  Some of these
684times, it turns out that the Variable could have been given a different register
685without conflict, but by this time it's too late.  The literature recognizes
686this situation and describes "second-chance bin-packing", which Subzero can do.
687We can rerun the register allocator in a mode that respects existing register
688allocation decisions, and sometimes it finds new non-conflicting opportunities.
689In fact, we can repeatedly run the register allocator until convergence.
690Unfortunately, in the current implementation, these subsequent register
691allocation passes end up being extremely expensive.  This is because of the
692treatment of the "unhandled pre-colored" Variable set, which is normally very
693small but ends up being quite large on subsequent passes.  Its performance can
694probably be made acceptable with a better choice of data structures, but for now
695this second-chance mechanism is disabled.
696
697Future work is to implement LLVM's `Greedy
698<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
699register allocator as a replacement for the basic linear-scan algorithm, given
700LLVM's experience with its improvement in code quality.  (The blog post claims
701that the Greedy allocator also improved maintainability because a lot of hacks
702could be removed, but Subzero is probably not yet to that level of hacks, and is
703less likely to see that particular benefit.)
704
705Local common subexpression elimination
706^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
707
708The Local CSE implementation goes through each instruction and records a portion
709of each ``Seen`` instruction in a hashset-like container.  That portion consists
710of the entire instruction except for any dest variable. That means ``A = X + Y``
711and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows
712us to detect common subexpressions.
713
714Whenever a repetition is detected, the redundant variables are stored in a
715container mapping the replacee to the replacement. In the case above, it would
716be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.
717
718At any point if a variable that has an entry in the replacement table is
719encountered, it is replaced with the variable it is mapped to. This ensures that
720the redundant variables will not have any uses in the basic block, allowing
721dead code elimination to clean up the redundant instruction.
722
723With SSA, the information stored is never invalidated. However, non-SSA input is
724supported with the ``-lcse=no-ssa`` option. This has to maintain some extra
725dependency information to ensure proper invalidation on variable assignment.
726This is not rigorously tested because this pass is run at an early stage where
727it is safe to assume variables have a single definition. This is not enabled by
728default because it bumps the compile time overhead from 2% to 6%.
729
730Loop-invariant code motion
731^^^^^^^^^^^^^^^^^^^^^^^^^^
732
733This pass utilizes the loop analysis information to hoist invariant instructions
734to loop pre-headers. A loop must have a single entry node (header) and that node
735must have a single external predecessor for this optimization to work, as it is
736currently implemented.
737
738The pass works by iterating over all instructions in the loop until the set of
739invariant instructions converges. In each iteration, a non-invariant instruction
740involving only constants or a variable known to be invariant is added to the
741result set. The destination variable of that instruction is added to the set of
742variables known to be invariant (which is initialized with the constant args).
743
744Improving the loop-analysis infrastructure is likely to have significant impact
745on this optimization. Inserting an extra node to act as the pre-header when the
746header has multiple incoming edges from outside could also be a good idea.
747Expanding the initial invariant variable set to contain all variables that do
748not have definitions inside the loop does not seem to improve anything.
749
750Short circuit evaluation
751^^^^^^^^^^^^^^^^^^^^^^^^
752
753Short circuit evaluation splits nodes and introduces early jumps when the result
754of a logical operation can be determined early and there are no observable side
755effects of skipping the rest of the computation. The instructions considered
756backwards from the end of the basic blocks. When a definition of a variable
757involved in a conditional jump is found, an extra jump can be inserted in that
758location, moving the rest of the instructions in the node to a newly inserted
759node. Consider this example::
760
761  __N :
762    a = <something>
763    Instruction 1 without side effect
764    ... b = <something> ...
765    Instruction N without side effect
766    t1 = or a b
767    br t1 __X __Y
768
769is transformed to::
770
771  __N :
772    a = <something>
773    br a __X __N_ext
774
775  __N_ext :
776    Instruction 1 without side effect
777    ... b = <something> ...
778    Instruction N without side effect
779    br b __X __Y
780
781The logic for AND is analogous, the only difference is that the early jump is
782facilitated by a ``false`` value instead of ``true``.
783
784Global Variable Splitting
785^^^^^^^^^^^^^^^^^^^^^^^^^
786
787Global variable splitting (``-split-global-vars``) is run after register
788allocation. It works on the variables that did not manage to get registers (but
789are allowed to) and decomposes their live ranges into the individual segments
790(which span a single node at most). New variables are created (but not yet used)
791with these smaller live ranges and the register allocator is run again. This is
792not inefficient as the old variables that already had registers are now
793considered pre-colored.
794
795The new variables that get registers replace their parent variables for their
796portion of its (parent's) live range. A copy from the old variable to the new
797is introduced before the first use and the reverse after the last def in the
798live range.
799
800Basic phi lowering
801^^^^^^^^^^^^^^^^^^
802
803The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
804implements it).  Consider this example::
805
806    L1:
807      ...
808      br L3
809    L2:
810      ...
811      br L3
812    L3:
813      A = phi [B, L1], [C, L2]
814      X = phi [Y, L1], [Z, L2]
815
816For each destination of a phi instruction, we can create a temporary and insert
817the temporary's assignment at the end of the predecessor block::
818
819    L1:
820      ...
821      A' = B
822      X' = Y
823      br L3
824    L2:
825      ...
826      A' = C
827      X' = Z
828      br L3
829    L2:
830      A = A'
831      X = X'
832
833This transformation is very simple and reliable.  It can be done before target
834lowering and register allocation, and it easily avoids the classic lost-copy and
835related problems.  ``Om1`` lowering uses this strategy.
836
837However, it has the disadvantage of initializing temporaries even for branches
838not taken, though that could be mitigated by splitting non-critical edges and
839putting assignments in the edge-split nodes.  Another problem is that without
840extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
841specific ordering even though phi semantics are that the assignments are
842parallel or unordered.  This sometimes imposes false live range overlaps and
843leads to poorer register allocation.
844
845Advanced phi lowering
846^^^^^^^^^^^^^^^^^^^^^
847
848``O2`` lowering defers phi lowering until after register allocation to avoid the
849problem of false live range overlaps.  It works as follows.  We split each
850incoming edge and move the (parallel) phi assignments into the split nodes.  We
851linearize each set of assignments by finding a safe, topological ordering of the
852assignments, respecting register assignments as well.  For example::
853
854    A = B
855    X = Y
856
857Normally these assignments could be executed in either order, but if ``B`` and
858``X`` are assigned the same physical register, we would want to use the above
859ordering.  Dependency cycles are broken by introducing a temporary.  For
860example::
861
862    A = B
863    B = A
864
865Here, a temporary breaks the cycle::
866
867    t = A
868    A = B
869    B = t
870
871Finally, we use the existing target lowering to lower the assignments in this
872basic block, and once that is done for all basic blocks, we run the post-phi
873variant of register allocation on the edge-split basic blocks.
874
875When computing a topological order, we try to first schedule assignments whose
876source has a physical register, and last schedule assignments whose destination
877has a physical register.  This helps reduce register pressure.
878
879X86 address mode inference
880^^^^^^^^^^^^^^^^^^^^^^^^^^
881
882We try to take advantage of the x86 addressing mode that includes a base
883register, an index register, an index register scale amount, and an immediate
884offset.  We do this through simple pattern matching.  Starting with a load or
885store instruction where the address is a variable, we initialize the base
886register to that variable, and look up the instruction where that variable is
887defined.  If that is an add instruction of two variables and the index register
888hasn't been set, we replace the base and index register with those two
889variables.  If instead it is an add instruction of a variable and a constant, we
890replace the base register with the variable and add the constant to the
891immediate offset.
892
893There are several more patterns that can be matched.  This pattern matching
894continues on the load or store instruction until no more matches are found.
895Because a program typically has few load and store instructions (not to be
896confused with instructions that manipulate stack variables), this address mode
897inference pass is fast.
898
899X86 read-modify-write inference
900^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
901
902A reasonably common bitcode pattern is a non-atomic update of a memory
903location::
904
905    x = load addr
906    y = add x, 1
907    store y, addr
908
909On x86, with good register allocation, the Subzero passes described above
910generate code with only this quality::
911
912    mov [%ebx], %eax
913    add $1, %eax
914    mov %eax, [%ebx]
915
916However, x86 allows for this kind of code::
917
918    add $1, [%ebx]
919
920which requires fewer instructions, but perhaps more importantly, requires fewer
921physical registers.
922
923It's also important to note that this transformation only makes sense if the
924store instruction ends ``x``'s live range.
925
926Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
927opportunities via simple pattern matching.  The only problem is that it is run
928before liveness analysis, which is needed to determine whether ``x``'s live
929range ends after the RMW.  Since liveness analysis is one of the most expensive
930passes, it's not attractive to run it an extra time just for RMW analysis.
931Instead, we essentially generate both the RMW and the non-RMW versions, and then
932during lowering, the RMW version deletes itself if it finds x still live.
933
934X86 compare-branch inference
935^^^^^^^^^^^^^^^^^^^^^^^^^^^^
936
937In the LLVM instruction set, the compare/branch pattern works like this::
938
939    cond = icmp eq a, b
940    br cond, target
941
942The result of the icmp instruction is a single bit, and a conditional branch
943tests that bit.  By contrast, most target architectures use this pattern::
944
945    cmp a, b  // implicitly sets various bits of FLAGS register
946    br eq, target  // branch on a particular FLAGS bit
947
948A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
949``cond`` and conditionally branches.  Subzero has a pass that identifies
950boolean-based operations like this and folds them into a single
951compare/branch-like operation.  It is set up for more than just cmp/br though.
952Boolean producers include icmp (integer compare), fcmp (floating-point compare),
953and trunc (integer truncation when the destination has bool type).  Boolean
954consumers include branch, select (the ternary operator from the C language), and
955sign-extend and zero-extend when the source has bool type.
956
957Sandboxing
958^^^^^^^^^^
959
960Native Client's sandbox model uses software fault isolation (SFI) to provide
961safety when running untrusted code in a browser or other environment.  Subzero
962implements Native Client's `sandboxing
963<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
964to enable Subzero-translated executables to be run inside Chrome.  Subzero also
965provides a fairly simple framework for investigating alternative sandbox models
966or other restrictions on the sandbox model.
967
968Sandboxing in Subzero is not actually implemented as a separate pass, but is
969integrated into lowering and assembly.
970
971- Indirect branches, including the ret instruction, are masked to a bundle
972  boundary and bundle-locked.
973
974- Call instructions are aligned to the end of the bundle so that the return
975  address is bundle-aligned.
976
977- Indirect branch targets, including function entry and targets in a switch
978  statement jump table, are bundle-aligned.
979
980- The intrinsic for reading the thread pointer is inlined appropriately.
981
982- For x86-64, non-stack memory accesses are with respect to the reserved sandbox
983  base register.  We reduce the aggressiveness of address mode inference to
984  leave room for the sandbox base register during lowering.  There are no memory
985  sandboxing changes for x86-32.
986
987Code emission
988-------------
989
990Subzero's integrated assembler is derived from Dart's `assembler code
991<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_.  There is a pass
992that iterates through the low-level ICE instructions and invokes the relevant
993assembler functions.  Placeholders are added for later fixup of branch target
994offsets.  (Backward branches use short offsets if possible; forward branches
995generally use long offsets unless it is an intra-block branch of "known" short
996length.)  The assembler emits into a staging buffer.  Once emission into the
997staging buffer for a function is complete, the data is emitted to the output
998file as an ELF object file, and metadata such as relocations, symbol table, and
999string table, are accumulated for emission at the end.  Global data initializers
1000are emitted similarly.  A key point is that at this point, the staging buffer
1001can be deallocated, and only a minimum of data needs to held until the end.
1002
1003As a debugging alternative, Subzero can emit textual assembly code which can
1004then be run through an external assembler.  This is of course super slow, but
1005quite valuable when bringing up a new target.
1006
1007As another debugging option, the staging buffer can be emitted as textual
1008assembly, primarily in the form of ".byte" lines.  This allows the assembler to
1009be tested separately from the ELF related code.
1010
1011Memory management
1012-----------------
1013
1014Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
1015LLVM's ``BumpPtrAllocator``.  This is an arena-style allocator where objects
1016allocated from the arena are never actually freed; instead, when the CFG
1017translation completes and the CFG is deleted, the entire arena memory is
1018reclaimed at once.  This style of allocation works well in an environment like a
1019compiler where there are distinct phases with only easily-identifiable objects
1020living across phases.  It frees the developer from having to manage object
1021deletion, and it amortizes deletion costs across just a single arena deletion at
1022the end of the phase.  Furthermore, it helps scalability by allocating entirely
1023from thread-local memory pools, and minimizing global locking of the heap.
1024
1025Instructions are probably the most heavily allocated complex class in Subzero.
1026We represent an instruction list as an intrusive doubly linked list, allocate
1027all instructions from the ``CfgLocalAllocator``, and we make sure each
1028instruction subclass is basically `POD
1029<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
1030trivial destructor.  This way, when the CFG is finished, we don't need to
1031individually deallocate every instruction.  We do similar for Variables, which
1032is probably the second most popular complex class.
1033
1034There are some situations where passes need to use some `STL container class
1035<http://en.cppreference.com/w/cpp/container>`_.  Subzero has a way of using the
1036``CfgLocalAllocator`` as the container allocator if this is needed.
1037
1038Multithreaded translation
1039-------------------------
1040
1041Subzero is designed to be able to translate functions in parallel.  With the
1042``-threads=N`` command-line option, there is a 3-stage producer-consumer
1043pipeline:
1044
1045- A single thread parses the ``.pexe`` file and produces a sequence of work
1046  units.  A work unit can be either a fully constructed CFG, or a set of global
1047  initializers.  The work unit includes its sequence number denoting its parse
1048  order.  Each work unit is added to the translation queue.
1049
1050- There are N translation threads that draw work units from the translation
1051  queue and lower them into assembler buffers.  Each assembler buffer is added
1052  to the emitter queue, tagged with its sequence number.  The CFG and its
1053  ``CfgLocalAllocator`` are disposed of at this point.
1054
1055- A single thread draws assembler buffers from the emitter queue and appends to
1056  the output file.  It uses the sequence numbers to reintegrate the assembler
1057  buffers according to the original parse order, such that output order is
1058  always deterministic.
1059
1060This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
1061for a total of ``N+2`` execution threads, taking the parser and emitter threads
1062into account.  For the special case of ``N=0``, execution is entirely sequential
1063-- the same thread parses, translates, and emits, one function at a time.  This
1064is useful for performance measurements.
1065
1066Ideally, we would like to get near-linear scalability as the number of
1067translation threads increases.  We expect that ``-threads=1`` should be slightly
1068faster than ``-threads=0`` as the small amount of time spent parsing and
1069emitting is done largely in parallel with translation.  With perfect
1070scalability, we see ``-threads=N`` translating ``N`` times as fast as
1071``-threads=1``, up until the point where parsing or emitting becomes the
1072bottleneck, or ``N+2`` exceeds the number of CPU cores.  In reality, memory
1073performance would become a bottleneck and efficiency might peak at, say, 75%.
1074
1075Currently, parsing takes about 11% of total sequential time.  If translation
1076scalability ever gets so fast and awesomely scalable that parsing becomes a
1077bottleneck, it should be possible to make parsing multithreaded as well.
1078
1079Internally, all shared, mutable data is held in the GlobalContext object, and
1080access to each field is guarded by a mutex.
1081
1082Security
1083--------
1084
1085Subzero includes a number of security features in the generated code, as well as
1086in the Subzero translator itself, which run on top of the existing Native Client
1087sandbox as well as Chrome's OS-level sandbox.
1088
1089Sandboxed translator
1090^^^^^^^^^^^^^^^^^^^^
1091
1092When running inside the browser, the Subzero translator executes as sandboxed,
1093untrusted code that is initially checked by the validator, just like the
1094LLVM-based ``pnacl-llc`` translator.  As such, the Subzero binary should be no
1095more or less secure than the translator it replaces, from the point of view of
1096the Chrome sandbox.  That said, Subzero is much smaller than ``pnacl-llc`` and
1097was designed from the start with security in mind, so one expects fewer attacker
1098opportunities here.
1099
1100Code diversification
1101^^^^^^^^^^^^^^^^^^^^
1102
1103`Return-oriented programming
1104<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
1105now-common technique for starting with e.g. a known buffer overflow situation
1106and launching it into a deeper exploit.  The attacker scans the executable
1107looking for ROP gadgets, which are short sequences of code that happen to load
1108known values into known registers and then return.  An attacker who manages to
1109overwrite parts of the stack can overwrite it with carefully chosen return
1110addresses such that certain ROP gadgets are effectively chained together to set
1111up the register state as desired, finally returning to some code that manages to
1112do something nasty based on those register values.
1113
1114If there is a popular ``.pexe`` with a large install base, the attacker could
1115run Subzero on it and scan the executable for suitable ROP gadgets to use as
1116part of a potential exploit.  Note that if the trusted validator is working
1117correctly, these ROP gadgets are limited to starting at a bundle boundary and
1118cannot use the trick of finding a gadget that happens to begin inside another
1119instruction.  All the same, gadgets with these constraints still exist and the
1120attacker has access to them.  This is the attack model we focus most on --
1121protecting the user against misuse of a "trusted" developer's application, as
1122opposed to mischief from a malicious ``.pexe`` file.
1123
1124Subzero can mitigate these attacks to some degree through code diversification.
1125Specifically, we can apply some randomness to the code generation that makes ROP
1126gadgets less predictable.  This randomness can have some compile-time cost, and
1127it can affect the code quality; and some diversifications may be more effective
1128than others.  A more detailed treatment of hardening techniques may be found in
1129the Matasano report "`Attacking Clientside JIT Compilers
1130<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
1131
1132To evaluate diversification effectiveness, we use a third-party ROP gadget
1133finder and limit its results to bundle-aligned addresses.  For a given
1134diversification technique, we run it with a number of different random seeds,
1135find ROP gadgets for each version, and determine how persistent each ROP gadget
1136is across the different versions.  A gadget is persistent if the same gadget is
1137found at the same code address.  The best diversifications are ones with low
1138gadget persistence rates.
1139
1140Subzero implements 7 different diversification techniques.  Below is a
1141discussion of each technique, its effectiveness, and its cost.  The discussions
1142of cost and effectiveness are for a single diversification technique; the
1143translation-time costs for multiple techniques are additive, but the effects of
1144multiple techniques on code quality and effectiveness are not yet known.
1145
1146In Subzero's implementation, each randomization is "repeatable" in a sense.
1147Each pass that includes a randomization option gets its own private instance of
1148a random number generator (RNG).  The RNG is seeded with a combination of a
1149global seed, the pass ID, and the function's sequence number.  The global seed
1150is designed to be different across runs (perhaps based on the current time), but
1151for debugging, the global seed can be set to a specific value and the results
1152will be repeatable.
1153
1154Subzero-generated code is subject to diversification once per translation, and
1155then Chrome caches the diversified binary for subsequent executions.  An
1156attacker may attempt to run the binary multiple times hoping for
1157higher-probability combinations of ROP gadgets.  When the attacker guesses
1158wrong, a likely outcome is an application crash.  Chrome throttles creation of
1159crashy processes which reduces the likelihood of the attacker eventually gaining
1160a foothold.
1161
1162Constant blinding
1163~~~~~~~~~~~~~~~~~
1164
1165Here, we prevent attackers from controlling large immediates in the text
1166(executable) section.  A random cookie is generated for each function, and if
1167the constant exceeds a specified threshold, the constant is obfuscated with the
1168cookie and equivalent code is generated.  For example, instead of this x86
1169instruction::
1170
1171    mov $0x11223344, <%Reg/Mem>
1172
1173the following code might be generated::
1174
1175    mov $(0x11223344+Cookie), %temp
1176    lea -Cookie(%temp), %temp
1177    mov %temp, <%Reg/Mem>
1178
1179The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
1180prevent unintended effects on the flags register.
1181
1182This transformation has almost no effect on translation time, and about 1%
1183impact on code quality, depending on the threshold chosen.  It does little to
1184reduce gadget persistence, but it does remove a lot of potential opportunities
1185to construct intra-instruction ROP gadgets (which an attacker could use only if
1186a validator bug were discovered, since the Native Client sandbox and associated
1187validator force returns and other indirect branches to be to bundle-aligned
1188addresses).
1189
1190Constant pooling
1191~~~~~~~~~~~~~~~~
1192
1193This is similar to constant blinding, in that large immediates are removed from
1194the text section.  In this case, each unique constant above the threshold is
1195stored in a read-only data section and the constant is accessed via a memory
1196load.  For the above example, the following code might be generated::
1197
1198    mov $Label$1, %temp
1199    mov %temp, <%Reg/Mem>
1200
1201This has a similarly small impact on translation time and ROP gadget
1202persistence, and a smaller (better) impact on code quality.  This is because it
1203uses fewer instructions, and in some cases good register allocation leads to no
1204increase in instruction count.  Note that this still gives an attacker some
1205limited amount of control over some text section values, unless we randomize the
1206constant pool layout.
1207
1208Static data reordering
1209~~~~~~~~~~~~~~~~~~~~~~
1210
1211This transformation limits the attacker's ability to control bits in global data
1212address references.  It simply permutes the order in memory of global variables
1213and internal constant pool entries.  For the constant pool, we only permute
1214within a type (i.e., emit a randomized list of ints, followed by a randomized
1215list of floats, etc.) to maintain good packing in the face of alignment
1216constraints.
1217
1218As might be expected, this has no impact on code quality, translation time, or
1219ROP gadget persistence (though as above, it limits opportunities for
1220intra-instruction ROP gadgets with a broken validator).
1221
1222Basic block reordering
1223~~~~~~~~~~~~~~~~~~~~~~
1224
1225Here, we randomize the order of basic blocks within a function, with the
1226constraint that we still want to maintain a topological order as much as
1227possible, to avoid making the code too branchy.
1228
1229This has no impact on code quality, and about 1% impact on translation time, due
1230to a separate pass to recompute layout.  It ends up having a huge effect on ROP
1231gadget persistence, tied for best with nop insertion, reducing ROP gadget
1232persistence to less than 5%.
1233
1234Function reordering
1235~~~~~~~~~~~~~~~~~~~
1236
1237Here, we permute the order that functions are emitted, primarily to shift ROP
1238gadgets around to less predictable locations.  It may also change call address
1239offsets in case the attacker was trying to control that offset in the code.
1240
1241To control latency and memory footprint, we don't arbitrarily permute functions.
1242Instead, for some relatively small value of N, we queue up N assembler buffers,
1243and then emit the N functions in random order, and repeat until all functions
1244are emitted.
1245
1246Function reordering has no impact on translation time or code quality.
1247Measurements indicate that it reduces ROP gadget persistence to about 15%.
1248
1249Nop insertion
1250~~~~~~~~~~~~~
1251
1252This diversification randomly adds a nop instruction after each regular
1253instruction, with some probability.  Nop instructions of different lengths may
1254be selected.  Nop instructions are never added inside a bundle_lock region.
1255Note that when sandboxing is enabled, nop instructions are already being added
1256for bundle alignment, so the diversification nop instructions may simply be
1257taking the place of alignment nop instructions, though distributed differently
1258through the bundle.
1259
1260In Subzero's currently implementation, nop insertion adds 3-5% to the
1261translation time, but this is probably because it is implemented as a separate
1262pass that adds actual nop instructions to the IR.  The overhead would probably
1263be a lot less if it were integrated into the assembler pass.  The code quality
1264is also reduced by 3-5%, making nop insertion the most expensive of the
1265diversification techniques.
1266
1267Nop insertion is very effective in reducing ROP gadget persistence, at the same
1268level as basic block randomization (less than 5%).  But given nop insertion's
1269impact on translation time and code quality, one would most likely prefer to use
1270basic block randomization instead (though the combined effects of the different
1271diversification techniques have not yet been studied).
1272
1273Register allocation randomization
1274~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1275
1276In this diversification, the register allocator tries to make different but
1277mostly functionally equivalent choices, while maintaining stable code quality.
1278
1279A naive approach would be the following.  Whenever the allocator has more than
1280one choice for assigning a register, choose randomly among those options.  And
1281whenever there are no registers available and there is a tie for the
1282lowest-weight variable, randomly select one of the lowest-weight variables to
1283evict.  Because of the one-pass nature of the linear-scan algorithm, this
1284randomization strategy can have a large impact on which variables are ultimately
1285assigned registers, with a corresponding large impact on code quality.
1286
1287Instead, we choose an approach that tries to keep code quality stable regardless
1288of the random seed.  We partition the set of physical registers into equivalence
1289classes.  If a register is pre-colored in the function (i.e., referenced
1290explicitly by name), it forms its own equivalence class.  The remaining
1291registers are partitioned according to their combination of attributes such as
1292integer versus floating-point, 8-bit versus 32-bit, caller-save versus
1293callee-saved, etc.  Each equivalence class is randomly permuted, and the
1294complete permutation is applied to the final register assignments.
1295
1296Register randomization reduces ROP gadget persistence to about 10% on average,
1297though there tends to be fairly high variance across functions and applications.
1298This probably has to do with the set of restrictions in the x86-32 instruction
1299set and ABI, such as few general-purpose registers, ``%eax`` used for return
1300values, ``%edx`` used for division, ``%cl`` used for shifting, etc.  As
1301intended, register randomization has no impact on code quality, and a slight
1302(0.5%) impact on translation time due to an extra scan over the variables to
1303identify pre-colored registers.
1304
1305Fuzzing
1306^^^^^^^
1307
1308We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
1309combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
1310<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling.  The purpose is to
1311find and fix cases where Subzero crashes or otherwise ungracefully fails on
1312unexpected inputs, and to do so automatically over a large range of unexpected
1313inputs.  By fixing bugs that arise from fuzz testing, we reduce the possibility
1314of an attacker exploiting these bugs.
1315
1316Most of the problems found so far are ones most appropriately handled in the
1317parser.  However, there have been a couple that have identified problems in the
1318lowering, or otherwise inappropriately triggered assertion failures and fatal
1319errors.  We continue to dig into this area.
1320
1321Future security work
1322^^^^^^^^^^^^^^^^^^^^
1323
1324Subzero is well-positioned to explore other future security enhancements, e.g.:
1325
1326- Tightening the Native Client sandbox.  ABI changes, such as the previous work
1327  on `hiding the sandbox base address
1328  <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
1329  in x86-64, are easy to experiment with in Subzero.
1330
1331- Making the executable code section read-only.  This would prevent a PNaCl
1332  application from inspecting its own binary and trying to find ROP gadgets even
1333  after code diversification has been performed.  It may still be susceptible to
1334  `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
1335  security is still overall improved.
1336
1337- Instruction selection diversification.  It may be possible to lower a given
1338  instruction in several largely equivalent ways, which gives more opportunities
1339  for code randomization.
1340
1341Chrome integration
1342------------------
1343
1344Currently Subzero is available in Chrome for the x86-32 architecture, but under
1345a flag.  When the flag is enabled, Subzero is used when the `manifest file
1346<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
1347linking to the ``.pexe`` file specifies the ``O0`` optimization level.
1348
1349The next step is to remove the flag, i.e. invoke Subzero as the only translator
1350for ``O0``-specified manifest files.
1351
1352Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
1353case Subzero could be used for all PNaCl translation.
1354
1355Command line options
1356--------------------
1357
1358Subzero has a number of command-line options for debugging and diagnostics.
1359Among the more interesting are the following.
1360
1361- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
1362  diagnostic output, with various levels of detail after each pass.  Instruction
1363  numbers can be printed or suppressed.  Deleted instructions can be printed or
1364  suppressed (they are retained in the instruction list, as discussed earlier,
1365  because they can help explain how lower-level instructions originated).
1366  Liveness information can be printed when available.  Details of register
1367  allocation can be printed as register allocator decisions are made.  And more.
1368
1369- Running Subzero with any level of verbosity produces an enormous amount of
1370  output.  When debugging a single function, verbose output can be suppressed
1371  except for a particular function.  The ``-verbose-focus`` flag suppresses
1372  verbose output except for the specified function.
1373
1374- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
1375  at exit.  Timing markers can be placed in the Subzero source code to demarcate
1376  logical operations or passes of interest.  Basic timing information plus
1377  call-stack type timing information is printed at the end.
1378
1379- Along with ``-timing``, the user can instead get a report on the overall
1380  translation time for each function, to help focus on timing outliers.  Also,
1381  ``-timing-focus`` limits the ``-timing`` reporting to a single function,
1382  instead of aggregating pass timing across all functions.
1383
1384- The ``-szstats`` option reports various statistics on each function, such as
1385  stack frame size, static instruction count, etc.  It may be helpful to track
1386  these stats over time as Subzero is improved, as an approximate measure of
1387  code quality.
1388
1389- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
1390  output, annotate the assembly output with register-focused liveness
1391  information.  In particular, each basic block is annotated with which
1392  registers are live-in and live-out, and each instruction is annotated with
1393  which registers' and stack locations' live ranges end at that instruction.
1394  This is really useful when studying the generated code to find opportunities
1395  for code quality improvements.
1396
1397Testing and debugging
1398---------------------
1399
1400LLVM lit tests
1401^^^^^^^^^^^^^^
1402
1403For basic testing, Subzero uses LLVM's `lit
1404<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests.  We
1405have a suite of hundreds of small functions where we test for particular
1406assembly code patterns across different target architectures.
1407
1408Cross tests
1409^^^^^^^^^^^
1410
1411Unfortunately, the lit tests don't do a great job of precisely testing the
1412correctness of the output.  Much better are the cross tests, which are execution
1413tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
1414variety of interesting inputs.  Each cross test consists of a set of C, C++,
1415and/or low-level bitcode files.  The C and C++ source files are compiled down to
1416bitcode.  The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
1417Subzero mangles global symbol names with a special prefix to avoid duplicate
1418symbol errors.  A driver program invokes both versions on a large set of
1419interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
1420differ.  Cross tests turn out to be an excellent way of testing the basic
1421lowering patterns, but they are less useful for testing more global things like
1422liveness analysis and register allocation.
1423
1424Bisection debugging
1425^^^^^^^^^^^^^^^^^^^
1426
1427Sometimes with a new application, Subzero will end up producing incorrect code
1428that either crashes at runtime or otherwise produces the wrong results.  When
1429this happens, we need to narrow it down to a single function (or small set of
1430functions) that yield incorrect behavior.  For this, we have a bisection
1431debugging framework.  Here, we initially translate the entire application once
1432with Subzero and once with ``pnacl-llc``.  We then use ``objdump`` to
1433selectively weaken symbols based on a whitelist or blacklist provided on the
1434command line.  The two object files can then be linked together without link
1435errors, with the desired version of each method "winning".  Then the binary is
1436tested, and bisection proceeds based on whether the binary produces correct
1437output.
1438
1439When the bisection completes, we are left with a minimal set of
1440Subzero-translated functions that cause the failure.  Usually it is a single
1441function, though sometimes it might require a combination of several functions
1442to cause a failure; this may be due to an incorrect call ABI, for example.
1443However, Murphy's Law implies that the single failing function is enormous and
1444impractical to debug.  In that case, we can restart the bisection, explicitly
1445blacklisting the enormous function, and try to find another candidate to debug.
1446(Future work is to automate this to find all minimal sets of functions, so that
1447debugging can focus on the simplest example.)
1448
1449Fuzz testing
1450^^^^^^^^^^^^
1451
1452As described above, we try to find internal Subzero bugs using fuzz testing
1453techniques.
1454
1455Sanitizers
1456^^^^^^^^^^
1457
1458Subzero can be built with `AddressSanitizer
1459<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
1460<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support.  This is
1461done using something as simple as ``make ASAN=1`` or ``make TSAN=1``.  So far,
1462multithreading has been simple enough that TSan hasn't found any bugs, but ASan
1463has found at least one memory leak which was subsequently fixed.
1464`UndefinedBehaviorSanitizer
1465<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
1466(UBSan) support is in progress.  `Control flow integrity sanitization
1467<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
1468consideration.
1469
1470Current status
1471==============
1472
1473Target architectures
1474--------------------
1475
1476Subzero is currently more or less complete for the x86-32 target.  It has been
1477refactored and extended to handle x86-64 as well, and that is mostly complete at
1478this point.
1479
1480ARM32 work is in progress.  It currently lacks the testing level of x86, at
1481least in part because Subzero's register allocator needs modifications to handle
1482ARM's aliasing of floating point and vector registers.  Specifically, a 64-bit
1483register is actually a gang of two consecutive and aligned 32-bit registers, and
1484a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
1485ARM64 work has not started; when it does, it will be native-only since the
1486Native Client sandbox model, validator, and other tools have never been defined.
1487
1488An external contributor is adding MIPS support, in most part by following the
1489ARM work.
1490
1491Translator performance
1492----------------------
1493
1494Single-threaded translation speed is currently about 5× the ``pnacl-llc``
1495translation speed.  For a large ``.pexe`` file, the time breaks down as:
1496
1497- 11% for parsing and initial IR building
1498
1499- 4% for emitting to /dev/null
1500
1501- 27% for liveness analysis (two liveness passes plus live range construction)
1502
1503- 15% for linear-scan register allocation
1504
1505- 9% for basic lowering
1506
1507- 10% for advanced phi lowering
1508
1509- ~11% for other minor analysis
1510
1511- ~10% measurement overhead to acquire these numbers
1512
1513Some improvements could undoubtedly be made, but it will be hard to increase the
1514speed to 10× of ``pnacl-llc`` while keeping acceptable code quality.  With
1515``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
1516``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
1517
1518Code quality
1519------------
1520
1521Measured across 16 components of spec2k, Subzero's code quality is uniformly
1522better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
1523between ``pnacl-llc`` ``-O0`` and ``-O2``.
1524
1525Translator size
1526---------------
1527
1528When built in MINIMAL mode, the x86-64 native translator size for the x86-32
1529target is about 700 KB, not including the size of functions referenced in
1530dynamically-linked libraries.  The sandboxed version of Subzero is a bit over 1
1531MB, and it is statically linked and also includes nop padding for bundling as
1532well as indirect branch masking.
1533
1534Translator memory footprint
1535---------------------------
1536
1537It's hard to draw firm conclusions about memory footprint, since the footprint
1538is at least proportional to the input function size, and there is no real limit
1539on the size of functions in the ``.pexe`` file.
1540
1541That said, we looked at the memory footprint over time as Subzero translated
1542``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
1543disposal.  One of LLVM's libraries that Subzero uses can report the current
1544malloc heap usage.  With single-threaded translation, Subzero tends to hover
1545around 15 MB of memory usage.  There are a couple of monstrous functions where
1546Subzero grows to around 100 MB, but then it drops back down after those
1547functions finish translating.  In contrast, ``pnacl-llc`` grows larger and
1548larger throughout translation, reaching several hundred MB by the time it
1549completes.
1550
1551It's a bit more interesting when we enable multithreaded translation.  When
1552there are N translation threads, Subzero implements a policy that limits the
1553size of the translation queue to N entries -- if it is "full" when the parser
1554tries to add a new CFG, the parser blocks until one of the translation threads
1555removes a CFG.  This means the number of in-memory CFGs can (and generally does)
1556reach 2*N+1, and so the memory footprint rises in proportion to the number of
1557threads.  Adding to the pressure is the observation that the monstrous functions
1558also take proportionally longer time to translate, so there's a good chance many
1559of the monstrous functions will be active at the same time with multithreaded
1560translation.  As a result, for N=32, Subzero's memory footprint peaks at about
1561260 MB, but drops back down as the large functions finish translating.
1562
1563If this peak memory size becomes a problem, it might be possible for the parser
1564to resequence the functions to try to spread out the larger functions, or to
1565throttle the translation queue to prevent too many in-flight large functions.
1566It may also be possible to throttle based on memory pressure signaling from
1567Chrome.
1568
1569Translator scalability
1570----------------------
1571
1572Currently scalability is "not very good".  Multiple translation threads lead to
1573faster translation, but not to the degree desired.  We haven't dug in to
1574investigate yet.
1575
1576There are a few areas to investigate.  First, there may be contention on the
1577constant pool, which all threads access, and which requires locked access even
1578for reading.  This could be mitigated by keeping a CFG-local cache of the most
1579common constants.
1580
1581Second, there may be contention on memory allocation.  While almost all CFG
1582objects are allocated from the CFG-local allocator, some passes use temporary
1583STL containers that use the default allocator, which may require global locking.
1584This could be mitigated by switching these to the CFG-local allocator.
1585
1586Third, multithreading may make the default allocator strategy more expensive.
1587In a single-threaded environment, a pass will allocate its containers, run the
1588pass, and deallocate the containers.  This results in stack-like allocation
1589behavior and makes the heap free list easier to manage, with less heap
1590fragmentation.  But when multithreading is added, the allocations and
1591deallocations become much less stack-like, making allocation and deallocation
1592operations individually more expensive.  Again, this could be mitigated by
1593switching these to the CFG-local allocator.
1594