1
2Verification todo
3~~~~~~~~~~~~~~~~~
4check that illegal insns on all targets don't cause the _toIR.c's to
5assert.  [DONE: amd64 x86 ppc32 ppc64 arm s390]
6
7check also with --vex-guest-chase-cond=yes
8
9check that all targets can run their insn set tests with
10--vex-guest-max-insns=1.
11
12all targets: run some tests using --profile-flags=... to exercise
13function patchProfInc_<arch> [DONE: amd64 x86 ppc32 ppc64 arm s390]
14
15figure out if there is a way to write a test program that checks
16that event checks are actually getting triggered
17
18
19Cleanups
20~~~~~~~~
21host_arm_isel.c and host_arm_defs.c: get rid of global var arm_hwcaps.
22
23host_x86_defs.c, host_amd64_defs.c: return proper VexInvalRange
24records from the patchers, instead of {0,0}, so that transparent
25self hosting works properly.
26
27host_ppc_defs.h: is RdWrLR still needed?  If not delete.
28
29ditto ARM, Ld8S
30
31Comments that used to be in m_scheduler.c:
32   tchaining tests:
33   - extensive spinrounds
34   - with sched quantum = 1  -- check that handle_noredir_jump
35     doesn't return with INNER_COUNTERZERO
36   other:
37   - out of date comment w.r.t. bit 0 set in libvex_trc_values.h
38   - can VG_TRC_BORING still happen?  if not, rm
39   - memory leaks in m_transtab (InEdgeArr/OutEdgeArr leaking?)
40   - move do_cacheflush out of m_transtab
41   - more economical unchaining when nuking an entire sector
42   - ditto w.r.t. cache flushes
43   - verify case of 2 paths from A to B
44   - check -- is IP_AT_SYSCALL still right?
45
46
47Optimisations
48~~~~~~~~~~~~~
49ppc: chain_XDirect: generate short form jumps when possible
50
51ppc64: immediate generation is terrible .. should be able
52       to do better
53
54arm codegen: Generate ORRS for CmpwNEZ32(Or32(x,y))
55
56all targets: when nuking an entire sector, don't bother to undo the
57patching for any translations within the sector (nor with their
58invalidations).
59
60(somewhat implausible) for jumps to disp_cp_indir, have multiple
61copies of disp_cp_indir, one for each of the possible registers that
62could have held the target guest address before jumping to the stub.
63Then disp_cp_indir wouldn't have to reload it from memory each time.
64Might also have the effect of spreading out the indirect mispredict
65burden somewhat (across the multiple copies.)
66
67
68Implementation notes
69~~~~~~~~~~~~~~~~~~~~
70T-chaining changes -- summary
71
72* The code generators (host_blah_isel.c, host_blah_defs.[ch]) interact
73  more closely with Valgrind than before.  In particular the
74  instruction selectors must use one of 3 different kinds of
75  control-transfer instructions: XDirect, XIndir and XAssisted.
76  All archs must use these the same; no more ad-hoc control transfer
77  instructions.
78  (more detail below)
79
80
81* With T-chaining, translations can jump between each other without
82  going through the dispatcher loop every time.  This means that the
83  event check (counter dec, and exit if negative) the dispatcher loop
84  previously did now needs to be compiled into each translation.
85
86
87* The assembly dispatcher code (dispatch-arch-os.S) is still
88  present.  It still provides table lookup services for
89  indirect branches, but it also provides a new feature:
90  dispatch points, to which the generated code jumps.  There
91  are 5:
92
93  VG_(disp_cp_chain_me_to_slowEP):
94  VG_(disp_cp_chain_me_to_fastEP):
95    These are chain-me requests, used for Boring conditional and
96    unconditional jumps to destinations known at JIT time.  The
97    generated code calls these (doesn't jump to them) and the
98    stub recovers the return address.  These calls never return;
99    instead the call is done so that the stub knows where the
100    calling point is.  It needs to know this so it can patch
101    the calling point to the requested destination.
102  VG_(disp_cp_xindir):
103    Old-style table lookup and go; used for indirect jumps
104  VG_(disp_cp_xassisted):
105    Most general and slowest kind.  Can transfer to anywhere, but
106    first returns to scheduler to do some other event (eg a syscall)
107    before continuing.
108  VG_(disp_cp_evcheck_fail):
109    Code jumps here when the event check fails.
110
111
112* new instructions in backends: XDirect, XIndir and XAssisted.
113  XDirect is used for chainable jumps.  It is compiled into a
114  call to VG_(disp_cp_chain_me_to_slowEP) or
115  VG_(disp_cp_chain_me_to_fastEP).
116
117  XIndir is used for indirect jumps.  It is compiled into a jump
118  to VG_(disp_cp_xindir)
119
120  XAssisted is used for "assisted" (do something first, then jump)
121  transfers.  It is compiled into a jump to VG_(disp_cp_xassisted)
122
123  All 3 of these may be conditional.
124
125  More complexity: in some circumstances (no-redir translations)
126  all transfers must be done with XAssisted.  In such cases the
127  instruction selector will be told this.
128
129
130* Patching: XDirect is compiled basically into
131     %r11 = &VG_(disp_cp_chain_me_to_{slow,fast}EP)
132     call *%r11
133  Backends must provide a function (eg) chainXDirect_AMD64
134  which converts it into a jump to a specified destination
135     jmp $delta-of-PCs
136  or
137     %r11 = 64-bit immediate
138     jmpq *%r11
139  depending on branch distance.
140
141  Backends must provide a function (eg) unchainXDirect_AMD64
142  which restores the original call-to-the-stub version.
143
144
145* Event checks.  Each translation now has two entry points,
146  the slow one (slowEP) and fast one (fastEP).  Like this:
147
148     slowEP:
149        counter--
150        if (counter < 0) goto VG_(disp_cp_evcheck_fail)
151     fastEP:
152        (rest of the translation)
153
154  slowEP is used for control flow transfers that are or might be
155  a back edge in the control flow graph.  Insn selectors are
156  given the address of the highest guest byte in the block so
157  they can determine which edges are definitely not back edges.
158
159  The counter is placed in the first 8 bytes of the guest state,
160  and the address of VG_(disp_cp_evcheck_fail) is placed in
161  the next 8 bytes.  This allows very compact checks on all
162  targets, since no immediates need to be synthesised, eg:
163
164    decq 0(%baseblock-pointer)
165    jns  fastEP
166    jmpq *8(baseblock-pointer)
167    fastEP:
168
169  On amd64 a non-failing check is therefore 2 insns; all 3 occupy
170  just 8 bytes.
171
172  On amd64 the event check is created by a special single
173  pseudo-instruction AMD64_EvCheck.
174
175
176* BB profiling (for --profile-flags=).  The dispatch assembly
177  dispatch-arch-os.S no longer deals with this and so is much
178  simplified.  Instead the profile inc is compiled into each
179  translation, as the insn immediately following the event
180  check.  Again, on amd64 a pseudo-insn AMD64_ProfInc is used.
181  Counters are now 64 bit even on 32 bit hosts, to avoid overflow.
182
183  One complexity is that at JIT time it is not known where the
184  address of the counter is.  To solve this, VexTranslateResult
185  now returns the offset of the profile inc in the generated
186  code.  When the counter address is known, VEX can be called
187  again to patch it in.  Backends must supply eg
188  patchProfInc_AMD64 to make this happen.
189
190
191* Front end changes (guest_blah_toIR.c)
192
193  The way the guest program counter is handled has changed
194  significantly.  Previously, the guest PC was updated (in IR)
195  at the start of each instruction, except for the first insn
196  in an IRSB.  This is inconsistent and doesn't work with the
197  new framework.
198
199  Now, each instruction must update the guest PC as its last
200  IR statement -- not its first.  And no special exemption for
201  the first insn in the block.  As before most of these are
202  optimised out by ir_opt, so no concerns about efficiency.
203
204  As a logical side effect of this, exits (IRStmt_Exit) and the
205  block-end transfer are both considered to write to the guest state
206  (the guest PC) and so need to be told the offset of it.
207
208  IR generators (eg disInstr_AMD64) are no longer allowed to set the
209  IRSB::next, to specify the block-end transfer address.  Instead they
210  now indicate, to the generic steering logic that drives them (iow,
211  guest_generic_bb_to_IR.c), that the block has ended.  This then
212  generates effectively "goto GET(PC)" (which, again, is optimised
213  away).  What this does mean is that if the IR generator function
214  ends the IR of the last instruction in the block with an incorrect
215  assignment to the guest PC, execution will transfer to an incorrect
216  destination -- making the error obvious quickly.
217