README
1Workload descriptor format
2==========================
3
4ctx.engine.duration_us.dependency.wait,...
5<uint>.<str>.<uint>[-<uint>]|*.<int <= 0>[/<int <= 0>][...].<0|1>,...
6B.<uint>
7M.<uint>.<str>[|<str>]...
8P|S|X.<uint>.<int>
9d|p|s|t|q|a|T.<int>,...
10b.<uint>.<str>[|<str>].<str>
11f
12
13For duration a range can be given from which a random value will be picked
14before every submit. Since this and seqno management requires CPU access to
15objects, care needs to be taken in order to ensure the submit queue is deep
16enough these operations do not affect the execution speed unless that is
17desired.
18
19Additional workload steps are also supported:
20
21 'd' - Adds a delay (in microseconds).
22 'p' - Adds a delay relative to the start of previous loop so that the each loop
23 starts execution with a given period.
24 's' - Synchronises the pipeline to a batch relative to the step.
25 't' - Throttle every n batches.
26 'q' - Throttle to n max queue depth.
27 'f' - Create a sync fence.
28 'a' - Advance the previously created sync fence.
29 'B' - Turn on context load balancing.
30 'b' - Set up engine bonds.
31 'M' - Set up engine map.
32 'P' - Context priority.
33 'S' - Context SSEU configuration.
34 'T' - Terminate an infinite batch.
35 'X' - Context preemption control.
36
37Engine ids: DEFAULT, RCS, BCS, VCS, VCS1, VCS2, VECS
38
39Example (leading spaces must not be present in the actual file):
40----------------------------------------------------------------
41
42 1.VCS1.3000.0.1
43 1.RCS.500-1000.-1.0
44 1.RCS.3700.0.0
45 1.RCS.1000.-2.0
46 1.VCS2.2300.-2.0
47 1.RCS.4700.-1.0
48 1.VCS2.600.-1.1
49 p.16000
50
51The above workload described in human language works like this:
52
53 1. A batch is sent to the VCS1 engine which will be executing for 3ms on the
54 GPU and userspace will wait until it is finished before proceeding.
55 2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random
56 duration range), 3.7ms and 1ms respectively. The first batch has a data
57 dependency on the preceding VCS1 batch, and the last of the group depends
58 on the first from the group.
59 5. Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms
60 RCS batch.
61 6. This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms
62 VCS2 batch.
63 7. Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the
64 same step the tool is told to wait for the batch completes before
65 proceeding.
66 8. Finally the tool is told to wait long enough to ensure the next iteration
67 starts 16ms after the previous one has started.
68
69When workload descriptors are provided on the command line, commas must be used
70instead of new lines.
71
72Multiple dependencies can be given separated by forward slashes.
73
74Example:
75
76 1.VCS1.3000.0.1
77 1.RCS.3700.0.0
78 1.VCS2.2300.-1/-2.0
79
80I this case the last step has a data dependency on both first and second steps.
81
82Batch durations can also be specified as infinite by using the '*' in the
83duration field. Such batches must be ended by the terminate command ('T')
84otherwise they will cause a GPU hang to be reported.
85
86Sync (fd) fences
87----------------
88
89Sync fences are also supported as dependencies.
90
91To use them put a "f<N>" token in the step dependecy list. N is this case the
92same relative step offset to the dependee batch, but instead of the data
93dependency an output fence will be emitted at the dependee step, and passed in
94as a dependency in the current step.
95
96Example:
97
98 1.VCS1.3000.0.0
99 1.RCS.500-1000.-1/f-1.0
100
101In this case the second step will have both a data dependency and a sync fence
102dependency on the previous step.
103
104Example:
105
106 1.RCS.500-1000.0.0
107 1.VCS1.3000.f-1.0
108 1.VCS2.3000.f-2.0
109
110VCS1 and VCS2 batches will have a sync fence dependency on the RCS batch.
111
112Example:
113
114 1.RCS.500-1000.0.0
115 f
116 2.VCS1.3000.f-1.0
117 2.VCS2.3000.f-2.0
118 1.RCS.500-1000.0.1
119 a.-4
120 s.-4
121 s.-4
122
123VCS1 and VCS2 batches have an input sync fence dependecy on the standalone fence
124created at the second step. They are submitted ahead of time while still not
125runnable. When the second RCS batch completes the standalone fence is signaled
126which allows the two VCS batches to be executed. Finally we wait until the both
127VCS batches have completed before starting the (optional) next iteration.
128
129Submit fences
130-------------
131
132Submit fences are a type of input fence which are signalled when the originating
133batch buffer is submitted to the GPU. (In contrary to normal sync fences, which
134are signalled when completed.)
135
136Submit fences have the identical syntax as the sync fences with the lower-case
137's' being used to select them. Eg:
138
139 1.RCS.500-1000.0.0
140 1.VCS1.3000.s-1.0
141 1.VCS2.3000.s-2.0
142
143Here VCS1 and VCS2 batches will only be submitted for executing once the RCS
144batch enters the GPU.
145
146Context priority
147----------------
148
149 P.1.-1
150 1.RCS.1000.0.0
151 P.2.1
152 2.BCS.1000.-2.0
153
154Context 1 is marked as low priority (-1) and then a batch buffer is submitted
155against it. Context 2 is marked as high priority (1) and then a batch buffer
156is submitted against it which depends on the batch from context 1.
157
158Context priority command is executed at workload runtime and is valid until
159overriden by another (optional) same context priority change. Actual driver
160ioctls are executed only if the priority level has changed for the context.
161
162Context preemption control
163--------------------------
164
165 X.1.0
166 1.RCS.1000.0.0
167 X.1.500
168 1.RCS.1000.0.0
169
170Context 1 is marked as non-preemptable batches and a batch is sent against 1.
171The same context is then marked to have batches which can be preempted every
172500us and another batch is submitted.
173
174Same as with context priority, context preemption commands are valid until
175optionally overriden by another preemption control change on the same context.
176
177Engine maps
178-----------
179
180Engine maps are a per context feature which changes the way engine selection is
181done in the driver.
182
183Example:
184
185 M.1.VCS1|VCS2
186
187This sets up context 1 with an engine map containing VCS1 and VCS2 engine.
188Submission to this context can now only reference these two engines.
189
190Engine maps can also be defined based on class like VCS.
191
192Example:
193
194M.1.VCS
195
196This sets up the engine map to all available VCS class engines.
197
198Context load balancing
199----------------------
200
201Context load balancing (aka Virtual Engine) is an i915 feature where the driver
202will pick the best engine (most idle) to submit to given previously configured
203engine map.
204
205Example:
206
207 B.1
208
209This enables load balancing for context number one.
210
211Engine bonds
212------------
213
214Engine bonds are extensions on load balanced contexts. They allow expressing
215rules of engine selection between two co-operating contexts tied with submit
216fences. In other words, the rule expression is telling the driver: "If you pick
217this engine for context one, then you have to pick that engine for context two".
218
219Syntax is:
220 b.<context>.<engine_list>.<master_engine>
221
222Engine list is a list of one or more sibling engines separated by a pipe
223character (eg. "VCS1|VCS2").
224
225There can be multiple bonds tied to the same context.
226
227Example:
228
229 M.1.RCS|VECS
230 B.1
231 M.2.VCS1|VCS2
232 B.2
233 b.2.VCS1.RCS
234 b.2.VCS2.VECS
235
236This tells the driver that if it picked RCS for context one, it has to pick VCS1
237for context two. And if it picked VECS for context one, it has to pick VCS1 for
238context two.
239
240If we extend the above example with more workload directives:
241
242 1.DEFAULT.1000.0.0
243 2.DEFAULT.1000.s-1.0
244
245We get to a fully functional example where two batch buffers are submitted in a
246load balanced fashion, telling the driver they should run simultaneously and
247that valid engine pairs are either RCS + VCS1 (for two contexts respectively),
248or VECS + VCS2.
249
250This can also be extended using sync fences to improve chances of the first
251submission not getting on the hardware after the second one. Second block would
252then look like:
253
254 f
255 1.DEFAULT.1000.f-1.0
256 2.DEFAULT.1000.s-1.0
257 a.-3
258
259Context SSEU configuration
260--------------------------
261
262 S.1.1
263 1.RCS.1000.0.0
264 S.2.-1
265 2.RCS.1000.0.0
266
267Context 1 is configured to run with one enabled slice (slice mask 1) and a batch
268is sumitted against it. Context 2 is configured to run with all slices (this is
269the default so the command could also be omitted) and a batch submitted against
270it.
271
272This shows the dynamic SSEU reconfiguration cost beween two contexts competing
273for the render engine.
274
275Slice mask of -1 has a special meaning of "all slices". Otherwise any integer
276can be specifying as the slice mask, but beware any apart from 1 and -1 can make
277the workload not portable between different GPUs.
278