1# Perfetto CI design document
2
3This CI is used on-top of (not in replacement of) AOSP's TreeHugger.
4It gives early testing signals and coverage on other OSes and older Android
5devices not supported by TreeHugger.
6
7See the [Testing](/docs/contributing/testing.md) page for more details about the
8project testing strategy.
9
10## Architecture diagram
11
12![Architecture diagram](/docs/images/continuous-integration.png)
13
14There are four major components:
15
161. Frontend: AppEngine.
172. Controller: AppEngine BG service.
183. Workers: Compute Engine + Docker.
194. Database: Firebase realtime database.
20
21They are coupled via the Firebase DB. The DB is the source of truth for the
22whole CI.
23
24## Controller
25
26The Controller orchestrates the CI. It's the most trusted piece of the system.
27
28It is based on a background AppEngine service. Such service is only
29triggered by deferred tasks and periodic Cron jobs.
30
31The Controller is the only entity which performs authenticated access to Gerrit.
32It uses a non-privileged gmail account and has no meaningful voting power.
33
34The controller loop does mainly the following:
35
36- It periodically (every 5s) polls Gerrit for CLs updated in the last 24h.
37- It checks the list of CLs against the list of already known CLs in the DB.
38- For each new CL it enqueues `N` new jobs in the database, one for each
39  configuration defined in [config.py](/infra/ci/config.py) (e.g. `linux-debug`,
40  `android-release`, ...).
41- It monitors the state of jobs. When all jobs for a CL have been completed,
42  it posts a comment and adds the vote if the CL is marked as `Presubmit-Ready`.
43- It does some other less-relevant bookkeeping.
44- AppEngine is highly reliable and self-healing. If a task fails (e.g. because
45  of a Gerrit 500) it will be automatically re-tried with exponential backoff.
46
47## Frontend
48
49The frontend is an AppEngine service that hosts the CI website @
50[ci.perfetto.dev](https://ci.perfetto.dev).
51Conversely to the Controller, it is exposed to the public via HTTP.
52
53- It's an almost fully static website based on HTML and Javascript.
54- The only backend-side code ([frontend.py](/infra/ci/frontend/frontend.py))
55  is used to proxy XHR GET requests to Gerrit, due to the lack of Gerrit
56  CORS headers.
57- Such XHR requests are GET-only and anonymous.
58- The frontend python code also serves as a memcache layer for Gerrit requests
59  that return immutable data (e.g. revision logs) to reduce the likeliness of
60  hitting Gerrit errors / timeouts.
61
62## Worker GCE VM
63
64The actual testing job happens inside these Google Compute Engine VMs.
65The GCE instance is running a CrOS-based
66[Container-Optimized](https://cloud.google.com/container-optimized-os/docs/) OS.
67
68The whole system image is read-only. The VM itself is stateless. No state is
69persisted outside of the DB and Google Cloud Storage (only for UI artifacts).
70The SSD is used only as a scratch disk and is cleared on each reboot.
71
72VMs are dynamically spawned using the Google Cloud Autoscaler and use a
73Stackdriver Custom Metric pushed by the Controller as cost function.
74Such metric is the number of queued + running jobs.
75
76Each VM runs two types of Docker containers: _worker_ and the _sandbox_.
77They are in a 1:1 relationship, each worker controls at most one sandbox
78associated. Workers are always alive (they work in polling-mode), while
79sandboxes are started and stopped by the worker on-demand.
80
81On each GCE instance there are M (currently 10) worker containers running and
82hence up to M sandboxes.
83
84### Worker containers
85
86Worker containers are trusted entities. They can impersonate the GCE service
87account and have R/W access to the DB. They can also spawn sandbox containers.
88
89Their behavior depends only on code that is manually deployed and doesn't depend
90on the checkout under test. The reason why workers are Docker containers is NOT
91security but only reproducibility and maintenance.
92
93Each worker does the following:
94
95- Poll for an available job from the `/jobs_queued` sub-tree of the DB.
96- Move such job into `/jobs_running`.
97- Start the sandbox container, passing down the job config and the git revision
98  via env vars.
99- Stream the sandbox stdout to the `/logs` sub-tree of the DB.
100- Terminate the sandbox container prematurely in case of timeouts or job
101  cancellations requested by the Controller.
102- Upload UI artifacts to GCS.
103- Update the DB to reflect completion of jobs, removing the entry from
104  `/jobs_running` and updating the `/jobs/$jobId/status` fields.
105
106### Sandbox containers
107
108Sandbox containers are untrusted entities. They can access the internet
109(for git pull / install-build-deps) but they cannot impersonate the GCE service
110account, cannot write into the DB, cannot write into GCS buckets.
111Docker here is used both as an isolation boundary and for reproducibility /
112debugging.
113
114Each sandbox does the following:
115
116- Checkout the code at the revision specified in the job config.
117- Run one of the [test/ci/](/test/ci/) scripts which will build and run tests.
118- Return either a success (0) or fail (!= 0) exit code.
119
120A sandbox container is almost completely stateless with the only exception of
121the semi-ephemeral `/ci/cache` mount-point. This mount-point is tmpfs-based
122(hence cleared on reboot) but is shared across all sandboxes. It's used only to
123maintain the shared ccache.
124
125# Data model
126
127The whole CI is based on
128[Firebase Realtime DB](https://firebase.google.com/docs/database).
129It is a high-scale JSON object accessible via a simple REST API.
130Clients can GET/PUT/PATCH/DELETE individual sub-nodes without having a local
131full-copy of the DB.
132
133```bash
134/ci
135    # For post-submit jobs.
136    /branches
137        /master-20190626000853
138        # ┃     ┗━ Committer-date of the HEAD of the branch.
139        # ┗━ Branch name
140        {
141            author: "primiano@google.com"
142            rev: "0552edf491886d2bb6265326a28fef0f73025b6b"
143            subject: "Cloud-based CI"
144            time_committed: "2019-07-06T02:35:14Z"
145            jobs:
146            {
147                20190708153242--branches-master-20190626000853--android-...: 0
148                20190708153242--branches-master-20190626000853--linux-...:  0
149                ...
150            }
151        }
152        /master-20190701235742 {...}
153
154    # For pre-submit jobs.
155    /cls
156        /1000515-65
157        {
158            change_id:    "platform%2F...~I575be190"
159            time_queued:  "2019-07-08T15:32:42Z"
160            time_ended:   "2019-07-08T15:33:25Z"
161            revision_id:  "18c2e4d0a96..."
162            wants_vote:   true
163            voted:        true
164            jobs: {
165                20190708153242--cls-1000515-65--android-clang:  0
166                ...
167                20190708153242--cls-1000515-65--ui-clang:       0
168            }
169        }
170        /1000515-66 {...}
171        ...
172        /1011130-3 {...}
173
174    /cls_pending
175       # Effectively this is an array of pending CLs that we might need to
176       # vote on at the end. Only the keys matter, the values have no
177       # semantic and are always 0.
178       /1000515-65: 0
179
180    /jobs
181        /20190708153242--cls-1000515-65--android-clang-arm-debug:
182        #  ┃               ┃             ┗━ Job type.
183        #  ┃               ┗━ Path of the CL or branch object.
184        #  ┗━ Datetime when the job was created.
185        {
186            src:          "cls/1000515-66"
187            status:       "QUEUED"
188                          "STARTED"
189                          "COMPLETED"
190                          "FAILED"
191                          "TIMED_OUT"
192                          "CANCELLED"
193                          "INTERRUPTED"
194            time_ended:   "2019-07-07T12:47:22Z"
195            time_queued:  "2019-07-07T12:34:22Z"
196            time_started: "2019-07-07T12:34:25Z"
197            type:         "android-clang-arm-debug"
198            worker:       "zqz2-worker-2"
199        }
200        /20190707123422--cls-1000515-66--android-clang-arm-rel {..}
201
202    /jobs_queued
203        # Effectively this is an array. Only the keys matter, the values
204        # have no semantic and are always 0.
205        /20190708153242--cls-1000515-65--android-clang-arm-debug: 0
206
207    /jobs_running
208        # Effectively this is an array. Only the keys matter, the values
209        # have no semantic and are always 0.
210        /20190707123422--cls-1000515-66--android-clang-arm-rel
211
212    /logs
213        /20190707123422--cls-1000515-66--android-clang-arm-rel
214            /00a053-0000: "+ chmod 777 /ci/cache /ci/artifacts"
215            # ┃      ┗━ Monotonic counter to establish total order on log lines
216            # ┃         retrieved within the same read() batch.
217            # ┃
218            # ┗━ Hex-encoded timestamp, relative since start of test.
219            /00a053-0001: "+ chown perfetto.perfetto /ci/ramdisk"
220            ...
221
222```
223
224# Sequence Diagram
225
226This is what happens, in order, on a worker instance from boot to the test run.
227
228```bash
229make -C /infra/ci worker-start
230┗━ gcloud start ...
231
232[GCE] # From /infra/ci/worker/gce-startup-script.sh
233docker run worker-1 ...
234...
235docker run worker-N ...
236
237[worker-X] # From /infra/ci/worker/Dockerfile
238┗━ /infra/ci/worker/worker.py
239  ┗━ docker run sandbox-X ...
240
241[sandbox-X] # From /infra/ci/sandbox/Dockerfile
242┗━ /infra/ci/sandbox/init.sh
243  ┗━ /infra/ci/sandbox/testrunner.sh
244    ┣━ git fetch refs/changes/...
245    ┇  ...
246    ┇  # This env var is passed by the test definition
247    ┇  # specified in /infra/ci/config.py .
248    ┗━ $PERFETTO_TEST_SCRIPT
249       ┣━ # Which is one of these:
250       ┣━ /test/ci/android_tests.sh
251       ┣━ /test/ci/fuzzer_tests.sh
252       ┣━ /test/ci/linux_tests.sh
253       ┗━ /test/ci/ui_tests.sh
254          ┣━ ninja ...
255          ┗━ out/dist/{unit,integration,...}test
256```
257
258### [gce-startup-script.sh](/infra/ci/worker/gce-startup-script.sh)
259
260- Is ran once per GVE vm, at (re)boot.
261- It prepares the tmpfs mountpoint for the shared ccache.
262- It wipes the SSD scratch disk for the build artifacts
263- It pulls the latest {worker, sandbox} container images from
264  the Google Cloud Container registry.
265- Sets up Docker and `iptables` (for the sandboxed network).
266- Starts `N` worker containers in Docker.
267
268### [worker.py](/infra/ci/worker/worker.py)
269
270- It polls the DB to retrieve a job.
271- When a job is retrieved starts a sandbox container.
272- It streams the container stdout/stderr to the DB.
273- It upload the build artifacts to GCS.
274
275### [testrunner.sh](/infra/ci/sandbox/testrunner.sh)
276
277- It is pinned in the container image. Does NOT depend on the particular
278  revision being tested.
279- Checks out the repo at the revision specified (by the Controller) in the
280  job config pulled from the DB.
281- Sets up ccache
282- Deals with caching of buildtools/.
283- Runs the test script specified in the job config from the checkout.
284
285### [{android,fuzzer,linux,ui}_tests.sh](/test/ci/linux_tests.sh)
286
287- Are NOT pinned in the container and are ran from the checked out revision.
288- Finally build and run the test.
289
290## Playbook
291
292### Frontend (JS/HTML/CSS) changes
293
294Test-locally: `make -C infra/ci/frontend test`
295
296Deploy with `make -C infra/ci/frontend deploy`
297
298### Controller changes
299
300Deploy with `make -C infra/ci/controller deploy`
301
302It is possible to try locally via the `make -C infra/ci/controller test`
303but this involves:
304
305- Manually stopping the production AppEngine instance via the Cloud Console
306  (stopping via the `gcloud` cli doesn't seem to work, b/136828660)
307- Downloading the testing service credentials `test-credentials.json`
308  (they are in the internal Team drive).
309
310### Worker/Sandbox changes
311
3121. Build and push the new docker containers with:
313
314   `make -C infra/ci build push`
315
3162. Restart the GCE instances, either manually or via
317
318   `make -C infra/ci restart-workers`
319
320
321## Security considerations
322
323- Both the Firebase DB and the gs://perfetto-artifacts GCS bucket are
324  world-readable and writable by the GAE and GCE service accounts.
325
326- The GAE service account also has the ability to log into Gerrit using a
327  dedicated gmail.com account. The GCE service account doesn't.
328
329- Overall, no account in this project has any interesting privilege:
330  - The Gerrit account used for commenting on CLs is just a random gmail account
331    and has no special voting power.
332  - The service accounts of GAE and GCE don't have any special capabilities
333    outside of the CI project itself.
334
335- This CI deals only with functional and performance testing and doesn't deal
336  with any sort of continuous deployment.
337
338- Presubmit jobs are only triggered if at least one of the following is true:
339  - The owner of the CL is a @google.com account.
340  - The user that applied the Presubmit-Ready label is a @google.com account.
341
342- Sandboxes are not too hard to escape (Docker is the only boundary) and can
343  pollute each other via the shared ccache.
344
345- As such neither pre-submit nor post-submit build artifacts are considered
346  trusted. They are only used for establishing functional correctness and
347  performance regression testing.
348
349- Binaries built by the CI are not ran on any other machines outside of the
350  CI project. They are deliberately not downloadable.
351
352- The only build artifacts that are retained (for up to 30 days) and uploaded to
353  the GCS bucket are the UI artifacts. This is for the only sake of getting
354  visual previews of the HTML changes.
355
356- UI artifacts are served from a different origin (the GCS per-bucket API) than
357  the production UI.
358