Skip to content

Commit 331f9c0

Browse files
authored
Break glass v2 (#4171)
Implement the new break glass behavior for the monorepo world. Fixes flutter/flutter#159898 Fixes flutter/flutter#132811 Below is an updated copy of flutter/flutter#159898 (comment): # Pull request ## Two-state Merge Queue Guard The guard can only be in two states: - **Pending** (yellow): this is the initial state, and the guard stays in this state for as long as the pull request is not allowed to enter the merge queue. - **Success** (green): the guard enters this state when the infra decides that the PR is allowed to enter the merge queue. The state can only go from pending to success. There are no other state transitions. ## Deciding when the guard goes green Going from pending to green is the only transition we need to worry about. This is how we decide it: - **Normal case**: the normal workflow for landing a PR is for all the checks to go green. Once this happens, Cocoon closes the guard (by making it green). This will allow the `autosubmit` label to enqueue the PR, and it will allow the author to press "Merge When Ready". - **Emergency case**: the a PR must be landed despite failing checks (typically on a red tree status). The author adds the `emergency` label. Cocoon immediately unlocks the guard, ignoring any pending or failed checks. In conjunction with the `autosubmit` label, Cocoon will also automatically enqueue the PR after all tests pass even if the tree status is red. With an `emergency` Cocoon will also let the PR to jump the queue. However, if the PR must be landed right away, the author can use the "Merge When Ready" button manually as soon as Cocoon unlocks the merge guard. ## Explainer The above system makes everything simpler. There are fewer states (just pending and success), fewer transitions (just pending => success), and fewer situations to consider (normal and emergency). We don't need to do anything special w.r.t. the guard for the purposes of retrying failed flaky tests. It simply stays pending while the author is doing whatever is necessary to make the PR green enough to land: push fixes, retry check runs, rebase, get approvals, etc. Importantly, it covers all situations we need to handle in PR management. ### Why not have a third "failed" state? The guard's job is not to tell you whether any tests are failing. You can already tell which tests are failing by looking at the status of respective check runs (this may change in the future, but when that time comes we'll find a new visual cue). The guard's job is only to tell you whether your PR is allowed to be enqueued. Permission to proceed never "fails". It's either granted (green), or not yet granted (pending). Once granted the PR is typically enqueued or landed immediately, so performing any other state transitions after green is mostly meaningless. Green is the final state of the guard, that's all. The other reason for keeping the guard in two states is for simpler state management. Once green, the PR can immediately be enqueued. There's no transition from pending to failed, from failed to pending, or from failed to green. We can remove some of the existing Cocoon code around this, and we don't have to add anything new to support normal and emergency PR landing. # Merge group There are two possible outcomes for a merge group: - **Land**: all check runs are green => Cocoon completes the merge guard as success and GitHub lands the merge group. - **Fail**: some check runs failed => Cocoon fails the merge guard and GitHub pulls the merge group (and the corresponding PR) out of the queue.
1 parent f4ab691 commit 331f9c0

22 files changed

+872
-450
lines changed

app_dart/lib/src/model/firestore/ci_staging.dart

Lines changed: 51 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,8 @@ class CiStaging extends Document {
8888
required String checkRun,
8989
required String conclusion,
9090
}) async {
91-
final logCrumb = 'markConclusion(${slug.owner}_${slug.name}_${sha}_$stage, $checkRun, $conclusion)';
91+
final changeCrumb = '${slug.owner}_${slug.name}_$sha';
92+
final logCrumb = 'markConclusion(${changeCrumb}_$stage, $checkRun, $conclusion)';
9293

9394
// Marking needs to happen while in a transaction to ensure `remaining` is
9495
// updated correctly. For that to happen correctly; we need to perform a
@@ -110,8 +111,10 @@ class CiStaging extends Document {
110111

111112
var remaining = -1;
112113
var failed = -1;
114+
var total = -1;
113115
bool valid = false;
114116
String? checkRunGuard;
117+
String? recordedConclusion;
115118

116119
late final Document doc;
117120

@@ -146,9 +149,15 @@ class CiStaging extends Document {
146149
}
147150
failed = maybeFailed;
148151

152+
final maybeTotal = int.tryParse(fields[kTotalField]?.integerValue ?? '');
153+
if (maybeTotal == null) {
154+
throw '$logCrumb: missing field "$kTotalField" for $transaction / ${doc.fields}';
155+
}
156+
total = maybeTotal;
157+
149158
// We will have check_runs scheduled after the engine was built successfully, so missing the checkRun field
150159
// is an OK response to have. All fields should have been written at creation time.
151-
final recordedConclusion = fields[checkRun]?.stringValue;
160+
recordedConclusion = fields[checkRun]?.stringValue;
152161
if (recordedConclusion == null) {
153162
log.info('$logCrumb: $checkRun not present in doc for $transaction / $doc');
154163
await docRes.rollback(RollbackRequest(transaction: transaction), kDatabase);
@@ -157,6 +166,8 @@ class CiStaging extends Document {
157166
remaining: remaining,
158167
checkRunGuard: null,
159168
failed: failed,
169+
summary: 'Check run "$checkRun" not present in $stage CI stage',
170+
details: 'Change $changeCrumb',
160171
);
161172
}
162173

@@ -207,7 +218,7 @@ class CiStaging extends Document {
207218
fields[checkRun] = Value(stringValue: conclusion);
208219
fields[kRemainingField] = Value(integerValue: '$remaining');
209220
fields[kFailedField] = Value(integerValue: '$failed');
210-
} on DetailedApiRequestError catch (e) {
221+
} on DetailedApiRequestError catch (e, stack) {
211222
if (e.status == 404) {
212223
// An attempt to read a document not in firestore should not be retried.
213224
log.info('$logCrumb: staging document not found for $transaction');
@@ -217,6 +228,15 @@ class CiStaging extends Document {
217228
remaining: -1,
218229
checkRunGuard: null,
219230
failed: failed,
231+
summary: 'Internal server error',
232+
details: '''
233+
Staging document not found for CI stage "$stage" for $changeCrumb. Got 404 from
234+
Firestore.
235+
236+
Error:
237+
${e.toString()}
238+
$stack
239+
''',
220240
);
221241
}
222242
// All other errors should bubble up and be retried.
@@ -239,6 +259,15 @@ class CiStaging extends Document {
239259
remaining: remaining,
240260
checkRunGuard: checkRunGuard,
241261
failed: failed,
262+
summary: valid ? 'All tests passed' : 'Not a valid state transition for $checkRun',
263+
details: valid
264+
? '''
265+
For CI stage $stage:
266+
Total check runs scheduled: $total
267+
Pending: $remaining
268+
Failed: $failed
269+
'''
270+
: 'Attempted to transition the state of check run $checkRun from "$recordedConclusion" to "$conclusion".',
242271
);
243272
}
244273

@@ -343,12 +372,16 @@ class StagingConclusion {
343372
final int remaining;
344373
final String? checkRunGuard;
345374
final int failed;
375+
final String summary;
376+
final String details;
346377

347378
const StagingConclusion({
348379
required this.result,
349380
required this.remaining,
350381
required this.checkRunGuard,
351382
required this.failed,
383+
required this.summary,
384+
required this.details,
352385
});
353386

354387
bool get isOk => result == StagingConclusionResult.ok;
@@ -366,8 +399,21 @@ class StagingConclusion {
366399
other.result == result &&
367400
other.remaining == remaining &&
368401
other.checkRunGuard == checkRunGuard &&
369-
other.failed == failed);
402+
other.failed == failed &&
403+
other.summary == summary &&
404+
other.details == details);
405+
406+
@override
407+
int get hashCode => Object.hashAll([
408+
result,
409+
remaining,
410+
checkRunGuard,
411+
failed,
412+
summary,
413+
details,
414+
]);
370415

371416
@override
372-
int get hashCode => Object.hashAll([result, remaining, checkRunGuard, failed]);
417+
String toString() =>
418+
'StagingConclusion("$result", "$remaining", "$checkRunGuard", "$failed", "$summary", "$details")';
373419
}

app_dart/lib/src/service/config.dart

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,37 @@ const String kDefaultBranchName = 'master';
3232
class Config {
3333
Config(this._db, this._cache);
3434

35-
/// Labels autosubmit looks for on pull requests
35+
/// When present on a pull request, instructs Cocoon to submit it
36+
/// automatically as soon as all the required checks pass.
3637
///
3738
/// Keep this in sync with the similar `Config` class in `auto_submit`.
3839
static const String kAutosubmitLabel = 'autosubmit';
3940

41+
/// When present on a pull request, allows it to land without passing all the
42+
/// checks, and jumps the merge queue.
43+
///
44+
/// Keep this in sync with the similar `Config` class in `auto_submit`.
45+
static const String kEmergencyLabel = 'emergency';
46+
47+
/// Validates that CI tasks were successfully created from the .ci.yaml file.
48+
///
49+
/// If this check fails, it means Cocoon failed to fully populate the list of
50+
/// CI checks and the PR/commit should be treated as failing.
51+
static const String kCiYamlCheckName = 'ci.yaml validation';
52+
53+
/// A required check that stays in pending state until a sufficient subset of
54+
/// checks pass.
55+
///
56+
/// This check is "required", meaning that it must pass before Github will
57+
/// allow a PR to land in the merge queue, or a merge group to land on the
58+
/// target branch (main or master).
59+
///
60+
/// IMPORTANT: the name of this task - "Merge Queue Guard" - must strictly
61+
/// match the name of the required check configured in the repo settings.
62+
/// Changing the name here or in the settings alone will break the PR
63+
/// workflow.
64+
static const String kMergeQueueLockName = 'Merge Queue Guard';
65+
4066
final DatastoreDB _db;
4167

4268
final CacheService _cache;
@@ -150,7 +176,6 @@ class Config {
150176

151177
// GitHub App properties.
152178
Future<String> get githubPrivateKey => _getSingleValue('githubapp_private_pem');
153-
Future<String> get overrideTreeStatusLabel => _getSingleValue('override_tree_status_label');
154179
Future<String> get githubPublicKey => _getSingleValue('githubapp_public_pem');
155180
Future<String> get githubAppId => _getSingleValue('githubapp_id');
156181
Future<Map<String, dynamic>> get githubAppInstallations async {

0 commit comments

Comments
 (0)