Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1588452: Don't crash TimingDistribution.start() prior to Glean.init #400

Merged
merged 4 commits into from
Oct 24, 2019

Conversation

mdboom
Copy link
Contributor

@mdboom mdboom commented Oct 21, 2019

TimingDistributionMetricType.start() doesn't really need a Glean object, since it doesn't matter whether upload is enabled or not to start the timer. It only matters later when we stop and record the sample, so we can just to that there only.

@mdboom mdboom force-pushed the timing-distribution-pre-initialization branch from 70f13de to 3419085 Compare October 21, 2019 17:44
Copy link
Member

@travis79 travis79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it this was very similar to the fix for the dynamic labels...

@codecov-io
Copy link

codecov-io commented Oct 21, 2019

Codecov Report

Merging #400 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #400      +/-   ##
============================================
+ Coverage     76.25%   76.28%   +0.02%     
  Complexity      308      308              
============================================
  Files            95       95              
  Lines          5374     5371       -3     
  Branches        632      631       -1     
============================================
- Hits           4098     4097       -1     
+ Misses          807      806       -1     
+ Partials        469      468       -1
Impacted Files Coverage Δ Complexity Δ
...n/java/mozilla/telemetry/glean/rust/LibGleanFFI.kt 66.66% <ø> (ø) 0 <0> (ø) ⬇️
glean-core/src/metrics/timing_distribution.rs 61.53% <100%> (+0.84%) 0 <0> (ø) ⬇️
...e/ios/Glean/Metrics/TimingDistributionMetric.swift 98.57% <100%> (ø) 0 <0> (ø) ⬇️
...etry/glean/private/TimingDistributionMetricType.kt 85.71% <100%> (-0.26%) 14 <0> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ac00bf...c393948. Read the comment docs.

Copy link
Member

@badboy badboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has slight semantic changes though:
Right now if upload is disabled and you try to use a timing distribution, no timer is started. When you enable upload and then stop, no value is recorded (as nothing was started).

With this change, when upload is disabled and you start, a timer is started.
When Glean upload is enabled and then stop is called, you get the whole timespan from before Glean upload was enabled up until the stop.

@mdboom
Copy link
Contributor Author

mdboom commented Oct 22, 2019

Indeed, there are semantic changes, but I think for the better. I think upload enabled should control only the uploading, not the (admittedly needless) handling of timers, otherwise the semantics lead to non-error errors.

Before this change:

(upload enabled)
1) start
2) upload disabled
3) stop -- stops timer, but doesn't record
4) start
5) upload enabled
6) stop -- error! There's no timer

With this change:

(upload enabled)
1) start
2) upload disabled
3) stop -- stops timer, but doesn't record
4) start
5) upload enabled
6) stop -- records time since 4

And in the scenario you suggest:

With this change, when upload is disabled and you start, a timer is started. When Glean upload is enabled and then stop is called, you get the whole timespan from before Glean upload was enabled up until the stop.

This seems correct to me. Whatever was being timed is correct. Before this change, it would record an error. I guess you're thinking that times that start in the opt-out window but end in the opt-in window shouldn't be recorded? I would have thought only timings that start and end in the opt-out window shouldn't be recorded, which is true both before and after this change.

@badboy
Copy link
Member

badboy commented Oct 22, 2019

I guess you're thinking that times that start in the opt-out window but end in the opt-in window shouldn't be recorded? I would have thought only timings that start and end in the opt-out window shouldn't be recorded, which is true both before and after this change.

And I guess this is the point which I'm unsure about, but I also have no opinion in either way.

I'm calling in a @chutten

@chutten
Copy link
Contributor

chutten commented Oct 22, 2019

Y'see, this is another one of those situations where separating canRecord from canUpload would be handy. Right now the Glean SDK assumes if !canUpload it means !canRecord which causes problems because canUpload can change for reasons other than an app telling us to stop recording (most notably between app start and Glean SDK init). We could assume canRecord is implied and always true, but then we have to change the semantics of all the metric types to properly record and report values even when !canUpload.

Though if we split them we'd run into problems where, until init happens, canRecord would be assumed true, but we'd then have to dump any pre-init data if init tells us canRecord should be false. But I digress.

To the case in question, I think mdboom's approach is the better one for developer ergonomics. Some component with a timing distribution doesn't know if the app has set !canUpload on them at any time. It seems unfair to log an error when something outside of the component's control causes what would be a perfectly valid pair of operations (start then stop) to be invalid.

@badboy
Copy link
Member

badboy commented Oct 22, 2019

most notably between app start and Glean SDK init

That's ... not the entire truth. We defer actual recording (and the isUploadEnabled check) to after the init, we do store the value of time-related functionality before the init (e.g. in timespan or timing distribution), in memory only though.

@badboy
Copy link
Member

badboy commented Oct 22, 2019

To the case in question, I think mdboom's approach is the better one for developer ergonomics. Some component with a timing distribution doesn't know if the app has set !canUpload on them at any time.

Fair enough, we should just make sure we properly document this in at least the changelog I assume?
Need to reread the docs on timing distribution to see what we promised.

@mdboom mdboom requested a review from badboy October 23, 2019 15:48
@mdboom
Copy link
Contributor Author

mdboom commented Oct 23, 2019

I've updated the timing distribution docs on the exact semantics wrt upload enabled. Hopefully that clears up the confusion...

@mdboom mdboom force-pushed the timing-distribution-pre-initialization branch from 38433ef to 1ab9c16 Compare October 23, 2019 15:56
Copy link
Member

@badboy badboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now nothing is holding back my r+ anymore.
I suggest a slight rewording in the docs though.

And now that we have a release (and future releases!) we should get into the habit of documenting important stuff in the CHANGELOG.md

@@ -7,6 +7,10 @@ To measure the distribution of single timespans, see [Timespans](timespan.md). T
Timing distributions are recorded in a histogram where the buckets have an exponential distribution, specifically with 8 buckets for every power of 2.
This makes them suitable for measuring timings on a number of time scales without any configuration.

The effect of Glean's "upload enabled" flag occurs when a timing is stopped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify that a bit. Calling it "flag" and such seems too wordy.

What about:

Timings always span the full length between `start` and `stopAndAccumulate`.
If the Glean upload is disabled when calling `start`, the timer is still started.
If the Glean upload is disabled at the time `stopAndAccumulate` is called, nothing is recorded.

Maybe the sentence about start could even be left out.

@mdboom mdboom merged commit 345288b into mozilla:master Oct 24, 2019
mdboom added a commit to mdboom/glean.rs that referenced this pull request Oct 24, 2019
I missed this as part of mozilla#400.  It is now fully covered by the tests added
there.

Also \o/ last `@Ignore`d test!
mdboom added a commit that referenced this pull request Oct 24, 2019
* CLEANUP: Remove test for preinit timing distribution recording

I missed this as part of #400.  It is now fully covered by the tests added
there.

Also \o/ last `@Ignore`d test!

* Remove unused imports
@mdboom mdboom deleted the timing-distribution-pre-initialization branch April 14, 2020 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants