Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename etl_test_count and update it for unsuccessful scamper parsing #1053

Merged
merged 13 commits into from
Feb 23, 2022

Conversation

cristinaleonr
Copy link
Contributor

@cristinaleonr cristinaleonr commented Feb 8, 2022

#1052 for more details.

This change is Reviewable

@coveralls
Copy link
Collaborator

coveralls commented Feb 8, 2022

Pull Request Test Coverage Report for Build 7188

  • 17 of 48 (35.42%) changed or added relevant lines in 15 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.07%) to 64.255%

Changes Missing Coverage Covered Lines Changed/Added Lines %
parser/hopannotation1.go 1 2 50.0%
parser/ndt5_result.go 1 2 50.0%
parser/ndt7_result.go 1 2 50.0%
parser/parser.go 0 1 0.0%
row/row.go 0 1 0.0%
parser/ndt_meta.go 1 3 33.33%
parser/tcpinfo.go 2 4 50.0%
parser/disco.go 1 4 25.0%
parser/ss.go 1 6 16.67%
parser/ndt.go 1 8 12.5%
Files with Coverage Reduction New Missed Lines %
active/active.go 2 90.63%
Totals Coverage Status
Change from base Build 7186: 0.07%
Covered Lines: 3854
Relevant Lines: 5998

💛 - Coveralls

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)


parser/scamper1.go, line 118 at r1 (raw file):

	if err != nil {
		date := fileMetadata["date"].(civil.Date)
		if legacyScamperEnd.Before(date) {

Does this change how TaskTotal (etl_task_total) records the error?

If we know that this case is "invalid" we want to exclude it from the total task accounting. The basic form of the SLI equation is "good/valid events" -- our alerts are looking at "bad/valid events" but that's equivalent if we exclude invalid events. - https://www.coursera.org/lecture/site-reliability-engineering-slos/the-sli-equation-qU8h2

There may need to be an update to the worker.go also to treat some errors differently, e.g. define a base error like ErrIsInvalid then make the TaskTotal accounting conditional on errors.Is(ErrIsInvalid). There may be other ways too.

Copy link
Contributor Author

@cristinaleonr cristinaleonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @stephen-soltesz)


parser/scamper1.go, line 118 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Does this change how TaskTotal (etl_task_total) records the error?

If we know that this case is "invalid" we want to exclude it from the total task accounting. The basic form of the SLI equation is "good/valid events" -- our alerts are looking at "bad/valid events" but that's equivalent if we exclude invalid events. - https://www.coursera.org/lecture/site-reliability-engineering-slos/the-sli-equation-qU8h2

There may need to be an update to the worker.go also to treat some errors differently, e.g. define a base error like ErrIsInvalid then make the TaskTotal accounting conditional on errors.Is(ErrIsInvalid). There may be other ways too.

Oh, that's a good point. I've excluded it from the TaskTotal metric.


task/task.go, line 160 at r2 (raw file):

		// Shouldn't have any of these, as they should be handled in ParseAndInsert.
		if loopErr != nil {
			metrics.TaskTotal.WithLabelValues(

Remove double counting of task errors. These are caught by the caller (worker.go).

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 3 of 5 files at r2.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)


parser/scamper1_test.go, line 94 at r2 (raw file):

			err = n.ParseAndInsert(meta, file, data)
			if err.Error() != tt.want.Error() {

When possible, prefer using the errors.Is pattern by wrapping errors. Because etl has a long legacy this may not be easy or possible. So feel free to read this note as a nice-to-have for the future. In general comparing error strings is a little more brittle.


worker/worker.go, line 229 at r2 (raw file):

	if err != nil {
		if !errors.Is(err, parser.ErrIsInvalid) {
			metrics.TaskTotal.WithLabelValues(path.DataType, "TaskError").Inc()

Hmm, I'm a little confused. Does this mean that if any file inside a task archive fails to parse the entire task is recorded as an error?

And, b/c failfast == true in tsk.ProcessAllTests(true) means any bad file stops the archive from processing the rest?

Gardener has a Job (date) which produces many parser Tasks (archives) which contains many files. Before this moment I was thinking TaskTotal was counting files as "ok" or failed... like what percentage of valid files are parsing successfully... This makes more sense how the pre 2019 error rate was ~20%. I'm wondering if ProcessAllTests should be returning two numbers, total valid files and either files that succeeded or errored?

I'm also wondering if the SLIs based on etl_task_total are what I thought they were.

Copy link
Contributor Author

@cristinaleonr cristinaleonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @stephen-soltesz)


parser/scamper1_test.go, line 94 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

When possible, prefer using the errors.Is pattern by wrapping errors. Because etl has a long legacy this may not be easy or possible. So feel free to read this note as a nice-to-have for the future. In general comparing error strings is a little more brittle.

This has been removed.


worker/worker.go, line 229 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Hmm, I'm a little confused. Does this mean that if any file inside a task archive fails to parse the entire task is recorded as an error?

And, b/c failfast == true in tsk.ProcessAllTests(true) means any bad file stops the archive from processing the rest?

Gardener has a Job (date) which produces many parser Tasks (archives) which contains many files. Before this moment I was thinking TaskTotal was counting files as "ok" or failed... like what percentage of valid files are parsing successfully... This makes more sense how the pre 2019 error rate was ~20%. I'm wondering if ProcessAllTests should be returning two numbers, total valid files and either files that succeeded or errored?

I'm also wondering if the SLIs based on etl_task_total are what I thought they were.

Right, TaskTotal is counting the number of tasks, not individual files, and it marks the entire task as failed if any of the files cannot be parsed.

As for the failfast option, it seems to come from this change #906 and it does stop the processing when failfast == true.

If we want to base the failure rate on the number of individual files, we can use the TestCount metric instead for the SLI. I have renamed it to TestTotal like we did before. It also makes the proportion of errors (even pre-scamper1) small enough that we don't have to skip them https://grafana.mlab-sandbox.measurementlab.net/d/q4MrNzh7k/pipeline-slis?orgId=1&viewPanel=4&editPanel=4.

@cristinaleonr cristinaleonr changed the title Filter out legacy errors from scamper1 parsing errors Rename etl_test_count and update it for unsuccessful scamper parsing Feb 22, 2022
Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 19 of 22 files at r4.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr and @gfr10598)


parser/scamper1.go, line 111 at r4 (raw file):

	scamperOutput, err := parser.ParseTraceroute(rawContent)
	if err != nil {
		metrics.TestTotal.WithLabelValues(p.TableName(), scamper1, err.Error()).Inc()

What is the content of this error. If there are variable strings (e.g. filenames, pids, etc) this may be a high-cardinatliy value which is not suitable for a prometheus metric label. Every unique set of metric label values creates a unique timeseries that Prometheus must index. An explosion of these values can create higher-than-desired overhead for the Prometheus server and queries to the server.

Also, I tried to look in github.com/m-lab/traceroute-caller/parser but could not find this function. Where is it?


worker/worker.go, line 229 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Right, TaskTotal is counting the number of tasks, not individual files, and it marks the entire task as failed if any of the files cannot be parsed.

As for the failfast option, it seems to come from this change #906 and it does stop the processing when failfast == true.

If we want to base the failure rate on the number of individual files, we can use the TestCount metric instead for the SLI. I have renamed it to TestTotal like we did before. It also makes the proportion of errors (even pre-scamper1) small enough that we don't have to skip them https://grafana.mlab-sandbox.measurementlab.net/d/q4MrNzh7k/pipeline-slis?orgId=1&viewPanel=4&editPanel=4.

Okay - I have no objection to the name change.

Okay - I'm still not sure we want failfast behavior in the v2 pipeline. Especially if early scamper1 data triggers it. If only a single "bad" file is being counted per archive then the ratio may indeed be low, but it may not be representative.

@gfr10598 do you recall why you added failfast=true for the v2 pipeline?

@cristinaleonr - proposal - revert the scamper1 changes from this PR so that we can commit the name change, then resume the consideration of how to address scamper1 specifically. It doesn't make sense to me that we're not considering the known invalid date periods any more. I'm considering these things quickly so if I'm missing something please help me see.

Copy link
Contributor Author

@cristinaleonr cristinaleonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @gfr10598 and @stephen-soltesz)


parser/scamper1.go, line 111 at r4 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

What is the content of this error. If there are variable strings (e.g. filenames, pids, etc) this may be a high-cardinatliy value which is not suitable for a prometheus metric label. Every unique set of metric label values creates a unique timeseries that Prometheus must index. An explosion of these values can create higher-than-desired overhead for the Prometheus server and queries to the server.

Also, I tried to look in github.com/m-lab/traceroute-caller/parser but could not find this function. Where is it?

These errors come from https://github.com/m-lab/traceroute-caller/blob/874dfadcc9fa5c2b9225f6a60e031f30eacf5229/parser/parser.go#L10.

The TRC code was refactored and the ParseTraceroute() function renamed ParseRawData(). We will have to update it when we pull the latest version of TRC.

As discussed, we will not change the metric in Prometheus at this time.


worker/worker.go, line 229 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Okay - I have no objection to the name change.

Okay - I'm still not sure we want failfast behavior in the v2 pipeline. Especially if early scamper1 data triggers it. If only a single "bad" file is being counted per archive then the ratio may indeed be low, but it may not be representative.

@gfr10598 do you recall why you added failfast=true for the v2 pipeline?

@cristinaleonr - proposal - revert the scamper1 changes from this PR so that we can commit the name change, then resume the consideration of how to address scamper1 specifically. It doesn't make sense to me that we're not considering the known invalid date periods any more. I'm considering these things quickly so if I'm missing something please help me see.

As discussed offline, let's just rename the metric for now (and make sure to also update it for failed scamper1 tests).

Specifically, the plan is to:

  • Change the metric name (this PR).
  • Export the errTracerouteFile error in TRC.
  • Update etl to use the latest version of TRC and filter out errTracerouteFile errors for legacy scamper parsing.
  • Investigate the failfast behavior.

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @stephen-soltesz)


worker/worker.go, line 229 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

As discussed offline, let's just rename the metric for now (and make sure to also update it for failed scamper1 tests).

Specifically, the plan is to:

  • Change the metric name (this PR).
  • Export the errTracerouteFile error in TRC.
  • Update etl to use the latest version of TRC and filter out errTracerouteFile errors for legacy scamper parsing.
  • Investigate the failfast behavior.

Thank you!

@cristinaleonr cristinaleonr merged commit 56a2669 into master Feb 23, 2022
@cristinaleonr cristinaleonr deleted the sandbox-cristinaleon-filter-out-legacy-errors branch February 23, 2022 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants