Rename etl_test_count and update it for unsuccessful scamper parsing #1053

cristinaleonr · 2022-02-08T16:31:16Z

#1052 for more details.

This change is

coveralls · 2022-02-08T16:35:11Z

Pull Request Test Coverage Report for Build 7188

17 of 48 (35.42%) changed or added relevant lines in 15 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.07%) to 64.255%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
parser/hopannotation1.go	1	2	50.0%
parser/ndt5_result.go	1	2	50.0%
parser/ndt7_result.go	1	2	50.0%
parser/parser.go	0	1	0.0%
row/row.go	0	1	0.0%
parser/ndt_meta.go	1	3	33.33%
parser/tcpinfo.go	2	4	50.0%
parser/disco.go	1	4	25.0%
parser/ss.go	1	6	16.67%
parser/ndt.go	1	8	12.5%

Files with Coverage Reduction	New Missed Lines	%
active/active.go	2	90.63%

Totals
Change from base Build 7186:	0.07%
Covered Lines:	3854
Relevant Lines:	5998

💛 - Coveralls

stephen-soltesz

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)

parser/scamper1.go, line 118 at r1 (raw file):

	if err != nil {
		date := fileMetadata["date"].(civil.Date)
		if legacyScamperEnd.Before(date) {

Does this change how TaskTotal (etl_task_total) records the error?

If we know that this case is "invalid" we want to exclude it from the total task accounting. The basic form of the SLI equation is "good/valid events" -- our alerts are looking at "bad/valid events" but that's equivalent if we exclude invalid events. - https://www.coursera.org/lecture/site-reliability-engineering-slos/the-sli-equation-qU8h2

There may need to be an update to the worker.go also to treat some errors differently, e.g. define a base error like ErrIsInvalid then make the TaskTotal accounting conditional on errors.Is(ErrIsInvalid). There may be other ways too.

cristinaleonr

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @stephen-soltesz)

parser/scamper1.go, line 118 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Does this change how TaskTotal (etl_task_total) records the error?

If we know that this case is "invalid" we want to exclude it from the total task accounting. The basic form of the SLI equation is "good/valid events" -- our alerts are looking at "bad/valid events" but that's equivalent if we exclude invalid events. - https://www.coursera.org/lecture/site-reliability-engineering-slos/the-sli-equation-qU8h2

There may need to be an update to the worker.go also to treat some errors differently, e.g. define a base error like ErrIsInvalid then make the TaskTotal accounting conditional on errors.Is(ErrIsInvalid). There may be other ways too.

Oh, that's a good point. I've excluded it from the TaskTotal metric.

task/task.go, line 160 at r2 (raw file):

		// Shouldn't have any of these, as they should be handled in ParseAndInsert.
		if loopErr != nil {
			metrics.TaskTotal.WithLabelValues(

Remove double counting of task errors. These are caught by the caller (worker.go).

stephen-soltesz

Reviewed 1 of 3 files at r1, 3 of 5 files at r2.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr)

parser/scamper1_test.go, line 94 at r2 (raw file):

			err = n.ParseAndInsert(meta, file, data)
			if err.Error() != tt.want.Error() {

When possible, prefer using the errors.Is pattern by wrapping errors. Because etl has a long legacy this may not be easy or possible. So feel free to read this note as a nice-to-have for the future. In general comparing error strings is a little more brittle.

worker/worker.go, line 229 at r2 (raw file):

	if err != nil {
		if !errors.Is(err, parser.ErrIsInvalid) {
			metrics.TaskTotal.WithLabelValues(path.DataType, "TaskError").Inc()

Hmm, I'm a little confused. Does this mean that if any file inside a task archive fails to parse the entire task is recorded as an error?

And, b/c failfast == true in tsk.ProcessAllTests(true) means any bad file stops the archive from processing the rest?

Gardener has a Job (date) which produces many parser Tasks (archives) which contains many files. Before this moment I was thinking TaskTotal was counting files as "ok" or failed... like what percentage of valid files are parsing successfully... This makes more sense how the pre 2019 error rate was ~20%. I'm wondering if ProcessAllTests should be returning two numbers, total valid files and either files that succeeded or errored?

I'm also wondering if the SLIs based on etl_task_total are what I thought they were.

This reverts commit 062f349.

cristinaleonr

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @stephen-soltesz)

parser/scamper1_test.go, line 94 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

When possible, prefer using the errors.Is pattern by wrapping errors. Because etl has a long legacy this may not be easy or possible. So feel free to read this note as a nice-to-have for the future. In general comparing error strings is a little more brittle.

This has been removed.

worker/worker.go, line 229 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Hmm, I'm a little confused. Does this mean that if any file inside a task archive fails to parse the entire task is recorded as an error?

And, b/c failfast == true in tsk.ProcessAllTests(true) means any bad file stops the archive from processing the rest?

Gardener has a Job (date) which produces many parser Tasks (archives) which contains many files. Before this moment I was thinking TaskTotal was counting files as "ok" or failed... like what percentage of valid files are parsing successfully... This makes more sense how the pre 2019 error rate was ~20%. I'm wondering if ProcessAllTests should be returning two numbers, total valid files and either files that succeeded or errored?

I'm also wondering if the SLIs based on etl_task_total are what I thought they were.

Right, TaskTotal is counting the number of tasks, not individual files, and it marks the entire task as failed if any of the files cannot be parsed.

As for the failfast option, it seems to come from this change #906 and it does stop the processing when failfast == true.

If we want to base the failure rate on the number of individual files, we can use the TestCount metric instead for the SLI. I have renamed it to TestTotal like we did before. It also makes the proportion of errors (even pre-scamper1) small enough that we don't have to skip them https://grafana.mlab-sandbox.measurementlab.net/d/q4MrNzh7k/pipeline-slis?orgId=1&viewPanel=4&editPanel=4.

stephen-soltesz

Reviewed 19 of 22 files at r4.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @cristinaleonr and @gfr10598)

parser/scamper1.go, line 111 at r4 (raw file):

	scamperOutput, err := parser.ParseTraceroute(rawContent)
	if err != nil {
		metrics.TestTotal.WithLabelValues(p.TableName(), scamper1, err.Error()).Inc()

What is the content of this error. If there are variable strings (e.g. filenames, pids, etc) this may be a high-cardinatliy value which is not suitable for a prometheus metric label. Every unique set of metric label values creates a unique timeseries that Prometheus must index. An explosion of these values can create higher-than-desired overhead for the Prometheus server and queries to the server.

Also, I tried to look in github.com/m-lab/traceroute-caller/parser but could not find this function. Where is it?

worker/worker.go, line 229 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Right, TaskTotal is counting the number of tasks, not individual files, and it marks the entire task as failed if any of the files cannot be parsed.

As for the failfast option, it seems to come from this change #906 and it does stop the processing when failfast == true.

If we want to base the failure rate on the number of individual files, we can use the TestCount metric instead for the SLI. I have renamed it to TestTotal like we did before. It also makes the proportion of errors (even pre-scamper1) small enough that we don't have to skip them https://grafana.mlab-sandbox.measurementlab.net/d/q4MrNzh7k/pipeline-slis?orgId=1&viewPanel=4&editPanel=4.

Okay - I have no objection to the name change.

Okay - I'm still not sure we want failfast behavior in the v2 pipeline. Especially if early scamper1 data triggers it. If only a single "bad" file is being counted per archive then the ratio may indeed be low, but it may not be representative.

@gfr10598 do you recall why you added failfast=true for the v2 pipeline?

@cristinaleonr - proposal - revert the scamper1 changes from this PR so that we can commit the name change, then resume the consideration of how to address scamper1 specifically. It doesn't make sense to me that we're not considering the known invalid date periods any more. I'm considering these things quickly so if I'm missing something please help me see.

cristinaleonr

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @gfr10598 and @stephen-soltesz)

parser/scamper1.go, line 111 at r4 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

What is the content of this error. If there are variable strings (e.g. filenames, pids, etc) this may be a high-cardinatliy value which is not suitable for a prometheus metric label. Every unique set of metric label values creates a unique timeseries that Prometheus must index. An explosion of these values can create higher-than-desired overhead for the Prometheus server and queries to the server.

Also, I tried to look in github.com/m-lab/traceroute-caller/parser but could not find this function. Where is it?

These errors come from https://github.com/m-lab/traceroute-caller/blob/874dfadcc9fa5c2b9225f6a60e031f30eacf5229/parser/parser.go#L10.

The TRC code was refactored and the ParseTraceroute() function renamed ParseRawData(). We will have to update it when we pull the latest version of TRC.

As discussed, we will not change the metric in Prometheus at this time.

worker/worker.go, line 229 at r2 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Okay - I have no objection to the name change.

Okay - I'm still not sure we want failfast behavior in the v2 pipeline. Especially if early scamper1 data triggers it. If only a single "bad" file is being counted per archive then the ratio may indeed be low, but it may not be representative.

@gfr10598 do you recall why you added failfast=true for the v2 pipeline?

@cristinaleonr - proposal - revert the scamper1 changes from this PR so that we can commit the name change, then resume the consideration of how to address scamper1 specifically. It doesn't make sense to me that we're not considering the known invalid date periods any more. I'm considering these things quickly so if I'm missing something please help me see.

As discussed offline, let's just rename the metric for now (and make sure to also update it for failed scamper1 tests).

Specifically, the plan is to:

Change the metric name (this PR).
Export the errTracerouteFile error in TRC.
Update etl to use the latest version of TRC and filter out errTracerouteFile errors for legacy scamper parsing.
Investigate the failfast behavior.

stephen-soltesz

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @stephen-soltesz)

worker/worker.go, line 229 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

As discussed offline, let's just rename the metric for now (and make sure to also update it for failed scamper1 tests).

Specifically, the plan is to:

Change the metric name (this PR).

Export the errTracerouteFile error in TRC.

Update etl to use the latest version of TRC and filter out errTracerouteFile errors for legacy scamper parsing.

Investigate the failfast behavior.

Thank you!

Filter out legacy errors from scamper1 parsing errors

729a6dd

cristinaleonr requested a review from stephen-soltesz February 8, 2022 16:43

stephen-soltesz requested changes Feb 8, 2022

View reviewed changes

Return ErrIsInvalid

bf86beb

cristinaleonr commented Feb 9, 2022

View reviewed changes

stephen-soltesz requested changes Feb 10, 2022

View reviewed changes

cristinaleonr and others added 10 commits February 18, 2022 18:04

Use custom error

8807393

Merge branch 'master' into sandbox-cristinaleon-filter-out-legacy-errors

78e52ee

Fix failed stats

062f349

Revert "Fix failed stats"

77e5313

This reverts commit 062f349.

Use TestCount metric

f1e9333

Change to etl_test_total

2b681b3

Rename in worker_test.go

8d6e4aa

Rename in worker_test.go fix

4093124

Remove error filtering

1aa5b0d

Add TestTotal to error

32ae555

cristinaleonr commented Feb 22, 2022

View reviewed changes

cristinaleonr changed the title ~~Filter out legacy errors from scamper1 parsing errors~~ Rename etl_test_count and update it for unsuccessful scamper parsing Feb 22, 2022

stephen-soltesz requested changes Feb 22, 2022

View reviewed changes

cristinaleonr commented Feb 22, 2022

View reviewed changes

stephen-soltesz approved these changes Feb 22, 2022

View reviewed changes

cristinaleonr mentioned this pull request Feb 22, 2022

Update etl_test_count references m-lab/prometheus-support#879

Merged

Merge branch 'master' into sandbox-cristinaleon-filter-out-legacy-errors

a7325e6

cristinaleonr merged commit 56a2669 into master Feb 23, 2022

cristinaleonr deleted the sandbox-cristinaleon-filter-out-legacy-errors branch February 23, 2022 00:10

cristinaleonr added the traceroute label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename etl_test_count and update it for unsuccessful scamper parsing #1053

Rename etl_test_count and update it for unsuccessful scamper parsing #1053

cristinaleonr commented Feb 8, 2022 •

edited

Loading

coveralls commented Feb 8, 2022 •

edited

Loading

stephen-soltesz left a comment

cristinaleonr left a comment

stephen-soltesz left a comment

cristinaleonr left a comment

stephen-soltesz left a comment

cristinaleonr left a comment

stephen-soltesz left a comment

Rename etl_test_count and update it for unsuccessful scamper parsing #1053

Rename etl_test_count and update it for unsuccessful scamper parsing #1053

Conversation

cristinaleonr commented Feb 8, 2022 • edited Loading

coveralls commented Feb 8, 2022 • edited Loading

Pull Request Test Coverage Report for Build 7188

💛 - Coveralls

stephen-soltesz left a comment

Choose a reason for hiding this comment

cristinaleonr left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

cristinaleonr left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

cristinaleonr left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

cristinaleonr commented Feb 8, 2022 •

edited

Loading

coveralls commented Feb 8, 2022 •

edited

Loading