-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add errors data to MethodMeta message #10
base: main
Are you sure you want to change the base?
Conversation
… timeouts, as well as any agent related errors we faced while creating a witness
proto/api_spec/method.proto
Outdated
// Represents the case where the client timed out waiting for a response from the server. | ||
CLIENT_TIMEOUT = 2; | ||
|
||
// Represents the case where the server timed out before it could send the response back. | ||
SERVER_TIMEOUT = 3; | ||
|
||
// Represents when one of the parties decides to abruptly terminate the connection. | ||
// A reset is usually indicative of a problem, similar to a timeout, but the underlying causes can be different. | ||
CONNECTION_RESET = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these three can be combined into two, or split into four, but they're a bit ambiguous right now.
We don't really know that the cause of a early close is, it might be a timeout or it might be a crash. So I think we could do CLIENT_CLOSED and SERVER_CLOSED and that covers all the cases, without the need for CONNECTION_RESET.
At the TCP protocol level we could distinguish between a normal FIN shutdown and an actual RST, but RST should be rare and I don't think it's necessarily worth calling them out separately. (RST tends to indicate that the TCP state machines are not in sync, for example if a machine rebooted and came back.) But the alternative is CLIENT_FIN, SERVER_FIN, CLIENT_RST, SERVER_RST to cover the full spectrum of what we could observe.
} | ||
|
||
errorType type = 1; | ||
string message = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think a bit about whether we want to store this unstructured field.
Alternatives are:
- Put it in the witness metadata (the JSON wrapper we use for upload) or client telemetry.
- Use only structured data in witnesses.
The danger is that if we actually start relying on it then it becomes a fixed format we have to respect, and if we don't rely on it then what is the point of paying for it?
Do you have any thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean when you say witness metadata. Do you mean somewhere in the witness with info struct?
type witnessWithInfo struct {
// The name of the interface on which this witness was captured.
netInterface string
srcIP net.IP // The HTTP client's IP address.
srcPort uint16 // The HTTP client's port number.
dstIP net.IP // The HTTP server's IP address.
dstPort uint16 // The HTTP server's port number.
observationTime time.Time
id akid.WitnessID
requestEnd time.Time
responseStart time.Time
witness *pb.Witness
}
I was also skeptical about adding this field. I kept it to pair it up with the OTHER
errorType case where we don't know what exactly went wrong, but we do have an error message that can be used to display somewhere (to us or to the user). I am open for the option of omitting it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the structure I was thinking of. We do not have clear design guidelines about what is "info" and what belongs in the IR itself. I think I erred on my very first project by putting processing latency in the IR, when it would have been fine in witnessWithInfo instead.
I do think the error should be part of the witness, so that the witness clearly records its state, but I'm not 100% sure I'm not making the same error again. :)
We can leave the message
field as-is for now, but leave a note suggesting additional fields if any structured data needs to be transmitted.
…rver closed types
JIRA: https://postmanlabs.atlassian.net/browse/POA-155
Confluence: https://postmanlabs.atlassian.net/l/cp/EtJHwV3u
Why
Right now if we fail to capture a valid response, the request becomes a “partial witness” and is not included in the backend timeline. There are several possible causes:
This means that the error rates we are giving to the user have a bias: they only show HTTP requests that successfully got a 4xx or 5xx response as errors, not those that failed because of another problem, or because the agent failed to interpret the response.
What
This PR adds support for describing these kinds of errors in the Method message.
Updated the MethodMeta message to have a new repeated
errors
field. Each error message contains two fieldsThe errors field is kept as repeated since there can be more than one reason for a failure to generate the complete witness.