Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. cleanups in hubble and monitor packages #14103

Merged
merged 3 commits into from
Nov 25, 2020

Conversation

tklauser
Copy link
Member

@tklauser tklauser commented Nov 20, 2020

Small godoc comment fixes and preparatory factoring for successive PRs. See individual commits for details.

The last commit might need closer inspection as I'm not sure whether this would break API:

    monitor/api: format agent start timestamp in RFC3339Nano format
    
    time.Time.String() may include a monotonic clock reading, e.g. when t is
    time.Now() which is e.g. the case for the agent start timestamp. The
    godoc for time.Time.String [1] states:
    
      If the time has a monotonic clock reading, the returned string
      includes a final field "m=±<value>", where value is the monotonic
      clock reading formatted as a decimal number of seconds.
    
      [1] https://golang.org/pkg/time/#Time.String
    
    The format including the monotonic clock reading is hard to decode
    because there is no predefined format string in the stdlib time package.
    Also, the monotonic clock reading isn't really useful for the agent
    start timestamp, the walltime clock should be enough. Thus, format the
    timestamp string in RFC3339Nano format which can easily be decoded using
    time.Parse(time.RFC3339Nano, t), e.g in the hubble API parser.

It might contain a monitorAPI.AgentNotifyMessage as emitted by the
*Message constructor funcs.

Signed-off-by: Tobias Klauser <[email protected]>
@tklauser tklauser added area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay labels Nov 20, 2020
@tklauser tklauser requested a review from a team as a code owner November 20, 2020 10:20
@tklauser tklauser requested a review from a team November 20, 2020 10:20
@tklauser tklauser requested review from ti-mo and glibsm November 20, 2020 10:20
@tklauser tklauser force-pushed the pr/tklauser/hubble-monitor-misc-cleanups branch from 2e3135e to 63a64bd Compare November 20, 2020 11:01
@tklauser
Copy link
Member Author

test-me-please

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change with regard to newCompatFlow is not strictly equivalent for l7.Decode.
Previously, we had this (added comment for clarity):

flow := &pb.Flow{}
if err := p.l7.Decode(&logrecord, flow); err != nil { // l7.Decode sets `flow.Time` from logrecord.Timestamp
	return nil, err
}
flow.Time = ts // flow.Time is overwritten here with monitorEvent.Timestamp
flow.NodeName = monitorEvent.NodeName

With your change:

flow := newCompatFlow(ts, monitorEvent.NodeName) // flow.Time is set from monitorEvent.Timestamp
if err := p.l7.Decode(&logrecord, flow); err != nil { // l7.Decode overwrites `flow.Time` from logrecord.Timestamp
	return nil, err
}

Ultimately, I think these are equivalent. Did you actually verify?

@tklauser
Copy link
Member Author

tklauser commented Nov 20, 2020

The change with regard to newCompatFlow is not strictly equivalent for l7.Decode.

Oh, you're right. I overlooked that l7.Decode will set flow.Timestamp. If I read the code correctly, the timestamps are not equivalent. monitorEvent.Timestamp is set to the time the MonitorEvent is sent:

func (c *consumer) NotifyAgentEvent(typ int, message interface{}) {
c.sendEvent(&observerTypes.MonitorEvent{
Timestamp: time.Now(),
NodeName: nodeTypes.GetName(),
Payload: &observerTypes.AgentEvent{
Type: typ,
Message: message,
},
})
}

while logrecord.Timestamp is set when the pkg/proxy/logger.LogRecord is created:

lr := LogRecord{
LogRecord: accesslog.LogRecord{
Type: t,
ObservationPoint: observationPoint,
IPVersion: accesslog.VersionIPv4,
TransportProtocol: 6,
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
NodeAddressInfo: accesslog.NodeAddressInfo{},
},
endpointInfoRegistry: endpointInfoRegistry,
localEndpointInfo: getEndpointInfo(localEndpointInfoSource),
}

I'll amend the commit accordingly. In fact, let me just drop that commit. It's probably not worth deduplicating if we then need to overwrite flow.Time again for MessageTypeAccessLog.

Correct godoc comments for type AgentNotifyMessage and func StartMessage
to state the proper name.

Signed-off-by: Tobias Klauser <[email protected]>
@tklauser tklauser force-pushed the pr/tklauser/hubble-monitor-misc-cleanups branch from 63a64bd to 960e70e Compare November 20, 2020 14:01
Copy link
Contributor

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how you discovered this bug. (I mean how it only popped up now)

From godoc on (t Time) String():

// The returned string is meant for debugging; for a stable serialized
// representation, use t.MarshalText, t.MarshalBinary, or t.Format
// with an explicit format string.

☝️ Referencing this fact in the commit message might be useful for future travellers. :)

I had a hunch this was not an isolated case, but a quick glance at the codebase doesn't reveal any other such occurrences. 😰

pkg/hubble/parser/parser.go Outdated Show resolved Hide resolved
pkg/hubble/parser/parser.go Outdated Show resolved Hide resolved
pkg/monitor/api/types.go Show resolved Hide resolved
@tklauser
Copy link
Member Author

I'm curious how you discovered this bug. (I mean how it only popped up now)

I discovered it in a test I wrote for the successive PR, where I'm decoding the AgentNotifyMessage produced by func StartMessage and was wondering why I always got an empty time.Time when parsing TimeNotification.Time.

From godoc on (t Time) String():

// The returned string is meant for debugging; for a stable serialized
// representation, use t.MarshalText, t.MarshalBinary, or t.Format
// with an explicit format string.

point_up Referencing this fact in the commit message might be useful for future travellers. :)

The commit message mentions https://golang.org/pkg/time/#Time.String already but I agree that copying that part will make things easier to understand. Will amend.

I had a hunch this was not an isolated case, but a quick glance at the codebase doesn't reveal any other such occurrences. cold_sweat

Yeah, I was worried too but a quick grep showed nothing on my side either. Though, it might also be very non-obvious, e.g. when using something like

ts := time.Now()
fmt.Sprintf("%s", ts)

which will also invoke time.Time.String() 😞

time.Time.String() may include a monotonic clock reading, e.g. when t is
time.Now() which is e.g. the case for the agent start timestamp. The
godoc for time.Time.String [1] states:

  If the time has a monotonic clock reading, the returned string
  includes a final field "m=±<value>", where value is the monotonic
  clock reading formatted as a decimal number of seconds.

  [1] https://golang.org/pkg/time/#Time.String

The format including the monotonic clock reading is hard to decode
because there is no predefined format string in the stdlib time package.
Also, the monotonic clock reading isn't really useful for the agent
start timestamp, the walltime clock should be enough. Thus, format the
timestamp string in RFC3339Nano format which can easily be decoded using
time.Parse(time.RFC3339Nano, t), e.g in the hubble API parser.

Signed-off-by: Tobias Klauser <[email protected]>
@tklauser tklauser force-pushed the pr/tklauser/hubble-monitor-misc-cleanups branch from 960e70e to ff6cd2b Compare November 20, 2020 14:13
@tklauser tklauser requested a review from rolinh November 20, 2020 14:13
@tklauser tklauser requested a review from ti-mo November 20, 2020 14:13
@ti-mo
Copy link
Contributor

ti-mo commented Nov 20, 2020

which will also invoke time.Time.String() 😞

Yup, that's true. Counterpoint: fmt.Sprintf("%s", ts) should be quite rare. It's likely to be used to embed timestamps into a larger string, but it's highly unlikely for that string to then be fed back into a timestamp parser. :) We should be good. 👍

@rolinh
Copy link
Member

rolinh commented Nov 20, 2020

I'm curious how you discovered this bug. (I mean how it only popped up now) (...)

Is it me or is this a breaking change that was overlooked by the Go team? AFAIK, monotonic clock support was added in Go 1.9 which means that Go code compiled with Go < 1.9 would not suffer from this problem whereas with Go >= 1.9, the behavior of time.String() changed.

@tklauser About flow.Time, I actually don't know if it is meant to always be set from monitorEvent.Timestamp or if setting from logrecord.Timestamp would actually be the right thing to do as I'm lacking some context. @gandro or @glibsm would probably know.

@tklauser
Copy link
Member Author

I'm curious how you discovered this bug. (I mean how it only popped up now) (...)

Is it me or is this a breaking change that was overlooked by the Go team? AFAIK, monotonic clock support was added in Go 1.9 which means that Go code compiled with Go < 1.9 would not suffer from this problem whereas with Go >= 1.9, the behavior of time.String() changed.

AFAIK the Go team doesn't consider the String() method output for standard library type stable API. Thus, they change them for such cases. I seem to remember some other case in the os package, for example.

@tklauser About flow.Time, I actually don't know if it is meant to always be set from monitorEvent.Timestamp or if setting from logrecord.Timestamp would actually be the right thing to do as I'm lacking some context. @gandro or @glibsm would probably know.

Indeed, let's wait for their feedback and in case we decide that we want to set from logrecord.Timestamp I could resurrect said commit in a follow-up PR.

@tklauser
Copy link
Member Author

tklauser commented Nov 20, 2020

test-me-please

GKE cluster provisioning failure: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/3344

@tklauser
Copy link
Member Author

retest-gke

@gandro
Copy link
Member

gandro commented Nov 23, 2020

@tklauser About flow.Time, I actually don't know if it is meant to always be set from monitorEvent.Timestamp or if setting from logrecord.Timestamp would actually be the right thing to do as I'm lacking some context. @gandro or @glibsm would probably know.

Indeed, let's wait for their feedback and in case we decide that we want to set from logrecord.Timestamp I could resurrect said commit in a follow-up PR.

For some historical context: GetFlows originally returned a stream Flow and not a stream GetFlowsResponse. We only added the latter once we started experimenting with also returning control messages in GetFlows (such as the predecessor of NoteStatusEvent). The node_name and time were added later to GetFlowsResponse, because we wanted to avoid having to duplicate them in each event type.

The flowpb.Flow.Time field needs to be set because it is still exposed to the user in many places. The -o json output in the CLI being the most prominent one (which, again for historical reasons, displays flowpb.Flow, not observerpb.GetFlowsResponse objects. That's why we have -o jsonpb in the CLI).


Now for the question if GetFlowsResponse.Time is supposed to always be set from monitorEvent.Timestamp: Yes, I believe it kind of is. Ideally, it should be set from a monotonic clock.

The GetFlowsResponse.Time field is what the --since/--until ring buffer filter acts on. The current implementation of that filter assumes the timestamps in the ring buffer are sorted/monotonic, because it stops reading once it hits events older than since. monitorEvent.Timestamp.

Unfortunately, the current implementation is probably buggy, because the ptypes.Timestamp that we store in GetFlowsResponse.Time is not guaranteed to be monotonic. Unfortunately, rewriting the filters to not assume monotonic timestamps is also not trivial (each request with a time-based filter would have to read the full ring buffer which is inherentrly racy, as the end of the ring buffer is constantly overwritten - c.f. #13799).

So given these constraints, it it seems intentional that we do not populate GetFlowsResponse.Time from e.g. StartAgent.Time or AccessLog.Time, because these are more or less arbitrary timestamps, while monitorEvent.Timestamp is at least monotonic most of the time (in the absence of e.g. NTP jumps).


As for why Flow.Time is overwritten with the value from GetFlowsResponse.Time: This is probably done to not confuse the user in -o json, because the displayed timestamp there is Flow.Time, but the filter acted on GetFlowsResponse.Time.

Ultimately, I think having two different timestamps (Flow.Time/StartAgent.Time="this is when the event actually happened" vs. GetFlowsResponse.Time="this is when the event was observed by Hubble") is probably more correct, we just have to communicate it clearly to the user that the filters act on the latter timestamp.

@tklauser
Copy link
Member Author

tklauser commented Nov 23, 2020

retest-gke

previous failure: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/3358/ GKE provisioning failed

@nebril
Copy link
Member

nebril commented Nov 24, 2020

retest-gke

@tklauser
Copy link
Member Author

tklauser commented Nov 24, 2020

retest-gke

GKE still failed to provisioning: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/3377/

@tklauser
Copy link
Member Author

retest-gke

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 25, 2020
@errordeveloper errordeveloper merged commit 76e0cfe into master Nov 25, 2020
@errordeveloper errordeveloper deleted the pr/tklauser/hubble-monitor-misc-cleanups branch November 25, 2020 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants