-
Notifications
You must be signed in to change notification settings - Fork 78
=cluster update SWIM to 0.3.0, includes metrics #780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Cool, now with deserialization metrics configured via props: (debug mode) Doing serialization is harder and we'd need some baggage for it to be honest...:
|
|
TODO:
|
| var instrumentation: ActorInstrumentation! | ||
|
|
||
| @usableFromInline | ||
| let metrics: ActiveActorMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it ends up being one dictionary pointer and one integer...
Slimming down the runtime I kind of don't focus on right now, since the entire thing will change so much with concurrency
|
|
||
| self.namingContext = ActorNamingContext() | ||
|
|
||
| self.metrics = ActiveActorMetrics(system: system, address: address, props: props.metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so props is just settings which ones we want to alloc
| self.userMessages.enqueue(envelope) | ||
|
|
||
| self.shell?._system?.metrics.recordMailboxMessageCount(Int(self.status.messageCount)) | ||
| self.shell?.metrics[gauge: .mailboxCount]?.record(oldStatus.messageCount + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would want a more typesafe thing here but it's annoying to write... so so far relying on" i know it's a gauge"
| case mailboxCount | ||
|
|
||
| case messageProcessingTime | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actual keys for the metrics that are "per actor (group)"
| props.metrics = metricsProps | ||
| return props | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is how users can configure metrics for an actor;
The group becomes part of the label, so users may give many actors the same group config and it'd end up in the same backend metric.
| .serialization, | ||
| .deserialization, | ||
| ] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current options.
Serialization is actually hard without baggage / context propagation... What does it mean to "metrics for serialization for THIS actor -- it's mean all outgoing messages from this actor", so that's a bit hard to do and not done yet.
| @usableFromInline | ||
| let ref: _ReceivesSystemMessages | ||
|
|
||
| public let _props: Props? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove, not needed after all
| self.refType = .local | ||
| self._props = cell.actor?.props | ||
| default: | ||
| // TODO: this is good enough for now... it gets harder with delegates and adapters... need to revisit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove, not needed after all
| default: | ||
| return ActiveActorMetrics.noop | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is how serialization etc reach into metrics to report timing
| // If we passed the maximum capacity of the user queue, we can't enqueue more | ||
| // items and have to decrement the activations count again. This is not racy, | ||
| // because we only process messages if the queue actually contains them (does | ||
| // not return NULL), so even if messages get processed concurrently, it's safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drexin could I ask you to have a look at the mailbox count metrics?
I think this is fairly good, but maybe you have a better idea or take on it?
The semantics rely on:
- report message count whenever we +1 it with an enqueue
- report whenever we start a run
- report whenever we complete a run with "done"/closed etc -- we should have ended up at zero and must record this
- or we scheduled again so the count was set by the other actor and may be reported "too large" -- if we reported here we would suddenly under count the value; so rather we keep the value high until the next schedule and then we get a more correct value...
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually have to report when we start a run? The reporting on enqueue should be enough IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it I think we should report on reschedule as well. Depending on how much is going on it might not be re-scheduled immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So reporting on just enqueue is +1 but we also need to report smaller counts -- enqueued messages need to be marked as processed.
So the reporting on the beginning of a run basically handles the "on reschedule"... but a bit naively...
I was thinking this is why it's not good to report on end of a run:
- mailbox @ 120
- we run 100
- we decrement status and know we were at 120 and processed 100
- we'd report 20
- in between the setting of that status enqueues happen -- thus 21 was reported for example
- it's not so good if we report 20 now then...
but then again... if messages are sent to us after we issue the schedule, it'll be 22 again and it's fine... The value will be wobbling around a bit, and tbh that's fine... This is not a super precise counter but just metrics.
I wonder on what edge we should report though -- do you think on end of a run is clearer?
Or the current "on beginning if rescheduling and on end when we completed and it's zero" is a bit meh?
drexin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Arghs, actually just looked at what you linked me to :D. I'll have a deeper look.
|
|
||
| self.swim = SWIM.Settings() | ||
| self.swim.unreachability = .enabled | ||
| if node.systemName != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isEmpty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah actually, systemName is optional so I use a sneaky comparison here -- nil != "" as well as "some" != "" without having to spell it out.
| // not return NULL), so even if messages get processed concurrently, it's safe | ||
| // to decrement here. | ||
| _ = self.decrementMessageCount() | ||
| self.shell?.metrics[gauge: .mailboxCount]?.record(oldStatus.messageCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: how come we record oldStatus.messageCount and not the result after self.decrementMessageCount()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good point actually. I was not thinking about concurrency properly here...
If single threaded the decrement is equal to "don't do that +1" but since there's many senders it may be a very different number actually...
Actually!
Perhaps we should NOT report ever if we did not manage to enqueue (!), because it was updated and reported by whichever thread did perform the enqueue or mailbox status update... WDYT @drexin ?
|
Okey I think we're good here then... added more tests as well. The shell is now sadly 536 bytes, up by 16 because the dict pointer and the active set. |
TODO: need to emit the Shell specific Timer metrics.
Once this lands we should release 0.6.1