Skip to content

Conversation

@ktoso
Copy link
Member

@ktoso ktoso commented Oct 2, 2020

TODO: need to emit the Shell specific Timer metrics.

Once this lands we should release 0.6.1

@ktoso ktoso added this to the 0.6.1 milestone Oct 2, 2020
@ktoso
Copy link
Member Author

ktoso commented Oct 7, 2020

Cool, now with deserialization metrics configured via props:

recording 150.0 in first.swim.shell.deserialization.size[]
recording 150.0 in second.swim.shell.deserialization.size[]
recording 936007 in first.swim.shell.deserialization.time[]
recording 861968 in second.swim.shell.deserialization.time[]

(debug mode)

Doing serialization is harder and we'd need some baggage for it to be honest...:

  • if I enable via props.metrics(measure: [serialization]) what does that mean -- measure all message serialization which I send? then we need baggage to associate the "sender" with a message. we don't have this today.

@ktoso
Copy link
Member Author

ktoso commented Oct 7, 2020

TODO:

  • pingRequest round trip time

var instrumentation: ActorInstrumentation!

@usableFromInline
let metrics: ActiveActorMetrics
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it ends up being one dictionary pointer and one integer...

Slimming down the runtime I kind of don't focus on right now, since the entire thing will change so much with concurrency


self.namingContext = ActorNamingContext()

self.metrics = ActiveActorMetrics(system: system, address: address, props: props.metrics)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so props is just settings which ones we want to alloc

self.userMessages.enqueue(envelope)

self.shell?._system?.metrics.recordMailboxMessageCount(Int(self.status.messageCount))
self.shell?.metrics[gauge: .mailboxCount]?.record(oldStatus.messageCount + 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would want a more typesafe thing here but it's annoying to write... so so far relying on" i know it's a gauge"

case mailboxCount

case messageProcessingTime
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual keys for the metrics that are "per actor (group)"

props.metrics = metricsProps
return props
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how users can configure metrics for an actor;

The group becomes part of the label, so users may give many actors the same group config and it'd end up in the same backend metric.

.serialization,
.deserialization,
]
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current options.

Serialization is actually hard without baggage / context propagation... What does it mean to "metrics for serialization for THIS actor -- it's mean all outgoing messages from this actor", so that's a bit hard to do and not done yet.

@usableFromInline
let ref: _ReceivesSystemMessages

public let _props: Props?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove, not needed after all

self.refType = .local
self._props = cell.actor?.props
default:
// TODO: this is good enough for now... it gets harder with delegates and adapters... need to revisit
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove, not needed after all

default:
return ActiveActorMetrics.noop
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how serialization etc reach into metrics to report timing

// If we passed the maximum capacity of the user queue, we can't enqueue more
// items and have to decrement the activations count again. This is not racy,
// because we only process messages if the queue actually contains them (does
// not return NULL), so even if messages get processed concurrently, it's safe
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drexin could I ask you to have a look at the mailbox count metrics?

I think this is fairly good, but maybe you have a better idea or take on it?

The semantics rely on:

  • report message count whenever we +1 it with an enqueue
  • report whenever we start a run
  • report whenever we complete a run with "done"/closed etc -- we should have ended up at zero and must record this
    • or we scheduled again so the count was set by the other actor and may be reported "too large" -- if we reported here we would suddenly under count the value; so rather we keep the value high until the next schedule and then we get a more correct value...

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually have to report when we start a run? The reporting on enqueue should be enough IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it I think we should report on reschedule as well. Depending on how much is going on it might not be re-scheduled immediately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So reporting on just enqueue is +1 but we also need to report smaller counts -- enqueued messages need to be marked as processed.

So the reporting on the beginning of a run basically handles the "on reschedule"... but a bit naively...


I was thinking this is why it's not good to report on end of a run:

  • mailbox @ 120
  • we run 100
  • we decrement status and know we were at 120 and processed 100
    • we'd report 20
  • in between the setting of that status enqueues happen -- thus 21 was reported for example
  • it's not so good if we report 20 now then...

but then again... if messages are sent to us after we issue the schedule, it'll be 22 again and it's fine... The value will be wobbling around a bit, and tbh that's fine... This is not a super precise counter but just metrics.

I wonder on what edge we should report though -- do you think on end of a run is clearer?
Or the current "on beginning if rescheduling and on end when we completed and it's zero" is a bit meh?

@ktoso
Copy link
Member Author

ktoso commented Oct 7, 2020

(This was green, but I had to wrap up for now... left a thing I want to follow up on fatal error in there on purpose -- just a minor what label something is reporting on and we're done here I hope).

Would love some review here if you have some time @drexin @yim-lee Thanks! 🙇‍♂️

drexin
drexin previously approved these changes Oct 7, 2020
Copy link
Member

@drexin drexin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@drexin drexin dismissed their stale review October 7, 2020 16:45

Arghs, actually just looked at what you linked me to :D. I'll have a deeper look.


self.swim = SWIM.Settings()
self.swim.unreachability = .enabled
if node.systemName != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isEmpty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah actually, systemName is optional so I use a sneaky comparison here -- nil != "" as well as "some" != "" without having to spell it out.

// not return NULL), so even if messages get processed concurrently, it's safe
// to decrement here.
_ = self.decrementMessageCount()
self.shell?.metrics[gauge: .mailboxCount]?.record(oldStatus.messageCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: how come we record oldStatus.messageCount and not the result after self.decrementMessageCount()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point actually. I was not thinking about concurrency properly here...

If single threaded the decrement is equal to "don't do that +1" but since there's many senders it may be a very different number actually...

Actually!

Perhaps we should NOT report ever if we did not manage to enqueue (!), because it was updated and reported by whichever thread did perform the enqueue or mailbox status update... WDYT @drexin ?

@ktoso
Copy link
Member Author

ktoso commented Oct 8, 2020

Okey I think we're good here then...

added more tests as well.

The shell is now sadly 536 bytes, up by 16 because the dict pointer and the active set.

@ktoso
Copy link
Member Author

ktoso commented Oct 8, 2020

I want to allow @budde to take this for a spin so merging and releasing 0.6.1 -- more insights on how we report the metrics very welcome though @drexin, happy to follow up on these and discuss during the week perhaps?

@ktoso ktoso merged commit d42b708 into apple:main Oct 8, 2020
@ktoso ktoso deleted the wip-swim-3 branch October 8, 2020 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants