=cluster update SWIM to 0.3.0, includes metrics #780

ktoso · 2020-10-02T11:59:38Z

TODO: need to emit the Shell specific Timer metrics.

Once this lands we should release 0.6.1

ktoso · 2020-10-07T08:26:40Z

Cool, now with deserialization metrics configured via props:

recording 150.0 in first.swim.shell.deserialization.size[]
recording 150.0 in second.swim.shell.deserialization.size[]
recording 936007 in first.swim.shell.deserialization.time[]
recording 861968 in second.swim.shell.deserialization.time[]

(debug mode)

Doing serialization is harder and we'd need some baggage for it to be honest...:

if I enable via props.metrics(measure: [serialization]) what does that mean -- measure all message serialization which I send? then we need baggage to associate the "sender" with a message. we don't have this today.

rather than props; props is just settings

ktoso · 2020-10-07T12:10:55Z

TODO:

pingRequest round trip time

ktoso · 2020-10-07T12:16:53Z

Sources/DistributedActors/ActorShell.swift

    var instrumentation: ActorInstrumentation!

+    @usableFromInline
+    let metrics: ActiveActorMetrics


it ends up being one dictionary pointer and one integer...

Slimming down the runtime I kind of don't focus on right now, since the entire thing will change so much with concurrency

ktoso · 2020-10-07T12:17:07Z

Sources/DistributedActors/ActorShell.swift


        self.namingContext = ActorNamingContext()

+        self.metrics = ActiveActorMetrics(system: system, address: address, props: props.metrics)


so props is just settings which ones we want to alloc

ktoso · 2020-10-07T12:17:31Z

Sources/DistributedActors/Mailbox.swift

        self.userMessages.enqueue(envelope)
-
-        self.shell?._system?.metrics.recordMailboxMessageCount(Int(self.status.messageCount))
+        self.shell?.metrics[gauge: .mailboxCount]?.record(oldStatus.messageCount + 1)


would want a more typesafe thing here but it's annoying to write... so so far relying on" i know it's a gauge"

ktoso · 2020-10-07T12:17:54Z

Sources/DistributedActors/Metrics/Metrics+Actor.swift

+    case mailboxCount
+
+    case messageProcessingTime
+}


actual keys for the metrics that are "per actor (group)"

ktoso · 2020-10-07T12:23:42Z

Sources/DistributedActors/Props+Metrics.swift

+        props.metrics = metricsProps
+        return props
+    }
+}


this is how users can configure metrics for an actor;

The group becomes part of the label, so users may give many actors the same group config and it'd end up in the same backend metric.

ktoso · 2020-10-07T12:24:30Z

Sources/DistributedActors/Props+Metrics.swift

+        .serialization,
+        .deserialization,
+    ]
+}


the current options.

Serialization is actually hard without baggage / context propagation... What does it mean to "metrics for serialization for THIS actor -- it's mean all outgoing messages from this actor", so that's a bit hard to do and not done yet.

ktoso · 2020-10-07T12:24:46Z

Sources/DistributedActors/Refs+any.swift

    @usableFromInline
    let ref: _ReceivesSystemMessages

+    public let _props: Props?


TODO: remove, not needed after all

ktoso · 2020-10-07T12:24:51Z

Sources/DistributedActors/Refs+any.swift

+            self.refType = .local
+            self._props = cell.actor?.props
        default:
+            // TODO: this is good enough for now... it gets harder with delegates and adapters... need to revisit


TODO: remove, not needed after all

ktoso · 2020-10-07T12:25:04Z

Sources/DistributedActors/Refs+any.swift

+        default:
+            return ActiveActorMetrics.noop
+        }
+    }


this is how serialization etc reach into metrics to report timing

ktoso · 2020-10-07T12:28:20Z

Sources/DistributedActors/Mailbox.swift

            // If we passed the maximum capacity of the user queue, we can't enqueue more
            // items and have to decrement the activations count again. This is not racy,
            // because we only process messages if the queue actually contains them (does
            // not return NULL), so even if messages get processed concurrently, it's safe


@drexin could I ask you to have a look at the mailbox count metrics?

I think this is fairly good, but maybe you have a better idea or take on it?

The semantics rely on:

report message count whenever we +1 it with an enqueue

report whenever we start a run

report whenever we complete a run with "done"/closed etc -- we should have ended up at zero and must record this

or we scheduled again so the count was set by the other actor and may be reported "too large" -- if we reported here we would suddenly under count the value; so rather we keep the value high until the next schedule and then we get a more correct value...

WDYT?

Do we actually have to report when we start a run? The reporting on enqueue should be enough IMHO.

Thinking about it I think we should report on reschedule as well. Depending on how much is going on it might not be re-scheduled immediately.

So reporting on just enqueue is +1 but we also need to report smaller counts -- enqueued messages need to be marked as processed.

So the reporting on the beginning of a run basically handles the "on reschedule"... but a bit naively...

I was thinking this is why it's not good to report on end of a run:

mailbox @ 120

we run 100

we decrement status and know we were at 120 and processed 100

we'd report 20

in between the setting of that status enqueues happen -- thus 21 was reported for example

it's not so good if we report 20 now then...

but then again... if messages are sent to us after we issue the schedule, it'll be 22 again and it's fine... The value will be wobbling around a bit, and tbh that's fine... This is not a super precise counter but just metrics.

I wonder on what edge we should report though -- do you think on end of a run is clearer?
Or the current "on beginning if rescheduling and on end when we completed and it's zero" is a bit meh?

ktoso · 2020-10-07T12:33:51Z

(This was green, but I had to wrap up for now... left a thing I want to follow up on fatal error in there on purpose -- just a minor what label something is reporting on and we're done here I hope).

Would love some review here if you have some time @drexin @yim-lee Thanks! 🙇‍♂️

drexin

LGTM

Arghs, actually just looked at what you linked me to :D. I'll have a deeper look.

yim-lee · 2020-10-07T17:20:51Z

Sources/DistributedActors/Cluster/ClusterSettings.swift

+
        self.swim = SWIM.Settings()
        self.swim.unreachability = .enabled
+        if node.systemName != "" {


nit: isEmpty?

Ah actually, systemName is optional so I use a sneaky comparison here -- nil != "" as well as "some" != "" without having to spell it out.

yim-lee · 2020-10-07T17:31:27Z

Sources/DistributedActors/Mailbox.swift

            // not return NULL), so even if messages get processed concurrently, it's safe
            // to decrement here.
            _ = self.decrementMessageCount()
+            self.shell?.metrics[gauge: .mailboxCount]?.record(oldStatus.messageCount)


Q: how come we record oldStatus.messageCount and not the result after self.decrementMessageCount()?

Oh, good point actually. I was not thinking about concurrency properly here...

If single threaded the decrement is equal to "don't do that +1" but since there's many senders it may be a very different number actually...

Actually!

Perhaps we should NOT report ever if we did not manage to enqueue (!), because it was updated and reported by whichever thread did perform the enqueue or mailbox status update... WDYT @drexin ?

Sources/DistributedActors/Mailbox.swift

ktoso · 2020-10-08T05:54:24Z

Okey I think we're good here then...

added more tests as well.

The shell is now sadly 536 bytes, up by 16 because the dict pointer and the active set.

ktoso · 2020-10-08T07:20:04Z

I want to allow @budde to take this for a spin so merging and releasing 0.6.1 -- more insights on how we report the metrics very welcome though @drexin, happy to follow up on these and discuss during the week perhaps?

=cluster update SWIM to 0.3.0, includes metrics

0116f52

ktoso added this to the 0.6.1 milestone Oct 2, 2020

wip on metrics via props

21fd661

ktoso added 4 commits October 7, 2020 18:02

move storage of metrics into the actor shell; they are "per actor"

2db7eb5

rather than props; props is just settings

implement opt in metric in mailbox

e7cff10

make metrics test kit not so verbose

eaf6980

re adjust metrics emitting in mailbox

5895f7c

ktoso commented Oct 7, 2020

View reviewed changes

implement ping roundtrip metrics for SWIM inside the peer

7891057

ktoso commented Oct 7, 2020

View reviewed changes

drexin previously approved these changes Oct 7, 2020

View reviewed changes

yim-lee reviewed Oct 7, 2020

View reviewed changes

+swim metrics in the SWIMActor Shell, roundtrip times and counts

8dce164

cleanup

ca754af

ktoso merged commit d42b708 into apple:main Oct 8, 2020

ktoso deleted the wip-swim-3 branch October 8, 2020 07:20


		self.namingContext = ActorNamingContext()

		self.metrics = ActiveActorMetrics(system: system, address: address, props: props.metrics)

=cluster update SWIM to 0.3.0, includes metrics #780

=cluster update SWIM to 0.3.0, includes metrics #780

Uh oh!

Conversation

ktoso commented Oct 2, 2020

Uh oh!

ktoso commented Oct 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ktoso commented Oct 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ktoso commented Oct 7, 2020

Uh oh!

drexin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ktoso commented Oct 8, 2020

Uh oh!

ktoso commented Oct 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ktoso commented Oct 7, 2020 •

edited

Loading