Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: runtime error: invalid memory address or nil pointer dereference #24

Closed
owenhaynes opened this issue Mar 3, 2021 · 26 comments
Closed

Comments

@owenhaynes
Copy link
Contributor

owenhaynes commented Mar 3, 2021

Client version v0.6.9
Kafka Version: kafka 2.6
Connection Auth: MTLS connection
Auto commit disabled

Our Kafka dev cluster can be under high load, so no idea if its related

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7d381f]

goroutine 10119 [running]:
github.com/twmb/franz-go/pkg/kgo.(*topicPartitions).load(...)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/topics_and_partitions.go:73
github.com/twmb/franz-go/pkg/kgo.(*consumer).assignPartitions(0xc0007326a0, 0xc0030022d0, 0xc00084d500)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/consumer.go:357 +0x47f
github.com/twmb/franz-go/pkg/kgo.(*groupConsumer).fetchOffsets(0xc00084d4a0, 0xfc34a8, 0xc002bcce40, 0xc0024914a0, 0x0, 0x0)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/consumer_group.go:1201 +0x4d7
github.com/twmb/franz-go/pkg/kgo.(*groupConsumer).setupAssignedAndHeartbeat.func3(0xc002bd7080, 0xc002554060, 0xc00084d4a0, 0xc0024914a0, 0xfc34a8, 0xc002bcce40)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/consumer_group.go:750 +0x145
created by github.com/twmb/franz-go/pkg/kgo.(*groupConsumer).setupAssignedAndHeartbeat
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/consumer_group.go:746 +0x387
@owenhaynes
Copy link
Contributor Author

looks to be a sync error with getting topic metadata and the assigning heartbeats

@twmb
Copy link
Owner

twmb commented Mar 3, 2021

Thanks for the report, and apologies for the panic. I’ll be able to look into this later in the day my time.

I checked a few things quickly, and the code causing this panic comments that the caller must be non nil at that point, so that comment is clearly based on old information. My offhand guess is that something got lost / moved / added during refactorings.

How reliably are you able to trigger this panic?

@owenhaynes
Copy link
Contributor Author

owenhaynes commented Mar 3, 2021

At the moment every app run its happening consuming from 200+topics indvididly. Our Kafka server looks to be at 80%+ load.

I was in the middle of some debugging to add some extra info:

but its like the topic metadata is not full populated before the topics_and_partitions.go:73 is run. I only see one topic listed in the metadata list and not the topic we should be subscribe for heartbeats for

code using

	client, err := kgo.NewClient(kgo.SeedBrokers(c.brokers...),
		kgo.Dialer(tlsDialer.DialContext),
	if err != nil {
		return err
	}
	defer client.Close()

	client.AssignGroup(c.id, kgo.GroupTopics(c.topic), kgo.DisableAutoCommit())

we create 200 clients in the app one for each topic at moment as we currently testing this library as a drop in replacement and would be optimised in that respect later. I don't see this being the problem as the library does not to contain any sheared state from what I can tell.

@twmb
Copy link
Owner

twmb commented Mar 3, 2021

Cool thanks for the update—doesn’t seem like an isolation case is a thing.

I was planning to fully audit the consumer group code soon, and I’ll push up that timeline based off this issue (this weekend or next week).

I do see something that I suspect and can validate later when I’m on a laptop: the code does not currently handle the case where a client is assigned a topic it actually did not express interest in. I’ll also look for a case where it is assigned something but the internal topic partitions struct has not been created for some reason.

What group balancer are you using here?

@owenhaynes
Copy link
Contributor Author

I don't override the default which this client provides

@twmb
Copy link
Owner

twmb commented Mar 3, 2021

Thanks! And lastly, are all consumers in the group using this client?

@owenhaynes
Copy link
Contributor Author

owenhaynes commented Mar 3, 2021 via email

@twmb
Copy link
Owner

twmb commented Mar 3, 2021

Interesting, so this happening in the cooperative case. If it's easy, can you try using the StickyBalancer? And, if that causes the issue still, can you try the RangeBalancer? These can help give clues as to areas of my code to look into.

@twmb
Copy link
Owner

twmb commented Mar 4, 2021

Here's what I'm currently looking at:

  • A consumer group member indicates only topics it is interested in when joining a group. In the client, these interests are either static (if you use GroupTopics) or dynamic (if you use GroupTopicsRegex). The topics added to the JoinGroup request are only added if they have been discovered from a Metadata response, and they are stored in the client after discovery before tracking that they are usable.
  • The sticky balancer has at least a few tests that ensure topics/partitions are assigned correctly, even when some consumers are interested in more topics than others.
  • Given that, a client should only be assigned topics that it has already stored internally.

I can still fix the panic itself by fixing up fetchOffsets, but given that you said all consumers are using this client, there's a bug somewhere else.

Can you also confirm that each consumer is using its own dedicated client? That is, no consumer is sharing a client?

I may add some info level logging around what a client expresses interest in when joining a group, and then what that client is assigned.

@owenhaynes
Copy link
Contributor Author

We do get some connection timeouts on this cluster if that's any help, this happens a lot more frequently in other go libs, not sure about the python ones we use.

[unable to open connection to broker addr *:17189 id 51 err dial tcp *:17189: i/o timeout]

Using different balancers

Sticky

client.AssignGroup(c.id, kgo.GroupTopics(c.topic), kgo.DisableAutoCommit(), kgo.Balancers(kgo.StickyBalancer()))

Not crashing a lot of topics are not getting any data, like they are stuck. There is a lot of lag on these topics. Only 3 topics look to have got data from our metrics.
Ran for 10min

Range

client.AssignGroup(c.id, kgo.GroupTopics(c.topic), kgo.DisableAutoCommit(), kgo.Balancers(kgo.RangeBalancer()))

Crashing

panic: runtime error: index out of range [1] with length 1

goroutine 1895 [running]:
github.com/twmb/franz-go/pkg/kgo.(*rangeBalancer).balance(0x152eab8, 0xc00262c000, 0xf4, 0xf4, 0xc00342e3c0, 0xf4)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/group_balancer.go:306 +0x766
github.com/twmb/franz-go/pkg/kgo.(*groupConsumer).balanceGroup(0xc00062edc0, 0xc003227530, 0x5, 0xc0004f7000, 0xf4, 0xf4, 0xa, 0xc0016b4ed8, 0x7f6f401ae8a0)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/group_balancer.go:122 +0x2ab
github.com/twmb/franz-go/pkg/kgo.(*groupConsumer).handleJoinResp(0xc00062edc0, 0xc0012a0180, 0x0, 0x0, 0x2, 0xc0000b6001, 0x1, 0x0)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/consumer_group.go:1064 +0x73b
github.com/twmb/franz-go/pkg/kgo.(*groupConsumer).joinAndSync(0xc00062edc0, 0xc0006eff03, 0xebfc80)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/consumer_group.go:973 +0x3d3
github.com/twmb/franz-go/pkg/kgo.(*groupConsumer).manage(0xc00062edc0)
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/consumer_group.go:385 +0xc5
created by github.com/twmb/franz-go/pkg/kgo.(*groupConsumer).findNewAssignments
        /go/pkg/mod/github.com/twmb/[email protected]/pkg/kgo/consumer_group.go:1316 +0x70a

Could it be related to the number of topics we have on our kafka instance, we are only consuming from a subset

Total topics is around 350~
We want to consume 200~

Yes no consumer is shearing a client as we create a whole new client per topic

something like this what happens internally.

topics := []string{} // Would contain around 200 topics

wg := &sync.Waitroup{}

for _, topic := range topics {
      topic := topic
       go func() {

       wg.Add(1)
       defer wg.Done()
	client, err := kgo.NewClient(kgo.SeedBrokers(c.brokers...),
		kgo.Dialer(tlsDialer.DialContext),
	if err != nil {
		return err
	}
	defer client.Close()

	client.AssignGroup("consumerID", kgo.GroupTopics(topic), kgo.DisableAutoCommit())
         for {
		fetches := client.PollFetches(ctx)

		iter := fetches.RecordIter()

		for _, fetchErr := range fetches.Errors() {
			// Handle errors
		}

		for !iter.Done() {
			msg := iter.Next()
                        // Do something with message
                }
         }
    }()
}

wg.Wait()

@twmb
Copy link
Owner

twmb commented Mar 4, 2021

Thanks for testing. If it's not too much, could you try the RoundRobin balancer too?

I think the dialing is unrelated, that's at a lower level than this library.

@twmb
Copy link
Owner

twmb commented Mar 4, 2021

Also, I see that you're disabling autocommitting -- can you try with autocommitting enabled?

Also, if you're open to joining the Discord channel, I can ask some higher bandwidth questions, and I really appreciate your assisted debugging so far. The surface area of where this could be occurring is so far quite large, so I'm trying to narrow down at least a few things about it.

@owenhaynes
Copy link
Contributor Author

Ok round robin no crash just stuck.

The reason auto committing is disabled is because we want to only commit once we know that the msg has processed correctly. but happily disable for a quick test.

Yeah should be able to join discord, but will most likely be tomorrow.

@twmb
Copy link
Owner

twmb commented Mar 4, 2021

Thanks!

Here's my current plan:

  • the panic in the range balancer looks to be from an obvious mistake, I can fix this and add test cases for it
  • the original reported panic is still an unknown--I can fix it from occurring, but it would only occur on a bad assignment
  • I'm going to add some verbose logs to the consumer that is assigned the leader, these logs will print what topics every consumer has expressed interest in, and then the resulting plan, and for each consumer, I will add logs printing their assignment.

When I get those fixes / logs in, if possible I'd like you to re-run the cooperative sticky test (original balancer), the sticky test, and the range test. My hope is that the balancing logs there can give some more concrete things to look at. I can have these small commits pushed today.

@owenhaynes
Copy link
Contributor Author

owenhaynes commented Mar 4, 2021

auto commit same results. some topics working with messages. At this point I am just assuming its pot luck that some topics work the topics which work sometimes are different if they do work.

@owenhaynes
Copy link
Contributor Author

Thanks!

Here's my current plan:

  • the panic in the range balancer looks to be from an obvious mistake, I can fix this and add test cases for it
  • the original reported panic is still an unknown--I can fix it from occurring, but it would only occur on a bad assignment
  • I'm going to add some verbose logs to the consumer that is assigned the leader, these logs will print what topics every consumer has expressed interest in, and then the resulting plan, and for each consumer, I will add logs printing their assignment.

When I get those fixes / logs in, if possible I'd like you to re-run the cooperative sticky test (original balancer), the sticky test, and the range test. My hope is that the balancing logs there can give some more concrete things to look at. I can have these small commits pushed today.

No problem

twmb added a commit that referenced this issue Mar 5, 2021
Issue #24 showed that the range balancer was broken. On first glance,
the issue appears to be because the consumerIdx wraps past the end of
potentialConsumers, and the obvious fix is to just inc and mod. This
would be an incorrect fix.

The real problem lies above in determining div and rem. The whole point
is that (div*(potentialConsumers) + rem) == num partitions, thus if we
drain div and one from rem every loop, then we will never increment past
the end of potential consumers, because partitions will be completely
drained on the last consumer.

The prior logic accidentally made div smaller and rem larger, thus
giving too few partitions to each consumer, and then in some scenarios,
would cause us to exhaust consumers while still draining partitions.

I will be adding unit tests later.
@twmb
Copy link
Owner

twmb commented Mar 5, 2021

I've pushed two commits, can you try those tests with go get github.com/twmb/franz-go@8178f2cc5883304d409f653ff7e5d42163cf3e8e? Specifically, the logging output for all consumers across a rebalance is what I'm looking for.

@owenhaynes
Copy link
Contributor Author

Ok here is the log output for

client.AssignGroup(c.id, kgo.GroupTopics(c.topic), kgo.DisableAutoCommit())

group-balcner-logs.log

@owenhaynes
Copy link
Contributor Author

client.AssignGroup(c.id, kgo.GroupTopics(c.topic), kgo.DisableAutoCommit(), kgo.Balancers(kgo.StickyBalancer()))

sticky-balancer.log

@owenhaynes
Copy link
Contributor Author

client.AssignGroup(c.id, kgo.GroupTopics(c.topic), kgo.DisableAutoCommit(), kgo.Balancers(kgo.RangeBalancer()))

range-balencer.log

@twmb
Copy link
Owner

twmb commented Mar 5, 2021

@owenhaynes I added the new logs at the Debug level, and I think the default is Info -- is it possible to up the level to Debug? I see some logs that are at the info level but none of the new ones I added.

Also, how many members are in the group?

@owenhaynes
Copy link
Contributor Author

I thought I had set it logging to debug will double check.

Just one member as only have 1 pod running

@owenhaynes
Copy link
Contributor Author

Here are the fixed debug logs
defualt_balancer.log
range_balancer.log
sticky_balancer.log

@owenhaynes
Copy link
Contributor Author

FYI we upgraded to kafka 2.7 today, issue still exists so not related to using Kafka 2.6

@twmb
Copy link
Owner

twmb commented Mar 10, 2021

As an update: I have an idea of why the topic assignment after the group balance isn't working for you, I am in the middle of fixing that. I'm not sure yet what caused the original issue you reported, though.

twmb added a commit that referenced this issue Mar 20, 2021
This is a companion commit to 89146f6, which realistically _needs_ this
commit because:

  - I did not properly split the work and add to the commit properly
  - this commit fixes some lock ordering problems I noticed while
    writing that commit

This should fix the panic in #24 by at least logging on when it would be
detected and continuing, however the bug itself is still a mystery. The
debug logs about what the balance results were should help, though, if
this crops up again.

There are a few lock ordering fixes in here which are now documented
extensively. Notably, PollFetches needs the consumer mu, and there is a
huge reason as to why.

The prerevoke and revoke logic, and how we ensure things are done before
returning sometimes, is all more extensively documented.

Lastly, all instances of assignPartitions is now properly guarded by the
consumer mutex. Prior, some instances were not.
twmb added a commit that referenced this issue Mar 21, 2021
ATOMIC STORES
=============

This commit switches the consumer type to be stored in an atomic value,
rather than a uint8 type that specifies which pointer to use guarded by
a mutex.

This switch fundamentally arises from trying to unblock metadata updates
while a group consumer is leaving the group. Previously, we had to grab
the consumer lock to check the consumer type to check if we were
consuming with regex. We avoid that now.

This actually makes a bunch of other areas simpler as well -- many
places needed the group consumer to do some logic on the group consumer
directly. Previously, we had to grab the consumer lock, and for
simplicity we held it through the function. Holding it was unnecessary,
and now we avoid grabbing the lock at all.

Anything that sets the consumer value grabs a new dedicated assignMu.
The dedicated assignMu allows us to unblock a clean group leave, which
may (in a shortly incoming commit) grab the consumer mu to assign
partitions on revoke.

We do not have to worry about TOCTOU: the guarantee is things work in
order. If a person concurrently modifies something, they may change the
outcome of stuff that was set into sequence by original events, but the
outcome is still sound according to our client. Particularly, a later
metadata update will trigger the right sequence for the new assignment.

Same type of logic with offset setting, but people should not be doing
that concurrently with assigning and whatnot.

UPDATES & LOCK ORDERING FIXES
=============================

This is the bulk of this commit that mostly fixes some lock orderings
and missing locks.

This should fix the panic in #24 by at least logging on when it would be
detected and continuing, however the bug itself is still a mystery. The
debug logs about what the balance results were should help, though, if
this crops up again.

There are a few lock ordering fixes in here which are now documented
extensively. Notably, PollFetches needs the consumer mu, and there is a
huge reason as to why.

The prerevoke and revoke logic, and how we ensure things are done before
returning sometimes, is all more extensively documented.

Lastly, all instances of assignPartitions is now properly guarded by the
consumer mutex. Prior, some instances were not.
@twmb
Copy link
Owner

twmb commented Mar 29, 2021

For posterity, this was moved to discord and fixed in the following three commits:

  • e038916 fixes range balancer
  • e168855 fixes an edge case in the sticky balancer (what caused this panic originally)
  • a670bc7 allows the leader to balance topics that the leader itself does not know of (the stuck problem from this issue)
  • 938651e ensures that the panic will not happen even if a different leader is bugged

These commits are released in v0.6.10

@twmb twmb closed this as completed Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants