Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Panic after calling Send (after x hours of functioning properly) #149

Closed
brocaar opened this issue Oct 30, 2019 · 15 comments · Fixed by #158
Closed

Panic after calling Send (after x hours of functioning properly) #149

brocaar opened this issue Oct 30, 2019 · 15 comments · Fixed by #158
Assignees

Comments

@brocaar
Copy link

brocaar commented Oct 30, 2019

I'm not sure if this is a bug in the SDK (it looks similar to this issue: #141), or a implementation issue at my side that I'm not aware of, but after hours of functioning properly and publishing messages to the Azure Service Bus, my application panic-ed with the following trace:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xb2110d]

goroutine 1072417 [running]:
github.com/Azure/azure-service-bus-go.(*Sender).Send(0xc00019f260, 0x1620840, 0xc0002b43c0, 0xc0008ab3b0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/!azure/[email protected]/sender.go:151 +0x32d

github.com/Azure/azure-service-bus-go.(*Topic).Send(0xc0003978e0, 0x1620840, 0xc0002b43c0, 0xc0008ab3b0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/!azure/[email protected]/topic.go:105 +0x1c1

github.com/brocaar/lora-app-server/internal/integration/azureservicebus.(*Integration).publish(0xc0002b4300, 0x1620900, 0xc000a9a900, 0x135d2cd, 0x2, 0x18, 0x270375d39373433, 0x131a280, 0xc0007c59a0, 0x0, ...)
/lora-app-server/internal/integration/azureservicebus/azure_service_bus.go:193 +0x6b9

github.com/brocaar/lora-app-server/internal/integration/azureservicebus.(*Integration).SendDataUp(0xc0002b4300, 0x1620900, 0xc000a9a900, 0x18, 0xc00098f6e0, 0x13, 0xc000534aa0, 0x10, 0x270375d39373433, 0xc000855a70, ...)
/lora-app-server/internal/integration/azureservicebus/azure_service_bus.go:78 +0xd7

As you can see in the trace, this is where the SDK gets called:
https://github.com/brocaar/lora-app-server/blob/v3.4.0/internal/integration/azureservicebus/azure_service_bus.go#L193

Before the panic, Send did publish a message to the Service Bus without error (e.g. there was no indication that there was a connection issue). This was in the logfiles before the panic:

integration/azureservicebus: event published" ctx_id=xxxxxxxx dev_eui=xxxxxxx event=up

(https://github.com/brocaar/lora-app-server/blob/v3.4.0/internal/integration/azureservicebus/azure_service_bus.go#L199)

@devigned
Copy link
Member

devigned commented Oct 30, 2019

Ok, so s.session must be nil.

msg.SessionID = &s.session.SessionID

That should only happen if the internal state of sender is in an unexpected state when Send was called.

I think what might be happening is that the sender is recovering or closing the connection at the same time as the Send func is called. Close will set the session = nil. Send does not respect the Close / Recover mutex, which will lead to a possibility of session being nil.

Seems like Send should try to acquire the sender mutex.

Thoughts?

@brocaar
Copy link
Author

brocaar commented Oct 30, 2019

I haven't dived fully into the SDK internals, but I was thinking the same :)

@brocaar
Copy link
Author

brocaar commented Oct 31, 2019

Seems like Send should try to acquire the sender mutex.

@devigned I think when adding

s.clientMu.RLock()
defer s.clientMu.RUnlock()

to the Send function, this might cause a deadlock (Send calls trySend which might call s.Recover which will acquire a lock too).

I'm not sure if putting the lock only around msg.SessionID = &s.session.SessionID would be safe / sufficient as SessionID is a pointer to a value which might change when the recovery is triggered in an other go-routine?

Acquiring the lock in Send and releasing it after s.sender.Send also doesn't look a clean solution as it will put the lock and unlock in two different places. Or maybe part of the code within Send could be moved into trySend to keep the lock / unlock in one place?

Is this something you could investigate further?

@asanchezdelc
Copy link

Any timeline on when this will be fixed?

@brocaar
Copy link
Author

brocaar commented Nov 11, 2019

@devigned could you give an update on the status of this issue? Currently several Azure customers are affected by this issue.

@devigned
Copy link
Member

I'm sorry for the delay. I definitely will if no one else is able to attend to the issue. However, I've been trying to transition this project to other folks as my role at MSFT has changed.

@jhendrixMSFT do you know who would be the right person to handle this issue?

@jhendrixMSFT
Copy link
Member

@catalinaperalta and myself can take a look. To set expectations we're both new to this space and there's a lot to unpack here so it might take a while to roll out a fix.

@devigned
Copy link
Member

devigned commented Nov 11, 2019

Happy to help if there are any questions, @jhendrixMSFT and @catalinaperalta. Basically, Send has a shared resource and locks access when recovering from failure. There is access of the shared state outside of the lock scope. Correcting that access of shared state either by moving it within the scope of the lock, or refactoring out the shared state (session ID) would fix the issue.

@jhendrixMSFT
Copy link
Member

What shared state is clientMu guarding? The name implies the client field but from this issue it sounds like it's also the session field. Any others?

@devigned
Copy link
Member

It's really any action on the client during recovery or initial establishment of a connection. That is why it is clientMu. The amqp client consists of a connection, session and link.

@catalinaperalta
Copy link
Member

Just wanted to give an update, our suggested fix for this is to put a read lock around:

msg.SessionID = &s.session.SessionID
next := s.session.getNext()
msg.GroupSequence = &next

However, we found a performance regression due to a recent change and are looking into it before rolling this out.

@mcgradycchen
Copy link
Contributor

Hi @devigned

After upgrade from v0.7.0 to v0.10.0, we also has the same error.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8ee64f]

goroutine 11 [running]:
main/vendor/github.com/Azure/azure-service-bus-go.(*Sender).Send(0xc4201c47e0, 0xbfa7e0, 0xc4200220a0, 0xc4211ec780, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/sender.go:151 +0x34f
main/vendor/github.com/Azure/azure-service-bus-go.(*Queue).Send(0xc420370600, 0xbfa7e0, 0xc4200220a0, 0xc4211ec780, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/queue.go:167 +0x1a9
main/stream/asb.(*ASBStream).ProduceMessage(0xc420086820, 0xc42000ed48, 0x7f8f81a02601, 0xc420ca2120)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/stream/asb/asbs.go:185 +0xa7
main/stream/asb.(*ASBStream).ProduceMessageWithReconnect(0xc420086820, 0xc42000ed48, 0xb70522, 0x4)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/stream/asb/asbs.go:202 +0x4d
main/flow.(*T0ResultHandler).handle(0xc4200f9260, 0xbfe7e0, 0xc420d7e000, 0xc420b93260, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/result.go:101 +0x1f5
main/flow.(*VirtualAnalyzerResultHandler).handle(0xc4200f9280, 0xbfe7e0, 0xc420d7e000, 0xc420b93260, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/result.go:77 +0x22f
main/flow.(*ActionResultHandler).handle(0xc4200f92a0, 0xbfe7e0, 0xc420d7e000, 0xc420b93260, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/result.go:43 +0xcb
main/flow.WrapScanEventKafkaHandler.func1(0xc42006c5a0, 0xb9ed78, 0xbf4500, 0xc4200f92a0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/scan.go:117 +0x82c
created by main/flow.WrapScanEventKafkaHandler
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/scan.go:94 +0x17d

And, our consumer still has a panic when do reconnect.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x92f67a]

goroutine 95542 [running]:
main/vendor/github.com/Azure/go-amqp.(*Receiver).Receive(0x0, 0xeb4ec0, 0xc420406b40, 0xe14776, 0x1c, 0xeb4ec0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/go-amqp/client.go:1864 +0x3a
main/vendor/github.com/Azure/azure-service-bus-go.(*Receiver).listenForMessage(0xc4200cf600, 0xeb4ec0, 0xc420406b40, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/receiver.go:338 +0x107
main/vendor/github.com/Azure/azure-service-bus-go.(*Receiver).listenForMessages(0xc4200cf600, 0xeb4ec0, 0xc420406b40, 0xc421831c80)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/receiver.go:290 +0xef
created by main/vendor/github.com/Azure/azure-service-bus-go.(*Receiver).Listen
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/receiver.go:212 +0x139

Any plan to fix this issue ?

Thanks

@devigned
Copy link
Member

The current maintainers here are @jhendrixMSFT and @catalinaperalta. I'll defer to them. If needed, I'm happy to advise or to put the PR together.

@johejo
Copy link
Contributor

johejo commented Feb 22, 2020

Hi.
It has also happened in my production environment.
Any update on this?

johejo added a commit to johejo/azure-service-bus-go that referenced this issue Feb 26, 2020
johejo added a commit to johejo/azure-service-bus-go that referenced this issue Feb 26, 2020
johejo added a commit to johejo/azure-service-bus-go that referenced this issue Mar 4, 2020
johejo added a commit to johejo/azure-service-bus-go that referenced this issue Mar 9, 2020
jhendrixMSFT added a commit that referenced this issue Apr 28, 2020
* fix #149 add mutex RLock/RUnlock

* Recover should close and rebuild link atomically.

Return ErrConnectionClosed when attempting to send a message on a closed
connection.

* Take read lock when accessing s.sender

Consolidate code for creating ErrConnectionClosed

Co-authored-by: Joel Hendrix <[email protected]>
@jhendrixMSFT
Copy link
Member

Fixed in v0.10.1.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants