Panic after calling Send (after x hours of functioning properly) #149

brocaar · 2019-10-30T15:06:10Z

I'm not sure if this is a bug in the SDK (it looks similar to this issue: #141), or a implementation issue at my side that I'm not aware of, but after hours of functioning properly and publishing messages to the Azure Service Bus, my application panic-ed with the following trace:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xb2110d]

goroutine 1072417 [running]:
github.com/Azure/azure-service-bus-go.(*Sender).Send(0xc00019f260, 0x1620840, 0xc0002b43c0, 0xc0008ab3b0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/!azure/[email protected]/sender.go:151 +0x32d

github.com/Azure/azure-service-bus-go.(*Topic).Send(0xc0003978e0, 0x1620840, 0xc0002b43c0, 0xc0008ab3b0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/!azure/[email protected]/topic.go:105 +0x1c1

github.com/brocaar/lora-app-server/internal/integration/azureservicebus.(*Integration).publish(0xc0002b4300, 0x1620900, 0xc000a9a900, 0x135d2cd, 0x2, 0x18, 0x270375d39373433, 0x131a280, 0xc0007c59a0, 0x0, ...)
/lora-app-server/internal/integration/azureservicebus/azure_service_bus.go:193 +0x6b9

github.com/brocaar/lora-app-server/internal/integration/azureservicebus.(*Integration).SendDataUp(0xc0002b4300, 0x1620900, 0xc000a9a900, 0x18, 0xc00098f6e0, 0x13, 0xc000534aa0, 0x10, 0x270375d39373433, 0xc000855a70, ...)
/lora-app-server/internal/integration/azureservicebus/azure_service_bus.go:78 +0xd7

As you can see in the trace, this is where the SDK gets called:
https://github.com/brocaar/lora-app-server/blob/v3.4.0/internal/integration/azureservicebus/azure_service_bus.go#L193

Before the panic, Send did publish a message to the Service Bus without error (e.g. there was no indication that there was a connection issue). This was in the logfiles before the panic:

integration/azureservicebus: event published" ctx_id=xxxxxxxx dev_eui=xxxxxxx event=up

(https://github.com/brocaar/lora-app-server/blob/v3.4.0/internal/integration/azureservicebus/azure_service_bus.go#L199)

The text was updated successfully, but these errors were encountered:

devigned · 2019-10-30T16:08:56Z

Ok, so s.session must be nil.

azure-service-bus-go/sender.go

Line 151 in e1e88bc

msg.SessionID = &s.session.SessionID

That should only happen if the internal state of sender is in an unexpected state when Send was called.

I think what might be happening is that the sender is recovering or closing the connection at the same time as the Send func is called. Close will set the session = nil. Send does not respect the Close / Recover mutex, which will lead to a possibility of session being nil.

Seems like Send should try to acquire the sender mutex.

Thoughts?

brocaar · 2019-10-30T16:21:41Z

I haven't dived fully into the SDK internals, but I was thinking the same :)

brocaar · 2019-10-31T18:55:26Z

Seems like Send should try to acquire the sender mutex.

@devigned I think when adding

s.clientMu.RLock()
defer s.clientMu.RUnlock()

to the Send function, this might cause a deadlock (Send calls trySend which might call s.Recover which will acquire a lock too).

I'm not sure if putting the lock only around msg.SessionID = &s.session.SessionID would be safe / sufficient as SessionID is a pointer to a value which might change when the recovery is triggered in an other go-routine?

Acquiring the lock in Send and releasing it after s.sender.Send also doesn't look a clean solution as it will put the lock and unlock in two different places. Or maybe part of the code within Send could be moved into trySend to keep the lock / unlock in one place?

Is this something you could investigate further?

asanchezdelc · 2019-11-04T15:44:30Z

Any timeline on when this will be fixed?

brocaar · 2019-11-11T14:36:27Z

@devigned could you give an update on the status of this issue? Currently several Azure customers are affected by this issue.

devigned · 2019-11-11T16:14:00Z

I'm sorry for the delay. I definitely will if no one else is able to attend to the issue. However, I've been trying to transition this project to other folks as my role at MSFT has changed.

@jhendrixMSFT do you know who would be the right person to handle this issue?

jhendrixMSFT · 2019-11-11T16:53:55Z

@catalinaperalta and myself can take a look. To set expectations we're both new to this space and there's a lot to unpack here so it might take a while to roll out a fix.

devigned · 2019-11-11T17:10:47Z

Happy to help if there are any questions, @jhendrixMSFT and @catalinaperalta. Basically, Send has a shared resource and locks access when recovering from failure. There is access of the shared state outside of the lock scope. Correcting that access of shared state either by moving it within the scope of the lock, or refactoring out the shared state (session ID) would fix the issue.

jhendrixMSFT · 2019-11-11T18:58:20Z

What shared state is clientMu guarding? The name implies the client field but from this issue it sounds like it's also the session field. Any others?

devigned · 2019-11-11T19:15:42Z

It's really any action on the client during recovery or initial establishment of a connection. That is why it is clientMu. The amqp client consists of a connection, session and link.

catalinaperalta · 2019-11-22T19:20:54Z

Just wanted to give an update, our suggested fix for this is to put a read lock around:

azure-service-bus-go/sender.go

Lines 151 to 153 in e1e88bc

    
           msg.SessionID = &s.session.SessionID 
        
           next := s.session.getNext() 
        
           msg.GroupSequence = &next

However, we found a performance regression due to a recent change and are looking into it before rolling this out.

mcgradycchen · 2020-01-16T00:36:10Z

Hi @devigned

After upgrade from v0.7.0 to v0.10.0, we also has the same error.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8ee64f]

goroutine 11 [running]:
main/vendor/github.com/Azure/azure-service-bus-go.(*Sender).Send(0xc4201c47e0, 0xbfa7e0, 0xc4200220a0, 0xc4211ec780, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/sender.go:151 +0x34f
main/vendor/github.com/Azure/azure-service-bus-go.(*Queue).Send(0xc420370600, 0xbfa7e0, 0xc4200220a0, 0xc4211ec780, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/queue.go:167 +0x1a9
main/stream/asb.(*ASBStream).ProduceMessage(0xc420086820, 0xc42000ed48, 0x7f8f81a02601, 0xc420ca2120)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/stream/asb/asbs.go:185 +0xa7
main/stream/asb.(*ASBStream).ProduceMessageWithReconnect(0xc420086820, 0xc42000ed48, 0xb70522, 0x4)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/stream/asb/asbs.go:202 +0x4d
main/flow.(*T0ResultHandler).handle(0xc4200f9260, 0xbfe7e0, 0xc420d7e000, 0xc420b93260, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/result.go:101 +0x1f5
main/flow.(*VirtualAnalyzerResultHandler).handle(0xc4200f9280, 0xbfe7e0, 0xc420d7e000, 0xc420b93260, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/result.go:77 +0x22f
main/flow.(*ActionResultHandler).handle(0xc4200f92a0, 0xbfe7e0, 0xc420d7e000, 0xc420b93260, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/result.go:43 +0xcb
main/flow.WrapScanEventKafkaHandler.func1(0xc42006c5a0, 0xb9ed78, 0xbf4500, 0xc4200f92a0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/scan.go:117 +0x82c
created by main/flow.WrapScanEventKafkaHandler
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/flow/scan.go:94 +0x17d

And, our consumer still has a panic when do reconnect.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x92f67a]

goroutine 95542 [running]:
main/vendor/github.com/Azure/go-amqp.(*Receiver).Receive(0x0, 0xeb4ec0, 0xc420406b40, 0xe14776, 0x1c, 0xeb4ec0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/go-amqp/client.go:1864 +0x3a
main/vendor/github.com/Azure/azure-service-bus-go.(*Receiver).listenForMessage(0xc4200cf600, 0xeb4ec0, 0xc420406b40, 0x0, 0x0, 0x0)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/receiver.go:338 +0x107
main/vendor/github.com/Azure/azure-service-bus-go.(*Receiver).listenForMessages(0xc4200cf600, 0xeb4ec0, 0xc420406b40, 0xc421831c80)
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/receiver.go:290 +0xef
created by main/vendor/github.com/Azure/azure-service-bus-go.(*Receiver).Listen
        /home/jenkinsbuild/ci-jenkins/workspace/TMCAS/cas-selfci-prod/prod/cas-gmail/go/src/main/vendor/github.com/Azure/azure-service-bus-go/receiver.go:212 +0x139

Any plan to fix this issue ?

Thanks

devigned · 2020-01-16T00:54:11Z

The current maintainers here are @jhendrixMSFT and @catalinaperalta. I'll defer to them. If needed, I'm happy to advise or to put the PR together.

johejo · 2020-02-22T16:53:32Z

Hi.
It has also happened in my production environment.
Any update on this?

* fix #149 add mutex RLock/RUnlock * Recover should close and rebuild link atomically. Return ErrConnectionClosed when attempting to send a message on a closed connection. * Take read lock when accessing s.sender Consolidate code for creating ErrConnectionClosed Co-authored-by: Joel Hendrix <[email protected]>

jhendrixMSFT · 2020-04-28T20:49:17Z

Fixed in v0.10.1.

johejo added a commit to johejo/azure-service-bus-go that referenced this issue Feb 26, 2020

fix Azure#149 add mutex lock/unlock

e49376d

johejo added a commit to johejo/azure-service-bus-go that referenced this issue Feb 26, 2020

fix Azure#149 add mutex lock/unlock

37b3b08

johejo mentioned this issue Feb 26, 2020

fix #149 add mutex lock/unlock #158

Merged

johejo added a commit to johejo/azure-service-bus-go that referenced this issue Mar 4, 2020

fix Azure#149 add mutex lock/unlock

07236b2

ramya-rao-a assigned jhendrixMSFT and catalinaperalta Mar 6, 2020

johejo added a commit to johejo/azure-service-bus-go that referenced this issue Mar 9, 2020

fix Azure#149 add mutex RLock/RUnlock

0336f54

jhendrixMSFT closed this as completed in #158 Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic after calling Send (after x hours of functioning properly) #149

Panic after calling Send (after x hours of functioning properly) #149

brocaar commented Oct 30, 2019

devigned commented Oct 30, 2019 •

edited

Loading

brocaar commented Oct 30, 2019

brocaar commented Oct 31, 2019

asanchezdelc commented Nov 4, 2019

brocaar commented Nov 11, 2019

devigned commented Nov 11, 2019

jhendrixMSFT commented Nov 11, 2019

devigned commented Nov 11, 2019 •

edited

Loading

jhendrixMSFT commented Nov 11, 2019

devigned commented Nov 11, 2019

catalinaperalta commented Nov 22, 2019

mcgradycchen commented Jan 16, 2020

devigned commented Jan 16, 2020

johejo commented Feb 22, 2020

jhendrixMSFT commented Apr 28, 2020

Panic after calling Send (after x hours of functioning properly) #149

Panic after calling Send (after x hours of functioning properly) #149

Comments

brocaar commented Oct 30, 2019

devigned commented Oct 30, 2019 • edited Loading

brocaar commented Oct 30, 2019

brocaar commented Oct 31, 2019

asanchezdelc commented Nov 4, 2019

brocaar commented Nov 11, 2019

devigned commented Nov 11, 2019

jhendrixMSFT commented Nov 11, 2019

devigned commented Nov 11, 2019 • edited Loading

jhendrixMSFT commented Nov 11, 2019

devigned commented Nov 11, 2019

catalinaperalta commented Nov 22, 2019

mcgradycchen commented Jan 16, 2020

devigned commented Jan 16, 2020

johejo commented Feb 22, 2020

jhendrixMSFT commented Apr 28, 2020

devigned commented Oct 30, 2019 •

edited

Loading

devigned commented Nov 11, 2019 •

edited

Loading