Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

producer: panic in popTransaction #171

Open
chzyer opened this issue Mar 25, 2016 · 10 comments
Open

producer: panic in popTransaction #171

chzyer opened this issue Mar 25, 2016 · 10 comments

Comments

@chzyer
Copy link
Contributor

chzyer commented Mar 25, 2016

Hi, guys. I have meet a unexpected panic in producer, but I couldn't figure out how it happens.

Here is the panic log, and the version of go-nsq is newest.

panic: runtime error: index out of range

goroutine [running]:
runtime.panic(0x7b5a60, 0xa4457c)
    /usr/local/go/src/pkg/runtime/panic.c:279 +0xf5
github.com/nsqio/go-nsq.(*Producer).popTransaction()
    /data/www/go_proj/src/github.com/nsqio/go-nsq/producer.go:336 +0x199
github.com/nsqio/go-nsq.(*Producer).router()
    /data/www/go_proj/src/github.com/nsqio/go-nsq/producer.go:319 +0x3dd
created by github.com/nsqio/go-nsq.(*Producer).connect
    /data/www/go_proj/src/github.com/nsqio/go-nsq/producer.go:290 +0x651

I check (*Producer) popTransaction(), t := w.transactions[0] cause the crash, so I check all the functions which manipulate w.transactions. Unfortunately, all the functions only use in (*Producer).router(), which means there is not possible occurs the data races things.

So, Is there any other possible ? Thanks.

@jehiah jehiah added the bug label Mar 25, 2016
@jehiah
Copy link
Member

jehiah commented Mar 25, 2016

definitely sounds like a bug of some sort. @chzyer can you confirm what version go-nsq you are using, and w/ what version of Go.

Also, to clarify was this a panic you saw once, or is it recurring for you?

@jehiah
Copy link
Member

jehiah commented Mar 25, 2016

The only thing i can think of is if a connection gets an unexpected FrameTypeError that doesn't match a response, or if there is a case where both a response and error are possible. thoughts @mreiferson ?

@mreiferson mreiferson changed the title crash in (*Producer).popTransaction producer: panic in popTransaction Mar 25, 2016
@mreiferson
Copy link
Member

Strange, never seen this before. @chzyer do you have more of the logs for the process that crashed so we can see the event in context?

@RobitYadda
Copy link

RobitYadda commented May 6, 2016

I meet this too. It throw error massage "not connected" before panic.

@jehiah
Copy link
Member

jehiah commented May 6, 2016

@RobitYadda can you share logs that include the context of your error?

@RobitYadda
Copy link

RobitYadda commented May 9, 2016

My code like this:

for{
//other process
    go func() {
        err = writer.Publish("events", msgBytes)
        if err != nil {
            glog.Errorf("%v", err)
        }
    }()
}

Logs(some are logged nsq-client,others are logged by glog above):

2016/05/06 12:43:30 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:40915->127.0.0.1:4150: write: broken pipe
2016/05/06 12:43:35 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:40915->127.0.0.1:4150: write: broken pipe
2016/05/06 12:43:37 INF    1 (127.0.0.1:4150) beginning close
2016/05/06 12:43:39 INF    1 (127.0.0.1:4150) readLoop exiting
2016/05/06 12:43:43 INF    1 (127.0.0.1:4150) breaking out of writeLoop
2016/05/06 12:43:43 INF    1 (127.0.0.1:4150) writeLoop exiting
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T12:44:00Z",message="not connected" 1462509847222037115
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T12:44:18Z",message="not connected" 1462509858100540707
2016/05/06 12:44:19 INF    1 (127.0.0.1:4150) finished draining, cleanup exiting
2016/05/06 12:44:19 INF    1 (127.0.0.1:4150) clean close complete
libra,program=arsenal.linux,type=ERROR value=1,debug="alertFilterEtcd.go:574 on 2016-05-06T12:44:22Z",message="update sensor config fail - client: etcd cluster is unavailable or misconfigured" 1462509862192037144
2016/05/06 12:44:22 INF    1 exiting router
libra,program=arsenal.linux,type=ERROR value=1,debug="alertFilterEtcd.go:574 on 2016-05-06T12:44:41Z",message="update sensor config fail - client: etcd cluster is unavailable or misconfigured" 1462509881807959037
2016/05/06 12:45:16 INF    1 (127.0.0.1:4150) connecting to nsqd
2016/05/06 12:45:18 ERR    1 (127.0.0.1:4150) error connecting to nsqd - dial tcp 127.0.0.1:4150: i/o timeout
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T12:45:18Z",message="dial tcp 127.0.0.1:4150: i/o timeout" 1462509919885942504
2016/05/06 12:45:23 INF    1 (127.0.0.1:4150) connecting to nsqd
2016/05/06 12:45:32 ERR    1 (127.0.0.1:4150) error connecting to nsqd - dial tcp 127.0.0.1:4150: i/o timeout
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T12:45:32Z",message="dial tcp 127.0.0.1:4150: i/o timeout" 1462509933682324968
2016/05/06 12:49:00 INF    1 (127.0.0.1:4150) connecting to nsqd
2016/05/06 12:50:30 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:42109->127.0.0.1:4150: write: broken pipe
2016/05/06 12:51:14 ERR    1 (127.0.0.1:4150) error connecting to nsqd - failed to IDENTIFY - write tcp 127.0.0.1:42109->127.0.0.1:4150: write: broken pipe
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T12:51:14Z",message="failed to IDENTIFY - write tcp 127.0.0.1:42109->127.0.0.1:4150: write: broken pipe" 1462510274685366850
2016/05/06 12:51:19 INF    1 (127.0.0.1:4150) connecting to nsqd
2016/05/06 12:54:17 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:42150->127.0.0.1:4150: write: broken pipe
2016/05/06 12:54:17 ERR    1 (127.0.0.1:4150) error connecting to nsqd - failed to IDENTIFY - write tcp 127.0.0.1:42150->127.0.0.1:4150: write: broken pipe
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T12:54:17Z",message="failed to IDENTIFY - write tcp 127.0.0.1:42150->127.0.0.1:4150: write: broken pipe" 1462510457374874116
2016/05/06 12:54:17 INF    1 (127.0.0.1:4150) connecting to nsqd
2016/05/06 12:54:36 ERR    1 (127.0.0.1:4150) error connecting to nsqd - dial tcp 127.0.0.1:4150: i/o timeout
2016/05/06 12:54:36 INF    1 (127.0.0.1:4150) connecting to nsqd
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T12:54:36Z",message="dial tcp 127.0.0.1:4150: i/o timeout" 1462510476560542319
2016/05/06 12:57:13 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:42199->127.0.0.1:4150: write: broken pipe
2016/05/06 12:57:13 ERR    1 (127.0.0.1:4150) error connecting to nsqd - failed to IDENTIFY - write tcp 127.0.0.1:42199->127.0.0.1:4150: write: broken pipe
2016/05/06 12:57:20 INF    1 (127.0.0.1:4150) connecting to nsqd
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T12:57:13Z",message="failed to IDENTIFY - write tcp 127.0.0.1:42199->127.0.0.1:4150: write: broken pipe" 1462510730900303035
2016/05/06 13:02:10 ERR    1 (127.0.0.1:4150) IO error - EOF
2016/05/06 13:02:18 INF    1 (127.0.0.1:4150) beginning close
2016/05/06 13:02:18 INF    1 (127.0.0.1:4150) readLoop exiting
2016/05/06 13:02:18 INF    1 (127.0.0.1:4150) breaking out of writeLoop
2016/05/06 13:02:18 INF    1 (127.0.0.1:4150) writeLoop exiting
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:15Z",message="not connected" 1462510938989011182
2016/05/06 13:02:19 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:42238->127.0.0.1:4150: write: broken pipe
2016/05/06 13:02:19 ERR    1 (127.0.0.1:4150) sending command - write tcp 127.0.0.1:42238->127.0.0.1:4150: write: broken pipe
2016/05/06 13:02:19 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:42238->127.0.0.1:4150: write: broken pipe
2016/05/06 13:02:19 ERR    1 (127.0.0.1:4150) sending command - write tcp 127.0.0.1:42238->127.0.0.1:4150: write: broken pipe
2016/05/06 13:02:21 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:42238->127.0.0.1:4150: write: broken pipe
2016/05/06 13:02:21 ERR    1 (127.0.0.1:4150) sending command - write tcp 127.0.0.1:42238->127.0.0.1:4150: write: broken pipe
2016/05/06 13:02:21 ERR    1 (127.0.0.1:4150) IO error - write tcp 127.0.0.1:42238->127.0.0.1:4150: write: broken pipe
2016/05/06 13:02:22 INF    1 (127.0.0.1:4150) finished draining, cleanup exiting
2016/05/06 13:02:22 INF    1 (127.0.0.1:4150) clean close complete
2016/05/06 13:02:24 ERR    1 (127.0.0.1:4150) sending command - write tcp 127.0.0.1:42238->127.0.0.1:4150: write: broken pipe
2016/05/06 13:02:24 INF    1 exiting router
2016/05/06 13:02:46 INF    1 (127.0.0.1:4150) connecting to nsqd
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:21Z",message="not connected" 1462510941901917675
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:24Z",message="not connected" 1462510944927696160
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:30Z",message="not connected" 1462510950462640975
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:47Z",message="not connected" 1462510967318804398
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:47Z",message="not connected" 1462510967626758648
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:47Z",message="not connected" 1462510967626774878
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:47Z",message="not connected" 1462510967626801894
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:47Z",message="not connected" 1462510967935213569
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:03:00Z",message="not connected" 1462510980931576223
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:03:01Z",message="not connected" 1462510981763260898
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:35Z",message="not connected" 1462510984484744221
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:36Z",message="not connected" 1462510986899016306
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:03:00Z",message="not connected" 1462510986899025539
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:03:00Z",message="not connected" 1462510987012900523
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:03:07Z",message="not connected" 1462510987012914554
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:03:07Z",message="not connected" 1462510987012935646
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:03:07Z",message="not connected" 1462510987012942224
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:45Z",message="not connected" 1462510989908608763
libra,program=arsenal.linux,type=ERROR value=1,debug="tonsq.go:113 on 2016-05-06T13:02:47Z",message="not connected" 1462510998289152495
panic: runtime error: index out of range

goroutine 190052618 [running]:
github.com/bitly/go-nsq.(*Producer).router(0xc82016e000)
        /Users/yanzizhen/go/src/github.com/bitly/go-nsq/producer.go:319 +0x759
created by github.com/bitly/go-nsq.(*Producer).connect
        /Users/yanzizhen/go/src/github.com/bitly/go-nsq/producer.go:290 +0x7cc
......

@judwhite
Copy link
Contributor

judwhite commented May 9, 2016

I looked at producer.go and conn.go and wasn't able to see how the transactions slice is modified outside of that function (while inside that function), or understand why it'd be 0. The locking around close, connect, and router all seem correct, but obviously something is going on.

@RobitYadda @chzyer I wrote a small program attempting to reproduce this issue with no luck.

Are you able to repro with the above code?

Are there any modifications you can make to make it repro?

My logs:

2016/05/09 02:47:24 INF    1 (127.0.0.1:4150) connecting to nsqd
2016/05/09 02:47:25 210000 successful messages
2016/05/09 02:47:25 Service "nsqd" is RUNNING
2016/05/09 02:47:30 Service "nsqd" is STOP_PENDING
2016/05/09 02:47:30 ERR    1 (127.0.0.1:4150) IO error - read tcp 127.0.0.1:64022->127.0.0.1:4150: wsarecv: An existing connection was forcibly closed by the remote host.
2016/05/09 02:47:30 INF    1 (127.0.0.1:4150) beginning close
2016/05/09 02:47:30 INF    1 (127.0.0.1:4150) readLoop exiting
2016/05/09 02:47:30 INF    1 (127.0.0.1:4150) breaking out of writeLoop
2016/05/09 02:47:30 INF    1 (127.0.0.1:4150) writeLoop exiting
2016/05/09 02:47:30 INF    1 (127.0.0.1:4150) finished draining, cleanup exiting
2016/05/09 02:47:30 INF    1 (127.0.0.1:4150) clean close complete
2016/05/09 02:47:30 INF    1 exiting router
2016/05/09 02:47:31 Service "nsqd" is STOPPED
2016/05/09 02:47:33 INF    1 (127.0.0.1:4150) connecting to nsqd
2016/05/09 02:47:33 Service "nsqd" is START_PENDING
2016/05/09 02:47:34 Service "nsqd" is RUNNING
2016/05/09 02:47:34 220000 successful messages
2016/05/09 02:47:39 Exiting...
2016/05/09 02:47:39 INF    1 stopping
2016/05/09 02:47:39 INF    1 exiting router
2016/05/09 02:47:42 Service "nsqd" is STOP_PENDING
2016/05/09 02:47:42 ERR    1 (127.0.0.1:4150) IO error - read tcp 127.0.0.1:64032->127.0.0.1:4150: wsarecv: An existing connection was forcibly closed by the remote host.
2016/05/09 02:47:42 INF    1 (127.0.0.1:4150) beginning close
2016/05/09 02:47:42 INF    1 (127.0.0.1:4150) readLoop exiting
2016/05/09 02:47:42 INF    1 (127.0.0.1:4150) breaking out of writeLoop
2016/05/09 02:47:42 INF    1 (127.0.0.1:4150) writeLoop exiting
2016/05/09 02:47:42 INF    1 (127.0.0.1:4150) finished draining, cleanup exiting
2016/05/09 02:47:42 INF    1 (127.0.0.1:4150) clean close complete
2016/05/09 02:47:43 Service "nsqd" is STOPPED
2016/05/09 02:47:43 Done

@RobitYadda
Copy link

The link return 404

@judwhite
Copy link
Contributor

judwhite commented May 9, 2016

@RobitYadda I accidentally linked to the edit page.. try now.

@judwhite
Copy link
Contributor

For a minute I thought this might be a race with _heartbeat_ but I don't see it... putting it out there in case anyone wants to take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants