Panic when moving to neoq_dead_jobs #98

BillBuilt · 2023-10-13T15:27:34Z

Thank you for the patch for #85, however this gets me to the next problems.

First, when a failed job exceeds the max retries, it is to be moved to the neoq_dead_jobs table, however a panic is being generated there:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a53181]

goroutine 118 [running]:
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).moveToDeadQueue(0xc000774150?, {0x20bd150, 0xc000774150}, {0x20c4300, 0xc0000126a8}, 0xc0001f5680, {0x0, 0x0})
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:442 +0xe1
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).updateJob(0xc0003f4000, {0x20bd150, 0xc000774150}, {0x0, 0x0})
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:484 +0x2c5
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).handleJob(0xc0003f4000, {0x20bd188, 0xc00039c370}, {0x28a0a38, 0x1}, {0xc000412150, 0x8, 0x6fc23ac00, 0x0, {0x1dc5b27, ...}})
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:723 +0x4ce
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).start.func2()
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:539 +0x1db
created by github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).start in goroutine 5
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:532 +0x353

I believe this is due to the fact that jobErr is not being populated, causing the nil pointer dereference. However the error message is also in the job struct so maybe the fix would be to drop the jobErr param and set the error field to the value of j.Error ?

In /backends/postgres/postgres_backend.go moveToDeadQueue()
Go from:

_, err = tx.Exec(ctx, `INSERT INTO neoq_dead_jobs(id, queue, fingerprint, payload, retries, max_retries, error, deadline)
	VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
	j.ID, j.Queue, j.Fingerprint, j.Payload, j.Retries, j.MaxRetries, jobErr.Error(), j.Deadline)

To:

_, err = tx.Exec(ctx, `INSERT INTO neoq_dead_jobs(id, queue, fingerprint, payload, retries, max_retries, error, deadline)
	VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
	j.ID, j.Queue, j.Fingerprint, j.Payload, j.Retries, j.MaxRetries, j.Error, j.Deadline)

Secondly, if I patch the above error as described to get through the panic, the next time the failed job is ran, it is being ran on a different queue (handler) than what is assigned to it. Here is a snippet of my testing log showing that the first time it is getting handled by the correct handler (nqTestHandlerWithErr()) but the second time it is not (nqTestHandler()), but I do not know if my changes above are causing this or not, so I did not open a separate issue:

=== RUN   TestNeoq/TestAddJobs
2023/10/13 11:10:22 nqTestHandler()
2023/10/13 11:10:22 got job id: 1 queue: test_now messsage: [test_now] this is an instant job
2023/10/13 11:10:22 waiting 10 seconds for delayed job to run...
2023/10/13 11:10:27 nqTestHandler()
2023/10/13 11:10:27 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job
2023/10/13 11:10:32 waiting 3 seconds for failed job to run...
2023/10/13 11:10:32 nqTestHandlerWithErr()
2023/10/13 11:10:32 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors
time=2023-10-13T11:10:32.495-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error"
2023/10/13 11:10:35 waiting 30 seconds for failed job to get retried...
2023/10/13 11:10:48 nqTestHandler() <<<<< should be `nqTestHandlerWithErr()`
2023/10/13 11:10:48 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors

And here are the test handlers:

var (
	ErrTest = errors.New("this is a test error")
)

func nqTestHandler(db *database.DB) handler.Func {
	return func(ctx context.Context) (err error) {
		var j *jobs.Job
		j, err = jobs.FromContext(ctx)
		if err != nil {
			return
		}
		log.Println("nqTestHandler()")
		log.Println("got job id:", j.ID, "queue:", j.Queue, "messsage:", j.Payload["message"])
		return
	}
}

func nqTestHandlerWithErr() handler.Func {
	return func(ctx context.Context) (err error) {
		j, err := jobs.FromContext(ctx)
		if err != nil {
			log.Printf("ERROR in nqTestHandlerWithErr: %v", err)
		}
		log.Println("nqTestHandlerWithErr()")
		log.Println("got job id:", j.ID, "queue:", j.Queue, "messsage:", j.Payload["message"])
		return ErrTest
	}
}

Thank you!

The text was updated successfully, but these errors were encountered:

acaloiaro · 2023-10-13T15:37:19Z

Thanks for the report @BillBuilt. I'll look into this soon.

acaloiaro · 2023-10-13T16:49:10Z

@BillBuilt this is fixed in v0.41.0. Thanks again.

BillBuilt · 2023-10-16T15:48:29Z

Thank you! I can confirm the panic is indeed gone and the correct handler is being ran on the retry, however, after moving to the dead jobs table, the error message is empty (neoq_dead_jobs.error) when it should include the error message.

acaloiaro · 2023-10-16T15:51:56Z

Interesting -- I tested that as well, but must have made a change that affected it before cutting a release. I'll get this sorted out as well. Thanks for the confirmation.

acaloiaro · 2023-10-16T15:56:17Z

Can you confirm whether you see an error-level log with the following form?

time=2023-10-16T09:54:30.678-06:00 level=ERROR msg="job failed" job_error="job failed to process: panic [/home/adriano/git/neoq/backends/postgres/postgres_backend_test.go:522]: no good"

If you see job_error="" instead of an actual error message, it means the stringification for your error is the empty string.

BillBuilt · 2023-10-16T17:09:07Z

Hello, I spoke too soon. The error handler is NOT being ran on retries. And, if I bump max_retries from 1 to 2, things get more strange. First, results after running with max_retries=1 With max retries = 1 ==================== 2023/10/16 12:36:05 waiting 10 seconds for delayed job to run... 2023/10/16 12:36:05 nqTestHandler() 2023/10/16 12:36:05 got job id: 1 queue: test_now messsage: [test_now] this is an instant job 2023/10/16 12:36:10 nqTestHandler() 2023/10/16 12:36:10 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job 2023/10/16 12:36:15 waiting 3 seconds for failed job to run... 2023/10/16 12:36:15 nqTestHandlerWithErr() 2023/10/16 12:36:15 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors retries 0 max retries 0xc0004191a0 time=2023-10-16T12:36:15.276-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error" 2023/10/16 12:36:18 waiting 60 seconds for failed job to get retried... 2023/10/16 12:36:31 nqTestHandler() <<<< should be nqTestHandlerWithErr() 2023/10/16 12:36:31 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors And the resulting database tables: <img width="903" alt="max_retries=1" src="https://github.com/acaloiaro/neoq/assets/28831382/fa9f8619-ad77-4a76-b539-fec03c3e10b0"> Things to note about this: 1. The failed job gets retried as expected, but it does not use the correct handler 2. The failed job gets moved to the neoq_dead_jobs table as expected 3. The log shows 'job_error="job failed to process: this is a test error”' When max_retries=2: With max retries = 2 ==================== 2023/10/16 12:47:59 waiting 10 seconds for delayed job to run... 2023/10/16 12:47:59 nqTestHandler() 2023/10/16 12:47:59 got job id: 1 queue: test_now messsage: [test_now] this is an instant job 2023/10/16 12:48:04 nqTestHandler() 2023/10/16 12:48:04 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job 2023/10/16 12:48:09 waiting 3 seconds for failed job to run... 2023/10/16 12:48:09 nqTestHandlerWithErr() 2023/10/16 12:48:09 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors retries 0 max retries 0xc0003c83d0 time=2023-10-16T12:48:09.883-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error" 2023/10/16 12:48:12 waiting 60 seconds for failed job to get retried... 2023/10/16 12:48:25 nqTestHandler() <<<< should be nqTestHandlerWithErr() and the retry is setting its status as "processed" so it never gets to the neoq_dead_jobs table 2023/10/16 12:48:25 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors And the resulting database tables: <img width="1306" alt="max_retries=2" src="https://github.com/acaloiaro/neoq/assets/28831382/78611f82-5c35-47c3-9b65-6e483b2aa2ca"> Things to note about this: 1. The failed job gets retried as expected, but it does not use the correct handler, nor does it increment the `retries` field 2. The failed job DOES NOT get moved to the neoq_dead_jobs table as expected since the retry uses the incorrect handler, and as such gets a ‘processed’ status so the 3rd retry never runs. However, even with these discrepancies, I would expect the `retries` to be ‘1’ at this point. 3. The log shows 'job_error="job failed to process: this is a test error”' Thank you - Bill On Oct 16, 2023, at 11:56 AM, Adriano Caloiaro ***@***.***> wrote: Can you confirm whether you see a error-level log with the following form? time=2023-10-16T09:54:30.678-06:00 level=ERROR msg="job failed" job_error="job failed to process: panic [/home/adriano/git/neoq/backends/postgres/postgres_backend_test.go:522]: no good" If you see job_error="" instead of an actual error message, it means the stringification for your error is the empty string. — Reply to this email directly, view it on GitHub <#98 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG365FV5A2D42MHE3RAQ2ZDX7VKKZAVCNFSM6AAAAAA57KH62SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUG44TGOJVHA>. You are receiving this because you were mentioned.

acaloiaro · 2023-10-17T17:38:17Z

Hi Bill, I’m traveling but will take a look at your data when I can.

…

On Mon, Oct 16, 2023, at 11:09, Bill Matlock wrote: Hello, I spoke too soon. The error handler is NOT being ran on retries. And, if I bump max_retries from 1 to 2, things get more strange. First, results after running with max_retries=1 With max retries = 1 ==================== 2023/10/16 12:36:05 waiting 10 seconds for delayed job to run... 2023/10/16 12:36:05 nqTestHandler() 2023/10/16 12:36:05 got job id: 1 queue: test_now messsage: [test_now] this is an instant job 2023/10/16 12:36:10 nqTestHandler() 2023/10/16 12:36:10 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job 2023/10/16 12:36:15 waiting 3 seconds for failed job to run... 2023/10/16 12:36:15 nqTestHandlerWithErr() 2023/10/16 12:36:15 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors retries 0 max retries 0xc0004191a0 time=2023-10-16T12:36:15.276-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error" 2023/10/16 12:36:18 waiting 60 seconds for failed job to get retried... 2023/10/16 12:36:31 nqTestHandler() <<<< should be nqTestHandlerWithErr() 2023/10/16 12:36:31 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors And the resulting database tables: Things to note about this: 1. The failed job gets retried as expected, but it does not use the correct handler 2. The failed job gets moved to the neoq_dead_jobs table as expected 3. The log shows 'job_error="job failed to process: this is a test error”' When max_retries=2: With max retries = 2 ==================== 2023/10/16 12:47:59 waiting 10 seconds for delayed job to run... 2023/10/16 12:47:59 nqTestHandler() 2023/10/16 12:47:59 got job id: 1 queue: test_now messsage: [test_now] this is an instant job 2023/10/16 12:48:04 nqTestHandler() 2023/10/16 12:48:04 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job 2023/10/16 12:48:09 waiting 3 seconds for failed job to run... 2023/10/16 12:48:09 nqTestHandlerWithErr() 2023/10/16 12:48:09 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors retries 0 max retries 0xc0003c83d0 time=2023-10-16T12:48:09.883-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error" 2023/10/16 12:48:12 waiting 60 seconds for failed job to get retried... 2023/10/16 12:48:25 nqTestHandler() <<<< should be nqTestHandlerWithErr() and the retry is setting its status as "processed" so it never gets to the neoq_dead_jobs table 2023/10/16 12:48:25 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors And the resulting database tables: Things to note about this: 1. The failed job gets retried as expected, but it does not use the correct handler, nor does it increment the `retries` field 2. The failed job DOES NOT get moved to the neoq_dead_jobs table as expected since the retry uses the incorrect handler, and as such gets a ‘processed’ status so the 3rd retry never runs. However, even with these discrepancies, I would expect the `retries` to be ‘1’ at this point. 3. The log shows 'job_error="job failed to process: this is a test error”' Thank you - Bill On Oct 16, 2023, at 11:56 AM, Adriano Caloiaro ***@***.***> wrote: Can you confirm whether you see a error-level log with the following form? time=2023-10-16T09:54:30.678-06:00 level=ERROR msg="job failed" job_error="job failed to process: panic [/home/adriano/git/neoq/backends/postgres/postgres_backend_test.go:522]: no good" If you see job_error="" instead of an actual error message, it means the stringification for your error is the empty string. — Reply to this email directly, view it on GitHub <#98 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG365FV5A2D42MHE3RAQ2ZDX7VKKZAVCNFSM6AAAAAA57KH62SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUG44TGOJVHA>. You are receiving this because you were mentioned. — Reply to this email directly, view it on GitHub <#98 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZNMQDHZWHMQCEER57OBZLX7VS35AVCNFSM6AAAAAA57KH62SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUHEYTIOJVG4>. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

acaloiaro · 2023-10-20T02:57:24Z

Hi @BillBuilt I'm not sure that I can produce this one without any code. Can you supply example code that reproduces what you're reporting?

I've tried to reproduce your result, but jobs are ending up in the dead queue as expected for me.

Could you also call neoq.New with neoq.WithLogLevel(logging.LogLevelDebug) and share the debug logs as well?

With reproduction code, I should be able to get any problems fixed up quickly now that I'm in one place.

acaloiaro · 2023-10-20T03:44:47Z

Update: I've reproduced it and will get started on a solution soon.

acaloiaro · 2023-10-20T14:53:29Z

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

BillBuilt · 2023-10-20T16:11:48Z

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

Thank you - and sorry about not getting you some example code - but I see you were able to reproduce. Let me know if you need anything else from me.

acaloiaro · 2023-10-20T16:20:48Z

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

Thank you - and sorry about not getting you some example code - but I see you were able to reproduce. Let me know if you need anything else from me.

No worries -- if Elliot has the time to review by tomorrow, I'll wait it out. If not, I'll go ahead and merge tomorrow.

BillBuilt · 2023-10-20T16:35:07Z

No rush - thank you!

Fixes a bug that can allow retries to end up on the wrong queue in settings where there are multiple handlers.

acaloiaro · 2023-10-21T13:33:17Z

@BillBuilt Release v0.43.0 has your fix. Cheers.

BillBuilt · 2023-10-31T20:25:07Z

@acaloiaro That fixed our issues! Thank you and sorry for the late response.

acaloiaro · 2023-11-01T01:55:38Z

Good to hear, Bill. Cheers!

elliotcourant added bug Something isn't working postgres backend Postgres backend issues labels Oct 13, 2023

acaloiaro added a commit that referenced this issue Oct 13, 2023

fix: #98 segvault when moving jobs to dead queue

507ae71

acaloiaro closed this as completed in 84caae3 Oct 13, 2023

acaloiaro reopened this Oct 16, 2023

acaloiaro added a commit that referenced this issue Oct 20, 2023

fix: #98 retries can be picked up by wrong handler

cbcf27e

acaloiaro added a commit that referenced this issue Oct 20, 2023

fix: #98 retries can be picked up by wrong handler

f854991

acaloiaro mentioned this issue Oct 20, 2023

fix: #98 retries can be picked up by wrong handler #100

Merged

acaloiaro added a commit that referenced this issue Oct 20, 2023

fix: #98 retries can be picked up by wrong handler

7a3e7ac

acaloiaro added a commit that referenced this issue Oct 20, 2023

fix: #98 retries can be picked up by wrong handler

6a48598

acaloiaro added a commit that referenced this issue Oct 21, 2023

fix: #98 retries can be picked up by wrong handler

4ec478c

Fixes a bug that can allow retries to end up on the wrong queue in settings where there are multiple handlers.

acaloiaro closed this as completed in #100 Oct 21, 2023

acaloiaro added a commit that referenced this issue Oct 21, 2023

fix: #98 retries can be picked up by wrong handler

e95311e

Fixes a bug that can allow retries to end up on the wrong queue in settings where there are multiple handlers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic when moving to neoq_dead_jobs #98

Panic when moving to neoq_dead_jobs #98

BillBuilt commented Oct 13, 2023

acaloiaro commented Oct 13, 2023

acaloiaro commented Oct 13, 2023

BillBuilt commented Oct 16, 2023 via email

acaloiaro commented Oct 16, 2023 via email

acaloiaro commented Oct 16, 2023 •

edited

Loading

BillBuilt commented Oct 16, 2023 via email •

edited

Loading

acaloiaro commented Oct 17, 2023 via email

acaloiaro commented Oct 20, 2023 •

edited

Loading

acaloiaro commented Oct 20, 2023

acaloiaro commented Oct 20, 2023

BillBuilt commented Oct 20, 2023

acaloiaro commented Oct 20, 2023

BillBuilt commented Oct 20, 2023

acaloiaro commented Oct 21, 2023

BillBuilt commented Oct 31, 2023

acaloiaro commented Nov 1, 2023

Panic when moving to neoq_dead_jobs #98

Panic when moving to neoq_dead_jobs #98

Comments

BillBuilt commented Oct 13, 2023

acaloiaro commented Oct 13, 2023

acaloiaro commented Oct 13, 2023

BillBuilt commented Oct 16, 2023 via email

acaloiaro commented Oct 16, 2023 via email

acaloiaro commented Oct 16, 2023 • edited Loading

BillBuilt commented Oct 16, 2023 via email • edited Loading

acaloiaro commented Oct 17, 2023 via email

acaloiaro commented Oct 20, 2023 • edited Loading

acaloiaro commented Oct 20, 2023

acaloiaro commented Oct 20, 2023

BillBuilt commented Oct 20, 2023

acaloiaro commented Oct 20, 2023

BillBuilt commented Oct 20, 2023

acaloiaro commented Oct 21, 2023

BillBuilt commented Oct 31, 2023

acaloiaro commented Nov 1, 2023

acaloiaro commented Oct 16, 2023 •

edited

Loading

BillBuilt commented Oct 16, 2023 via email •

edited

Loading

acaloiaro commented Oct 20, 2023 •

edited

Loading