Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when moving to neoq_dead_jobs #98

Closed
BillBuilt opened this issue Oct 13, 2023 · 16 comments · Fixed by #100
Closed

Panic when moving to neoq_dead_jobs #98

BillBuilt opened this issue Oct 13, 2023 · 16 comments · Fixed by #100
Labels
bug Something isn't working postgres backend Postgres backend issues

Comments

@BillBuilt
Copy link

Thank you for the patch for #85, however this gets me to the next problems.

First, when a failed job exceeds the max retries, it is to be moved to the neoq_dead_jobs table, however a panic is being generated there:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a53181]

goroutine 118 [running]:
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).moveToDeadQueue(0xc000774150?, {0x20bd150, 0xc000774150}, {0x20c4300, 0xc0000126a8}, 0xc0001f5680, {0x0, 0x0})
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:442 +0xe1
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).updateJob(0xc0003f4000, {0x20bd150, 0xc000774150}, {0x0, 0x0})
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:484 +0x2c5
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).handleJob(0xc0003f4000, {0x20bd188, 0xc00039c370}, {0x28a0a38, 0x1}, {0xc000412150, 0x8, 0x6fc23ac00, 0x0, {0x1dc5b27, ...}})
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:723 +0x4ce
github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).start.func2()
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:539 +0x1db
created by github.com/acaloiaro/neoq/backends/postgres.(*PgBackend).start in goroutine 5
	/Users/billmatlock/go/pkg/mod/github.com/acaloiaro/[email protected]/backends/postgres/postgres_backend.go:532 +0x353

I believe this is due to the fact that jobErr is not being populated, causing the nil pointer dereference. However the error message is also in the job struct so maybe the fix would be to drop the jobErr param and set the error field to the value of j.Error ?

In /backends/postgres/postgres_backend.go moveToDeadQueue()
Go from:

_, err = tx.Exec(ctx, `INSERT INTO neoq_dead_jobs(id, queue, fingerprint, payload, retries, max_retries, error, deadline)
	VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
	j.ID, j.Queue, j.Fingerprint, j.Payload, j.Retries, j.MaxRetries, jobErr.Error(), j.Deadline)

To:

_, err = tx.Exec(ctx, `INSERT INTO neoq_dead_jobs(id, queue, fingerprint, payload, retries, max_retries, error, deadline)
	VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
	j.ID, j.Queue, j.Fingerprint, j.Payload, j.Retries, j.MaxRetries, j.Error, j.Deadline)

Secondly, if I patch the above error as described to get through the panic, the next time the failed job is ran, it is being ran on a different queue (handler) than what is assigned to it. Here is a snippet of my testing log showing that the first time it is getting handled by the correct handler (nqTestHandlerWithErr()) but the second time it is not (nqTestHandler()), but I do not know if my changes above are causing this or not, so I did not open a separate issue:

=== RUN   TestNeoq/TestAddJobs
2023/10/13 11:10:22 nqTestHandler()
2023/10/13 11:10:22 got job id: 1 queue: test_now messsage: [test_now] this is an instant job
2023/10/13 11:10:22 waiting 10 seconds for delayed job to run...
2023/10/13 11:10:27 nqTestHandler()
2023/10/13 11:10:27 got job id: 2 queue: test_delayed messsage: [test_delayed] this is a delayed job
2023/10/13 11:10:32 waiting 3 seconds for failed job to run...
2023/10/13 11:10:32 nqTestHandlerWithErr()
2023/10/13 11:10:32 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors
time=2023-10-13T11:10:32.495-04:00 level=ERROR msg="job failed" job_error="job failed to process: this is a test error"
2023/10/13 11:10:35 waiting 30 seconds for failed job to get retried...
2023/10/13 11:10:48 nqTestHandler() <<<<< should be `nqTestHandlerWithErr()`
2023/10/13 11:10:48 got job id: 3 queue: test_error messsage: [test_error] this is a job with errors

And here are the test handlers:

var (
	ErrTest = errors.New("this is a test error")
)

func nqTestHandler(db *database.DB) handler.Func {
	return func(ctx context.Context) (err error) {
		var j *jobs.Job
		j, err = jobs.FromContext(ctx)
		if err != nil {
			return
		}
		log.Println("nqTestHandler()")
		log.Println("got job id:", j.ID, "queue:", j.Queue, "messsage:", j.Payload["message"])
		return
	}
}

func nqTestHandlerWithErr() handler.Func {
	return func(ctx context.Context) (err error) {
		j, err := jobs.FromContext(ctx)
		if err != nil {
			log.Printf("ERROR in nqTestHandlerWithErr: %v", err)
		}
		log.Println("nqTestHandlerWithErr()")
		log.Println("got job id:", j.ID, "queue:", j.Queue, "messsage:", j.Payload["message"])
		return ErrTest
	}
}

Thank you!

@acaloiaro
Copy link
Owner

Thanks for the report @BillBuilt. I'll look into this soon.

@elliotcourant elliotcourant added bug Something isn't working postgres backend Postgres backend issues labels Oct 13, 2023
@acaloiaro
Copy link
Owner

@BillBuilt this is fixed in v0.41.0. Thanks again.

@BillBuilt
Copy link
Author

BillBuilt commented Oct 16, 2023 via email

@acaloiaro
Copy link
Owner

acaloiaro commented Oct 16, 2023 via email

@acaloiaro
Copy link
Owner

acaloiaro commented Oct 16, 2023

Can you confirm whether you see an error-level log with the following form?

time=2023-10-16T09:54:30.678-06:00 level=ERROR msg="job failed" job_error="job failed to process: panic [/home/adriano/git/neoq/backends/postgres/postgres_backend_test.go:522]: no good"

If you see job_error="" instead of an actual error message, it means the stringification for your error is the empty string.

@acaloiaro acaloiaro reopened this Oct 16, 2023
@BillBuilt
Copy link
Author

BillBuilt commented Oct 16, 2023 via email

@acaloiaro
Copy link
Owner

acaloiaro commented Oct 17, 2023 via email

@acaloiaro
Copy link
Owner

acaloiaro commented Oct 20, 2023

Hi @BillBuilt I'm not sure that I can produce this one without any code. Can you supply example code that reproduces what you're reporting?

I've tried to reproduce your result, but jobs are ending up in the dead queue as expected for me.

Could you also call neoq.New with neoq.WithLogLevel(logging.LogLevelDebug) and share the debug logs as well?

With reproduction code, I should be able to get any problems fixed up quickly now that I'm in one place.

@acaloiaro
Copy link
Owner

Update: I've reproduced it and will get started on a solution soon.

@acaloiaro
Copy link
Owner

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

@BillBuilt
Copy link
Author

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

Thank you - and sorry about not getting you some example code - but I see you were able to reproduce. Let me know if you need anything else from me.

@acaloiaro
Copy link
Owner

@BillBuilt A fix is ready for this, but first I want to get a review from @elliotcourant before merging.

Thank you - and sorry about not getting you some example code - but I see you were able to reproduce. Let me know if you need anything else from me.

No worries -- if Elliot has the time to review by tomorrow, I'll wait it out. If not, I'll go ahead and merge tomorrow.

@BillBuilt
Copy link
Author

No rush - thank you!

acaloiaro added a commit that referenced this issue Oct 21, 2023
Fixes a bug that can allow retries to end up on
the wrong queue in settings where there are
multiple handlers.
acaloiaro added a commit that referenced this issue Oct 21, 2023
Fixes a bug that can allow retries to end up on
the wrong queue in settings where there are
multiple handlers.
@acaloiaro
Copy link
Owner

@BillBuilt Release v0.43.0 has your fix. Cheers.

@BillBuilt
Copy link
Author

@acaloiaro That fixed our issues! Thank you and sorry for the late response.

@acaloiaro
Copy link
Owner

Good to hear, Bill. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working postgres backend Postgres backend issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants