LG-8440: Ingest in-person enrollment status updates from SQS#8403
LG-8440: Ingest in-person enrollment status updates from SQS#8403
Conversation
There was a problem hiding this comment.
Is "uses" an abbreviation, or is it like the verb like "this uses that"?
There was a problem hiding this comment.
Uses is a verb here.
There was a problem hiding this comment.
ok! I feel like that's not a common naming convention we have in this repo? Is there a particular reason to start that now?
There was a problem hiding this comment.
The use of this convention is pretty isolated, so I don't think it's a big concern. If you want to have the job refactored to use delegation instead of mixins though then I'll rename accordingly.
There was a problem hiding this comment.
If I'm reading this right, it looks like we are only including this mixin in one place. Do we have plans for future SQS topics? (my understanding is no, we do not) so I think that it would make sense to just inline this for now
There was a problem hiding this comment.
This mixin is used by the job (included by both the job directly and the batch processor), but I'm intentionally separating this out to reduce the complexity of working on this section of the code. I did consider renaming this to SqsBatchWrapper.
There was a problem hiding this comment.
What about using delegation instead of inheritance via mixin? To have an SQS client/consumer class that we can instantiate instead of mixing in? delegation is much easier to follow than included methods from mixins when tracing code
There was a problem hiding this comment.
I was considering that exact refactor before posting this PR. If you think it's worth making the change, then I'll do it.
There was a problem hiding this comment.
yes definitely, let's go for it
There was a problem hiding this comment.
@zachmargolis I refactored to use DI instead of mixins. There's some lack of clarity about where to put the factories, so for now they live in the job and have their own tests.
There was a problem hiding this comment.
rescue inside of an ensure is very surprising to me. Could we move the body of this ensure into its own method? that way it can have its own rescue and stuff
There was a problem hiding this comment.
The rescue here is very intentional, but I think it's reasonable to move this into another method.
There was a problem hiding this comment.
Moved into the new private method process_deletions.
There was a problem hiding this comment.
I believe unhandled errors in background jobs already go to NewRelic, do we need to notify directly like this?
There was a problem hiding this comment.
We do need additional direct error logging here in order to better capture the context of the error; this is most likely to occur after we have captured the enrollment code. However that part could be isolated to CloudWatch if you think that's particularly important.
There was a problem hiding this comment.
here we filter on status, later we update ready_for_status_check, are they the same thing?
There was a problem hiding this comment.
No, they currently track different things. status tracks the overall progress of the enrollment creation, checking, and success/failure. ready_for_status_check only covers whether we received a message saying that the enrollment is ready to have its status checked (which is not tracked by status).
The check with USPS may result in a status update if USPS indicates that the enrollment is not present, has expired, or that the user has visited the post office already.
There was a problem hiding this comment.
More context here - we are currently checking enrollments that have a pending status. When this and related stories are complete, we will be able to deprioritize (but still eventually check) enrollments that have ready_for_status_check remain as false.
There was a problem hiding this comment.
I'm not getting why the append is happening outside of the above if statement? Based on the comment it feels like the sqs_message should only be appended if process_message returns true?
There was a problem hiding this comment.
If process_message returns false, then the enrollment cannot be processed (e.g. invalid JSON, wrong fields, etc.). Keeping it in the queue would unnecessarily slow down the ingestion of other email notifications.
There was a problem hiding this comment.
Related to Jack’s question-how does a message meet the condition described in the comment on line 38?
There was a problem hiding this comment.
@svalexander The message will meet the condition if an uncaught error occurs while processing the message. One example would be if the database becomes unavailable before we can update the enrollment status.
There was a problem hiding this comment.
Is this inclusion necessary, as BatchProcessor already includes this module?
There was a problem hiding this comment.
Strictly speaking no, but I consider it important since this class directly uses the poll method.
There was a problem hiding this comment.
Thank you for including these comments, very useful!
There was a problem hiding this comment.
Are these items that encountered errors when we attempted to process them?
There was a problem hiding this comment.
We're deleting items in the following cases:
- Processing succeeds
- The item is malformed (e.g. invalid JSON or has wrong fields)
- The item does not correspond to a pending enrollment record
If we don't delete the last two, then workers will continue attempting (and failing) to process them.
There was a problem hiding this comment.
Personally I like factory object. Matt showed me once we have a a lot of analytics events piling together.
config/application.yml.default
Outdated
There was a problem hiding this comment.
Just wondering how we decided on the time limit for this and the next line?
There was a problem hiding this comment.
It's normal with SQS to set wait_time_seconds high in order to reduce the number of calls and increase the likelihood that we'll receive items from the SQS on each call. Behind the scenes, SQS is also a distributed queue, which means order is only an approximate and not a strict guarantee (unless we use FIFO) & AWS could potentially be fetching the items from multiple machines.
I'm leaving visibility_timeout_seconds at the default, but it's a parameter that may be worth adjusting if the processing time is extended at some point. This controls when items will become visible to other workers (i.e. essentially for the case of an unreported processing failure).
23cb016 to
35d6b26
Compare
There was a problem hiding this comment.
tap returns the original value, not the block, I think we want a different approach here
1.tap { |x| x + 1 }
=> 1There was a problem hiding this comment.
Good catch, updated to use then instead. I also updated the test so that it will catch this issue.
There was a problem hiding this comment.
Aren't the deleted_items and processed_items values in the splatted analytics_hash? Why explicitly list them here? To be explicit?
There was a problem hiding this comment.
The default analytics_stats hash has all of the items as zero. The overrides here are primarily intended to represent changes in the items. I have set some of the zero redundantly because it matters for understanding the individual test's behavior.
There was a problem hiding this comment.
Do we want to grab this from the config so if it changes there the test will be aware?
There was a problem hiding this comment.
This is a unit test; testing the specific configuration is out-of-scope. This is meant to test that the module functions correctly as a unit.
JackRyan1989
left a comment
There was a problem hiding this comment.
LGMT!
Minor non-blocking question: We set a limit of ten messages to be pulled from SQS. The batch processor will handle any length of messages returned from SQS, but I'm wondering if it's worth thinking about setting a limit?
@JackRyan1989 I think that the control on this type of behavior should be on the job runtime, if possible, rather than the number of items. This may be a good place to implement or check for existing monitoring on this type of problem. |
There was a problem hiding this comment.
My vote is to remove this direct NewRelic notification, we tend to react to errors in NR as if they are unhandled 500s so seeing errors like this that were caught might make something seem more urgent than it is
There was a problem hiding this comment.
This kind of error signals that USPS is sending us garbage data or that AWS has become severely misconfigured. At the current scale it may not be consequential, but it could be extremely consequential at the scales we were preparing for earlier this year.
There was a problem hiding this comment.
IOW - if we are getting this at all, then it is very likely to be a problem that needs addressed soon. It doesn't mean we have to stop processing the other queue items, though.
There was a problem hiding this comment.
I'm fine keeping the NR notification, but we should reconsider if it turns out not to be useful
app/jobs/in_person/enrollments_ready_for_status_check/sqs_batch_wrapper.rb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This is just style, I really dislike unless, especially with a compound statement, I think its always clearer as an if
| return true unless IdentityConfig.store.in_person_proofing_enabled && | |
| IdentityConfig.store.in_person_enrollments_ready_job_enabled | |
| if !IdentityConfig.store.in_person_proofing_enabled || | |
| !IdentityConfig.store.in_person_enrollments_ready_job_enabled | |
| return true | |
| end |
There was a problem hiding this comment.
I disagree; unless provides a larger visual indication of the conditional behavior than the oft-missed exclamation mark.
There was a problem hiding this comment.
maybe the conditions could be pulled into well-named methods, ie:
def in_person_proofing_not_enabled
!IdentityConfig.store.in_person_proofing_enabled
endthen it could look like this, which imo is a little easier to parse/reason about
if in_person_proofing_not_enabled || in_person_enrollements_ready_job_not_enabled
return true
endThere was a problem hiding this comment.
blank? will return true if a variable is nil or false
return true if IdentityConfig.store.in_person_proofing_enabled.blank? ||
IdentityConfig.store.in_person_enrollments_ready_job_enabled.blank?There was a problem hiding this comment.
@tomas-nava That sounds like a reasonable compromise.
@Sgtpluck I think locality is preferable here.
There was a problem hiding this comment.
I like using dependency injection, but I think that it seems redundant to pass in the class_name explicitly like this, could the BatchProcessor class have a default implementation that passes its own self.class.name in to the error reporter? Since analytics here is basically just a default/empty instance with no user, I feel like we don't need to worry about passing in the same one necessarily
Ditto for below with EnrollmentPipeline
There was a problem hiding this comment.
Default implementations for DI have been a colossal headache for me in the past, so I really prefer to avoid default implementations.
There was a problem hiding this comment.
Since sqs_message is doubling for a plain struct, could we just use the real value type instead?
| before(:each) do | |
| allow(sqs_message).to receive(:message_id). | |
| and_return(sqs_message_id) | |
| end | |
| let(:sqs_message) { Aws::SQS::Types::Message.new(message_id: sqs_message_id) } |
There was a problem hiding this comment.
This builds into the test some assumptions about how the struct is created, so I would prefer to avoid this change.
There was a problem hiding this comment.
Is the worry that the class constructor changes and breaks the test, or something else?
There was a problem hiding this comment.
@mitchellhenke Yes, roughly speaking. i.e. constructor or factory doing some kind of interpretation on the data passed in rather than setting it to exactly match the original parameters.
There was a problem hiding this comment.
There is an additional concern about coupling and misrepresenting parts of the payload that we don't actually use.
There was a problem hiding this comment.
ditto, I think we could simplify the tests by using the actual value type here for sqs_message
There was a problem hiding this comment.
Same as above:
This builds into the test some assumptions about how the struct is created, so I would prefer to avoid this change.
spec/jobs/in_person/enrollments_ready_for_status_check/enrollment_pipeline_spec.rb
Outdated
Show resolved
Hide resolved
…status updates from SQS
75255e1 to
853013c
Compare
…ent_pipeline_spec.rb Co-authored-by: Zach Margolis <zachmargolis@users.noreply.github.com>
…h_wrapper.rb Co-authored-by: Zach Margolis <zachmargolis@users.noreply.github.com>
🎫 Ticket
LG-8440
🛠 Summary of changes
Create
InPerson::EnrollmentsReadyForStatusCheckJobjob to ingest in-person enrollment status updates from USPS.The following pseudocode approximates the job's behavior, minus error handling.
This adds the
aws-sdk-sqswhich was not previously present; this is the official AWS package. Themailgem is (and previously was) a direct dependency, so it has been added to theGemfile; this does not constitute a substantial change, but reduces the risk of regression.📜 Testing Plan
ready_for_status_checkflag is set on the enrollment recordLogs from local testing
The following logs cover a local test where 7 emails were sent. The batch size was adjusted to 5 in order to ensure the job fetched multiple times.
Expected success and deletion:
Expected failure and deletion:
Another error occurred prior to the test where I was able to confirm that the job will log the completion (as expected) even when an error is thrown.