LG-8058 | Populate identities consented field#7374
Conversation
[skip changelog] - for now
| @@ -0,0 +1,5 @@ | |||
| class AddConsentedToAgencyIdentities < ActiveRecord::Migration[7.0] | |||
| def change | |||
| add_column :agency_identities, :consented, :boolean | |||
There was a problem hiding this comment.
had a chat with @mitchellhenke and we realized that we probably don't need this. Agency UUIDs don't go away when a user unlinks all the apps, so we don't need to "hide" AgencyIdentity rows like this, which will help us simplify
app/jobs/identities_backfill_job.rb
Outdated
| UPDATE identities | ||
| SET last_consented_at = created_at | ||
| WHERE id > #{start_id} | ||
| AND id <= #{start_id + slice_size} |
There was a problem hiding this comment.
question: did the query planner do smarter things with a <= max vs a LIMIT ?
There was a problem hiding this comment.
For reasons that escape me, I got a syntax error when I was doing LIMIT. I assumed the error was actually somewhere else there, tried expressing it like this to highlight what the problem actually was, and the error went away.
I think something silly is wrong with my query and I'm just not seeing it. See my other comment down in the test; this query is populating last_consented_at for rows that have deleted_at set despite us specifying not to do that.
For background, I know MySQL much better than Postgres. The former isn't known for being a rigid adherent to ANSI standard SQL, so am I maybe doing something MySQL-specific and not realizing it?
There was a problem hiding this comment.
ah! I think that the LIMIT has to be in a subquery: https://dba.stackexchange.com/questions/69471/postgres-update-limit-1
There was a problem hiding this comment.
I left this as an id range. While it wouldn't have been too hard to use a subquery here, it just felt really weird.
app/jobs/identities_backfill_job.rb
Outdated
| UPDATE identities | ||
| SET last_consented_at = created_at | ||
| WHERE id > #{start_id} | ||
| AND id <= #{start_id + slice_size} |
There was a problem hiding this comment.
For reasons that escape me, I got a syntax error when I was doing LIMIT. I assumed the error was actually somewhere else there, tried expressing it like this to highlight what the problem actually was, and the error went away.
I think something silly is wrong with my query and I'm just not seeing it. See my other comment down in the test; this query is populating last_consented_at for rows that have deleted_at set despite us specifying not to do that.
For background, I know MySQL much better than Postgres. The former isn't known for being a rigid adherent to ANSI standard SQL, so am I maybe doing something MySQL-specific and not realizing it?
Works around a weird behavior in tests
app/jobs/identities_backfill_job.rb
Outdated
| # This is a short-term solution to backfill data requiring a table scan. | ||
| # This job can be deleted once it's done. | ||
|
|
||
| queue_as :low |
There was a problem hiding this comment.
Was this appropriate? My thinking is that this isn't too time-sensitive.
There was a problem hiding this comment.
it's probably fine, I think other backill jobs have been :long_running (which also won't page if they run for too long)
|
|
||
| def redis_get(key, default) | ||
| (REDIS_POOL.with { |redis| redis.get(key) } || default).to_i | ||
| end |
There was a problem hiding this comment.
I pulled this out into its own method because I found a bug in how I was doing this initially and figured I'd move it all to one place and fix it.
I was doing redis.get(key).to_i || default and realized that redis.get(DOES_NOT_EXIST) returns nil, so this would always return 0.
|
|
||
| trait :consented do | ||
| last_consented_at { Time.zone.now - 5.minutes } | ||
| end |
There was a problem hiding this comment.
I actually think we probably want to make :active set last_consented_at, but maybe after this story.
| deleted.reload | ||
|
|
||
| expect(deleted.deleted_at).not_to be_nil | ||
| expect(deleted.last_consented_at.to_i).to eq(consent_time) |
There was a problem hiding this comment.
Discussed in Slack, but the to_i is necessary here because otherwise we run into weird precision issues, where the DB read vs. in-memory representation seem to have a different resolution and this caused the test to fail over whether or not we show nanoseconds.
The factory sets deleted_at to "5 minutes ago" so there will be a distinct difference if the record inadvertently gets updated.
| end | ||
| end | ||
|
|
||
| describe '#batch_size' do |
There was a problem hiding this comment.
Arguably this and .position (etc.) shouldn't even be public methods, but because this is a short-lived script and I had a bug in an early draft of this, I just left them public and added a test.
app/jobs/identities_backfill_job.rb
Outdated
| # If we made it here without error, increment the counter for next time: | ||
| REDIS_POOL.with { |redis| redis.set(CACHE_KEY, position + batch_size) } |
There was a problem hiding this comment.
ok so there's a potential for a bug here with the last batch, where suppose we have 555 records and the last batch starts at 550 and goes for 10, it would "end" at 560. But that's accounting for 5 records that don't exist.
In our case, it's the old (lower id) rows that are missing last_consented_at so by the time we get to this case, we're probably fine.
I think the correct thing to do would be to add a RETURNING id or something, and grab the max of that.
However, may not be worth the effort
app/jobs/identities_backfill_job.rb
Outdated
|
|
||
| last_id = result.to_a&.first ? result.to_a.first['id'] : start_id + slice_size | ||
|
|
||
| logger.info "Processed rows #{start_id} through #{last_id}" |
There was a problem hiding this comment.
| logger.info "Processed rows #{start_id} through #{last_id}" | |
| logger.info "Processed rows #{start_id} through #{last_id} (count=#{result.ntuples})" |
I think we should log how many rows updated as well
There was a problem hiding this comment.
👍
I called this "updated" instead of "count" to make it more clear, because this output was confusing otherwise:
Processed rows 0 through 716 (count=1)
| elapsed_time = Time.zone.now - start_time | ||
| logger.info( | ||
| "Finished a full batch (#{batch_size} rows to id=#{last_id}) in #{elapsed_time} seconds", | ||
| ) |
There was a problem hiding this comment.
can we combine this log message with the one above? that way we only need one log line per job run to check.
There was a problem hiding this comment.
We discussed this in chat, but for the logs here: the previous log entry is inside the per-slice loop, while this is at the end of the whole batch. We concluded we'd keep both.
🎫 Ticket
https://cm-jira.usa.gov/browse/LG-8058
🛠 Summary of changes
Database change
This needs a bit of explanation since setting
last_consented_atwhere it was previously nil sounds bad.Also, note that the table name is
identitiesbut the model isServiceProviderIdentity.Historically, a ServiceProviderIdentity was created upon an user consenting to data sharing as part of authentication with a service provider, generating a UUID uniquely identifying that user to the SP. At some point in the past, a
last_consented_attimestamp was added, but only populated for new records going forward. So today, having a ServiceProviderIdentity indicates that a user has completed the process and linked their account with a service provider. Some rows don't have a timestamp logged, but today, a nil timestamp just means the record predated us storing the date it happened.As part of the work in LG-8058, we are going to being creating ServiceProviderIdentity rows sooner in the process, which means the mere presence of a row no longer automatically means the user has consented to sharing data with the service provider. We are going to rely on the
last_consented_atparameter for this—so we want alast_consented_at = nilto mean the user has not yet approved of sharing data with the SP. To do this, we first need to backfilllast_consented_at.In other words, this PR isn't actually changing what anyone has consented to. We're just putting in a date where it was previously blank.
The database change
Conceptually, this is simple. We're setting
last_consented_at = created_at where last_consented_at is null. We also don't want to populate this on rows that havedeleted_atset, since arguably that counts as revoking consent.But, this is a busy table and there is no index on
last_consented_at—nor is it really reasonable for there to be one. The table is also much larger than I would have expected. So, to avoid an update requiring a table scan across a huge table, we're scheduling a background job that will run over a smallish range of ids in each run.The batch and slice sizes are estimates, and should complete in roughly a day. I've allowed tuning them through Redis if we find the job is too hard on the database (or, more positively, if it's running quicker than we thought). Likewise, if this ends up having an unexpectedly hard impact on the database, we can set the position counter to an arbitrarily high number in the database to cause it to not run any more updates.
I figure we can fully stop the job by just deleting this; it's only useful as a one-off.
Deploy notes
Once this is deployed, it will begin running every 5 minutes. We should probably wait until Monday when everyone's back in the office before merging this, on account of the holiday.