Skip to content

LG-8058 | Populate identities consented field#7374

Merged
n1zyy merged 12 commits intomainfrom
mattw/LG-8058_initial_migration
Nov 30, 2022
Merged

LG-8058 | Populate identities consented field#7374
n1zyy merged 12 commits intomainfrom
mattw/LG-8058_initial_migration

Conversation

@n1zyy
Copy link
Contributor

@n1zyy n1zyy commented Nov 18, 2022

🎫 Ticket

https://cm-jira.usa.gov/browse/LG-8058

🛠 Summary of changes

Database change

This needs a bit of explanation since setting last_consented_at where it was previously nil sounds bad.

Also, note that the table name is identities but the model is ServiceProviderIdentity.

Historically, a ServiceProviderIdentity was created upon an user consenting to data sharing as part of authentication with a service provider, generating a UUID uniquely identifying that user to the SP. At some point in the past, a last_consented_at timestamp was added, but only populated for new records going forward. So today, having a ServiceProviderIdentity indicates that a user has completed the process and linked their account with a service provider. Some rows don't have a timestamp logged, but today, a nil timestamp just means the record predated us storing the date it happened.

As part of the work in LG-8058, we are going to being creating ServiceProviderIdentity rows sooner in the process, which means the mere presence of a row no longer automatically means the user has consented to sharing data with the service provider. We are going to rely on the last_consented_at parameter for this—so we want a last_consented_at = nil to mean the user has not yet approved of sharing data with the SP. To do this, we first need to backfill last_consented_at.

In other words, this PR isn't actually changing what anyone has consented to. We're just putting in a date where it was previously blank.

The database change

Conceptually, this is simple. We're setting last_consented_at = created_at where last_consented_at is null. We also don't want to populate this on rows that have deleted_at set, since arguably that counts as revoking consent.

But, this is a busy table and there is no index on last_consented_at—nor is it really reasonable for there to be one. The table is also much larger than I would have expected. So, to avoid an update requiring a table scan across a huge table, we're scheduling a background job that will run over a smallish range of ids in each run.

The batch and slice sizes are estimates, and should complete in roughly a day. I've allowed tuning them through Redis if we find the job is too hard on the database (or, more positively, if it's running quicker than we thought). Likewise, if this ends up having an unexpectedly hard impact on the database, we can set the position counter to an arbitrarily high number in the database to cause it to not run any more updates.

I figure we can fully stop the job by just deleting this; it's only useful as a one-off.

Deploy notes

Once this is deployed, it will begin running every 5 minutes. We should probably wait until Monday when everyone's back in the office before merging this, on account of the holiday.

@@ -0,0 +1,5 @@
class AddConsentedToAgencyIdentities < ActiveRecord::Migration[7.0]
def change
add_column :agency_identities, :consented, :boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a chat with @mitchellhenke and we realized that we probably don't need this. Agency UUIDs don't go away when a user unlinks all the apps, so we don't need to "hide" AgencyIdentity rows like this, which will help us simplify

UPDATE identities
SET last_consented_at = created_at
WHERE id > #{start_id}
AND id <= #{start_id + slice_size}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: did the query planner do smarter things with a <= max vs a LIMIT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reasons that escape me, I got a syntax error when I was doing LIMIT. I assumed the error was actually somewhere else there, tried expressing it like this to highlight what the problem actually was, and the error went away.

I think something silly is wrong with my query and I'm just not seeing it. See my other comment down in the test; this query is populating last_consented_at for rows that have deleted_at set despite us specifying not to do that.

For background, I know MySQL much better than Postgres. The former isn't known for being a rigid adherent to ANSI standard SQL, so am I maybe doing something MySQL-specific and not realizing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! I think that the LIMIT has to be in a subquery: https://dba.stackexchange.com/questions/69471/postgres-update-limit-1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this as an id range. While it wouldn't have been too hard to use a subquery here, it just felt really weird.

UPDATE identities
SET last_consented_at = created_at
WHERE id > #{start_id}
AND id <= #{start_id + slice_size}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reasons that escape me, I got a syntax error when I was doing LIMIT. I assumed the error was actually somewhere else there, tried expressing it like this to highlight what the problem actually was, and the error went away.

I think something silly is wrong with my query and I'm just not seeing it. See my other comment down in the test; this query is populating last_consented_at for rows that have deleted_at set despite us specifying not to do that.

For background, I know MySQL much better than Postgres. The former isn't known for being a rigid adherent to ANSI standard SQL, so am I maybe doing something MySQL-specific and not realizing it?

@n1zyy n1zyy marked this pull request as ready for review November 23, 2022 22:14
# This is a short-term solution to backfill data requiring a table scan.
# This job can be deleted once it's done.

queue_as :low
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this appropriate? My thinking is that this isn't too time-sensitive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably fine, I think other backill jobs have been :long_running (which also won't page if they run for too long)


def redis_get(key, default)
(REDIS_POOL.with { |redis| redis.get(key) } || default).to_i
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this out into its own method because I found a bug in how I was doing this initially and figured I'd move it all to one place and fix it.

I was doing redis.get(key).to_i || default and realized that redis.get(DOES_NOT_EXIST) returns nil, so this would always return 0.


trait :consented do
last_consented_at { Time.zone.now - 5.minutes }
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we probably want to make :active set last_consented_at, but maybe after this story.

deleted.reload

expect(deleted.deleted_at).not_to be_nil
expect(deleted.last_consented_at.to_i).to eq(consent_time)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in Slack, but the to_i is necessary here because otherwise we run into weird precision issues, where the DB read vs. in-memory representation seem to have a different resolution and this caused the test to fail over whether or not we show nanoseconds.

The factory sets deleted_at to "5 minutes ago" so there will be a distinct difference if the record inadvertently gets updated.

end
end

describe '#batch_size' do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably this and .position (etc.) shouldn't even be public methods, but because this is a short-lived script and I had a bug in an early draft of this, I just left them public and added a test.

@n1zyy n1zyy changed the title [WIP] LG-8058 | Populate identities consented field LG-8058 | Populate identities consented field Nov 23, 2022
Comment on lines +39 to +40
# If we made it here without error, increment the counter for next time:
REDIS_POOL.with { |redis| redis.set(CACHE_KEY, position + batch_size) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so there's a potential for a bug here with the last batch, where suppose we have 555 records and the last batch starts at 550 and goes for 10, it would "end" at 560. But that's accounting for 5 records that don't exist.

In our case, it's the old (lower id) rows that are missing last_consented_at so by the time we get to this case, we're probably fine.

I think the correct thing to do would be to add a RETURNING id or something, and grab the max of that.

However, may not be worth the effort

Copy link
Contributor

@zachmargolis zachmargolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


last_id = result.to_a&.first ? result.to_a.first['id'] : start_id + slice_size

logger.info "Processed rows #{start_id} through #{last_id}"
Copy link
Contributor

@zachmargolis zachmargolis Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info "Processed rows #{start_id} through #{last_id}"
logger.info "Processed rows #{start_id} through #{last_id} (count=#{result.ntuples})"

I think we should log how many rows updated as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

I called this "updated" instead of "count" to make it more clear, because this output was confusing otherwise:

Processed rows 0 through 716 (count=1)

Comment on lines +48 to +51
elapsed_time = Time.zone.now - start_time
logger.info(
"Finished a full batch (#{batch_size} rows to id=#{last_id}) in #{elapsed_time} seconds",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine this log message with the one above? that way we only need one log line per job run to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in chat, but for the logs here: the previous log entry is inside the per-slice loop, while this is at the end of the whole batch. We concluded we'd keep both.

@n1zyy n1zyy merged commit f5450aa into main Nov 30, 2022
@n1zyy n1zyy deleted the mattw/LG-8058_initial_migration branch November 30, 2022 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants