LG-8058 | Populate identities consented field by n1zyy · Pull Request #7374 · 18F/identity-idp

n1zyy · 2022-11-18T22:39:26Z

🎫 Ticket

https://cm-jira.usa.gov/browse/LG-8058

🛠 Summary of changes

Database change

This needs a bit of explanation since setting last_consented_at where it was previously nil sounds bad.

Also, note that the table name is identities but the model is ServiceProviderIdentity.

Historically, a ServiceProviderIdentity was created upon an user consenting to data sharing as part of authentication with a service provider, generating a UUID uniquely identifying that user to the SP. At some point in the past, a last_consented_at timestamp was added, but only populated for new records going forward. So today, having a ServiceProviderIdentity indicates that a user has completed the process and linked their account with a service provider. Some rows don't have a timestamp logged, but today, a nil timestamp just means the record predated us storing the date it happened.

As part of the work in LG-8058, we are going to being creating ServiceProviderIdentity rows sooner in the process, which means the mere presence of a row no longer automatically means the user has consented to sharing data with the service provider. We are going to rely on the last_consented_at parameter for this—so we want a last_consented_at = nil to mean the user has not yet approved of sharing data with the SP. To do this, we first need to backfill last_consented_at.

In other words, this PR isn't actually changing what anyone has consented to. We're just putting in a date where it was previously blank.

The database change

Conceptually, this is simple. We're setting last_consented_at = created_at where last_consented_at is null. We also don't want to populate this on rows that have deleted_at set, since arguably that counts as revoking consent.

But, this is a busy table and there is no index on last_consented_at—nor is it really reasonable for there to be one. The table is also much larger than I would have expected. So, to avoid an update requiring a table scan across a huge table, we're scheduling a background job that will run over a smallish range of ids in each run.

The batch and slice sizes are estimates, and should complete in roughly a day. I've allowed tuning them through Redis if we find the job is too hard on the database (or, more positively, if it's running quicker than we thought). Likewise, if this ends up having an unexpectedly hard impact on the database, we can set the position counter to an arbitrarily high number in the database to cause it to not run any more updates.

I figure we can fully stop the job by just deleting this; it's only useful as a one-off.

Deploy notes

Once this is deployed, it will begin running every 5 minutes. We should probably wait until Monday when everyone's back in the office before merging this, on account of the holiday.

[skip changelog] - for now

app/jobs/identities_backfill_job.rb

zachmargolis · 2022-11-21T19:03:16Z

db/primary_migrate/20221118191134_add_confirmed_to_identities_and_agency_identities.rb

@@ -0,0 +1,5 @@
+class AddConsentedToAgencyIdentities < ActiveRecord::Migration[7.0]
+  def change
+    add_column :agency_identities, :consented, :boolean


had a chat with @mitchellhenke and we realized that we probably don't need this. Agency UUIDs don't go away when a user unlinks all the apps, so we don't need to "hide" AgencyIdentity rows like this, which will help us simplify

app/jobs/identities_backfill_job.rb

zachmargolis · 2022-11-22T20:35:23Z

app/jobs/identities_backfill_job.rb

+UPDATE identities
+SET last_consented_at = created_at
+WHERE id > #{start_id}
+AND id <= #{start_id + slice_size}


question: did the query planner do smarter things with a <= max vs a LIMIT ?

For reasons that escape me, I got a syntax error when I was doing LIMIT. I assumed the error was actually somewhere else there, tried expressing it like this to highlight what the problem actually was, and the error went away.

I think something silly is wrong with my query and I'm just not seeing it. See my other comment down in the test; this query is populating last_consented_at for rows that have deleted_at set despite us specifying not to do that.

For background, I know MySQL much better than Postgres. The former isn't known for being a rigid adherent to ANSI standard SQL, so am I maybe doing something MySQL-specific and not realizing it?

ah! I think that the LIMIT has to be in a subquery: https://dba.stackexchange.com/questions/69471/postgres-update-limit-1

I left this as an id range. While it wouldn't have been too hard to use a subquery here, it just felt really weird.

app/jobs/identities_backfill_job.rb

n1zyy · 2022-11-22T21:22:25Z

app/jobs/identities_backfill_job.rb

+UPDATE identities
+SET last_consented_at = created_at
+WHERE id > #{start_id}
+AND id <= #{start_id + slice_size}


For reasons that escape me, I got a syntax error when I was doing LIMIT. I assumed the error was actually somewhere else there, tried expressing it like this to highlight what the problem actually was, and the error went away.

I think something silly is wrong with my query and I'm just not seeing it. See my other comment down in the test; this query is populating last_consented_at for rows that have deleted_at set despite us specifying not to do that.

For background, I know MySQL much better than Postgres. The former isn't known for being a rigid adherent to ANSI standard SQL, so am I maybe doing something MySQL-specific and not realizing it?

spec/jobs/identities_backfill_job_spec.rb

changelog: Internal, Attempts API, Populate last_consented_at field

Works around a weird behavior in tests

n1zyy · 2022-11-23T22:18:16Z

app/jobs/identities_backfill_job.rb

+  # This is a short-term solution to backfill data requiring a table scan.
+  # This job can be deleted once it's done.
+
+  queue_as :low


Was this appropriate? My thinking is that this isn't too time-sensitive.

it's probably fine, I think other backill jobs have been :long_running (which also won't page if they run for too long)

n1zyy · 2022-11-23T22:20:23Z

app/jobs/identities_backfill_job.rb

+
+  def redis_get(key, default)
+    (REDIS_POOL.with { |redis| redis.get(key) } || default).to_i
+  end


I pulled this out into its own method because I found a bug in how I was doing this initially and figured I'd move it all to one place and fix it.

I was doing redis.get(key).to_i || default and realized that redis.get(DOES_NOT_EXIST) returns nil, so this would always return 0.

n1zyy · 2022-11-23T22:21:04Z

spec/factories/service_provider_identities.rb

+
+  trait :consented do
+    last_consented_at { Time.zone.now - 5.minutes }
+  end


I actually think we probably want to make :active set last_consented_at, but maybe after this story.

n1zyy · 2022-11-23T22:23:13Z

spec/jobs/identities_backfill_job_spec.rb

+      deleted.reload
+
+      expect(deleted.deleted_at).not_to be_nil
+      expect(deleted.last_consented_at.to_i).to eq(consent_time)


Discussed in Slack, but the to_i is necessary here because otherwise we run into weird precision issues, where the DB read vs. in-memory representation seem to have a different resolution and this caused the test to fail over whether or not we show nanoseconds.

The factory sets deleted_at to "5 minutes ago" so there will be a distinct difference if the record inadvertently gets updated.

n1zyy · 2022-11-23T22:25:52Z

spec/jobs/identities_backfill_job_spec.rb

+    end
+  end
+
+  describe '#batch_size' do


Arguably this and .position (etc.) shouldn't even be public methods, but because this is a short-lived script and I had a bug in an early draft of this, I just left them public and added a test.

app/jobs/identities_backfill_job.rb

zachmargolis · 2022-11-23T22:33:55Z

app/jobs/identities_backfill_job.rb

+    # If we made it here without error, increment the counter for next time:
+    REDIS_POOL.with { |redis| redis.set(CACHE_KEY, position + batch_size) }


ok so there's a potential for a bug here with the last batch, where suppose we have 555 records and the last batch starts at 550 and goes for 10, it would "end" at 560. But that's accounting for 5 records that don't exist.

In our case, it's the old (lower id) rows that are missing last_consented_at so by the time we get to this case, we're probably fine.

I think the correct thing to do would be to add a RETURNING id or something, and grab the max of that.

However, may not be worth the effort

zachmargolis

LGTM

zachmargolis · 2022-11-29T16:58:09Z

app/jobs/identities_backfill_job.rb

+
+      last_id = result.to_a&.first ? result.to_a.first['id'] : start_id + slice_size
+
+      logger.info "Processed rows #{start_id} through #{last_id}"


Suggested change

logger.info "Processed rows #{start_id} through #{last_id}"

logger.info "Processed rows #{start_id} through #{last_id} (count=#{result.ntuples})"

I think we should log how many rows updated as well

👍

I called this "updated" instead of "count" to make it more clear, because this output was confusing otherwise:

Processed rows 0 through 716 (count=1)

zachmargolis · 2022-11-29T16:59:13Z

app/jobs/identities_backfill_job.rb

+    elapsed_time = Time.zone.now - start_time
+    logger.info(
+      "Finished a full batch (#{batch_size} rows to id=#{last_id}) in #{elapsed_time} seconds",
+    )


can we combine this log message with the one above? that way we only need one log line per job run to check.

We discussed this in chat, but for the logs here: the previous log entry is inside the per-slice loop, while this is at the end of the whole batch. We concluded we'd keep both.

LG-8058 | Populate identities consented field

564dd63

[skip changelog] - for now

zachmargolis reviewed Nov 18, 2022

View reviewed changes

app/jobs/identities_backfill_job.rb Outdated Show resolved Hide resolved