Skip to content

Throw errors in tests if PII is logged (LG-5163)#5447

Merged
zachmargolis merged 17 commits intomainfrom
margolis-check-for-pii-in-track-event
Sep 29, 2021
Merged

Throw errors in tests if PII is logged (LG-5163)#5447
zachmargolis merged 17 commits intomainfrom
margolis-check-for-pii-in-track-event

Conversation

@zachmargolis
Copy link
Contributor

@zachmargolis zachmargolis commented Sep 27, 2021

If we pass hashes that

  • contain the string pii anywhere (such as decrypted_pii: "json string")
  • have any known pii attributes as keys
    the tests will raise

Implementation is shared across Analytics and FakeAnalytics which should cover most of our tests where we're not otherwise stubbinb

Copy link
Contributor

@aduth aduth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

analytics.track_event(
Analytics::IDV_PHONE_CONFIRMATION_VENDOR,
form_result.to_h.merge(
pii_like_keypaths: [
Copy link
Contributor

@aduth aduth Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With how widespread it seems we have innocent logging by attribute name, I wonder if it's worth considering one of the other avenues considered for detecting PII, like common patterns for phone/SSN/etc, or "known" PII values like our mock profile details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense is that approach trades convenience for completeness? Many tests used that PII but not all of them do? Do we think it would be close enough to check for those attribute values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really feel too strongly one way or the other. For me, the ticket seemed to be framed as something of a safety net for peace of mind, and not necessarily the primary safeguard against logging PII. I could also wonder if using the mock details could catch logging in a situation where it's not cleanly structured data with attribute names, or with slight variations to the exact attribute name.

The current implementation isn't really problematic to me, though I would wonder if it could be confusing that the pii_like_keypaths is not actually part of the "extra" payload being logged.

We could even do both if we wanted 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 77a2aad

Example error:

FakeAnalytics::PiiDetected: track_event example PII first_name (FAKEY) detected in attributes
event: Trackable Event ()
full event: {:some_benign_key=>"FAKEY MCFAKERSON"}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option, is we could update FormResponse and the classes similar to it to have a new attribute (next to extra), which get included in the .to_h? like this:

FormResponse.new(
  success: false,
  errors: { ssn: 'is not formatted right' },
  extra: { num_attempts: 14 },
  pii_like_key_paths: [[:errors, :ssn], [:error_details, :ssn]],
)

@zachmargolis
Copy link
Contributor Author

This finally passes! 🎉 🎉 🎉 🎉 🎉 🎉

@zachmargolis zachmargolis merged commit 708e710 into main Sep 29, 2021
@zachmargolis zachmargolis deleted the margolis-check-for-pii-in-track-event branch September 29, 2021 16:11
@aduth
Copy link
Contributor

aduth commented Oct 7, 2021

Had my first encounter with a flakey test with the new checking:

https://app.circleci.com/pipelines/github/18F/identity-idp/54969/workflows/a136bb59-b802-42e5-aad3-a745e8f5d2f0/jobs/85331/parallel-runs/4

I wonder if it could make sense to do a regex match with word boundary?

e.g.

[1] pry(main)> { zip: '59010' }.to_json.include?('59010')
=> true
[2] pry(main)> { user_id: 'c5adb05a-21f3-4e37-a2f9-e590107a9895' }.to_json.include?('59010')
=> true
[3] pry(main)> { zip: '59010' }.to_json.match?(/\b59010\b/)
=> true
[4] pry(main)> { user_id: 'c5adb05a-21f3-4e37-a2f9-e590107a9895' }.to_json.match?(/\b59010\b/)
=> false

@zachmargolis
Copy link
Contributor Author

I wonder if it could make sense to do a regex match with word boundary?

Yup! Makes sense, can give it a shot. Or, we could decide zipcode on its own is not sensitive enough to warrant a fail, so we could skip that attribute (like we skip state)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants