Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Deid samples and resource #1400

Merged
merged 5 commits into from
Mar 15, 2018
Merged

Conversation

averikitsch
Copy link
Contributor

No description provided.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers
  • Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot.
  • The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
  • The email used to register you as an authorized contributor must also be attached to your GitHub account.

@googlebot googlebot added the cla: no This human has *not* signed the Contributor License Agreement. label Mar 13, 2018
dlp/deid.py Outdated
# See the License for the specific language governing permissions and
# limitations under the License.

"""Sample app that uses the Data Loss Prevention API for deidentifying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a wording that shortens this to one line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from __future__ import print_function

import argparse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this blank line, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid.py Outdated
masking_character: The character to mask matching sensitive data with.
number_to_mask: The maximum number of sensitive characters to mask in
a match. If omitted the request or set to 0, the API will mask any
mathcing characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If omitted or set to zero, the API will default to no maximum."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid.py Outdated
import google.cloud.dlp

# Instantiate a client
google.cloud.dlp.DlpServiceClient.SERVICE_ADDRESS = 'autopush-dlp.sandbox.googleapis.com' # DO NOT SUBMIT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead and remove this for the commit. (Actually, I think we don't even need it anymore due to changes on the API side since we started).

Also apply this to the other functions, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid.py Outdated
google.cloud.dlp.DlpServiceClient.SERVICE_ADDRESS = 'autopush-dlp.sandbox.googleapis.com' # DO NOT SUBMIT
dlp = google.cloud.dlp.DlpServiceClient()

# Add parent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the inspect_content.py samples, use this comment:

# Convert the project id into a full resource id.

Also apply this to the other functions, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid_test.py Outdated
import deid

harmful_string = 'My SSN is 372819127'
parent = os.environ['GCLOUD_PROJECT']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this 'project' instead of 'parent' to avoid confusing it with the fully qualified resource name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid_test.py Outdated
key_name = os.environ['DLP_DEID_KEY_NAME']
surrogate_type = 'SSN_TOKEN'
csv_file = 'resources/dates.csv'
output_csv_file = 'resources/temp.results.csv'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the inspect_config_test.py file to see how to create a temporary directory, instead of hardcoding a temp file here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid_test.py Outdated
assert 'My SSN is *******27' in out


def test_deidentify_with_mask_handles_masking_number_error(capsys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go ahead and remove this test; it's not a problem for the sample if the API rejects this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid_test.py Outdated
out, _ = capsys.readouterr()

assert 'Successful' in out
# read in csv??
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to read in the CSV at this time, but go ahead and remove this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid_test.py Outdated
date_fields=date_fields,
context_field_id=csv_context_field,
key_name=key_name)
# [END Deidentify with date shift]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all the START and END tags in the test file; they're only needed in the real sample file. They're used to mark regions that are auto-included in our documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@andrewsg
Copy link
Member

Great work on this!

wrapped_key = base64.b64decode(wrapped_key)

# Construct Deidentify Config
reidentify_config = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit wild. @andrewsg wdyt about using the real proto types here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they're all really heavily nested. What's the advantage of using the real proto types vs. just breaking up the dictionaries into separate steps of construction and adding them together iteratively?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less about breaking it up into steps and more about knowing the actual structure of the item in question. When using the message types directly you get a better sense for how things are constructed and where you can look to find out additional parameters. With just a big o' nested dictionary it's sort of a wild west - the user is left asking "well, how did they know how to construct this? How do I know how to construct this?".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this diagnosis of the problem for sure, but I'm not sure about the solution. If we use the real protos here people will just run into the proto help() crash bug. Node doesn't use proto objects; does it have a better user experience for discovering this stuff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node is also a vastly different community with different expectations. Remember, Python is strongly moving towards strongly typed. Seeing giant dictionary with no details on the structure or type constraints isn't wonderful.

Proto help() crash bug has been fixed and should hopefully be released soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay. I'll buy that. I think that's a bigger change to our focus than we can reasonably do in this PR, though; can we accept this part as-is and do another pass this or next week to see what it looks like with protos instead?

dlp/deid.py Outdated
item = {'value': string}

# Call the API
response = dlp.reidentify_content(parent,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid hanging indents throughout, prefer to line break at the opening ( and then indent the continuation by one level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid.py Outdated
# Read and parse the CSV file
import csv
from datetime import datetime
file = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use file, it's a built-in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlp/deid.py Outdated
'-m', '--masking_character',
help='The character to mask matching sensitive data with.')
mask_parser.add_argument(
'-p', '--project', default=os.environ['GCLOUD_PROJECT'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just don't do project ID detection at all. Have them pass it in.

@averikitsch
Copy link
Contributor Author

I signed it!

@googlebot
Copy link

CLAs look good, thanks!

@googlebot googlebot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Mar 14, 2018
@theacodes
Copy link
Contributor

theacodes commented Mar 14, 2018 via email

@averikitsch averikitsch merged commit 1decee1 into GoogleCloudPlatform:dlp Mar 15, 2018
Copy link
Contributor

@lesv lesv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - No changes since this was merged (and Jon and Andrew have reviewed). (And it really looks good)

Your comments improve things in almost all cases, the one place i'd quibble is "Print out the results" which appears to be redundant to the code. (ie if the line below and the comment are the same, then don't bother)

In general, unless the language style for comments differ, it's good to prefer terseness w/o sacrificing clarity. If you haven't already read the Google style guide comments it's worth looking at. (Note - it looks like you have)

I was going to suggest some additional vertical white space, but that appears to be counter the community's view. The one place that appears to be recommended is before the "Returns:"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants