Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to manually import annotations from analyst CSVs #487

Merged
merged 10 commits into from
Nov 5, 2019

Conversation

mrotondo
Copy link
Contributor

@mrotondo mrotondo commented Oct 3, 2019

This is a precursor to any automated importing, and does not include any de-duping or merging, so don't run it on the same csv multiple times!

Known issues (to either be addressed in this PR or subsequent ones, your call):

  • doesn't set the annotation author correctly, instead uses the API call initator
  • doesn't include any metadata about sheet "schema"
  • doesn't access sheets via the google drive API, just parses CSVs
  • as noted, no merging or deduping

@mrotondo mrotondo marked this pull request as ready for review October 3, 2019 19:36
@mrotondo
Copy link
Contributor Author

mrotondo commented Oct 3, 2019

I'm putting this up for review so that we can iron out any initial issues and then decide on paths for the merging/deduping and author-setting issues

Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fantastic start! All the comments I made inline are mostly small technical issues — overall, this seems like a good approach.

Would you move this to the scripts/ folder?

doesn't set the annotation author correctly, instead uses the API call initator

Might be good to include that as just another field in the annotation for now, until we figure out the best solution for this.

doesn't include any metadata about sheet "schema"

Let’s just add a field for this with the name annotation_schema and the value edgi_analyst_v2.

doesn't access sheets via the google drive API, just parses CSVs

I think that’s an OK limitation for now. Can address in a separate PR.

as noted, no merging or deduping

I think the plan we landed on for the API side of this is that multiple posts by the same user for the same change would just update a single annotation, so that should mitigate this. Of course, I need to check whether that’s what we’re actually doing in practice. :P

web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
web_monitoring/annotations_import.py Outdated Show resolved Hide resolved
@mrotondo
Copy link
Contributor Author

mrotondo commented Oct 4, 2019

Thanks for all the comments! I'll get around to most of this tomorrow, but:

I think the plan we landed on for the API side of this is that multiple posts by the same user for the same change would just update a single annotation, so that should mitigate this. Of course, I need to check whether that’s what we’re actually doing in practice. :P

I think https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/42bb8c7020e4d139fa278a95e0947cf9c8329659/app/models/change.rb#L82 looks like it is indeed finding any extant annotation for that change by that author, but then overwriting its data.

@Mr0grog
Copy link
Member

Mr0grog commented Oct 4, 2019

but then overwriting its data.

Ahhhhh, so it is. I guess you’d need to find it and read it here first, though it might be simpler if posting an annotation took a parameter for update vs. replace. Hmmm.

Update: see edgi-govdata-archiving/web-monitoring-db#375 (comment) for discussion.

# change to the API

bool_keys = [
'Language alteration',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good to me, but the one thing I’m wondering is whether we should attempt to map these names to something a little less prone to change, e.g. “Content change/addition/removal” → “content_changed”.

I’m not sure how often the titles of these columns has been altered without changing their meaning in the past, but I’m fairly sure it’s happened a few times (fixing typos, slight rewordings, etc.). Even though doing a mapping like this means any display needs to also know what text to show, it seems like it would make a more reliable system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please take a look to see if the structure looks alright, and if you'd like to change any of the new json keys

…o correspond to multiple csv column names which might change over time.
…oceeding with a malformed sheet

- Minor cleanup of attribute info initializatino
Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good now. Noted a couple minor nits inline, but neither of them are major problems.

scripts/annotations_import.py Outdated Show resolved Hide resolved
scripts/annotations_import.py Outdated Show resolved Hide resolved
mrotondo and others added 2 commits October 30, 2019 19:59
All our scripts don't have extensions and are marked as executable, and this does the same with `annotations_import`. Also makes some minor whitespace and import tweaks.
Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍

@Mr0grog
Copy link
Member

Mr0grog commented Nov 4, 2019

Is this good to merge now, or was there more you wanted to do?

@mrotondo
Copy link
Contributor Author

mrotondo commented Nov 5, 2019

This is good to go, thanks!

@Mr0grog
Copy link
Member

Mr0grog commented Nov 5, 2019

THEN IN IT GOES

@Mr0grog Mr0grog changed the title A script to manually import annotations from analyst CSVs (in the current sheet format) Add script to manually import annotations from analyst CSVs Nov 5, 2019
@Mr0grog Mr0grog merged commit 2af5cc3 into edgi-govdata-archiving:master Nov 5, 2019
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants