Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the behavior of the person scrape #900

Closed
hancush opened this issue Oct 5, 2022 · 7 comments
Closed

Improve the behavior of the person scrape #900

hancush opened this issue Oct 5, 2022 · 7 comments
Assignees

Comments

@hancush
Copy link
Collaborator

hancush commented Oct 5, 2022

Specifically, changes in name and membership dates, etc., misbehave often, and it'd be great if they didn't.

@hancush
Copy link
Collaborator Author

hancush commented Nov 1, 2022

There is a 5-year-old issue with a rich history on how we might approach handling deleted data. @antidipyramid, can you have a look and leave a comment in that issue about the approach that makes most sense to you?

@hancush
Copy link
Collaborator Author

hancush commented Nov 2, 2022

Whoops, forgot to add the link: opencivicdata/pupa#295

@antidipyramid antidipyramid self-assigned this Nov 2, 2022
@antidipyramid
Copy link
Collaborator

Both the pupa and python-opencivicdata repositories have been updated (here and here, respectively) to record when objects were last seen in a scrape. The scrapers should now be recording that information.

Next steps are to observe the new behavior and present the Metro team with our findings.

@hancush
Copy link
Collaborator Author

hancush commented Mar 1, 2023

A few weeks ago, I observed that the last_seen flag was not behaving as expected.

It turns out our scrapers were pinned to an earlier version of pupa. As of today, they are now running the most recent version. I'll circle back to this next week and evaluate whether the last_seen date makes more sense.

@antidipyramid has also drafted a pupa command to remove data that has not been seen in a certain window: opencivicdata/pupa#344

We'll pilot this once we confirm the date stamps are behaving as expected.

@hancush hancush moved this from In Progress to Backlog in boardagendas.metro.net - Monthly priorities Mar 7, 2023
@hancush
Copy link
Collaborator Author

hancush commented Mar 7, 2023

Date stamps look to be working! Here are the memberships and events we haven't seen in the past week:

          created_at           |          updated_at           |           last_seen           |    person_name    |    role
-------------------------------+-------------------------------+-------------------------------+-------------------+------------
 2022-12-20 20:33:34.550342+00 | 2022-12-20 20:33:34.550365+00 | 2022-12-20 20:33:34.550373+00 | Belinda Faustinos | Chair
 2022-12-20 20:33:35.182821+00 | 2022-12-20 20:33:35.182842+00 | 2022-12-20 20:33:35.182851+00 | Emina Darakjy     | Vice Chair
 2022-12-20 20:33:36.092624+00 | 2022-12-20 20:33:36.09264+00  | 2022-12-20 20:33:36.092646+00 | Louis Moret       | Member
(3 rows)

          created_at           |          updated_at           |           last_seen           |                slug                |         name          |        start_date
-------------------------------+-------------------------------+-------------------------------+------------------------------------+-----------------------+---------------------------
 2023-01-17 17:02:06.412322+00 | 2023-01-17 17:02:06.412348+00 | 2023-01-17 17:02:06.412357+00 | regular-board-meeting-7c4e45198c2d | Regular Board Meeting | 2023-01-17T17:00:00+00:00
 2023-01-03 17:02:33.072897+00 | 2023-01-03 17:16:49.772612+00 | 2023-01-03 17:16:49.772618+00 | regular-board-meeting-9875c3b95fb4 | Regular Board Meeting | 2023-01-03T17:00:00+00:00
 2023-01-03 18:27:21.891474+00 | 2023-01-03 18:27:21.891491+00 | 2023-01-03 18:27:21.891496+00 | regular-board-meeting-5b1c2e7b2588 | Regular Board Meeting | 2023-01-03T18:20:00+00:00
(3 rows)

Looks like all of these have been removed from Legistar.

@hancush hancush moved this from Backlog to In Progress in boardagendas.metro.net - Monthly priorities Mar 7, 2023
@hancush hancush moved this from 📝 In Progress to 📤 Review/QA in boardagendas.metro.net - Monthly priorities Mar 16, 2023
@hancush
Copy link
Collaborator Author

hancush commented Mar 21, 2023

Hannah will show Monkruman how to cut releases of OCD and pupa next week, then we'll add the DAG to flush data we haven't seen in a week.

@hancush
Copy link
Collaborator Author

hancush commented Jun 9, 2023

Done!

@hancush hancush closed this as completed Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants