Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for the rest of 2020 #158

Closed
11 of 17 tasks
Mr0grog opened this issue Nov 11, 2020 · 6 comments
Closed
11 of 17 tasks

Roadmap for the rest of 2020 #158

Mr0grog opened this issue Nov 11, 2020 · 6 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Nov 11, 2020

Near the start of this year, I wrote about the issues with maintaining the Web Monitoring project, and have slowly worked to ramp down some of the development. Ultimately, however, EDGI’s Web Monitoring team is continuing to use this software, and that means they need some path forward where this code requires less (ideally none) of my (@Mr0grog) day-to-day involvement, and where there’s some reasonable hope that another person can maintain it. I’m hoping to focus my efforts between now and the end of the year on that, and the road-map here will serve to help keep things as on-target as possible.

Major Goals

  1. Extract/abstract and publish packages for parts of the code that are more narrowly-focused and useful for other people. It’s at least possible for other people to maintain these, and creates some more broadly useful artifacts whether or not Web Monitoring continues meaningfully into the future.

  2. Make possible for someone other than @Mr0grog to maintain. The biggest blocker here is the variety of languages and frameworks used — web-monitoring-db, in particular, is the major odd one out.

  3. Increase stability/decrease necessary technical oversight. The platform currently needs a lot of TLC to keep it humming along, and that’s a real problem. The most obvious issue here is generating weekly task sheets, which are no longer automated

  4. Clean up the data/follow-up on some basic get-our-ducks-in-a-row issues. I know this is a weird and vague goal. It’s partially in service to the above points (if some of the basics are more cleaned up, they’re easier for others to grok and maintain), and partially about ensuring that, in a possible future where we do ramp down this system, the data is still accessible and useful.

It’s worth noting some of these are in tension with each other. Extracting more small packages makes more to manage, but it also makes it more possible for other people to take on that management. Making big changes to frameworks or languages will necessarily introduce bugs and reduce stability, but will also make it more possible for a single human to maintain everything.

Specific Actions

So how do we accomplish the above?

  • Extract diffing tools and server into their own package. (Point 1)

  • Extract wayback API into its own package. (Point 1)

  • Remove deprecated/vestigal fields in DB. (Point 4)

  • Support multiple URLs for pages in -db. (Allow pages to have multiple URLs web-monitoring-db#492) (Point 4)
    Effort: 1 week

  • Refactor the Version model. (Refactor Version Model web-monitoring-db#776) (Point 4)
    Effort: 3 days

  • Improve the Python API to web-monitoring-db. (DB API should have built-in retry functionality web-monitoring-processing#659, DB API needs methods for iterating over paginated results web-monitoring-processing#660, Most DB API methods should support arbitrary keyword args web-monitoring-processing#661) (Point 2) It’s questionable how worthwhile this is if we do a major rewrite of -db, BUT:

    • These are relatively easy and quick to do.
    • If the -db rewrite gets stuck, these will still apply and be big improvements on point 2.
    • If the -db rewrite still keeps some form of API around (likely, since we don’t want to cause huge unnecessary changes to -ui), it may still be useful.

    Effort: 1 day

  • Optimize the Wayback import script so it runs more quickly and reliably. (Point 3)

  • Don’t set page titles from versions that are errors. (Page titles should not be updated by versions that are errors web-monitoring-db#751) (Point 4) TBH, this is more about fixing a piece of the data that is currently super confusing than really getting at the major goals. But I want to prioritize it.
    Effort: 3 days

  • Investigate feasibility of setting up all our pages in Klaxon, moving the team over to it, and letting much of this project that we haven’t extracted out (see point 1) go to seed. I haven’t taken a good look at Klaxon since we first got started, but it’s more narrowly focused and appears to be actively maintained (and now I know at least one person at the Marshall Project, so I can ask them about it).

    When we started this project, we looked at Klaxon and felt:

    • We were already beyond it in scale,
    • We wanted a much richer feature set (enough so that it didn’t seem like the right place to start), and
    • It was critical that we preserve and integrate data from Versionista

    Except for scale, a lot of those may be less important going forward. I should at least look into it.

    Effort: 1 week

    Update: implemented this, but didn’t really get useful traction from analyst team. Will not work on more or maintain.

  • Rewrite web-monitoring-db in Python (sort of). See Rewrite this project in Python web-monitoring-db#119 (comment) — I think a rewritten version should have less lofty goals and not try and be a generalized API. We can also consolidate much of what’s left in web-monitoring-processing into it. (Point 2)
    Effort: 5-6 weeks

    THIS ONE SCARES ME A LOT. Even in this re-imagined form, I still don’t want to do it. BUT we have proven that it’s really hard to find someone who can work comfortably across Ruby, Python, and JS, especially with all the esoteric bits we wind up hitting at our breadth and scale around HTTP requests, the Wayback Machine, string encoding issues, etc. We have more Python than Ruby in our stack, and Python is also more favored in general in the [web] archiving community. I’ve kicked the can down the road on this for too long. If there’s any viable future for this project as a whole, this rewrite needs to happen, even though it’s big, problematic, and ripe for errors.

    Update: this has been canceled for 2020 — it’s not feasible in the current timeframe.

  • Automate web-monitoring-task-sheets. (Dockerize this so it can run as a scheduled AWS ECS web-monitoring-task-sheets#6) (Point 3) This absolutely must no longer be done by hand. It should probably be a docker image and run on AWS Batch. It’s too big and too rarely used to make sense as a Kubernetes job. (Generating task sheets needs 3 GB of RAM at a minimum and typically takes a couple hours to run.)
    Effort: 1.5 weeks

  • Update our URL list with the Internet Archive team. (Point 3) It’s been a while since we’ve done this, but I want to fix up the multiple URLs issue first. (We knew we had some page duplication, and generating seeds for End-of-Term archive showed me it’s a significant issue.)
    Effort: 1 day Blocked by multiple URLs per page.

  • Do a thorough re-investigation of the database setup. (Point 3) A lot of things we do today a pretty slow and impactful. This may be solved by the rewrite of web-monitoring-db (since we have a lot of room to change our access patterns when we remove/rethink the focus on a generalized API), but not necessarily. It deserves a dedicated task.
    Effort: 2-3 weeks (Best done in conjunction with -db rewrite)

  • Extract Ruby SURT module from -db into its own package. (Extract SURT into a separate gem web-monitoring-db#767) (Point 1) This probably isn’t a huge priority, especially if we kill off all our Ruby code (see web-monitoring-db rewrite below), although it’d be nice to salvage for others to use.
    Effort: 2 weeks (Should also see if we can back-port improvements to Python SURT)

  • Automated exports of DB data to flat files (probably in S3). (Set up automated process to export pages/versions web-monitoring-db#45) (Point 3) The primary use here is that people can grab this for doing large bulk analyses without needing to hit the database. It could also be used instead of having an API at all (even though it’s not too convenient for a lot of operations). I’m worried I’m inventing something that’s not effectively serving any real use-cases, though, and that should be avoided — especially right now. (Update: after discussing with @danielballan, it seems clear that this is not a good thing to pursue right now; we are inventing a need that probably does not exist.)


These are not the only important things between now and the end of 2020. In particular, needed fixes and enhancements in the packages we’ve extracted (wayback and web-monitoring-diff) must continue to be worked on. All else being equal, though, stuff here will take priority.

Update 2020-11-13: Changed effort estimates, removed edgi-govdata-archiving/web-monitoring-processing#661 from roadmap.

Update 2020-12-10: Canceled “Rewrite web-monitoring-db in Python (sort of).” See comments below for more.

@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 11, 2020

Had a good discussion with @danielballan on this. The list mostly feels good, but we will not do the automated export stuff.

We both think the -db rewrite is dicey, but is probably important if the project is to have a maintainable and more stable future. Narrowing its focus to serving the -ui project (instead of having a generic API) is critical to making it feasible.

@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 13, 2020

@gretchengehrke
Copy link

Thanks for laying this out so clearly @Mr0grog! This looks like a ton of work.

I'm way behind on working on getting volunteer capacity to help you. I have on my to-do list this week to create some tech-specific volunteer recruitment flyers and other postings (for listservs, or maybe Idealist? Or advertisement on Code 4 America?). I figure it'll take time to find people, but I'm guessing you wouldn't want any volunteers coming on board until most of the work you've listed out here is completed. Is that true?

One comment on what you've listed here: For generating task sheets, I wonder if that could be done monthly instead of weekly. I understand that making it automated would be much, much more sustainable, but in the meantime, I think it would be okay to be less frequent, since it takes us so long to actually write reports etc, it wouldn't be the end of the world to see a change a few weeks later than we otherwise would. Would that help things/?

@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 23, 2020

I'm guessing you wouldn't want any volunteers coming on board until most of the work you've listed out here is completed. Is that true?

Yeah, that's accurate. It would be extremely hard for someone totally new to the project to contribute productively to these kinds of tasks.

could [generating task sheets] be done monthly instead of weekly… would that help things?

I’m happy to change the schedule in whatever way is useful, but I don’t think this makes an especially huge difference on the technical end:
+ It’ll free up a little bit of my time. (The critical trade-off here is that it takes longer to analyze a larger timeframe, but I’d be doing it less often—overall a net savings).
- It’s still a fundamental problem that this requires me to be around and do it manually, that this is another disjoint, poorly integrated part of the system that needs remediation.

@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 25, 2020

Test Klaxon instance is up and running at https://edgi-klaxon-test-v2.herokuapp.com/

I’ve put a few pages in, and should do some more plus invite other users — at least @gretchengehrke and possibly other analysts.

@Mr0grog Mr0grog pinned this issue Dec 10, 2020
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 10, 2020

Update: at the current pace, and after talking with @danielballan, I don’t think the “Rewrite web-monitoring-db in Python (sort of)” item is feasible for this roadmap anymore. I’m going to focus more on trying to improve documentation and organization around it instead.

Ultimately, that one was always a high-effort, risky task, and while I think it would have gone a long way towards making it more feasible for other people to maintain things, it’s just not going to happen on a short time frame, and doesn’t make sense to sprint on right now. Longer term, it’s also not the only viable direction — Dan is interesting in focusing on improving the tools for people who want to do more general (and technical) data analysis (i.e. -diff and wayback) that don’t depend on a live, running service like -db; seeing if a tool like Klaxon would like to adopt some of these pieces might be a good path, too.

@Mr0grog Mr0grog closed this as completed Jan 17, 2023
@Mr0grog Mr0grog unpinned this issue Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants