Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep a record of what pages/versions an import touched and how #191

Closed
danielballan opened this issue Jan 5, 2018 · 10 comments
Closed

Keep a record of what pages/versions an import touched and how #191

danielballan opened this issue Jan 5, 2018 · 10 comments
Assignees
Milestone

Comments

@danielballan
Copy link
Contributor

I have seen a (yet unresolved) bug where an errored import job causes the UI /pages endpoint to become un-renderable. For issues like this, it would be useful to be able to identify and potentially remove records generated by a specific Import.

@Mr0grog
Copy link
Member

Mr0grog commented Jan 5, 2018

How is it unrenderable??? Regardless of whatever craziness is happening in the importer, it shouldn’t be possible for /api/v0/pages to somehow choke on anything it’s pulling out of the DB. Is it raising an exception?

I guess I can see some value in linking a version/page to an import, but it doesn’t seem (to me, at least) like a great solution to the problem you’re describing.

@danielballan
Copy link
Contributor Author

I guess I can see some value in linking a version/page to an import, but it doesn’t seem (to me, at least) like a great solution to the problem you’re describing.

So -- I definitely agree that we need to understand and fix the UI problem on its own. Nothing we do with imports should ever be able to break the /pages view. It's a separate issue. It just highlighted for me that an import-batch-gone-wrong is something we're likely to see again.

@danielballan
Copy link
Contributor Author

See edgi-govdata-archiving/web-monitoring-ui#174 for the UI bug.

@Mr0grog
Copy link
Member

Mr0grog commented Jan 5, 2018

See #192 for the DB bug about the specific issue that precipitated this.

@Mr0grog Mr0grog changed the title Label each Version and Page generated by an Import with the import_id Keep a record of what pages/versions an import touched and how Mar 19, 2018
@Mr0grog
Copy link
Member

Mr0grog commented Mar 19, 2018

Re-titled this to get at how @danielballan and I have talked about this issue. It doesn’t make sense to store information about what import created a page/version because multiple imports could modify a page or version (if you use ?update=merge or ?update=replace). There’s also a lot of overhead here.

Instead, we probably want to log what each import did. Some possible approaches here:

  1. Store a structured output log alongside the stored import payload, where each row in the import payload has a corresponding row that is something like:

    {
      "page_uuid": "<uuid>",
      "page_operation": "<created|updated|null>",
      "version_uuid": "<uuid>",
      "version_operation": "<created|updated|null>",
      "at": "2018-03-01T00:01:52Z"
    }
  2. Store a structured log of operations alongside the stored import payload, where rows don’t necessarily correspond to the payload, but are just a list of operations, like:

    {"object": "page",    "operation": "created", "uuid": "<uuid>", "at": "2018-03-01T00:01:52Z"}
    {"object": "version", "operation": "created", "uuid": "<uuid>", "at": "2018-03-01T00:01:52Z"}
    {"object": "page",    "operation": "updated", "uuid": "<uuid>", "at": "2018-03-01T00:01:52Z"}
    {"object": "version", "operation": "created", "uuid": "<uuid>", "at": "2018-03-01T00:01:52Z"}
    {"object": "version", "operation": "updated", "uuid": "<uuid>", "at": "2018-03-01T00:01:53Z"}
    // etc.
  3. Just a simple text log:

    [2018-03-01T00:01:52Z] Created Page "<uuid>"
    [2018-03-01T00:01:52Z] Created Version "<uuid>"
    [2018-03-01T00:01:52Z] Updated Page "<uuid>"
    [2018-03-01T00:01:52Z] Created Version "<uuid>"
    [2018-03-01T00:01:52Z] Created Version "<uuid>"
    [2018-03-01T00:01:53Z] Updated Version "<uuid>"
    
  4. A combination of (1) and (2), where each line is an array of operations and corresponds to a line in the import payload.

  5. Any of the above but just an append-log that is not a separate file for each import job.

  6. Other ideas???

There are plenty of ways to approach this one; almost any of them would probably be good, though I’d shoot for something more structured here. I think we are also aiming to not create database records for this because of extremely high overhead.

@Mr0grog
Copy link
Member

Mr0grog commented Apr 13, 2018

Prompted to revisit this because of #277, since we’re really going to want this info when a line is skipped because a page doesn’t exist and ?create_page=false. For that PR, I might just stick with normal logging. But it’s a point that we should come back around and get this done.

I’m going to throw my lot in for (2) above with an extra property about which row the log is associated with, e.g:

{"object": "page",    "operation": "created", "uuid": "<uuid>", "at": "2018-03-01T00:01:52Z", "row": 0}
{"object": "version", "operation": "created", "uuid": "<uuid>", "at": "2018-03-01T00:01:52Z", "row": 0}
{"object": "page",    "operation": "updated", "uuid": "<uuid>", "at": "2018-03-01T00:01:52Z", "row": 1}
{"object": "version", "operation": "created", "uuid": "<uuid>", "at": "2018-03-01T00:01:52Z", "row": 1}
{"object": "version", "operation": "updated", "uuid": "<uuid>", "at": "2018-03-01T00:01:53Z", "row": 2}
// etc.

I think that’s a little easier to manage than (4), where log lines for a row have to be collected together, but I’m feeling like either one of those is the best of what I’d proposed above. Any thoughts, @danielballan?

@danielballan
Copy link
Contributor Author

I’m going to throw my lot in for (2) above with an extra property about which row the log is associated with

This strikes the correct balance between ease/efficiency of writing and completeness. Since we can re-create other forms from this one, I think we should go with this.

@Mr0grog
Copy link
Member

Mr0grog commented Apr 13, 2018

Note from #278: we should make sure these get gzipped when they are saved.

@stale
Copy link

stale bot commented Jan 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Jan 23, 2019
@Mr0grog Mr0grog added the idea label Jan 24, 2019
@stale stale bot removed the stale label Jan 24, 2019
@Mr0grog Mr0grog added this to the Backlog milestone Jan 24, 2019
@Mr0grog Mr0grog added never-stale and removed idea labels Apr 24, 2019
@Mr0grog Mr0grog assigned Mr0grog and bensheldon and unassigned Mr0grog Jun 27, 2019
@Mr0grog
Copy link
Member

Mr0grog commented Oct 31, 2019

Update: this was closed by the work in #551, which does not create a special log file for each import. Instead, we looked back at the problems that precipitated this issue and at our current needs, and decided that just logging to the same place we do all our normal logs (STDOUT/STDERR → AWS CloudWatch in production and ./log/development.log in development) was the right answer that meaningfully reduced complexity here.

If you want to see just these logs, you can use grep (or whatever tooling you have in whatever service your logs might arrive in):

# In development:
> tail -f log/development.log | grep '\[import'
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36] Started Import 36
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=0] Created Page 9b302e5a-fe72-466f-800a-ec26f1d09674
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=0] Created Version 21d3494c-e92c-4e4c-accf-9cd9f1dcf1f8
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=1] Found Page 9b302e5a-fe72-466f-800a-ec26f1d09674
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=1] Created Version af617815-e120-4d64-9a15-944694a688bd
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=2] Found Page 9b302e5a-fe72-466f-800a-ec26f1d09674
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=2] Created Version 987083e2-5ca4-4d07-8828-8d82d2a8189d
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=3] Found Page 9b302e5a-fe72-466f-800a-ec26f1d09674
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=3] Created Version 2a9a5a1b-c5c4-43ef-9be3-eb2782eae98d
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=4] Found Page 9b302e5a-fe72-466f-800a-ec26f1d09674
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=4] Created Version 333fea36-2549-4718-b12b-fdc1e634bac8
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=5] Found Page 9b302e5a-fe72-466f-800a-ec26f1d09674
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36][row=5] Created Version 6e9ce120-2352-4219-aaef-0b6c0a730dd8
[ActiveJob] [ImportVersionsJob] [04b425c1-e80d-40a7-9bb3-7d82ba92092f] [import=36] Finished Import 36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants