Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider replacing kingfisher-process with ocds-merge-rs #437

Open
jpmckinney opened this issue Jun 12, 2023 · 0 comments
Open

Consider replacing kingfisher-process with ocds-merge-rs #437

jpmckinney opened this issue Jun 12, 2023 · 0 comments
Assignees
Labels
S: kingfisher Relating to the Kingfisher servers

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Jun 12, 2023

Edit: Moving comment from #402 (comment)

I've been working on a Rust version of OCDS Merge. If it performs well, then we could maybe change our analysis process to:

  • Run OCDS Merge directly on a Scrapy data directory. If an analyst wants to use data before the crawl is complete, that's fine – OCDS Merge can run on the closed files (lsof). Noting that merging should be done before upgrading.
  • We don't need to upgrade many collections. If there is one that is too large to complete quickly with ocdskit, I can write a Rust version.
  • Add a SQL loader command (remember to replace control codes), so that Kingfisher Summarize can still work. We can probably simplify Summarize, as not sure how frequently we analyze release/record collections.
    • We might prefer analysts to load data into separate tables (e.g. into a schema under their own name). This makes it very easy to clean up old data. That'll require changes to Summarize (mostly release_*.sql and JOIN data).
  • libcoveocds is too slow to run in sequence on an entire dataset. We can instead run a fast JSON Schema validator, and only run libcoveocds' other checks using sampling.

For all the above, the instructions in the documentation for data support managers should redirect output to files, for easier review of warnings. The instructions could maybe be organized into a Makefile.

If the above changes don't go ahead, then work on https://github.com/open-contracting/kingfisher-process/milestone/7

Note: There is a similar issue for the registry at open-contracting/data-registry#292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: kingfisher Relating to the Kingfisher servers
Projects
None yet
Development

No branches or pull requests

1 participant