Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Review Tool rewrite #223

Open
3 of 14 tasks
jpmckinney opened this issue Oct 21, 2024 · 3 comments
Open
3 of 14 tasks

Data Review Tool rewrite #223

jpmckinney opened this issue Oct 21, 2024 · 3 comments

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Oct 21, 2024

  • Create new repositories for non-CoVE version, e.g. plover and plover_web (still Django)

Web frontend (templates and text)

Web backend

  • Use Celery (with Redis queue) for background tasks
  • Implement async upload and async processing
  • Don't use the filesystem as the cache
    • Once we cache validation results, we no longer need to write/cache the metatab, conversion warnings, extended schema, cell_source_map, heading_source_map.
    • See Unflatten in Kingfisher Collect to get unflatten results from temporary directory.
    • We do want to write the original file and the converted file to the media directory, as it is helpful to users and analysts to download the file (especially if they did not upload it, lost track of it, or if the data at the URL has changed).
  • Use Django page cache (with Redis, since it's already installed)

Library (libcoveocds)

  • Extract relevant logic from lib-cove (only common.py is very relevant)
  • Simplify and refactor the extracted code
  • Stop writing to validation_errors-3.json and remove corresponding logic from web backend
  • Remove keys from output/context that are unused (check what other projects reads this library's output)
  • Remove JSON serializing of errors (originates in lib-cove) [This is needed to aggregate similar error]
  • Try switching to jsonschema-rs Try substituting jsonschema-rs for jsonschema lib-cove-ocds#123
  • Ask ODS about dropping AGPL in our code

Learning

While everything is fresh, read latest JSON Schema to see if anything can be simplified by adopting new versions

@jpmckinney
Copy link
Member Author

jpmckinney commented Oct 22, 2024

Ideas for new checks from internal discussions, that better solve for what data support managers find useful in the current Key Field Information:

  • “who bought what from whom, for how much, when and how”

Unlike other checks, it is useful to report on the details of these checks even when they pass.

It is also useful to report (similar to Pelican):

  • number of contracting processes
  • stages covered
  • date ranges: at least release date and tender period, but ideally also awards' date and and contracts' period

Plus, if possible (Slack):

  • tag counts
  • party role counts

An important design caveat is that users are not uploading full datasets or representative samples. The checks need to make sense even for a sample. We'll need to word any messages carefully. e.g. "the sample doesn't contain awards" not "your dataset doesn't contain awards" or something.

In general, along the lines of the earlier user research, we need the DRT to be useful, interpretable and actionable for OCDS implementers. If something is needed for data support, we might prefer to implement it as a notebook. (Of course, it is more convenient for team members to not load another tab.)

@jpmckinney
Copy link
Member Author

See also open-contracting/ocds-extensions#128 about ensuring that oneOf reports subschema errors correctly, for more than just the oneOf used for embedded vs linked releases.

@jpmckinney
Copy link
Member Author

I am using the jsonschema library wherever possible, instead of writing separate checks. Storing old code for checks here, in case useful in future:

OCID_PREFIX_RE = re.compile(r"^ocds-[a-z0-9]{6}")


def ocid_prefix_format(data_paths):
    values = [
        (value, "/".join(map(str, full_path)))
        for path in (
            ("releases", "ocid"),
            ("records", "ocid"),
            ("records", "releases", "ocid"),
            ("records", "compiledRelease", "ocid"),
        )
        if (full_paths := data_paths.get(path))
        for full_path, value in full_paths.items()
        if isinstance(value, str) and not OCID_PREFIX_RE.match(value)
    ]

    if values:
        return {"conformance_errors": {"ocds_prefixes_bad_format": values}}
    return {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant