Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(harvard_merger): add harvard_id field to OpinionCluster #4622

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cweider
Copy link
Collaborator

@cweider cweider commented Oct 25, 2024

Harvard's Caselaw Access Project has been sunset. For projects which have existing references to CAP cases, there's a need to identify a CAP case's corresponding CL opinion cluster.

An indexed harvard_id column is added to OpinionCluster. The field is also added to the fields of OpinionClusterFilter.

For migration, this patch builds on work done in #4284 and #4442 and extends import_harvard_pdfs to populate the harvard_id column using CAP crosswalk file.

Fixes: #4313

`import_harvard_pdfs` has a `process_entry` method that is similar,
but different from the entry processing in `process_crosswalk_file`
and abandoned. Since a specialized method for processing an entry
is useful, this patch reconciles the two bits of functionality.

Finally, it adds a `--job` parameter to the command to make adding
other jobs relating to CAP migration easier.
Harvard's Caselaw Access Project has been sunset. For projects
which have existing references to CAP cases, there's a need to
identify a CAP case's corresponding CL opinion cluster.

An indexed `harvard_id` column is added to `OpinionCluster`. The
field is also added to the `fields` of `OpinionClusterFilter`.

For migration, this patch builds on work done in freelawproject#4284 and freelawproject#4442
and extends `import_harvard_pdfs` to populate the `harvard_id`
column using CAP crosswalk file.

Fixes: freelawproject#4313
cl/search/models.py Outdated Show resolved Hide resolved
self.assertTrue(
os.path.exists(crosswalk_dir),
f"Crosswalk directory does not exist: {crosswalk_dir}",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything above is copy-paste from test_import_harvard_pdfs above. I don’t love it, but “good enough”? Nudge me if it needs to be better.

choices=["import_pdf"],
default="import_pdf",
help="",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see other things running through the crosswalk file to do work (#4614). I’ve no sound opinion for what the “correct” approach is. jobs here is driven by my interest in not copy/pasting (and then owning) the moderate complexity surrounding crosswalk processing. @jtmst, I’m happy to take any direction that you can give.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick drive-by and didn't review your code in the management command, nor the tests, but here are a few bits of feedback. Thanks for doing this.

cl/search/models.py Outdated Show resolved Hide resolved
cl/search/models.py Outdated Show resolved Hide resolved
cl/search/models.py Outdated Show resolved Hide resolved
- Use better indexing approach
- Use IntegerField
- Make nullable
self.processed_pdfs.add(pdf_path)

except OpinionCluster.DoesNotExist:
logger.info(f"Cluster not found for id: {cl_cluster_id}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tempted to rely on the caller’s exception handler to do this logging ¯_(ツ)_/¯

@quevon24
Copy link
Member

@cweider The import_harvard_pdfs command has been updated, you may have conflicts in the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog
Development

Successfully merging this pull request may close these issues.

Feature request: associate CAP IDs with clusters and allow querying clusters on that basis
3 participants