-
-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(harvard_merger): add harvard_id
field to OpinionCluster
#4622
base: main
Are you sure you want to change the base?
Conversation
`import_harvard_pdfs` has a `process_entry` method that is similar, but different from the entry processing in `process_crosswalk_file` and abandoned. Since a specialized method for processing an entry is useful, this patch reconciles the two bits of functionality. Finally, it adds a `--job` parameter to the command to make adding other jobs relating to CAP migration easier.
Harvard's Caselaw Access Project has been sunset. For projects which have existing references to CAP cases, there's a need to identify a CAP case's corresponding CL opinion cluster. An indexed `harvard_id` column is added to `OpinionCluster`. The field is also added to the `fields` of `OpinionClusterFilter`. For migration, this patch builds on work done in freelawproject#4284 and freelawproject#4442 and extends `import_harvard_pdfs` to populate the `harvard_id` column using CAP crosswalk file. Fixes: freelawproject#4313
self.assertTrue( | ||
os.path.exists(crosswalk_dir), | ||
f"Crosswalk directory does not exist: {crosswalk_dir}", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything above is copy-paste from test_import_harvard_pdfs
above. I don’t love it, but “good enough”? Nudge me if it needs to be better.
choices=["import_pdf"], | ||
default="import_pdf", | ||
help="", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see other things running through the crosswalk file to do work (#4614). I’ve no sound opinion for what the “correct” approach is. jobs
here is driven by my interest in not copy/pasting (and then owning) the moderate complexity surrounding crosswalk processing. @jtmst, I’m happy to take any direction that you can give.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a quick drive-by and didn't review your code in the management command, nor the tests, but here are a few bits of feedback. Thanks for doing this.
- Use better indexing approach - Use IntegerField - Make nullable
self.processed_pdfs.add(pdf_path) | ||
|
||
except OpinionCluster.DoesNotExist: | ||
logger.info(f"Cluster not found for id: {cl_cluster_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tempted to rely on the caller’s exception handler to do this logging ¯_(ツ)_/¯
@cweider The import_harvard_pdfs command has been updated, you may have conflicts in the PR |
Harvard's Caselaw Access Project has been sunset. For projects which have existing references to CAP cases, there's a need to identify a CAP case's corresponding CL opinion cluster.
An indexed
harvard_id
column is added toOpinionCluster
. The field is also added to thefields
ofOpinionClusterFilter
.For migration, this patch builds on work done in #4284 and #4442 and extends
import_harvard_pdfs
to populate theharvard_id
column using CAP crosswalk file.Fixes: #4313