-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bibupload: Also use CFG_OAI_ID_FIELD for deduping #2816
bibupload: Also use CFG_OAI_ID_FIELD for deduping #2816
Conversation
In case of replicating datasets between instances by means of oai harvesting it is plausible to check for external OAI-IDs not only in the field specified by CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG but also in CFG_OAI_ID_FIELD. Enh: find_records_from_extoaiid() to call find_record_from_oaiid() in case no records were found in CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG. Enh/Fix: find_record_from_oaiid() always returned the first record with a given OAI-ID assuming that it is unique. However, due to manual intervention one may have accidentially produced dupes here. Thus check, if more than one record with an OAI-ID exists. If so, check if all records except one are deleted. In this case return this surviving record, as cataloguers have resolved dupes manually. In case we still have dupes resort to the old behaviour for sake of compatibility, but at least throw a warning. This is somewhat of a TODO. Probably one should refuse merge here and wait for manual curation (to be discussed). addresses inveniosoftware#2812
for r in res: | ||
lst.append('recid:"%s"' % str(r[0])) | ||
candidates = search_pattern(p=' or '.join(lst)) | ||
deleted = search_pattern(p='980__c:"DELETED"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the defintion of deleted in Invenio is a bit larger: any subfield of 980 could in principle contain the word DELETED
(at least that is what is assumed as convention in bibupload
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it shold be 980%%:"DELETED"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's perfect 980:"DELETED"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted = search_pattern(p='980__c:"DELETED"')
It's perfect 980:"DELETED"
This and a minor fix for handling the return of None is in the updated PR. (BTW: shouln't the intbitset.add(None) just work?)
Kind regards,
Alexander Wagner
Deutsches Elektronen-Synchrotron DESY
Library and Documentation
Building 01d Room OG1.444
Notkestr. 85
22607 Hamburg
phone: +49-40-8998-1758
fax: +49-40-8994-1758
e-mail: [email protected]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW: shouln't the
intbitset.add(None)
just work?)
Nope, intbitset.add
expect always a positive integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaplun yes it is, and I missed it. It was more a suggestion if it wouldn't be good to allow intibset.add(None)
as a "do nothing". (Looks logical at first glance.)
Fix: treat all 980:"DELETED" entries as deleted records.
Fix: in case no record is found, find_record_from_oaiid() returns `None`. intbitset.add(None) seems not to be implemented, so it fails. Check if the result is None before adding.
@aw-bib @kaplun I'm not sure whether checking all 980 subfields is the right way... We have "advertised" 980 $c to be the canonical place to store information about deteled records, e.g. this is what the record merges does, e.g. we have many places in the code that do:
We'd probably have to amend this more globally... |
# in the locally used CFG_OAI_ID_FIELD that matches. This can happen | ||
# if oai harvesting is used to duplicate records. | ||
recid = find_record_from_oaiid(extoaiid) | ||
if recid != None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if recid is not None
is better style, see:
@aw-bib Any progress with this PR regarding my comments on 980 DELETED and on recid None? I'd like to release 1.1.7 in the next few days... Can you please have a look or shall we aim at 1.1.8? |
Use the cleaner "is not None" instead of `!= None` Addresses inveniosoftware#2812
WRT WRT |
980__c would be safer, 980 would be more liberal, at a certain risk. Well, chances are people aren't misusing the word DELETED in fields like 98012, 980zz, so the change should be basically OK... |
* FIX In case of replicating datasets between instances by means of OAI harvesting it is plausible to check for external OAI-IDs not only in the field specified by CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG but also in CFG_OAI_ID_FIELD. (closes inveniosoftware#2812) (PR inveniosoftware#2816) * Note: find_record_from_oaiid() always returned the first record with a given OAI-ID assuming that it is unique. However, due to manual intervention one may have accidentially produced dupes here. Thus check, if more than one record with an OAI-ID exists. If so, check if all records except one are deleted. In this case return this surviving record, as cataloguers have resolved dupes manually. In case we still have dupes resort to the old behaviour for sake of compatibility, but at least throw a warning. This is somewhat of a TODO. Probably one should refuse merge here and wait for manual curation (to be discussed).
@tiborsimko how to proceed? Should I do anything? Do you want to adopt 980__c? |
@tiborsimko, IMHO, although bibedit will write in 980__c, I believe we can safely check upon reading in all 980 subfields. No reasonable admin would call a collection tag This would also follow the https://en.wikipedia.org/wiki/Robustness_principle |
Nope, I squashed your commits and made a new PR #3721 so that the release notes would be nicely generated. There is nothing more to do on your side 😄 |
Closed by #3721. |
In case of replicating datasets between instances by means of oai harvesting it
is plausible to check for external OAI-IDs not only in the field specified by
CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG but also in CFG_OAI_ID_FIELD.
Enh: find_records_from_extoaiid() to call find_record_from_oaiid() in case no
records were found in CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG.
Enh/Fix: find_record_from_oaiid() always returned the first record with
a given OAI-ID assuming that it is unique. However, due to manual intervention
one may have accidentially produced dupes here. Thus check, if more than one
record with an OAI-ID exists. If so, check if all records except one are
deleted. In this case return this surviving record, as cataloguers have resolved
dupes manually. In case we still have dupes resort to the old behaviour for sake of
compatibility, but at least throw a warning. This is somewhat of a
TODO. Probably one should refuse merge here and wait for manual
curation (to be discussed).
addresses #2812