Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bibupload: Also use CFG_OAI_ID_FIELD for deduping #2816

Closed

Conversation

aw-bib
Copy link
Contributor

@aw-bib aw-bib commented Feb 24, 2015

In case of replicating datasets between instances by means of oai harvesting it
is plausible to check for external OAI-IDs not only in the field specified by
CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG but also in CFG_OAI_ID_FIELD.

Enh: find_records_from_extoaiid() to call find_record_from_oaiid() in case no
records were found in CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG.

Enh/Fix: find_record_from_oaiid() always returned the first record with
a given OAI-ID assuming that it is unique. However, due to manual intervention
one may have accidentially produced dupes here. Thus check, if more than one
record with an OAI-ID exists. If so, check if all records except one are
deleted. In this case return this surviving record, as cataloguers have resolved
dupes manually. In case we still have dupes resort to the old behaviour for sake of
compatibility, but at least throw a warning. This is somewhat of a
TODO. Probably one should refuse merge here and wait for manual
curation (to be discussed).

addresses #2812

In case of replicating datasets between instances by means of oai harvesting it
is plausible to check for external OAI-IDs not only in the field specified by
CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG but also in CFG_OAI_ID_FIELD.

Enh: find_records_from_extoaiid() to call find_record_from_oaiid() in case no
records were found in CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG.

Enh/Fix: find_record_from_oaiid() always returned the first record with
a given OAI-ID assuming that it is unique. However, due to manual intervention
one may have accidentially produced dupes here. Thus check, if more than one
record with an OAI-ID exists. If so, check if all records except one are
deleted. In this case return this surviving record, as cataloguers have resolved
dupes manually. In case we still have dupes resort to the old behaviour for sake of
compatibility, but at least throw a warning. This is somewhat of a
TODO. Probably one should refuse merge here and wait for manual
curation (to be discussed).

addresses inveniosoftware#2812
for r in res:
lst.append('recid:"%s"' % str(r[0]))
candidates = search_pattern(p=' or '.join(lst))
deleted = search_pattern(p='980__c:"DELETED"')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the defintion of deleted in Invenio is a bit larger: any subfield of 980 could in principle contain the word DELETED (at least that is what is assumed as convention in bibupload)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it shold be 980%%:"DELETED"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's perfect 980:"DELETED"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  •        deleted    = search_pattern(p='980__c:"DELETED"')
    

It's perfect 980:"DELETED"

This and a minor fix for handling the return of None is in the updated PR. (BTW: shouln't the intbitset.add(None) just work?)

Kind regards,

Alexander Wagner

Deutsches Elektronen-Synchrotron DESY
Library and Documentation

Building 01d Room OG1.444
Notkestr. 85
22607 Hamburg

phone: +49-40-8998-1758
fax: +49-40-8994-1758
e-mail: [email protected]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(BTW: shouln't the intbitset.add(None) just work?)

Nope, intbitset.add expect always a positive integers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaplun yes it is, and I missed it. It was more a suggestion if it wouldn't be good to allow intibset.add(None) as a "do nothing". (Looks logical at first glance.)

Fix: treat all 980:"DELETED" entries as deleted records.
Fix: in case no record is found, find_record_from_oaiid() returns `None`.
intbitset.add(None) seems not to be implemented, so it fails. Check if the
result is None before adding.
@jirikuncar jirikuncar added this to the v1.1.x milestone Feb 26, 2015
@tiborsimko tiborsimko modified the milestones: v1.1.6, v1.1.7 May 20, 2015
@tiborsimko
Copy link
Member

@aw-bib @kaplun I'm not sure whether checking all 980 subfields is the right way... We have "advertised" 980 $c to be the canonical place to store information about deteled records, e.g. this is what the record merges does, e.g. we have many places in the code that do:

if "DELETED" in self.get_field(recID, "980__c"):

We'd probably have to amend this more globally...

# in the locally used CFG_OAI_ID_FIELD that matches. This can happen
# if oai harvesting is used to duplicate records.
recid = find_record_from_oaiid(extoaiid)
if recid != None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiborsimko tiborsimko assigned aw-bib and unassigned tiborsimko Sep 9, 2015
@jirikuncar jirikuncar assigned aw-bib and unassigned aw-bib Jul 25, 2016
@tiborsimko
Copy link
Member

@aw-bib Any progress with this PR regarding my comments on 980 DELETED and on recid None? I'd like to release 1.1.7 in the next few days... Can you please have a look or shall we aim at 1.1.8?

@tiborsimko tiborsimko modified the milestones: v1.1.8, v1.1.7 Nov 14, 2016
Use the cleaner "is not None" instead of `!= None`

Addresses inveniosoftware#2812
@aw-bib
Copy link
Contributor Author

aw-bib commented Nov 14, 2016

WRT 980__c/980: the first version of the PR used only 980__c the later all 980. Probably, it's up to a decision what to use.

WRT is not None hope I found all of those :)

@tiborsimko
Copy link
Member

Probably, it's up to a decision what to use.

980__c would be safer, 980 would be more liberal, at a certain risk. Well, chances are people aren't misusing the word DELETED in fields like 98012, 980zz, so the change should be basically OK...

@tiborsimko tiborsimko modified the milestones: v1.1.7, v1.1.8 Nov 18, 2016
tiborsimko pushed a commit to tiborsimko/invenio that referenced this pull request Nov 18, 2016
* FIX In case of replicating datasets between instances by means of OAI
  harvesting it is plausible to check for external OAI-IDs not only in the field
  specified by CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG but also in CFG_OAI_ID_FIELD.
  (closes inveniosoftware#2812) (PR inveniosoftware#2816)

* Note: find_record_from_oaiid() always returned the first record with a given
  OAI-ID assuming that it is unique. However, due to manual intervention one may
  have accidentially produced dupes here. Thus check, if more than one record
  with an OAI-ID exists. If so, check if all records except one are deleted. In
  this case return this surviving record, as cataloguers have resolved dupes
  manually. In case we still have dupes resort to the old behaviour for sake of
  compatibility, but at least throw a warning. This is somewhat of a TODO.
  Probably one should refuse merge here and wait for manual curation (to be
  discussed).
@aw-bib
Copy link
Contributor Author

aw-bib commented Nov 18, 2016

@tiborsimko how to proceed? Should I do anything? Do you want to adopt 980__c?

@kaplun
Copy link
Member

kaplun commented Nov 18, 2016

@tiborsimko, IMHO, although bibedit will write in 980__c, I believe we can safely check upon reading in all 980 subfields. No reasonable admin would call a collection tag DELETED.

This would also follow the https://en.wikipedia.org/wiki/Robustness_principle

@tiborsimko
Copy link
Member

how to proceed? Should I do anything?

Nope, I squashed your commits and made a new PR #3721 so that the release notes would be nicely generated. There is nothing more to do on your side 😄

@tiborsimko
Copy link
Member

Closed by #3721.

@tiborsimko tiborsimko closed this Nov 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants