Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework journal abbreviation caching #6304

Merged
merged 13 commits into from
May 1, 2020
Merged

Rework journal abbreviation caching #6304

merged 13 commits into from
May 1, 2020

Conversation

tobiasdiez
Copy link
Member

@tobiasdiez tobiasdiez commented Apr 17, 2020

Rework the caching of the journal abbreviation list in order to reduce the memory footprint.
Instead of the plain in-memory cache, this PR now uses the h2 MVStore which is a light-weight file-based "database". I couldn't notice any performance degradation but a huge improvement in the memory department.

From:
image
To:
image
Thus about 30% reduction. Which leads to a very reasonable memory usage: 400mb with a db of 1k entries. For larger databases the auto completion still leads to problems (see below, which was before the changes of this PR) - but this will be a new PR.
image
(The first two entries belong to the classloader and thus represent all libraries that we use).

Further changes:

Further remarks:

@jlaehne
Copy link
Contributor

jlaehne commented Apr 18, 2020

  • There is an IEEE list in the abbreviation repository, so probably the "official IEEE" list from JabRef would have to be backported and merged with that one, making sure that changes listed in Refresh journal lists periodically #5749 are also included. Then this list could be included in the automatic merge script, which would probably make more sense than providing a separate list in JabRef
  • Instead JabRef should probably provide a list with dots and one without dots to choose from, currently they are created from two separate subsets of the repisitory and each has about 25k entries (the one without dots could be enriched by a merge with the other list after removing the dots).
  • The 100k+ entries stem from including the webofscience lists from the repository. However, I would advise against including these lists in the official list, as detailed in Refresh journal lists periodically #5749 (comment) and Refresh journal lists periodically #5749 (comment)
  • The file merge should be automatically called from https://github.com/JabRef/jabref/pull/5749/files#diff-fde893913708a95fc89371a85acfcaf1 , which I guess is run during build and then the conversion to the MVStore should be called right afterwards (in the same script?)

@tobiasdiez tobiasdiez requested a review from koppor April 18, 2020 23:24
@koppor
Copy link
Member

koppor commented Apr 23, 2020

  • I would propose to remove the "official IEEE abbreviation list" which changes e.g. "IEEE Transactions" to the bibtex string #ieeetrans#.

On the one hand, this feature makes JabRef special. On the other hand, this is a hidden gem.

This strongly relates to our improved typed @STRING handling: https://docs.jabref.org/fields/strings

In case users demand for this feature, they can "just" use the "IEEE abbreviation list", can't they?

  • The current implementation uses a simple dictionary between full and abbreviated journal name for the built-in abbreviations. Thus, the "shortest abbreviation" or the "medline abbreviation" are ignored (there were like 2 journals that had this information...).

Please do not drop this feature. I know that currently no one committed updated these abbreviations YET. Nevertheless, I believe in use cases where we really need that.

It should not be too hard to offer three different kind of journal name variants.

The abbreviation list needs to be converted from the csv format to the MVStore format. This is done in JournalAbbreviationLoader#writeDefaultDatabase. I guess the best way would be to call this method as part of the build process or upon changing the abbreviation list. Any suggestions on how and when to do this?

+1 for build process.

I plan to keep https://abbrv.jabref.org/ and enhance the abbreviation service offered there. The tooling to handle MVStore format is not as common as tooling for handling CSV files (looking at Python, Java, LibreOffice, GitHub browsing).

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small comments. Have to open the code in the IDE for a detailed feedback and possible code suggestions.

Added ADR-0010 nevertheless. Maybe, we have other reasons for H2?

@koppor
Copy link
Member

koppor commented Apr 23, 2020

Maybe, I should schedule a mob programming session at https://github.com/remotemobprogramming/upcoming-sessions to a) work on with-dot and dot-less and b) the three-state journal abbreviation?

@tobiasdiez
Copy link
Member Author

Thanks @jlaehne and @koppor for the input. I've now updated the code and the PR description accordingly.

I ignored the doted version as well as the shortest abbreviation for now. In theory, this can be added but needs more work (need to worry about serialization in MvStore). Thus, this can be done at a later point if users really requests it (which I doubt to be honest).

@tobiasdiez tobiasdiez marked this pull request as ready for review April 24, 2020 16:32
@tobiasdiez tobiasdiez added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Apr 24, 2020
@tobiasdiez tobiasdiez added this to the v5.1 milestone Apr 24, 2020
apply plugin: 'java'

repositories {
mavenLocal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maven local can be removed (only useful if you have gradle + maven builds in your system), will speed up the build.

return false;
}
AbbreviationViewModel that = (AbbreviationViewModel) o;
return getName().equals(that.getName()) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Objects.equals(...) to make it more consistent with other implementations

URL url = Objects.requireNonNull(JournalAbbreviationRepository.class.getResource(Objects.requireNonNull(resourceFileName)));
readJournalList(new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));
InputStream stream = JournalAbbreviationRepository.class.getResourceAsStream(resourceFileName);
readJournalList(new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible resource leak? I don't see where the reader is closed.

return abbreviations.stream().anyMatch(abbreviation -> isMatchedAbbreviated(journalName.trim(), abbreviation));
String journal = journalName.trim();

boolean isAbbreviated = customAbbreviations.stream().anyMatch(abbreviation -> isMatchedAbbreviated(journal, abbreviation));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine both returns with an OR?

@tobiasdiez
Copy link
Member Author

Thanks for the review @Siedlerchr. Feedback is now included.

@Siedlerchr
Copy link
Member

Codewise looks good, we just have to test the changes now in the jlink/jpackage variant to make sure the classpath resources etc are correctly found

@koppor
Copy link
Member

koppor commented Apr 29, 2020

@Siedlerchr That means, the only thing left to test is to download the binary from https://builds.jabref.org/pull/6304/merge/ and to execute it?

@koppor
Copy link
Member

koppor commented Apr 29, 2020

Does not work with jabref-authors.bib.

  1. Open jabref-authors.bib

  2. Abbreviate Journal names --> "MEDLINE"

  3. 9 changes

    One wrong: grafik

  4. Save

  5. changes in bib file are there

  6. "Unnabbreviate journal names"

  7. 11 changes (according to pop up) (or 3 changes)

  8. Save

  9. No changes in the bib file are shown

Is "MEDLINE" now forbidden?

@koppor
Copy link
Member

koppor commented Apr 29, 2020

In short: When "unabbreviating", no changes are written. Just load jabref-authors.bib and execute "Unnabreviate" and save. No changes in .bib file.

@tobiasdiez
Copy link
Member Author

Good observation, there was indeed a small bug in the code. Unnabbreviation of names with dots should work now. As I said above, the built-in list does no longer contain names without dots, so medline unabbreviation does indeed no longer work (something for a new PR in my opinion).

Copy link
Member

@calixtus calixtus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had some time to walk through this PR.
Got two nitpicks on codestyle, one about injection and one about a comment.

private final ExporterViewModel exporter;
@FXML private Button browseButton;
@FXML private TextField name;
@FXML private TextField fileName;
@FXML private TextField extension;
@FXML private ButtonType saveExporter;
@Inject private DialogService dialogService;
@Inject private PreferencesService preferences;
@Inject private final DialogService dialogService;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Inject does not seem right, as DialogService and PreferencesService are injected by the constructor in this class too...

CreateModifyExporterDialogView dialog = new CreateModifyExporterDialogView(null, dialogService, preferences,
loader);
CreateModifyExporterDialogView dialog = new CreateModifyExporterDialogView(null, dialogService, preferences
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental line break?

public LayoutFormatterPreferences getLayoutFormatterPreferences(JournalAbbreviationRepository repository) {
return new LayoutFormatterPreferences(getNameFormatterPreferences(),
getFileLinkPreferences(),
repository);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems off

// Abbreviation equality is tested on name only, so we might have to remove an old abbreviation
abbreviations.remove(abbreviation);
abbreviations.add(abbreviation);
// We do not want to keep duplicates, thus remove the old abbreviation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following two lines are a bit irritating for Java beginners, so I would suggest to keep the old comment on them.

@tobiasdiez
Copy link
Member Author

Thanks @calixtus for the feedback. Your suggestions are now included.

I'll merge this now. @koppor feel free to add additional comments if you think there is something that needs further work.

@tobiasdiez tobiasdiez merged commit a123eb1 into master May 1, 2020
@tobiasdiez tobiasdiez deleted the journalAbbFile branch May 1, 2020 12:15
@Siedlerchr
Copy link
Member

The files are generated every time on build time? So we probably need to add the -mv file to the gitignore file. Just noticed that

@tobiasdiez
Copy link
Member Author

At the moment yes, but I would propose to only add the mv file and remove the cls one: #5749 (comment)

Siedlerchr added a commit that referenced this pull request May 2, 2020
* upstream/master: (166 commits)
  New Crowdin translations (#6382)
  Update code-howtos.md (#6393)
  Fix jstyle was invalid with default section at the start (#6386)
  Correcting file name for groups.uml (#6373)
  Fix underscore character being omitted from file name in Recent Libraries list (#6389)
  Rework journal abbreviation caching (#6304)
  Fix selecting custom export for copy to clipboard with uppercase file ext (#6290)
  New Crowdin translations (#6375)
  Squashed 'src/main/resources/csl-styles/' changes from 143464e..906cd6d
  Fixes #6357: File directory (#6377)
  Disable the generate button if the ID field is empty (#6371)
  Fix Preferences style value too long (#6372)
  Fix various Dark theme issues (#6368)
  Correct label name in dependabot
  Bump java-diff-utils from 4.5 to 4.7 (#6365)
  Try with info.plist.template also (#6366)
  Fix wrong button order (Apply and Cancel) in ManageProtectedTermsDialog. (#6358)
  Bump flexmark-ext-gfm-strikethrough from 0.61.6 to 0.61.20 (#6361)
  Bump checkstyle from 8.31 to 8.32 (#6360)
  Bump flexmark-ext-gfm-tasklist from 0.61.16 to 0.61.20 (#6364)
  ...
koppor pushed a commit that referenced this pull request Jan 1, 2023
43566f2 Update molecular-oncology.csl (#6354)
38f2b5f Update san-francisco-estuary-and-watershed-science.csl (#6350)
724cb12 Update australasian-journal-of-philosophy.csl (#6344)
df6af86 Fix webpage in-text citation for council-of-science-editors-author-date.csl (#6318)
6900b58 Add month to magazine for Bluebook
e0a8148 Update cambridge-university-press-author-date-cambridge-a.csl (#6345)
0823448 Add space & comma before pp (#6343)
ff38cd2 Update american-journal-of-respiratory-and-critical-care-medicine.csl (#6341)
8727dfb Update early-music-history.csl (#6323)
1154354 Update boletin-de-pediatria.csl (#6310)
f25438e Update pravnik.csl (#6309)
db7a4ae Update zoological-journal-of-the-linnean-society.csl (#6304)
6c043a7 Create polygraphia.csl (#6307)
255e00c Create intellect-newgen-books.csl (#6308)
323629f Update historical-materialism.csl (#6300)
f62b70d Create european-review-of-international-studies.csl (#6301)
dff2698 Update ucl-university-college-harvard.csl (#6298)
0ce09c9 Update sciences-po-ecole-doctorale-note-french.csl (#6290)
bdd53ec Update sciences-po-ecole-doctorale-author-date.csl (#6291)
efde4d4 Create journal-of-law-medicine-ethics.csl (#6296)
7539b2c Create theses-de-sorbonne-universite.csl (#6295)
905f25a Update biochemical-society-transactions.csl (#6292)
a76a3f5 Update smithsonian-institution-scholarly-press author-date and note (#6294)
e6b6c6c Create exploration-of-targeted-anti-tumor-therapy.csl (#6276)
024c9c8 Create nys-nydanske-studier.csl (#6331)
d9ac8e1 Update vox-sanguinis.csl (#6327)
9a98e92 Remove DOI from Genetics & Molecular Biology

git-subtree-dir: buildres/csl/csl-styles
git-subtree-split: 43566f2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants