-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Dublin Core #3710
[WIP] Dublin Core #3710
Conversation
PdfXmpImporterTest:testImportEntries() PdfXmpImporterTest:testIsRecognizedFormat() Removed to much code when refactoring the XMPUtil. Non XMP metadata are also relevent, when retrieving org.apache.pdfbox.pdmodel.PDDocumentInformation
Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox. See pull JabRef#1096. Next step: Writing test cases for XMPUtil (DublinCore).
build.gradle
Outdated
compile 'org.apache.pdfbox:pdfbox:1.8.13' | ||
compile 'org.apache.pdfbox:fontbox:1.8.13' | ||
compile 'org.apache.pdfbox:jempbox:1.8.13' | ||
// update possible because custom metadata format was dropped, https://github.com/JabRef/jabref/pull/3710 - https://github.com/JabRef/jabref/pull/1096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove that line
Travis build failed because of an exceeded NPath complexity of a single method. Method was refactored and cleaned up.
|
||
BibEntry entry = new BibEntry(); | ||
|
||
// Contributor -> Editor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not comment these functions as the implementation might change... The comments could co at setEditor
etc. as method comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will change this with the next commit.
if (entries.isEmpty()) { | ||
System.err.println("Could not find BibEntry in " + args[0]); | ||
} else { | ||
XMPUtil.writeXMP(new File(args[1]), entries, result.getDatabase(), false, xmpPreferences); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this also be either a Path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The parameter is a file or sometimes also the filePath as a string is possible.
} | ||
|
||
if (args[0].endsWith(".bib") && args[1].endsWith(".pdf")) { | ||
ParserResult result = new BibtexParser(importFormatPreferences, Globals.getFileUpdateMonitor()).parse(new FileReader(args[0])); | ||
try (FileReader reader = new FileReader(args[0])) { | ||
ParserResult result = new BibtexParser(importFormatPreferences, Globals.getFileUpdateMonitor()).parse(reader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but does the parser accepts an inputStream? Then you could use Files.newInputStream...
would be actually preferable over a fileReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No overloaded parse method, which would accept an InputStream.
The tests cover the most important use cases, which include reading and writing metadata from pdf files. Both formats, DublinCore and PDMetadata (which are no XMP metadata) are tested.
This pull request is ready for the final review and for merging into master :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work. Finally, something is moving concerning pdf support 🥇 .
The code looks pretty good and I only have a bunch of minor remarks.
throw new EncryptedPdfsNotSupportedException(); | ||
} | ||
} | ||
public static PDDocument loadWithAutomaticDecryption(File file) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Path
as long as possible (i.e. until we use an interface that we cannot control and which expects a traditional File
). This remark applies to this method and a few other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException { | ||
XMPUtil.writeXMP(new File(fileName), entry, database, xmpPreferences); | ||
BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException { | ||
XMPUtil.writeXMP(Paths.get(fileName), entry, database, xmpPreferences); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are already touching most of the code in this class, it would be nice if you could rework the whole class from a collection of static helper methods to a "normal" class with instance methods. It appears that the file is passed to every single method so this is a natural candidate for a constructor argument. I'm not sure how much code is shared between writing and reading of xmp, but maybe it makes sense to split the class in an XmpReader
and XmpWriter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to separete the logic in two utility classes and a shared utitlity class.
Currently I don't see the benefit of getting "normal" reader and writer. I introduced two Extractors for the DocumentInformation and the DublinCore format. Maybe that's what you have intended.
I pushed a WIP state for another review. Maybe you have further suggestions, how to structure the package.
|
||
if (entry.isPresent()) { | ||
if (entry.get().getType() == null) { | ||
entry.get().setType(BibEntry.DEFAULT_TYPE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please move the part, where the type is set to a default, to the getBibtexEntry
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
public void testReadEmtpyMetadata() throws IOException, URISyntaxException { | ||
List<BibEntry> entries = XMPUtil.readXMP(XMPUtil.class.getResource("/org/jabref/logic/xmp/empty_metadata.pdf").toURI().getPath(), xmpPreferences); | ||
Assert.assertEquals(0, entries.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertEquals(Collections.emptyList, entries)
to get a better error message in case the test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"Patterson, David and Arvind and Asanov\\'\\i{}c, Krste and Chiou, Derek and Hoe, James and Kozyrakis, Christos and Lu, S{hih-Lien} and Oskin, Mark and Rabaey, Jan and Wawrzynek, John"); | ||
// read a bib entry from the tests before | ||
List<BibEntry> entries = XMPUtil.readXMP(XMPUtil.class.getResource("/org/jabref/logic/xmp/PD_metadata.pdf").toURI().getPath(), xmpPreferences); | ||
BibEntry entry = entries.get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is easier if you just create the entry by hand (and then write it, read and compare).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no further reason (besides simplicity), I would stay with this.
* Make sure that the privacy filter works. | ||
* The month attribute in DublinCore is the complete name of the month, e.g. March. | ||
* In JabRef, the format is #mar# instead. To get a working unit test, the JabRef's | ||
* bib-entry is altered from #mar# to {March}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the month shouldn't be post-processed to get a proper bibtex value? We even have the Month
class which handles parsing and converting to the correct output.
compile 'org.apache.pdfbox:fontbox:1.8.13' | ||
compile 'org.apache.pdfbox:jempbox:1.8.13' | ||
compile 'org.apache.pdfbox:pdfbox:2.0.8' | ||
compile 'org.apache.pdfbox:fontbox:2.0.8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further below (starting at line 218), we added exceptions for the update dependencies task. These are now invalid and should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
|
||
private void extractSubject() { | ||
String s = documentInformation.getSubject(); | ||
if (s != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We once agreed on avoiding single char variables. Please use a meaningful name here and for the others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
private void extractOtherFields() { | ||
COSDictionary dict = documentInformation.getCOSObject(); | ||
for (Map.Entry<COSName, COSBase> o : dict.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could encapsulate part of this as a stream. at least the filtering and mapping to the key.
https://www.mkyong.com/java8/java-8-filter-a-map-examples/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would stay with the implemented version. Currently it works and I'm not a stream fanatic 😋
*/ | ||
private void extractAuthor() { | ||
List<String> creators = dcSchema.getCreators(); | ||
if ((creators != null) && !creators.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a method in our own StringUtil class which combines both checks: IsNullOREmpty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with 366aeed. Please check the conditions again. It's not that natural to negate the isNullOrEmpty... Maybe another function would help in the StringUtils.
For my general understanding and my newbie state:
For me, it makes no sense to write more than one BibEntry to the metadata of a PDF file. Currently the implementation in XMPUtilWriter is also limited to a single element, but implemented as a list. (See line 260 and 110). Can I drop the list? |
List<String> dates = dcSchema.getUnqualifiedSequenceValueList("date"); | ||
if ((dates != null) && !dates.isEmpty()) { | ||
String date = dates.get(0).trim(); | ||
Calendar c = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please stick to the java 8 new date and time api: Use a DateTime or just a DateFormatter
DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm");
LocalDateTime dateTime = LocalDateTime.parse(str, formatter);
*/ | ||
private void extractBibTexFields() { | ||
List<String> relationships = dcSchema.getRelations(); | ||
if (relationships != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a stream with filter and map
*/ | ||
private void extractType() { | ||
List<String> l = dcSchema.getTypes(); | ||
if ((l != null) && !l.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid single char vairbales
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with 366aeed
continue; | ||
} | ||
|
||
if (FieldName.EDITOR.equals(field.getKey())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a switch/case here, but that's just my style. The others don't like them that much ;) So you can change it or leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave it :)
try (PDDocument document = loadWithAutomaticDecryption(path)) { | ||
Optional<XMPMetadata> meta = XMPUtilReader.getXMPMetadata(document); | ||
|
||
if (meta.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use chained maps on the optionals for each step:
This should work
meta.map(m->m.getDublinCore().map(m->new DublinCore...).map(entry->extractBibEntry).ifPresent(result::add)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the current state more readable than the chained map operations... Do you have another suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chanining has the improvement that you don't have to care about nulls or empty optionals inside. The code at the end is only executed if none of the values is null or empty.
Did not see your question. Theoretically it could be that someone has CrossRef linked bibentries, (e.g. a bib entry with a book and another one with a chapter referring the book).
but that's really and edge case. I think for most uses cases one entry is enough. And if the code doesn't support it, then drop it. let's see what @koppor has to say
LOGGER.info("Encryption not supported by XMPUtil"); | ||
return false; | ||
} catch (IOException e) { | ||
// happens if no metadata is found, no reason to log the exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please always log exceptions, at least add a debug. Just to make sure that not any other underlying exception is swallowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for the comment 👍
*/ | ||
public static void writeXMP(Path file, BibEntry entry, | ||
BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException { | ||
List<BibEntry> l = new LinkedList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason for a linked list? In most cases ArrayList is sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a question, I posted above. This is code from a previous author and I want to drop the List at all. Please comment my question above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the code looks good, just some code style improvements.
I have no real idea concerning your question about the list. I would say you could safely remove it, but I'm not sure. What happens if multiple entries have the same file linked and write their metadata into it (e.g. the pdf is a book and I have a bunch of |
XMPUtilWriter, line 147,148 Before new metadata is written to the PDF, all DublinCoreSchemas are removed. So I think, that there is currently no option to write more than one metaschema. |
Well, does the Dublin core schema specify. Multiple entries? |
Specifications are available at. http://dublincore.org/specifications/. The abstract model is there: http://dublincore.org/documents/abstract-model/ In case, I interpret that correctly, one record contains one description, which may contain multiple record sets. This is the edge case, where one wants to write XMP to a proceedings. Example:
|
XMPUtilWriter supports mutliple metadata entries in dublinCore and a single entry in the PDDocumentInformation.
" urldate = {2017-05-31},\r\n" + | ||
"}"; | ||
|
||
private static final String vapnik2000 = "@Book{Vapnik2000,\r\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create the BibEntries by hand (using new BibEntry(), setField
) and not based on the string representation. The XMP test should be as autonomous as possible, especially they shouldn't fail if the BibParser is changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
Reading mulitple BibEntries in DublinCore format now also works. If you want to test the reading of multiple entries, the PDF file JabRef_multipleMetaEntries.pdf contains three metadata entries in DublinCore for testing locally.
@@ -30,7 +35,7 @@ private XMPUtilReader() { | |||
* @param path The path to read the XMPMetadata from. | |||
* @return The XMPMetadata object found in the file | |||
*/ | |||
public static Optional<XMPMetadata> readRawXMP(Path path) throws IOException { | |||
public static Optional<List<XMPMetadata>> readRawXMP(Path path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A List
should always be non-null and thus it does not makes sense to wrap an Optional
around it. The not-present case corresponds to an empty list, which you can check using isEmpty()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this comment. Done 👍
Imports all metadata entries of a PDF file.
I considered all comments to the source code and refactored my code. The last commit alters the behavior of the XMP import: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micro comments at the test cases.
@Test | ||
public void testReadArticleDublinCoreReadXMP() throws IOException, URISyntaxException, ParseException { | ||
|
||
Path path = Paths.get(XMPUtilShared.class.getResource("/org/jabref/logic/xmp/article_dublinCore.pdf").toURI()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does one really need the absolute path here? I thought, just the filename is enough, because it looks up in the current directory (where the resources are mirrored to). If this is the package org.jabref.logic.xmp, it should "just work".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I do not need the absolute path 👍
*/ | ||
@Test | ||
public void testReadArticleDublinCoreReadXMP() throws IOException, URISyntaxException, ParseException { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* The month attribute in DublinCore is the complete name of the month, e.g. March. | ||
* In JabRef, the format is #mar# instead. To get a working unit test, the JabRef's | ||
* bib-entry is altered from #mar# to {March}. | ||
* <p/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this and the next line. The next line should be included in the method name. Maybe, the method has to renamed to testReadArticleDublinCoreReadRawXMP
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course 👍
* <p/> | ||
* Tests the readRawXMP - method | ||
* | ||
* @throws IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove empty @throws
comments. It is already clear by the method signature which exceptions are thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,137 @@ | |||
package org.jabref.logic.xmp; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one empty line (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -29,7 +29,7 @@ | |||
import org.jabref.logic.l10n.Localization; | |||
import org.jabref.logic.util.UpdateField; | |||
import org.jabref.logic.util.io.FileUtil; | |||
import org.jabref.logic.xmp.XMPUtil; | |||
import org.jabref.logic.xmp.XMPUtilShared; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename the class to XmpUtilShared
to follow Google's Casing rules? (Similar for method names, ...)
Renamed (hopefully) all occurences of XMP to Xmp to stay with the Google's Casing rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot for attacking such a huge task right as one of your first projects. Looking forward to see more PRs from you in the future.
* upstream/master: (94 commits) Add missing localization for Any file Refactor dublin core utility (#3756) Add Localization Update Architecture Tests to catch static imports (#3766) Added <any file type> to the Import File Filter Dialog. Don't trim when migrating review field (#3761) Reorder again Rename confirmation into "Merge fields" Fix logic Reorder checklist in PR template and add "good commit message" Replace x11 by unity7 Include desktop, desktop-legacy, wayland in snapcraft.yaml Improve Dublin Core (#3710) Incorporate suggestions by @Siedlerchr Update JUnit from 5.1.0-M2 -> 5.1.0 Update Mockito from 2.13.0 -> 2.15.0 Update wiremock from 2.14.0 -> 2.15.0 Fix exceptions for jacoco update gradle Add link to contribute.jabref.org (#3748) ... # Conflicts: # src/main/java/org/jabref/gui/JabRefFrame.java
Fixes #938 : Only supporting Dublin Core
Removed JabRef namespace for XMP 'xmlns:bibtex'.
Removed some unused methods in gui/importer/EntryFromPDFCreator.java (discussed with @koppor and @stefan-kolb)
Deleted tests for XMPUtil. (New ones are in progress).