Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadPDFs incorrectly extracts DOI with line break #95

Closed
rmcd1024 opened this issue Sep 9, 2022 · 2 comments
Closed

ReadPDFs incorrectly extracts DOI with line break #95

rmcd1024 opened this issue Sep 9, 2022 · 2 comments
Labels

Comments

@rmcd1024
Copy link

rmcd1024 commented Sep 9, 2022

I have attached the title page of an article. The DOI has a line break coinciding with a dash. It is extracted as the first rather than the second:

10.1146/annurev-financial-010421085556 ## missing a dash

10.1146/annurev-financial-010421-085556 ## Correct, returns a bibtex entry on CrossRef.

Here is the error:

> RefManageR::ReadPDFs('page1.pdf')
Getting Metadata for 1 pdfs...
## Ignore the following line, this is an artifact created by extracting the first page using pdftk
Command Line Error: Wrong page range given: the first page (2) can not be after the last page (1). 
Getting 1 BibTeX entries from CrossRef...
Server error [404] for doi “10.1146/annurev-financial-010421085556”, you may want to try again, or BibTeX 
unavailable for this doi

pdfinfo is version 3.03 and poppler-utils is version 0.86.1-0ubuntu1

sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] RefManageR_1.3.0

loaded via a namespace (and not attached):
[1] httr_1.4.4 compiler_4.2.1 magrittr_2.0.3 plyr_1.8.7
[5] R6_2.5.1 generics_0.1.3 tools_4.2.1 curl_4.3.2
[9] Rcpp_1.0.9 lubridate_1.8.0 xml2_1.3.3 stringi_1.7.8
[13] stringr_1.4.1 jsonlite_1.8.0 bibtex_0.4.2.3 ```

@mwmclean
Copy link
Collaborator

Hi, thanks for the report. Unfortunately, this issue is occurring in poppler, which you can verify if you run pdftotext from the command line. It must remove the trailing hyphen occuring as the last character on the line. I've pushed some fixes so that your example now runs without error and removes the Command Line Error: Wrong page range given message if you install the latest version of the packge from GitHub. I don't see any way to grab the correct DOI in this case without a change to poppler, sorry.

@rmcd1024
Copy link
Author

rmcd1024 commented Oct 2, 2022

Thanks. The new error message is definitely an improvement but because the function still writes a Bibtex entry (which I think is the correct decision), would it make sense for the function to note in this case that the entry may be wrong, something like:

Writing 1 (possibly incorrect) Bibtex entries

In any case, ReadPDFs is a fantastic function, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants