Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding of the bib file vs encoding of the output (console and Rmd) #75

Open
mbojan opened this issue May 9, 2020 · 4 comments
Open

Comments

@mbojan
Copy link
Member

mbojan commented May 9, 2020

This is perhaps related to #61.

I have tough times taming the encodings of the document vs the encoding of the Bib file (I think). The Bib file is in UTF-8. I have a Rmd document (xaringan-based slide deck) that uses RefManageR to show citations and reference list produced with PrintBibliography(). Rmd is also in UTF-8. Non-ASCII letters in the BibEntries do not print correctly both in (1) R console nor in the (2) rendered document.

  1. My understanding is that bad printing in the console is because the package prints them as is, i.e. UTF-8 characters in a non-UTF-8 locale. Does RefManageR have a functionality implemented to convert on the fly parsed BibEntries to the native encoding?
  2. I'm not sure why they are not correct in the rendered HTML as it is also in UTF-8 (also in HTML metadata)? The output looks just like in Problems inserting bib-entries encoded in UTF-8 to PDF files #61.

I'll be happy to attach files as an MWE. Unless, I'm missing something obvious. Thanks!

I'm on R 4.0.0, RStudio 1.2.5042 on Windows 10 with:

> Sys.getlocale()
[1] "LC_COLLATE=Polish_Poland.1250;LC_CTYPE=Polish_Poland.1250;LC_MONETARY=Polish_Poland.1250;LC_NUMERIC=C;LC_TIME=Polish_Poland.1250"
> l10n_info()
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] FALSE

$codepage
[1] 1250

$system.codepage
[1] 1250
@mbojan
Copy link
Member Author

mbojan commented May 13, 2020

It seems that the problem will be gone if ReadBib() will declare the encoding of the strings with Encoding().

mbojan added a commit to mbojan/RefManageR that referenced this issue May 13, 2020
@mbojan
Copy link
Member Author

mbojan commented May 13, 2020

See the commit referenced above. I think that if ReadBib(Encoding=) argument is "UTF-8" or "latin1" then that encoding could be declared using Encoding(). Mind that I omitted "latin1" in the commit above. I'm not yet sure if that might have some negative consequences / side effects.

I'll be happy to make a PR if you agree on the solution @mwmclean .

@mwmclean
Copy link
Collaborator

Thanks a lot for investigating this! I've never had much luck fixing up encoding issues on Windows. It's a source of endless annoyance. It would be great if you could create a PR for me to review.

@mbojan
Copy link
Member Author

mbojan commented May 14, 2020

Sure. Nothing like a COVID-induced necessity to prepare a lot of teaching materials in my native language and having to do that on a friendly Windows machine... 🙄

Please help me understand a little bit the mechanics of the package and where the encodings fit in. I have to say I have a hard time figuring out what's what in the source code. It seems somewhat Frankensteinian (no pun intended, 'been there, done that...). For example it seems that if there are LaTeX commands in the bib file then ReadBib() converts them to corresponding UTF-8 characters. (BTW, I found cleanupLatex() but surprisingly it is not used but tools::latexToUtf8() is...). Given that this conversion takes place I guess the package should ensure that all the strings read from the bib file are indeed UTF-8. Does bibtex::do_read_bib() do any conversion? It does have encoding argument but my layman's look at C code suggests it never does anything with it. In consequence I'm not sure whats the role of various encoding or Enc etc arguments in various places. Do they set the encoding on the connection from which the bib is read? Is it target encoding for storage? Or is it output encoding?

I am not 100% sure at this moment, but I guess to behave consistently:

  1. The user should be able to specify the encoding on the connection to bib file (default UTF-8)
  2. The strings should be converted to UTF-8 internally if a different encoding is used in (1)
  3. The output should be controlled by the user defaulting to native encoding

Nevertheless, I will submit a PR with a small change that uses Encoding() on the strings in BibEntry(). This will only affect objects that are created from BibTeX files. It seems to fix my problems on Windows and does not break stuff when I do the same on my linux box. What say you, @mwmclean ?

mbojan added a commit to mbojan/RefManageR that referenced this issue Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants