Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problems in pandoc_citeproc_convert() with Windows #2195

Closed
5 tasks done
mitchelloharawild opened this issue Jul 27, 2021 · 7 comments · Fixed by #2202
Closed
5 tasks done

Encoding problems in pandoc_citeproc_convert() with Windows #2195

mitchelloharawild opened this issue Jul 27, 2021 · 7 comments · Fixed by #2202
Labels
bug an unexpected problem or unintended behavior next to consider for next release

Comments

@mitchelloharawild
Copy link

A couple of issues have been raised in {vitae} about encoding issues for bibliographies on Windows (mitchelloharawild/vitae#167, mitchelloharawild/vitae#158). So far I've narrowed it down to rmarkdown::pandoc_citeproc_convert(), and I'm raising an issue here in the hopes that you have more experience in using pandoc with Windows encoding related issues.

MRE:

bib <- c(
  "@article{conc2021,",
  "  title={História da Habitação},",
  "  author={Conceição, Sérgio},", 
  "  journal={Portuguese History},",
  "  number={1},", 
  "  year={2021}", 
  "}"
)

writeLines(enc2utf8(bib), "test.bib", useBytes = TRUE)
rmarkdown::pandoc_citeproc_convert("test.bib")
#> [[1]]
#> [[1]]$author
#> [[1]]$author[[1]]
#> [[1]]$author[[1]]$family
#> [1] "Conceição"
#> 
#> [[1]]$author[[1]]$given
#> [1] "Sérgio"
#> 
#> 
#> 
#> [[1]]$`container-title`
#> [1] "Portuguese History"
#> 
#> [[1]]$id
#> [1] "conc2021"
#> 
#> [[1]]$issue
#> [1] "1"
#> 
#> [[1]]$issued
#> [[1]]$issued$`date-parts`
#> [[1]]$issued$`date-parts`[[1]]
#> [[1]]$issued$`date-parts`[[1]][[1]]
#> [1] 2021
#> 
#> 
#> 
#> 
#> [[1]]$title
#> [1] "História da habitação"
#> 
#> [[1]]$type
#> [1] "article-journal"

Created on 2021-07-26 by the reprex package (v2.0.0)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/Los_Angeles         
#>  date     2021-07-26                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source                            
#>  cli           3.0.1   2021-07-17 [1] CRAN (R 4.0.5)                    
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)                    
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.2)                    
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)                    
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)                    
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.0.5)                    
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.5)                    
#>  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.4)                    
#>  knitr         1.33    2021-04-24 [1] CRAN (R 4.0.5)                    
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.4)                    
#>  reprex        2.0.0   2021-04-02 [1] CRAN (R 4.0.5)                    
#>  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.0.5)                    
#>  rmarkdown     2.9.5   2021-07-27 [1] Github (rstudio/rmarkdown@bc936f7)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.5)                    
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)                    
#>  stringi       1.7.3   2021-07-16 [1] CRAN (R 4.0.2)                    
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)                    
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 4.0.5)                    
#>  xfun          0.24    2021-06-15 [1] CRAN (R 4.0.5)                    
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)                    
#> 
#> [1] C:/Users/Admin/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.2/library

Checklist

When filing a bug report, please check the boxes below to confirm that you have provided us with the information we need. Have you:

  • formatted your issue so it is easier for us to read?

  • included a minimal, self-contained, and reproducible example?

  • pasted the output from xfun::session_info('rmarkdown') in your issue?

  • upgraded all your packages to their latest versions (including your versions of R, the RStudio IDE, and relevant R packages)?

  • installed and tested your bug with the development version of the rmarkdown package using remotes::install_github("rstudio/rmarkdown")?


xfun::session_info('rmarkdown')
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363), RStudio 1.4.1103

Locale:
  LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
  LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
  LC_TIME=English_United States.1252    

Package version:
  base64enc_0.1.3   digest_0.6.27     evaluate_0.14     glue_1.4.2        graphics_4.0.2   
  grDevices_4.0.2   highr_0.9         htmltools_0.5.1.1 jsonlite_1.7.2    knitr_1.33       
  magrittr_2.0.1    markdown_1.1      methods_4.0.2     mime_0.11         rlang_0.4.11     
  rmarkdown_2.9     stats_4.0.2       stringi_1.7.3     stringr_1.4.0     tinytex_0.32     
  tools_4.0.2       utils_4.0.2       xfun_0.24         yaml_2.2.1       

Pandoc version: 2.11.2
@cderv
Copy link
Collaborator

cderv commented Jul 28, 2021

Hi @mitchelloharawild !

Thanks for opening this issue. Here are some notes on my investigation.

Does it comes from Pandoc ?

First thing I did to look into this is to see if this comes from Pandoc.

Writing test.bib from R

bib <- c(
  "@article{conc2021,",
  "  title={História da Habitação},",
  "  author={Conceição, Sérgio},", 
  "  journal={Portuguese History},",
  "  number={1},", 
  "  year={2021}", 
  "}"
)
xfun::write_utf8(bib, "test.bib")

Using pandoc from command line in terminal directly.

pandoc -t markdown -s -o test.md .\test.bib
pandoc -t csljson -s -o test.json .\test.bib

Trying to read the file from R as UTF-8

xfun::read_utf8("test.md")
#>  [1] "---"                                   "nocite: \"[@*]\""                     
#>  [3] "references:"                           "- author:"                            
#>  [5] "  - family: Conceição"                 "    given: Sérgio"                    
#>  [7] "  container-title: Portuguese History" "  id: conc2021"                       
#>  [9] "  issue: 1"                            "  issued: 2021"                       
#> [11] "  title: História da habitação"        "  type: article-journal"              
#> [13] "---"                                   ""
xfun::read_utf8("test.json")
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"Conceição\","              
#>  [6] "        \"given\": \"Sérgio\""                   
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"História da habitação\","       
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"

This works ok.

What happens with R then ?

In the R function, we don't write to file. We capture the output from a call to system()

rmarkdown/R/pandoc.R

Lines 150 to 153 in 0af6b35

# run the conversion
with_pandoc_safe_environment({
result <- system(command, intern = TRUE)
})

I think the cause is here because on Windows, UTF-8 is not the default encoding and I think this print incorrectly because the capture string is not marked as UTF8.

rmarkdown:::with_pandoc_safe_environment({
  result <- system("pandoc -t csljson -s test.bib", intern = TRUE)
})
# incorrect result
result
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"Conceição\","            
#>  [6] "        \"given\": \"Sérgio\""                  
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"História da habitação\","    
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"
# not mark as UTF-8 that pandoc output I believe
Encoding(result)
#>  [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
#>  [9] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
#> [17] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
# Mark it as UTF-8
Encoding(result) <- 'UTF-8'
# it works ok ! 
result
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"Conceição\","              
#>  [6] "        \"given\": \"Sérgio\""                   
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"História da habitação\","       
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"

So as the string get the incorrect encoding, jsonlite::fromJSON will not get the correct encoding inside the resulting list.

Workaround for you ?

basically, current workaround for you would be to convert to json, mark as UTF-8 encoding, and convert to list yourself.

res <- rmarkdown::pandoc_citeproc_convert("test.bib", type = "json")
res
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"Conceição\","            
#>  [6] "        \"given\": \"Sérgio\""                  
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"História da habitação\","    
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"
Encoding(res) <- "UTF-8"
res
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"Conceição\","              
#>  [6] "        \"given\": \"Sérgio\""                   
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"História da habitação\","       
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"
jsonlite::fromJSON(res, simplifyVector = FALSE)
#> [[1]]
#> [[1]]$author
#> [[1]]$author[[1]]
#> [[1]]$author[[1]]$family
#> [1] "Conceição"
#> 
#> [[1]]$author[[1]]$given
#> [1] "Sérgio"
#> 
#> 
#> 
#> [[1]]$`container-title`
#> [1] "Portuguese History"
#> 
#> [[1]]$id
#> [1] "conc2021"
#> 
#> [[1]]$issue
#> [1] "1"
#> 
#> [[1]]$issued
#> [[1]]$issued$`date-parts`
#> [[1]]$issued$`date-parts`[[1]]
#> [[1]]$issued$`date-parts`[[1]][[1]]
#> [1] 2021
#> 
#> 
#> 
#> 
#> [[1]]$title
#> [1] "História da habitação"
#> 
#> [[1]]$type
#> [1] "article-journal"

This is indeed while we need to fix and if you don't want to update dependency to later rmarkdown version

What we need to do in rmarkdown ?

We need to mark the result with correct encoding. I believe Pandoc will always be UTF8 as input and output. Since Pandoc 2.11+, pandoc-citeproc is not more used, and pandoc is directly use for conversion (as example above). However, I believe it would be the same for pandoc-citeproc

I see two solutions:

  1. Mark the output as I did using Encoding()
  2. Write output of command to a temp file and read this file as UTF-8 using xfun::read_utf8()

I wonder if the latter is not safer to avoid any R direct handling of encoding during capture in the system call.

@yihui if you have a preference in this matter.

Thanks for opening this issue @mitchelloharawild, I was not aware of this problem!

@cderv cderv added bug an unexpected problem or unintended behavior next to consider for next release labels Jul 28, 2021
@mitchelloharawild
Copy link
Author

Thanks for figuring this out, great description and investigation. I'll look into the most appropriate fix for {vitae} with this in mind.

@yihui
Copy link
Member

yihui commented Jul 29, 2021

  • Mark the output as I did using Encoding()

@cderv Do you mean this?

diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..3bca22e1 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -161,6 +161,7 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
   if (type == "list") {
     jsonlite::fromJSON(result, simplifyVector = FALSE)
   } else {
+    Encoding(result) <- "UTF-8"
     result
   }
 }

That sounds simple and safe to me since you have tested it. If we want to be conservative, we can certainly use the second solution (i.e. write to a file and read it back).

@cderv
Copy link
Collaborator

cderv commented Jul 29, 2021

Not exactly, it would need to be the output resulting from the call to system()

diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..36b86336 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -150,6 +150,7 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
   # run the conversion
   with_pandoc_safe_environment({
     result <- system(command, intern = TRUE)
+    Encoding(result) <- "UTF-8"
   })
   status <- attr(result, "status")
   if (!is.null(status)) {

or maybe this

diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..bb715b91 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -150,20 +150,22 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
   # run the conversion
   with_pandoc_safe_environment({
     result <- system(command, intern = TRUE)
   })
   status <- attr(result, "status")
   if (!is.null(status)) {
     cat(result, sep = "\n")
     stop("Error ", status, " occurred building shared library.")
   }

+  Encoding(result) <- "UTF-8"
+
   # convert the output if requested
   if (type == "list") {
     jsonlite::fromJSON(result, simplifyVector = FALSE)

This is because the call to fromJSON needs to happen on a strings input with mark encoding.

@yihui
Copy link
Member

yihui commented Jul 29, 2021

Okay. Either way seems to be fine to me.

cderv added a commit that referenced this issue Aug 18, 2021
Pandoc will output UTF-8 content but on non default UTF-8 (like Windows), system() will return the result string in native encoding. We need to mark it before further processing. fixes #2195

Another option would be to convert to a file and read it back into R.
@cderv
Copy link
Collaborator

cderv commented Aug 18, 2021

@mitchelloharawild I pushed the fix in the dev version of rmarkdown.

This should solve your issue in vitae. Thanks for the report.

@github-actions
Copy link

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior next to consider for next release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants