Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ref. for codepoints #4

Open
MichaelChirico opened this issue Nov 17, 2021 · 6 comments
Open

Ref. for codepoints #4

MichaelChirico opened this issue Nov 17, 2021 · 6 comments

Comments

@MichaelChirico
Copy link
Contributor

Tried looking into this open question in the README:

Here are some handy ways to find the Unicode code points for an existing string:

    Copy and paste into the Unicode character inspector.
    Do we have other suggestions?

From this answer here, it looks like "no", unless that logic were to be put into a common package we could reference:

https://stackoverflow.com/a/6240184/3576984

@gaborcsardi
Copy link
Owner

Base R works with code points, so this works currently:

x <- "\U0001f477\u200d\u2642\ufe0f"
# https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%91%B7%E2%80%8D%E2%99%82%EF%B8%8F
x
#> [1] "👷‍♂️"

utf8::utf8_print(strsplit(x, "")[[1]], utf8 = FALSE)
#> [1] "\U0001f477" "\u200d"     "\u2642"     "\ufe0f"

Of course it would be more correct to work with graphemes, so if base R will switch to that, then it might not work any more.

Btw. cli also has now some handy functions for UTF-8 strings, e.g. it handles graphemes properly:

cli::utf8_nchar(x)
#> [1] 1

nchar(x)
#> [1] 4

@MichaelChirico
Copy link
Contributor Author

MichaelChirico commented Nov 17, 2021

I think utf8::utf8_print() is what I was after with "putting that logic into a package", let's add a reference to it in the doc there.

It's a good tool for,

OK, I've copy-pasted "👍" into a string in my package for users. Now it's time to submit to CRAN or otherwise run R CMD check, and I'm getting dinged for the non-ASCII characters -- how do I convert it to a \U string?

@gaborcsardi
Copy link
Owner

Here is a base R solution:

Sys.setlocale("LC_ALL", "C")
#> [1] "C/C/C/C/C/en_US.UTF-8"

x
#> [1] "<U+0001F477><U+200D><U+2642><U+FE0F>"

It will mess up the current session of course...

@MichaelChirico
Copy link
Contributor Author

Right... still useful to mention. For the use case mentioned, we can just open up a new process & run it there quickly. Nice!

@gaborcsardi
Copy link
Owner

gaborcsardi commented Nov 17, 2021

Yeah, maybe there is a way to restore the locale, but withr::with_locale() refuses to change LC_ALL, and there might be a reason for that:

❯ withr::with_locale(c(LC_ALL = "C"), TRUE)
Error: Setting LC_ALL category not implemented.

callr can run it in another session:

❯ callr::r(function() { Sys.setlocale("LC_ALL", "C"); format("👷‍♂️") })
[1] "<U+0001F477><U+200D><U+2642><U+FE0F>"

Maybe it would be enough to change another category.

@gaborcsardi
Copy link
Owner

Oh, yeah, here it is:

withr::with_locale(c(LC_CTYPE = "C"), format("👷‍♂️"))
#> [1] "<U+0001F477><U+200D><U+2642><U+FE0F>"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants