-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] add support for non-ASCII feature names (fixes #2983) #3647
[R-package] add support for non-ASCII feature names (fixes #2983) #3647
Conversation
ok it looks like this is still failing on Windows 😕 I'll change this to work in progress and work on it tonight. |
Hi James :) I deserve no credits here, this came in the v3.0 release, I think it was this PR from @henry0312 https://github.com/microsoft/LightGBM/pull/2976/files. |
Ok I think this is ready for review! I learned a lot from "UTF-8 Support on Windows" and "Windows/UTF-8 Build of R and CRAN Packages" by @kalibera.
I think that on Windows, we cannot assume that non-ASCII feature names will survive the round trip between C++ and R ( specifically with Before this PR, the test on non-ASCII feature names was skipped completely on all operating systems. Now, that test runs on all operating systems, but on Windows we expect the feature names in I re-submitted on R Hub and can confirm these tests pass on 32-bit Windows and Solaris. |
# UTF-8 strings are not well-supported on Windows | ||
# * https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/ | ||
# * https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html | ||
if (!ON_WINDOWS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/gha run r-valgrind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valgrind is happy
==1913== LEAK SUMMARY:
==1913== definitely lost: 0 bytes in 0 blocks
==1913== indirectly lost: 0 bytes in 0 blocks
==1913== possibly lost: 336 bytes in 1 blocks
==1913== still reachable: 309,428,667 bytes in 55,367 blocks
==1913== of which reachable via heuristic:
==1913== newarray : 4,264 bytes in 1 blocks
==1913== suppressed: 0 bytes in 0 blocks
==1913== Reachable blocks (those to which a pointer was found) are not shown.
==1913== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==1913==
==1913== For lists of detected and suppressed errors, rerun with: -s
==1913== ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)
…ghtGBM into fix/non-ascii-feature-names
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
I started working on adding support for non-ASCII features in the R package tonight, and I think this might have already been fixed! I'm not sure how, but my bet is @AlbertoEAF 's work in #3405 , since that touched model reading / writing code.
I tested with R Hub and this is even now working on Solaris!
Notes for Reviewers
The change in this PR is just to stop skipping a test that is no longer broken. However, I gave this the label
feature
because I think it makes sense for "you can now use non-ASCII features names" to show up in the "New Features" part of the changelog for the next release, even if this PR really didn't do the work to add them and @AlbertoEAF actually did all the hard work 😀