Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of non-finite values #3942

Merged
merged 20 commits into from
Mar 16, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Dummy commit to trigger CI
mjmckp authored Feb 12, 2021
commit 108c6de58494d312acf7fc05741e9cc6d1dba2d3
2 changes: 1 addition & 1 deletion include/LightGBM/utils/common.h
Original file line number Diff line number Diff line change
@@ -1084,7 +1084,7 @@ struct __StringToTHelper<T, true> {

// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc.
if (!fast_parse_succeeded) {
std::string strlower(str);
std::string strlower(str);
std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); });
Copy link
Contributor

@AlbertoEAF AlbertoEAF Feb 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great clean code @mjmckp ;)

Instead of allocating a string, and since you already have a lambda, what about defining a case-insensitive comparison lambda and use std::equal to check the "inf" and "nan" values below?

Although this is the rare branch there might be longer strings than inf or nan which might be parsed here and might slow down our parsing without need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this hardly seems worth it, this branch is rarely invoked, meanwhile a colossal amount of strings are being allocated in splitting and parsing the input file, so these few extra allocations are a drop in the ocean.

I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...

Agreed. Will you add such tests?
I actually run such tests but on an external lgbm provider and didn't have nan nor inf on my model, but would prefer to see them in lgbm's CI so any breakage is detected immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could assist in adding the tests, any idea where these should go and how best to implement them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmckp #3555 has been just merged. We will really appreciate new GTest-compatible tests in this or in a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, would you mind pointing me towards a similar kind of test that I can use as a starting point please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't have any tests yet. This is something that we should concentrate on in the near future. For now, I think you can take a look at tests from @AlbertoEAF in #3997.
https://github.com/microsoft/LightGBM/pull/3997/files#diff-c363eba6eda99d9e560f8341a1fc8fe02e885d2256db2482e1c543430a25666d

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmckp I've merged this PR with the aim to not delay the upcoming release. Please feel free to add tests in a new PR. We'll be very grateful! And thanks a lot for the bug fix!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS Thanks a lot, I'll get up to speed on how the new tests work and add some tests for this in a new PR soon.

if (strlower == std::string("inf"))
tmp = std::numeric_limits<double>::infinity();