Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] dont format string values with precision #1465

Merged
merged 4 commits into from
Jun 26, 2018

Conversation

JasonTam
Copy link
Contributor

@JasonTam JasonTam commented Jun 21, 2018

I was running into this:
ValueError: Unknown format code 'f' for object of type 'str'
for string values when setting precision in create_tree_digraph

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jun 21, 2018

@JasonTam Hi! Thanks for creating PR.

Can you please provide the example in which you got described error?

UPD:
... or even better enhance the test here, so that it could be passed only with your modification.

@StrikerRUS StrikerRUS self-requested a review June 21, 2018 09:40
@JasonTam
Copy link
Contributor Author

JasonTam commented Jun 21, 2018

Sure, here is a minimal example:

import pandas as pd
import numpy as np
import lightgbm as lgb
np.random.seed(0)

x = pd.concat([
    pd.Series(np.random.choice(['a', 'b'], size=(100)), dtype='category'),
    pd.Series(np.random.randint(0, 2, size=100)),
], axis=1)
y = np.random.randint(0, 2, size=100)
ds = lgb.Dataset(x, y)

params = {'min_child_samples': 2}
gbm = lgb.train(params, ds)
lgb.create_tree_digraph(gbm, precision=2)

Not sure of the best way to modify the test:, the cheapest thing I could do is to convert all 30 columns to categorical. This guarantees that the test only passes with the modification. But I feel like it's not good for the test overall.

gbm.fit(self.X_train, self.y_train, categorical_feature=list(range(30)), verbose=False)

If I only convert a couple of columns to categorical, the model will just try to ignore those and split on the other columns.

@StrikerRUS
Copy link
Collaborator

Thanks for the example!

The error is caused by the scientific notation, e.g. 1.0000000180025095e-35. However, we are sill able to convert this value to float with specified precision instead of silent ignoring precision argument.

I think that float2str could be rewritten as follows:

    def float2str(value, precision=None):
        if precision is None:
            return str(value)
        else:
            try:
                return "{0:.{1}f}".format(float(value), precision)
            except TypeError:
                warnings.warn('Some notification that precison cannot be set to some values')
                return str(value)

Now it works correctly with scientific notation and notifies user that precision cannot be set in case of strange values.

@JasonTam
Copy link
Contributor Author

I thought the error is caused by categorical values? The split threshold would be on a set of categories ie: "0||1||2" which is a string. In my fix, I ignore precision if the value is of type string. Not sure if scientific notation is related to this.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jun 21, 2018

LightGBM cannot handle strings at all (see #789).

@JasonTam
Copy link
Contributor Author

Im not talking about the categorical values or codes of the dataset itself. I'm talking about the threshold that will be shown in the plot. It's a string if the split is based on set membership.

@StrikerRUS
Copy link
Collaborator

Threshold cannot be string as well.

@JasonTam
Copy link
Contributor Author

Idk man, when I use categorical data, value becomes a string when plotting. You can try my example with a debugger and take a look at value and see what I'm talking about

@StrikerRUS
Copy link
Collaborator

Just run your example completely without float2str function:

tree

@StrikerRUS
Copy link
Collaborator

Please provide the example and the resulted graph with strings you're talking about.

@JasonTam
Copy link
Contributor Author

Not sure exactly what you're asking for.
With my example on master branch, I can't create the graph -- it's going to error out with the following:

ValueError: Unknown format code 'f' for object of type 'str'
> /Users/jtam/tools/LightGBM/python-package/lightgbm/plotting.py(271)float2str()
    269 
    270     def float2str(value, precision=None):
--> 271         return "{0:.{1}f}".format(value, precision) if precision is not None else str(value)
    272 
    273     def add(root, parent=None, decision=None):

ipdb> value
'1'

If I apply my PR, I will get the following:
image

The 1 in the top node is actually a string (note how it's been skipped in the precision formatting , where all other values actually have precision=2).

Could it perhaps be my python or pandas version? -- not sure
(Python 3.6.0 pandas 0.22.0)

@StrikerRUS
Copy link
Collaborator

Oh, I see. I thought you're having the threshold value like this one "0||1||2", but you're talking about the one element from this set "1".

@JasonTam
Copy link
Contributor Author

JasonTam commented Jun 21, 2018

I actually do have thresholds like "0|1|2" , but for a private dataset. If I have time, I'll try to write a small example with it too.
screen shot 2018-06-21 at 5 14 57 pm
(^ graph rendered after applying PR)

@@ -268,7 +268,8 @@ def _to_graphviz(tree_info, show_info, feature_names, precision=None,
raise ImportError('You must install graphviz to plot tree.')

def float2str(value, precision=None):
return "{0:.{1}f}".format(value, precision) if precision is not None else str(value)
return "{0:.{1}f}".format(value, precision) \
if (precision is not None) and (not isinstance(value, str)) else str(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the brackets and use string_type from compat.py module instead of str in isinstance().

@StrikerRUS
Copy link
Collaborator

If I have time, I'll try to write a small example with it too.

OK, thank you! No need to do this.

@StrikerRUS
Copy link
Collaborator

Please remove these brackets too and I'll merge.

(precision is not None)

@JasonTam
Copy link
Contributor Author

I think the travis error is unrelated?

@StrikerRUS
Copy link
Collaborator

Thank you very much for your contribution and patience!

I think the travis error is unrelated?

Yeah, I've restart the test.

@StrikerRUS StrikerRUS merged commit 6488f31 into microsoft:master Jun 26, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants