-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient native support of pandas.DataFrame with mixed dense and sparse columns #4153
Comments
@staftermath good to see you here! Thanks for your question. LightGBM does work with sparse arrays in its core library, so I can at least tell you for sure that
To be honest, this is the first I've heard of a pandas SparseArray though. I personally would have to research more about https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html. Maybe @StrikerRUS knows more. If you convert your data to a scipy sparse matrix, do you still experience memory issues? |
Thanks James! Let me do some testing and report back. I suspect when pandas is fed to training, it only tries to get the np.arrays from underlying data. Perhaps csc_matrix would resolves the issue. Stay tuned :) |
Here is the corresponding line of the source code: LightGBM/python-package/lightgbm/basic.py Line 162 in dc1bc23
So, LightGBM supports input data to be However, everything above is true only for I believe this issue should be treated as a sub-issue of the following feature request (#2302): @staftermath Please comment if you think this is another issue. |
Oh, wait! You are using |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
Thanks for the comments @StrikerRUS . The int type is used as example. In my real case it is float64. But yes, the sparsearray is used a feature in pandas. Also update for @jameslamb , you are right, directly feed a csc_matrix to Dataset and then to lgb.train doesn't incur huge memory spikes. This effectively solves my problem. Thanks a lot! |
It looks like lightgbm will attempt to convert sparse array into numpy array internally. When the (converted) dense data frame is huge. This may cause memory issue.
A smaller sample pd DataFrame containing sparse array is:
In my real case, the pandas dataframe consists of about 15k sparse array columns and 1 million rows. The total memory for this dataframe is < 1 GB. However, when feed to lightgbm training, it raises memory error.
My question is, is it in theory, not possible to use pandas sparse array in training without internally convert to dense arrays?
Noob thought, if lightgbm is using bins, say, max_bin=16, can it efficiently use sparse array?
The text was updated successfully, but these errors were encountered: