-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113
Comments
Yes, I agree 100%
I'd suggest using 2 arrays ("matrices") instead of 1. The reason is that you may want to have separate Numpy |
Really looking forward to this! |
Perhaps it might be best to use |
pandas comes with quite a bit of overhead. sklearn doesn't use pandas; I don't think it's necessary for TPOT to use it either. |
This change will be in the 0.5 release. |
Since we don't maintain the column names any more, it seems that we could replace the pandas DataFrames in our pipeline structure with numpy matrices. We're always changing the data into numpy matrices anyway when passing them to the sklearn operations, so I'm not seeing the point of using pandas DataFrames any more.
This might make TPOT more memory efficient, as we won't introduce DataFrame overhead either.
To make this happen, we would need to:
self
variables (in place of having agroup
column)class
column is always the last entry in the matrix (in place of having aclass
column)guess
column is always the second-to-last entry in the matrix (in place of having aguess
column)I believe that this would also make #29 much easier to implement.
Any downsides to this change that we can think of?
The text was updated successfully, but these errors were encountered: