You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a centralized issue giving specification of how a dataframe(pandas, R's dataframe) can be converted into DMatrix. Dataframe can be a helpful data source. Giving such specification will give chance to direct data ingestion from dataframe, and avoid memory copy issues and possible ease of external memory integration.
Currently it is straightforward to do so for continuous features. Less obvious to do so for categorical features and sparse input.
Goal
Let us not aim to do complicated things. For example, automatically indexing all the factors(categorical features) and accept string input type.
Instead have a _minimum_ specification of how to represent sparse input and categorical features and being able to quickly convert to sparse matrix type. Let the dataframe solutions do the jobs such as feature engineering.
Example Proposal 1
All the categorical columns must already been maped to unique integers. So column C1 will be in [0, n) and column C2 will be in [n, n+m). Where n is number of unique categories in C1, and m is number of unique categories in C2.
Example Proposal 2
Map existing categorical columns into unique integers. C1 will be in [0, n) C2 will be in [0, m). When constructing DMatrix, also pass size of each column [n, m] to the constructor
The text was updated successfully, but these errors were encountered:
Could the Feather/Arrow be of any use in here? https://blog.rstudio.org/2016/03/29/feather/
It's supposed to be a light, language-agnostic, fast, and data frame-friendly format.
This is a centralized issue giving specification of how a dataframe(pandas, R's dataframe) can be converted into DMatrix. Dataframe can be a helpful data source. Giving such specification will give chance to direct data ingestion from dataframe, and avoid memory copy issues and possible ease of external memory integration.
Currently it is straightforward to do so for continuous features. Less obvious to do so for categorical features and sparse input.
Goal
Let us not aim to do complicated things. For example, automatically indexing all the factors(categorical features) and accept string input type.
Instead have a _minimum_ specification of how to represent sparse input and categorical features and being able to quickly convert to sparse matrix type. Let the dataframe solutions do the jobs such as feature engineering.
Example Proposal 1
All the categorical columns must already been maped to unique integers. So column C1 will be in [0, n) and column C2 will be in [n, n+m). Where n is number of unique categories in C1, and m is number of unique categories in C2.
Example Proposal 2
Map existing categorical columns into unique integers. C1 will be in [0, n) C2 will be in [0, m). When constructing DMatrix, also pass size of each column [n, m] to the constructor
The text was updated successfully, but these errors were encountered: