Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catagorical and ordinal feature specification and encoding #121

Closed
3 tasks done
hjwilli opened this issue Jan 8, 2019 · 5 comments
Closed
3 tasks done

catagorical and ordinal feature specification and encoding #121

hjwilli opened this issue Jan 8, 2019 · 5 comments
Assignees

Comments

@hjwilli
Copy link
Collaborator

hjwilli commented Jan 8, 2019

Be able to handle data that has categorical or ordinal features and preprocess data appropriately depending on the algorithm being run, see label vs onehot encoding

  • @hjwilli Update dataset upload api '/api/v1/datasets' to allow the specification of categorical or ordinal features
  • Allow projects.json to specify the strategy that should be used when pre-processing categorical data for a particular algorithm
  • Update machine to preprocess ordinal and categorical features as appropriate per the strategy and the dataset field specification, assume that machine will be able to pull a version of the data from the lab api that has already been label encoded

Needed for some features of #119

@hjwilli hjwilli added this to the Open Source PennAI milestone Jan 11, 2019
@hjwilli
Copy link
Collaborator Author

hjwilli commented Jan 18, 2019

Should the dataset profiles work off of label encoded data? onehot data?

@weixuanfu
Copy link
Contributor

weixuanfu commented Jan 25, 2019

  • Use pipeline with ColumnTransformer
  • Two steps of ColumnTransformer for catagorical and ordinal features respectively

@weixuanfu
Copy link
Contributor

weixuanfu commented Feb 1, 2019

  • Add ordinal map into files API and use it in OrdinalEncoder
  • Add "categorical_encoding_strategy" into projects.json for each ML algorithms

@hjwilli
Copy link
Collaborator Author

hjwilli commented Feb 13, 2019

Hi @weixuanfu, this seems great, the only update I see is that the response from the api to get a dataset the ordinal and categorical features will be in a slightly different format.

Once the code has been updated for this api spec and there is a unit test that mocks the api response with ordinal features, this should be ready for a pull request or direct merge into master.

Example response from a GET to http://lab:5080/api/v1/datasets/$datasetId:

{
    "_id": "5bf4841c9fc83c002cdbf5e6",
    "name": "appendicitis",
    "username": "testuser",
    "files": [
        {
            "_id": "5bf4841c9fc83c002cdbf5e8",
            "_raw_id" : "5bf4841c9fc83c00adfadfe", 
            "filename": "appendicitis.csv",
            "mimetype": "text/csv",
            "dependent_col": "class",
            "categorical_features" : ["cat_feat_1", "cat_feat_2"],
            "ordinal_features" : {"ord_feat_1" : ["MALE", "FEMALE"], "ord_feat_2" : ["FIRST", "SECOND", "THIRD"]},
            "timestamp": 1542751267786
        }
    ],
    "metafeatures": {
       ....
    }
}

@weixuanfu
Copy link
Contributor

@hjwilli ok, I will push a commit for supporting this API format.

weixuanfu added a commit that referenced this issue Feb 19, 2019
weixuanfu added a commit that referenced this issue Feb 19, 2019
weixuanfu added a commit that referenced this issue Feb 19, 2019
weixuanfu added a commit that referenced this issue Feb 19, 2019
weixuanfu added a commit that referenced this issue Feb 19, 2019
hjwilli added a commit that referenced this issue Feb 26, 2019
hjwilli added a commit that referenced this issue Feb 26, 2019
cat/ordinal api tests

References #121
hjwilli added a commit that referenced this issue Feb 26, 2019
Need to update tests/validation to fail for strings in fields that have
not been marked as cat/ord, and for ord cols with values that were not
explictly provided

References #121
hjwilli added a commit that referenced this issue Feb 27, 2019
cat/ord datasets can be uploaded via api

References #121
hjwilli added a commit that referenced this issue Feb 27, 2019
validation for datasets that have string data in cols not defined as
categorical, and for ordinal cols that contain values not explicitly
defined

References #121
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants