Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Validate input dataset of ml.clustering

Why are the changes needed?

when input dataset contains invalid values, fail fast and output relevant error message

Does this PR introduce any user-facing change?

No

How was this patch tested?

added testsuites

@github-actions github-actions bot added the ML label Mar 28, 2022
@zhengruifeng
Copy link
Contributor Author

cc @srowen @huaxingao

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the same change as last time? that's fine if this is the remainder.

@zhengruifeng
Copy link
Contributor Author

@srowen ALS is also added here, now all classification/regression/clustering/recommendation support dataset validation. I think there will be another PR to cleanup unused internal functions.

@zhengruifeng zhengruifeng changed the title [SPARK-38669][ML] Validate input dataset of ml.clustering [SPARK-38669][ML] Validate input dataset of ml.clustering and ml.recommendation Mar 29, 2022
@srowen
Copy link
Member

srowen commented Mar 29, 2022

The approach seems OK; looks like some related text failures though

@zhengruifeng
Copy link
Contributor Author

retest this please

@github-actions github-actions bot added the BUILD label Mar 30, 2022
@srowen
Copy link
Member

srowen commented Mar 30, 2022

I'm still seeing ...

annotations failed mypy checks:
python/pyspark/pandas/ml.py:62: error: Incompatible types in assignment (expression has type "Index", variable has type "MultiIndex")  [assignment]
python/pyspark/pandas/frame.py:6133: error: unused "type: ignore" comment
python/pyspark/pandas/frame.py:6292: error: Incompatible types in assignment (expression has type "Index", variable has type "MultiIndex")  [assignment]
Found 3 errors in 2 files (checked 3[25](https://github.com/zhengruifeng/spark/runs/5751276900?check_suite_focus=true#step:15:25) source files)

But I don't think that's related? I wonder if we just need to try tomorrow

@zhengruifeng
Copy link
Contributor Author

I also think it should be irrelevant, let me rebase the PR.

@zhengruifeng zhengruifeng force-pushed the clustering_validate_training_dataset branch from 0ebde3b to 38be19f Compare March 31, 2022 03:27
@zhengruifeng zhengruifeng force-pushed the clustering_validate_training_dataset branch from 38be19f to 03e61cc Compare March 31, 2022 07:43
@srowen
Copy link
Member

srowen commented Apr 2, 2022

Merged to master

@srowen srowen closed this in af942d7 Apr 2, 2022
@zhengruifeng zhengruifeng deleted the clustering_validate_training_dataset branch April 3, 2022 02:24
@zhengruifeng
Copy link
Contributor Author

thanks for reviewing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants