Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DL estimators for getting the output df schema #2611

Merged
merged 12 commits into from
Feb 4, 2021

Conversation

irasit
Copy link
Collaborator

@irasit irasit commented Jan 21, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • [ Y] Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

#2373 introduced a big delay when generating the output schema. Instead we can get the schema from the input df schema and label columns.

Fixes #2536.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.



def get_spark_df_output_schema(input_df_schema, label_cols, output_cols):
if len(label_cols) != len(output_cols):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair Please check here. Is it OK to always assume label_cols and output_cols are 1:1 matching?

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

irasit and others added 11 commits February 2, 2021 00:41
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Yana Shchyokotova <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
Signed-off-by: Peng Zhang <[email protected]>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgaddair tgaddair merged commit ea692ad into horovod:master Feb 4, 2021
@github-actions
Copy link

github-actions bot commented Feb 5, 2021

Unit Test Results

     691 files  +  18       691 suites  +18   4h 43m 32s ⏱️ + 6m 47s
     539 tests +    1       510 ✔️ +    1       29 💤 ±    0  0 ❌ ±0 
14 190 runs  +318  10 730 ✔️ +196  3 460 💤 +122  0 ❌ ±0 

Results for commit ea692ad. ± Comparison against base commit 2a775b2.

@irasit irasit deleted the df_schema branch February 5, 2021 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Performance degredation with Spark Estimator during schema inference
5 participants