Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate partition scope DLO strategy and persist to DLO table #284

Merged

Conversation

jiang95-dev
Copy link
Collaborator

@jiang95-dev jiang95-dev commented Feb 4, 2025

Summary

Generate partition scope DLO strategy and persist to the partition-level DLO table. The new DLO table contains 2 new columns: partitionId and partitionColumns. The latter will be used only for analysis purpose and won't be used as filters in the execution app. At this stage, we generate both table-level and partition-level strategies for each table.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Tested on the test cluster. The dlo_partition_strategies table was created and has 1 row for each partition. The file reduction count for each partition is 1, and the partition_id and partition_columns had been correctly set.

scala> spark.sql("use openhouse")

scala> spark.sql("create table u_openhouse.dlo_run (ts timestamp, id int, data string) partitioned by (days(ts), id)").show()

scala> spark.sql("insert into u_openhouse.dlo_run values (current_timestamp(), 1, 'data')").show()
scala> spark.sql("insert into u_openhouse.dlo_run values (current_timestamp(), 1, 'data')").show()

scala> spark.sql("insert into u_openhouse.dlo_run values (date_add(current_timestamp(), 1), 1, 'data')").show()
scala> spark.sql("insert into u_openhouse.dlo_run values (date_add(current_timestamp(), 1), 1, 'data')").show()


scala> spark.sql("show tables in u_openhouse").show(100, false)
+-----------+------------------------+-----------+
|namespace  |tableName               |isTemporary|
+-----------+------------------------+-----------+
|u_openhouse|dlo_run                 |false      |
|u_openhouse|dlo_strategies          |false      |
|u_openhouse|dlo_partition_strategies|false      |
+-----------+------------------------+-----------+


scala> spark.sql("select * from u_openhouse.dlo_partition_strategies").show(false)
+-------------------+-------------+-----------------+----------------------+----------------------+------------------------------+----------------------+
|fqtn               |partition_id |partition_columns|timestamp             |estimated_compute_cost|estimated_file_count_reduction|file_size_entropy     |
+-------------------+-------------+-----------------+----------------------+----------------------+------------------------------+----------------------+
|u_openhouse.dlo_run|2025-02-13, 1|ts_day, id       |2025-02-12 10:05:02.08|0.5                   |1.0                           |2.77080845024704256E17|
|u_openhouse.dlo_run|2025-02-12, 1|ts_day, id       |2025-02-12 10:05:02.08|0.5                   |1.0                           |2.77080845024704256E17|
+-------------------+-------------+-----------------+----------------------+----------------------+------------------------------+----------------------+


scala> spark.sql("select * from u_openhouse.dlo_strategies").show(false)
+-------------------+-----------------------+----------------------+------------------------------+----------------------+
|fqtn               |timestamp              |estimated_compute_cost|estimated_file_count_reduction|file_size_entropy     |
+-------------------+-----------------------+----------------------+------------------------------+----------------------+
|u_openhouse.dlo_run|2025-02-12 10:04:07.489|0.5                   |3.0                           |2.77080845024704256E17|
+-------------------+-----------------------+----------------------+------------------------------+----------------------+
kubectl create job --from=cronjob/jobs-cron-data-layout-strategy-generation dlo-adhoc-test

2025-02-12 10:01:38 INFO  JobsScheduler:155 - Submitting and running 1 jobs based on the job type: DATA_LAYOUT_STRATEGY_GENERATION
2025-02-12 10:01:38 INFO  OperationTask:67 - Launching job for TableMetadata(super=Metadata(creator=openhouse), dbName=u_openhouse, tableName=dlo_run, creationTimeMs=1739347541314, isPrimary=true, isTimePartitioned=true, isClustered=true, jobExecutionProperties={}, retentionConfig=null, historyConfig=null, replicationConfig=null)
2025-02-12 10:01:38 INFO  OperationTask:93 - Launched a job with id DATA_LAYOUT_STRATEGY_GENERATION_u_openhouse_dlo_run_a88fffd3-07e7-4a45-bd28-a05e6c17638a for TableMetadata(super=Metadata(creator=openhouse), dbName=u_openhouse, tableName=dlo_run, creationTimeMs=1739347541314, isPrimary=true, isTimePartitioned=true, isClustered=true, jobExecutionProperties={}, retentionConfig=null, historyConfig=null, replicationConfig=null)

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the description with testing done in local docker.

@jiang95-dev jiang95-dev force-pushed the lejiang/add-dlo-partition-table branch from 619844f to 9012af8 Compare February 4, 2025 23:28
Copy link
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jiang95-dev, overall looks great, having a few minor comments

@jiang95-dev jiang95-dev force-pushed the lejiang/add-dlo-partition-table branch from 4ef52f2 to 2ea8a2e Compare February 12, 2025 10:15
Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @jiang95-dev , looking forward to running analysis on collected data.

Copy link
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, minor suggestion to reduce code duplication

@jiang95-dev
Copy link
Collaborator Author

Ok, minor suggestion to reduce code duplication

Will do the refactoring in the future pr.

@jiang95-dev jiang95-dev merged commit 81ca32e into linkedin:main Feb 12, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants