Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 18, 2024

Which issue does this PR close?

Part of #1123

Rationale for this change

With the metrics that were added in #1175 we can now see how much time is spent in different areas of shuffle write. The next is to try and optimized this code and this will be easier if we have microbenchmarks for each area:

  • Evaluating partitioning expressions (there is opportunity for small saving with fast path for simple column references)
  • Hashing and calculating partition ids
  • Repartitioning the input batches
  • Encoding and compressing
  • Spilling

What changes are included in this PR?

  • Refactor shuffle write code to allow for micro benchmarking
  • Add new benchmarks

How are these changes tested?

No functional changes. Rely on existing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant