Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] pivot #2183

Merged
merged 7 commits into from
May 1, 2024
Merged

[FEAT] pivot #2183

merged 7 commits into from
May 1, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Apr 26, 2024

Pivot with single groupby (aka index column), pivot column, and values column.

Todo:

  • Enhance test coverage
  • Allow option to specify values to pivot on
  • Allow for multiple groupby, pivot, and values (Separate PR)
  • Docs

@github-actions github-actions bot added enhancement New feature or request documentation Improvements or additions to documentation labels Apr 26, 2024
@colin-ho colin-ho marked this pull request as ready for review April 29, 2024 21:07
Copy link

codecov bot commented Apr 29, 2024

Codecov Report

Attention: Patch coverage is 97.14286% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 85.56%. Comparing base (a0fd6ec) to head (0f9ae96).
Report is 11 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2183      +/-   ##
==========================================
+ Coverage   85.10%   85.56%   +0.45%     
==========================================
  Files          68       71       +3     
  Lines        7391     7549     +158     
==========================================
+ Hits         6290     6459     +169     
+ Misses       1101     1090      -11     
Files Coverage Δ
daft/dataframe/dataframe.py 90.16% <100.00%> (+1.07%) ⬆️
daft/execution/execution_step.py 93.58% <100.00%> (+0.41%) ⬆️
daft/execution/rust_physical_plan_shim.py 96.00% <100.00%> (+0.16%) ⬆️
daft/logical/builder.py 90.62% <100.00%> (+0.22%) ⬆️
daft/table/micropartition.py 90.54% <100.00%> (+0.21%) ⬆️
daft/table/table.py 57.05% <50.00%> (+0.92%) ⬆️

... and 12 files with indirect coverage changes


Example:
>>> df = daft.from_pydict(
{"group": ["A", "A", "B", "B"], "pivot": [1, 1, 1, 2], "value": [1, 2, 3, 4]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a better motivating example? Something simple maybe like:

"version": [...],
"benchmark_name": [...],
"walltime": [...],
df.pivot("version", "benchmark_name", "walltime", "min")

None,
))
}
_ => unreachable!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a message here, otherwise when we panic we won't be able to figure out what happened without a backtrace.

@@ -48,22 +48,6 @@ impl Aggregate {
})
}

pub(crate) fn schema(&self) -> SchemaRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema for an agg op is already created in the constructor , so this function is redundant. I believe it is leftover code from a previous change, and I'm removing this as a drive by. (I found this cuz I was modeling the pivot logical op after this)

import pytest


@pytest.mark.parametrize("repartition_nparts", [1, 2, 4])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try [1, 2, 5] because the 5 will also test empty partitions

};
let pivot_names_as_series = pivot_keys_series.to_str_values()?;
let pivot_names = pivot_names_as_series.utf8()?.as_arrow();
if pivot_names.len() > names.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need a strong guarantee here that Set(pivot_names).in(Set(names))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, for example in an empty partition, pivot_names will be empty, but we still need to create empty columns for these names otherwise the schema won't match with the other non-empty partitions.

)));
}

let mut pivot_name_to_pivot_key_idx = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably make this a HashMap or something

@@ -744,6 +700,139 @@ pub(super) fn translate_single_logical_node(
}
}

fn populate_aggregation_stages(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add function docstrings to explain what it does and remove mut

@colin-ho colin-ho requested a review from jaychia May 1, 2024 03:54
let pivoted = empty_table.pivot(group_by, pivot_col, values_col, names)?;
Ok(MicroPartition::new_loaded(
pivoted.schema.clone(),
vec![pivoted].into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just vec![] as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smart

@colin-ho colin-ho merged commit 3e5da66 into main May 1, 2024
29 checks passed
@colin-ho colin-ho deleted the colin/pivot branch May 1, 2024 23:33
@colin-ho colin-ho mentioned this pull request May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants