-
Notifications
You must be signed in to change notification settings - Fork 170
feat: Add narwhals.struct top level function
#3261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
So far this is what this PR does, I'll attempt polars/arrow next: df_native_pd = pd.DataFrame({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
df_pd = nw.from_native(df_native_pd)
df_struct_pd = df_pd.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))What I have not yet figure out is where to place the imports, nor where to add unit test apart from the doctests. |
narwhals/_pandas_like/namespace.py
Outdated
| import pandas as pd # TODO: where pd.ArrowDtype should come from? | ||
| import pyarrow.compute as pc # TODO: where to put this import? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where should these imports go? is ArrowDtype available through self?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reference, something like the following would be the preferred way:
narwhals/narwhals/_pandas_like/utils.py
Lines 553 to 560 in 01aab21
| if isinstance_or_issubclass(dtype, dtypes.Date): | |
| try: | |
| import pyarrow as pa # ignore-banned-import | |
| except ModuleNotFoundError as exc: | |
| # BUG: Never re-raised? | |
| msg = "'pyarrow>=13.0.0' is required for `Date` dtype." | |
| raise ModuleNotFoundError(msg) from exc | |
| return "date32[pyarrow]" |
|
At this point, we also get these results for polars df and arrow tables: Polars: df_native_pl = pl.DataFrame({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
df_pl = nw.from_native(df_native_pl)
df_struct_pl = df_pl.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))Arrow: table_native_pa = pa.table({
"a": [1, 2, 3],
"b": ["x", "y", "z"],
"c": [True, False, True],
})
df_pa = nw.from_native(table_native_pa)
df_struct_pa = df_pa.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t")) |
|
@msalvany I think some wires may have been crossed 😅 This feature is |
Hi @dangotbanned . I see that the original issue is |
|
I have started with the tests. I see that there are more backends than pandas, polars and arrow. Should we also implemment the missing ones? |
|
Hey @msalvany - thanks for the contribution 🚀 As a little side note/to expand a bit more on Dan's comment - we try to mirror the polars API, therefore we will aim to have In a similar way, However:
Regarding other backends:
For now you can start by xfailing them in the tests. I can see you are already xfailing certain polars version, so you can do something along the following lines: def test_dryrun(constructor: Constructor, *, request: pytest.FixtureRequest) -> None:
if "polars" in str(constructor) and POLARS_VERSION < (1, 0, 0):
# nth only available after 1.0
request.applymarker(pytest.mark.xfail)
+ if any(x in str(constructor) for x in ("dask", "duckdb", "ibis", "pyspark", "sqlframe")):
+ reason = "Not supported/not implemented"
+ request.applymarker(pytest.mark.xfail(reason))and in those backend namespaces you can add I hope it helps! Let's get pandas, polars and pyarrow in first, and then we can iterate for the others 🤞🏼 |
|
Hi, Thanks for the clarification @FBruzzesi, I totally get it now! I have changed all |
|
Hey @msalvany first and foremost, thanks for updating the PR - it looks close to the finish line 🙏🏼 I have a few of comments, especially regarding tests:
|
structnarwhals.struct top level function
|
thanks all! just a comment on
we should at least verify that this operation is feasible for spark/duckdb. fortunately, in this case, it looks like it's easily done with In [35]: rel = duckdb.sql("select * from values (1,4,0),(1,5,1),(2,6,2) df(a,b,i)")
In [36]: rel
Out[36]:
┌───────┬───────┬───────┐
│ a │ b │ i │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│ 1 │ 4 │ 0 │
│ 1 │ 5 │ 1 │
│ 2 │ 6 │ 2 │
└───────┴───────┴───────┘
In [37]: rel.select('a', 'b', 'i', duckdb.FunctionExpression('struct_pack', 'a', 'b'))
Out[37]:
┌───────┬───────┬───────┬──────────────────────────────┐
│ a │ b │ i │ struct_pack(a, b) │
│ int32 │ int32 │ int32 │ struct(a integer, b integer) │
├───────┼───────┼───────┼──────────────────────────────┤
│ 1 │ 4 │ 0 │ {'a': 1, 'b': 4} │
│ 1 │ 5 │ 1 │ {'a': 1, 'b': 5} │
│ 2 │ 6 │ 2 │ {'a': 2, 'b': 6} │
└───────┴───────┴───────┴──────────────────────────────┘in pyspark it looks like it's just struct |
Hello @MarcoGorelli, I'm going to use your example here to ask if the output we expect after nw.struct() is a new column containing the struct inside the original dataframe (as you showed here), or rather a new independent df with a single column containing the struct. If I understand this right, what polars.struct() generates is the 2nd option, but I might be mistaken. So far, this is what I was mimicking, just let me know if it should be changed. Thanks! |
this depends on whether you use |
I simply tested the data = [(1, 4, 0), (1, 5, 1), (2, 6, 2)]
columns = ["a", "b", "i"]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, columns)
df_with_struct = df.select("a", "b", "i", struct("a", "b").alias("struct_col"))
df_with_struct.show(truncate=False) |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…lframe & `xfail` in struct_test.py
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
| values = df[col].tolist() | ||
| non_null_values = [v for v in values if not pd.isna(v)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick note that tolist and iterating over values in Python isn't allowed here, as it's very inefficient - you'll need to look for a way to do this using the pandas api
What type of PR is this? (check all applicable)
Related issues
narwhals.struct#3247narwhals.struct#3247Checklist
If you have comments or can explain your changes, please do so below
TODO: