Skip to content

Conversation

@msalvany
Copy link

@msalvany msalvany commented Oct 31, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

TODO:

  • pandas like
    • doctest
    • docstring
    • code
  • polars
    • doctest
    • docstring
    • code
  • Arrow
    • doctest
    • docstring
    • code

@msalvany
Copy link
Author

msalvany commented Oct 31, 2025

So far this is what this PR does, I'll attempt polars/arrow next:

df_native_pd = pd.DataFrame({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})
df_pd = nw.from_native(df_native_pd)
df_struct_pd = df_pd.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))
┌─────────────────────────────────┐
|       Narwhals DataFrame        |
|---------------------------------|
|                                t|
|0   {'a': 1, 'b': 'x', 'c': True}|
|1  {'a': 2, 'b': 'y', 'c': False}|
|2   {'a': 3, 'b': 'z', 'c': True}|
└─────────────────────────────────┘

What I have not yet figure out is where to place the imports, nor where to add unit test apart from the doctests.

Comment on lines 339 to 340
import pandas as pd # TODO: where pd.ArrowDtype should come from?
import pyarrow.compute as pc # TODO: where to put this import?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should these imports go? is ArrowDtype available through self?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reference, something like the following would be the preferred way:

if isinstance_or_issubclass(dtype, dtypes.Date):
try:
import pyarrow as pa # ignore-banned-import
except ModuleNotFoundError as exc:
# BUG: Never re-raised?
msg = "'pyarrow>=13.0.0' is required for `Date` dtype."
raise ModuleNotFoundError(msg) from exc
return "date32[pyarrow]"

@msalvany
Copy link
Author

msalvany commented Oct 31, 2025

At this point, we also get these results for polars df and arrow tables:

Polars:

df_native_pl = pl.DataFrame({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})
df_pl = nw.from_native(df_native_pl)
df_struct_pl = df_pl.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|  shape: (3, 1)   |
|  ┌───────────┐   |
|  │ t         │   |
|  │ ---       │   |
|  │ struct[2] │   |
|  ╞═══════════╡   |
|  │ {1,"x"}   │   |
|  │ {2,"y"}   │   |
|  │ {3,"z"}   │   |
|  └───────────┘   |
└──────────────────┘

Arrow:

table_native_pa = pa.table({
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
    "c": [True, False, True],
})
df_pa = nw.from_native(table_native_pa)
df_struct_pa = df_pa.select(nw.concat_struct([nw.col("a"), nw.col("b"), nw.col("c")]).alias("t"))

┌──────────────────────────────┐
|      Narwhals DataFrame      |
|------------------------------|
|pyarrow.Table                 |
|t: struct<a: int64, b: string>|
|  child 0, a: int64           |
|  child 1, b: string          |
|----                          |
|t: [                          |
|  -- is_valid: all not null   |
|  -- child 0 type: int64      |
|[1,2,3]                       |
|  -- child 1 type: string     |
|["x","y","z"]]                |
└──────────────────────────────┘

@dangotbanned
Copy link
Member

@msalvany I think some wires may have been crossed 😅

This feature is narwhals.struct, which gets the name from polars:

@msalvany
Copy link
Author

msalvany commented Oct 31, 2025

@msalvany I think some wires may have been crossed 😅

Hi @dangotbanned . I see that the original issue is narwhals.struct. But in the discord conversation with @MarcoGorelli we talked about concat_{str, list} (despite concat_list is not yet there). I thought that in the same manner, concat_tuple would work, would it not? That's why I went for concat_struct. But whatever people find more consistent works for me.

@msalvany
Copy link
Author

msalvany commented Oct 31, 2025

I have started with the tests. I see that there are more backends than pandas, polars and arrow.

(narwhals) ➜  narwhals git:(issue_3247) ✗ pytest tests/expr_and_series/concat_struct_test.py -v -k dryrun --tb=no
============================================================ test session starts ============================================================
platform darwin -- Python 3.12.12, pytest-8.4.2, pluggy-1.6.0 -- /Users/maria/Documents/OpenSource/Narwhals/narwhals/.venv/bin/python3
cachedir: .pytest_cache
Using --randomly-seed=1430920357
hypothesis profile 'default'
rootdir: /Users/maria/Documents/OpenSource/Narwhals/narwhals
configfile: pyproject.toml
plugins: xdist-3.8.0, randomly-4.0.1, hypothesis-6.142.4, env-1.2.0, cov-7.0.0
collected 7 items                                                                                                                           

tests/expr_and_series/concat_struct_test.py::test_dryrun[pandas] PASSED                                                               [ 14%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[sqlframe] FAILED                                                             [ 28%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[pyarrow] PASSED                                                              [ 42%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[pandas[pyarrow]] PASSED                                                      [ 57%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[polars[eager]] PASSED                                                        [ 71%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[ibis] FAILED                                                                 [ 85%]
tests/expr_and_series/concat_struct_test.py::test_dryrun[duckdb] FAILED                                                               [100%]

========================================================== short test summary info ==========================================================
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[sqlframe] - AttributeError: 'SparkLikeNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[ibis] - AttributeError: 'IbisNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
FAILED tests/expr_and_series/concat_struct_test.py::test_dryrun[duckdb] - AttributeError: 'DuckDBNamespace' object has no attribute 'concat_struct'. Did you mean: 'concat_str'?
======================================================== 3 failed, 4 passed in 0.53s ========================================================

Should we also implemment the missing ones?

@FBruzzesi
Copy link
Member

FBruzzesi commented Oct 31, 2025

Hey @msalvany - thanks for the contribution 🚀

As a little side note/to expand a bit more on Dan's comment - we try to mirror the polars API, therefore we will aim to have narwhals.struct as mentioned in the original issue, that behaves the same as the polars.struct function for all the backends .

In a similar way, narwhals.concat_list will mirror polars.concat_list.

However:

I thought that in the same manner, concat_tuple would work

concat_tuple is not a polars function, therefore we won't have it either. There are a few exceptions to this rule, but this is not one of them.


Regarding other backends:

I have started with the tests. I see that there are more backends than pandas, polars and arrow.

For now you can start by xfailing them in the tests. I can see you are already xfailing certain polars version, so you can do something along the following lines:

def test_dryrun(constructor: Constructor, *, request: pytest.FixtureRequest) -> None:
    if "polars" in str(constructor) and POLARS_VERSION < (1, 0, 0):
        # nth only available after 1.0
        request.applymarker(pytest.mark.xfail)

+    if any(x in str(constructor) for x in ("dask", "duckdb", "ibis", "pyspark", "sqlframe")):
+        reason = "Not supported/not implemented"
+        request.applymarker(pytest.mark.xfail(reason))

and in those backend namespaces you can add struct = not_implemented() instead of defining the method.

I hope it helps! Let's get pandas, polars and pyarrow in first, and then we can iterate for the others 🤞🏼

@FBruzzesi FBruzzesi added the enhancement New feature or request label Nov 1, 2025
@msalvany
Copy link
Author

msalvany commented Nov 2, 2025

Hi,

Thanks for the clarification @FBruzzesi, I totally get it now! I have changed all concat_struct references to struct.

@msalvany msalvany changed the title DRAFT: ADD concat_struct DRAFT: ADD struct Nov 2, 2025
@FBruzzesi
Copy link
Member

FBruzzesi commented Nov 3, 2025

Hey @msalvany first and foremost, thanks for updating the PR - it looks close to the finish line 🙏🏼

I have a few of comments, especially regarding tests:

  • In the test, you are running the function, but then it would be good to add a comparison with an expected output. Something along the lines of:
     result = ...
     expected = ...  # <- this is a dictionary that matches the result dataframe content as key: list of values mapping
     assert_data_equal(result, expected)
  • Locally make sure to run pytest narwhals --doctest-modules as well. I think there is some formatting misalignment in the docstring example
  • I just noticed that in the contributing guide the part on pre-commit is not very clear. I would suggest to run:
    uv pip install pre-commit
    pre-commit install
    pre-commit run --all-files
    
  • I will update the PR title and convert it to draft - you are always free to change it back whenever you think it's ready

@FBruzzesi FBruzzesi marked this pull request as draft November 3, 2025 00:29
@FBruzzesi FBruzzesi changed the title DRAFT: ADD struct feat: Add narwhals.struct top level function Nov 3, 2025
@MarcoGorelli
Copy link
Member

thanks all! just a comment on

I hope it helps! Let's get pandas, polars and pyarrow in first, and then we can iterate for the others 🤞🏼

we should at least verify that this operation is feasible for spark/duckdb. fortunately, in this case, it looks like it's easily done with struct_pack, e.g.

In [35]: rel = duckdb.sql("select * from values (1,4,0),(1,5,1),(2,6,2) df(a,b,i)")

In [36]: rel
Out[36]:
┌───────┬───────┬───────┐
│   abi   │
│ int32int32int32 │
├───────┼───────┼───────┤
│     140 │
│     151 │
│     262 │
└───────┴───────┴───────┘

In [37]: rel.select('a', 'b', 'i', duckdb.FunctionExpression('struct_pack', 'a', 'b'))
Out[37]:
┌───────┬───────┬───────┬──────────────────────────────┐
│   abistruct_pack(a, b)       │
│ int32int32int32struct(a integer, b integer) │
├───────┼───────┼───────┼──────────────────────────────┤
│     140 │ {'a': 1, 'b': 4}             │
│     151 │ {'a': 1, 'b': 5}             │
│     262 │ {'a': 2, 'b': 6}             │
└───────┴───────┴───────┴──────────────────────────────┘

in pyspark it looks like it's just struct

@msalvany
Copy link
Author

msalvany commented Nov 3, 2025

In [35]: rel = duckdb.sql("select * from values (1,4,0),(1,5,1),(2,6,2) df(a,b,i)")

In [36]: rel
Out[36]:
┌───────┬───────┬───────┐
│   abi   │
│ int32int32int32 │
├───────┼───────┼───────┤
│     140 │
│     151 │
│     262 │
└───────┴───────┴───────┘

In [37]: rel.select('a', 'b', 'i', duckdb.FunctionExpression('struct_pack', 'a', 'b'))
Out[37]:
┌───────┬───────┬───────┬──────────────────────────────┐
│   abistruct_pack(a, b)       │
│ int32int32int32struct(a integer, b integer) │
├───────┼───────┼───────┼──────────────────────────────┤
│     140 │ {'a': 1, 'b': 4}             │
│     151 │ {'a': 1, 'b': 5}             │
│     262 │ {'a': 2, 'b': 6}             │
└───────┴───────┴───────┴──────────────────────────────┘

Hello @MarcoGorelli, I'm going to use your example here to ask if the output we expect after nw.struct() is a new column containing the struct inside the original dataframe (as you showed here), or rather a new independent df with a single column containing the struct.

If I understand this right, what polars.struct() generates is the 2nd option, but I might be mistaken.

So far, this is what I was mimicking, just let me know if it should be changed. Thanks!

@MarcoGorelli
Copy link
Member

a new column containing the struct inside the original dataframe (as you showed here), or rather a new independent df with a single column containing the struct.

this depends on whether you use with_columns or select

@msalvany
Copy link
Author

msalvany commented Nov 4, 2025

in pyspark it looks like it's just struct

I simply tested the struct from pyspark to be sure we get the same, and it looks fine too:

data = [(1, 4, 0), (1, 5, 1), (2, 6, 2)]
columns = ["a", "b", "i"]

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data, columns)
df_with_struct = df.select("a", "b", "i", struct("a", "b").alias("struct_col"))
df_with_struct.show(truncate=False)
+---+---+---+----------+
|a  |b  |i  |struct_col|
+---+---+---+----------+
|1  |4  |0  |{1, 4}    |
|1  |5  |1  |{1, 5}    |
|2  |6  |2  |{2, 6}    |
+---+---+---+----------+

@msalvany msalvany marked this pull request as ready for review November 4, 2025 13:43
Comment on lines +354 to +355
values = df[col].tolist()
non_null_values = [v for v in values if not pd.isna(v)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick note that tolist and iterating over values in Python isn't allowed here, as it's very inefficient - you'll need to look for a way to do this using the pandas api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enh]: Implement narwhals.struct

4 participants