Skip to content

test: Simplify read_scan_test, spark session#3024

Merged
dangotbanned merged 14 commits intomainfrom
test-simp-read-scan
Aug 23, 2025
Merged

test: Simplify read_scan_test, spark session#3024
dangotbanned merged 14 commits intomainfrom
test-simp-read-scan

Conversation

@dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Aug 22, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

I was looking to deduplicate the spark-like session stuff in prep for #3023.

I thought I'd use this as full-module example for #2959 on how we can write less repetitive tests 🙂

@dangotbanned dangotbanned marked this pull request as ready for review August 22, 2025 22:30
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dangotbanned - I have one comment that can help us even more 🎉

Comment on lines 79 to 103
def sqlframe_session() -> DuckDBSession:
from sqlframe.duckdb import DuckDBSession

def test_scan_csv(tmpdir: pytest.TempdirFactory, constructor: Constructor) -> None:
# NOTE: `__new__` override inferred by `pyright` only
# https://github.com/eakmanrq/sqlframe/blob/772b3a6bfe5a1ffd569b7749d84bea2f3a314510/sqlframe/base/session.py#L181-L184
return cast("DuckDBSession", DuckDBSession()) # type: ignore[redundant-cast]


def pyspark_session() -> SparkSession: # pragma: no cover
if is_spark_connect := os.environ.get("SPARK_CONNECT", None):
from pyspark.sql.connect.session import SparkSession
else:
from pyspark.sql import SparkSession
builder = cast("SparkSession.Builder", SparkSession.builder).appName("unit-tests")
builder = (
builder.remote(f"sc://localhost:{os.environ.get('SPARK_PORT', '15002')}")
if is_spark_connect
else builder.master("local[1]").config("spark.ui.enabled", "false")
)
return (
builder.config("spark.default.parallelism", "1")
.config("spark.sql.shuffle.partitions", "2")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider moving these into tests/utils.py? They can be re-used both in tests/conftest.py and in #3032

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to eventually, but I was thinking as session-scoped fixtures?

We'd need to restructure some of the existing tests though - e.g. so the same filtering that happens in --constructors also applies to these heavy things 🤔

We don't have to do that, but that was the idea I was working on when I got distracted and did this PR instead 😂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah second thought, @FBruzzesi yeah just move them, merge and use in #3032 if they're helpful 😅

I did explicitly mention that this was prep for #3023 anyway 🤦

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding

I do want to eventually, but I was thinking as session-scoped fixtures?

and

heavy things

The pyspark session is a singleton (the getOrCreate part should be key) - we create it once and use it in the pyspark constructor

narwhals/tests/conftest.py

Lines 203 to 213 in 68d762a

def _constructor(obj: Data) -> PySparkDataFrame:
_obj = deepcopy(obj)
index_col_name = generate_temporary_column_name(n_bytes=8, columns=list(_obj))
_obj[index_col_name] = list(range(len(_obj[next(iter(_obj))])))
return (
session.createDataFrame([*zip(*_obj.values())], schema=[*_obj.keys()])
.repartition(2)
.orderBy(index_col_name)
.drop(index_col_name)
)

For SQLFrame it should be quite lightweight

so the same filtering that happens in --constructors

What do you mean by this exactly? I am not following 🙈

Anyway, we can move it as a follow up, no worries

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain has melted for the day dude 😭

Anyway, we can move it as a follow up, no worries

I'm happy for you to do it now if you're keen to use it the other PR?

"spark.sql.session.timeZone", "UTC"
).getOrCreate()
session = pyspark_session()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Christ, great call @FBruzzesi!
I had no idea we had this logic in so many places 😂

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self approving my edits 😂 But please take a look at 267e7b7

You might have already done it, but I might want to wait for tomorrow because of

My brain has melted for the day dude 😭


@dangotbanned
Copy link
Member Author

Self approving my edits 😂 But please take a look at 267e7b7

All good thanks @FBruzzesi 😍

You might have already done it, but I might want to wait for tomorrow because of

My brain has melted for the day dude 😭

I'm still good to watch all the lovely code fly by and appreciate it 😄

@dangotbanned dangotbanned changed the title test: Simplify read_scan_test test: Simplify read_scan_test, spark session Aug 23, 2025
@dangotbanned dangotbanned merged commit 3c58b0e into main Aug 23, 2025
31 of 32 checks passed
@dangotbanned dangotbanned deleted the test-simp-read-scan branch August 23, 2025 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants