test: Simplify read_scan_test, spark session#3024
Conversation
`parametrize` already defines the contents, that'll do
FBruzzesi
left a comment
There was a problem hiding this comment.
Thanks @dangotbanned - I have one comment that can help us even more 🎉
tests/read_scan_test.py
Outdated
| def sqlframe_session() -> DuckDBSession: | ||
| from sqlframe.duckdb import DuckDBSession | ||
|
|
||
| def test_scan_csv(tmpdir: pytest.TempdirFactory, constructor: Constructor) -> None: | ||
| # NOTE: `__new__` override inferred by `pyright` only | ||
| # https://github.com/eakmanrq/sqlframe/blob/772b3a6bfe5a1ffd569b7749d84bea2f3a314510/sqlframe/base/session.py#L181-L184 | ||
| return cast("DuckDBSession", DuckDBSession()) # type: ignore[redundant-cast] | ||
|
|
||
|
|
||
| def pyspark_session() -> SparkSession: # pragma: no cover | ||
| if is_spark_connect := os.environ.get("SPARK_CONNECT", None): | ||
| from pyspark.sql.connect.session import SparkSession | ||
| else: | ||
| from pyspark.sql import SparkSession | ||
| builder = cast("SparkSession.Builder", SparkSession.builder).appName("unit-tests") | ||
| builder = ( | ||
| builder.remote(f"sc://localhost:{os.environ.get('SPARK_PORT', '15002')}") | ||
| if is_spark_connect | ||
| else builder.master("local[1]").config("spark.ui.enabled", "false") | ||
| ) | ||
| return ( | ||
| builder.config("spark.default.parallelism", "1") | ||
| .config("spark.sql.shuffle.partitions", "2") | ||
| .config("spark.sql.session.timeZone", "UTC") | ||
| .getOrCreate() | ||
| ) |
There was a problem hiding this comment.
Should we consider moving these into tests/utils.py? They can be re-used both in tests/conftest.py and in #3032
There was a problem hiding this comment.
I do want to eventually, but I was thinking as session-scoped fixtures?
We'd need to restructure some of the existing tests though - e.g. so the same filtering that happens in --constructors also applies to these heavy things 🤔
We don't have to do that, but that was the idea I was working on when I got distracted and did this PR instead 😂
There was a problem hiding this comment.
Ah second thought, @FBruzzesi yeah just move them, merge and use in #3032 if they're helpful 😅
I did explicitly mention that this was prep for #3023 anyway 🤦
There was a problem hiding this comment.
Regarding
I do want to eventually, but I was thinking as session-scoped fixtures?
and
heavy things
The pyspark session is a singleton (the getOrCreate part should be key) - we create it once and use it in the pyspark constructor
Lines 203 to 213 in 68d762a
For SQLFrame it should be quite lightweight
so the same filtering that happens in --constructors
What do you mean by this exactly? I am not following 🙈
Anyway, we can move it as a follow up, no worries
There was a problem hiding this comment.
My brain has melted for the day dude 😭
Anyway, we can move it as a follow up, no worries
I'm happy for you to do it now if you're keen to use it the other PR?
| "spark.sql.session.timeZone", "UTC" | ||
| ).getOrCreate() | ||
| session = pyspark_session() | ||
|
|
There was a problem hiding this comment.
Christ, great call @FBruzzesi!
I had no idea we had this logic in so many places 😂
All good thanks @FBruzzesi 😍
I'm still good to watch all the lovely code fly by and appreciate it 😄 |
read_scan_testread_scan_test, spark session

What type of PR is this? (check all applicable)
Related issues
DataFrame().lazy({'pyspark','sqlframe','pyspark[connect]'})#3023Checklist
If you have comments or can explain your changes, please do so below
I was looking to deduplicate the spark-like session stuff in prep for #3023.
I thought I'd use this as full-module example for #2959 on how we can write less repetitive tests 🙂