-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SparkSQLCompare and spark submodule #310
Conversation
single machine: 16 CPUs, 64 GB RAM100 Million rows x 10 cols:
500 Million rows x 10 cols:
Distributed: 20 executors, 8 cores, 32GB RAM100 Million rows x 10 cols:
500 Million rows x 10 cols:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good Faisal,
One thing to mention, I do seem to get several of the following warning when performing a Spark SQL compare: No Partition Defined for Window operation!
- looks like you'd need to add a partition to the Windows in the dataframe merge to silence these.
One other consideration - when re-running a compare, I get a few WARN CacheManager: Asked to cache already cached data.
. Not a meaningful warning since it's just trying to re-cache the dataframes from the first compare, may be worth just forcefully silencing this one.
I think this is coming from the following lines: df1 = df1.withColumn(
"__index",
row_number().over(Window.orderBy(monotonically_increasing_id())) - 1,
)
df2 = df2.withColumn(
"__index",
row_number().over(Window.orderBy(monotonically_increasing_id())) - 1,
) I need to create an index column but the increasing number need to happen across the whole dataframe, so there isn't anything to really partition on. This only happens when there are dupes. I think I can just use |
Re-testing locally with a small dataframe, it seems what use to be a ~6-7 second run is now ~25 seconds due to this specific change. I'm not sure how this scales with larger dataframes and running locally vs a cluster, but it may be worth checking this doesn't have a meaningful affect on performance. |
@rhaffar Any chance you have some sample code I can try and recreate locally? on a 10K row dataset I get:
|
Sure, can't share ipynb files so I'll just copy paste the code block. Frankly this was tested with dataframes that are only a few rows, and was run locally, so I suppose not super meaningful for performance testing. If you're getting improved performance with the updated code on a larger dataset then I think that's a much better indication regarding performance changes.
|
@rhaffar thanks for flagging. Running some benchmarks with dupes in there. not sure if this might be an issue when running on very small data since it all gets moved to one partition vs multiple. With larger datasets moving everything over to one partition would cause OOM errors. Let's see with 10M or 100M and distributed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! I looked at this on distributed for 10M, 50M and 100M rows: 10MwithColumn("__index", monotonically_increasing_id())
50Mwith row_number().over(Window.orderBy(monotonically_increasing_id())) - 1,
withColumn("__index", monotonically_increasing_id())
100Mrow_number().over(Window.orderBy(monotonically_increasing_id())) - 1,
withColumn("__index", monotonically_increasing_id())
Hypothesis seems to make sense. Large data would benefit from the |
@fdosani , I tried this pull request using remote databricks spark cluster using databricks-connect library. With this library, I get pyspark.sql.connect.dataframe.DataFrame and when I pass it to SparkSQCLompare(), I get following error: TypeError: df1 must be a pyspark.sql.DataFrame pyspark.sql.connect.dataframe.DataFrame worked earlier with legacysparkcompare. I am checking how to convert pyspark.sql.connect.dataframe.DataFrame to pyspark.sql.DataFrame |
So I don't know much about Spark Connect, but it seems to allow remote spark connects to run dataframes commands etc. For your natural workflow are you using Spark Connect, or was this just to test out? |
We always use databricks spark via spark connect from servers outside databricks, so I need to find workaround to convert this. I am looking the PR apache/spark#46129 to see if such conversion possible. |
@satniks can you try this branch quickly? https://github.com/capitalone/datacompy/tree/spark-connect-test |
Wow. Thanks @fdosani It worked and shown expected differences. It also fixed performance issues I reported earlier with version 0.12 with databricks sparks. Now waiting for version 0.13 :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, left some non-blocking comments. Will come back when you get the unit tests passing. Sorry for the late review.
# Clean up temp columns for duplicate row matching | ||
if self._any_dupes: | ||
outer_join = outer_join.drop( | ||
*[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the unpacking here may be unnecessary as we're creating a list just to unpack it. Could be simplified to .drop(col1, col2)
ignore_extra_columns : bool | ||
Ignores any columns in one dataframe and not in the other. | ||
""" | ||
if not ignore_extra_columns and not self.all_columns_match(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future, but I think this can be simplified to a return ignore_extra_columns or/and...
Tuple(str, str) | ||
Tuple of base and compare datatype | ||
""" | ||
base_dtype = [d[1] for d in dataframe.dtypes if d[0] == col_1][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking: think this can be simplified to next(d[1] for d in dataframe.dtypes if d[0] == col_1)
* [WIP] vanilla spark * [WIP] fixing tests and logic * [WIP] __index cleanup * updating pyspark.sql logic and fixing tests * restructuring spark logic into submodule and typing * remove pandas 2 restriction for spark sql * fix for sql call * updating docs * updating benchmarks with pyspark dataframe * relative imports and linting * relative imports and linting * feedback from review, switch to monotonic and simplify checks * allow pyspark.sql.connect.dataframe.DataFrame * checking version for spark connect * typo fix * adding import * adding connect extras * adding connect extras
* [WIP] vanilla spark * [WIP] fixing tests and logic * [WIP] __index cleanup * updating pyspark.sql logic and fixing tests * restructuring spark logic into submodule and typing * remove pandas 2 restriction for spark sql * fix for sql call * updating docs * updating benchmarks with pyspark dataframe * relative imports and linting * relative imports and linting * feedback from review, switch to monotonic and simplify checks * allow pyspark.sql.connect.dataframe.DataFrame * checking version for spark connect * typo fix * adding import * adding connect extras * adding connect extras
In the following PR I've created:
0.13.0
(breaking changes)SparkSQLCompare
SparkPandasCompare
SparkCompare
->SparkPandasCompare
spark
Some benchmark numbers to follow