-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Fugue support (Phase 1) #201
Conversation
@goodwanghan Thank you for the PR. I'll try and take a look at this shortly. I might have some question and have a discussion to get your opinion on a few things once I've dived in a bit. @ak-gupta @@NikhilJArora just in case you are interested in checking this out too. :) |
Sounds good @fdosani , if you want to test dask and ray, please use fugue 0.8.4.dev2, if you want to test other backends, just use the official releases. |
Just poking around with the code base. If I do something like the following. Basically comparing 2 dataframes which are just sorted backwards it returns different results which from reading the code makes sense.
Fully admit being new to Fugue I might not totally understand the implementation so please excuse my ignorance or dumb questions. So the splitting up into groups happens here. So lets say I have 3 "buckets" those 3 may or may not have the corresponding row from the other dataframe. Based on what @kvnkho mentioned at PyData these buckets are what is run on say Spark correct in isolation? |
Ah, this is because they don't match on schema. df1.b is integer df2.b is float. Fugue has more strict schema comparison. if you add
you will see the compare succeeds |
1 similar comment
Ah, this is because they don't match on schema. df1.b is integer df2.b is float. Fugue has more strict schema comparison. if you add
you will see the compare succeeds |
That is great. Ok so maybe my understanding was off here. The _comp which happens does happen on the entire dataset then correct? I thought maybe pieces are being segmented off and that if things were not ordered properly there would be a chance the join wouldn't take place. |
The For example df1 a b df2 a b if join_columns is a, then the data could be partitioned as group 1 (containing x y), and group 2(containing z) |
In this case, You asked about join, we didn't directly use join here, instead we use map->union->groupmap, which is similar to join. |
Got it! Yup this is making sense now. Thanks for the explanation. Are you OK if I push a couple of small tweaks? Nothing logic wise. This all makes sense now. I guess maybe the other thing we should discuss is the strict vs loose comparison we discussed above. In the core.py it happens here in |
Please feel free to make changes. Yeah we can also implement |
Actually never mind. i missed this line. 🤦 |
I did a test on 100gb data. It can finish in 7 min with 256 cpus. The test is that I loaded two spark dataframes from the same file. |
* Add Fugue support * add polars and duckdb * fixing docstrings and cleanup * adding in strict_schema and check for hash_cols in both dfs --------- Co-authored-by: fdosani <[email protected]>
Fugue is an abstraction for distributed and local computing frameworks such as Spark Dask, Ray, Duckdb and Polars. For datacompy, fugue can elegantly scale the core Compare class to different distributed backends.
Is Phase 1, we only implemented
is_match
function. In Phase 2, we will enable report.Notice,
is_match
will compare unordered data, meaning that,df1
,df2
without the same order can still match. In distributed systems, the concept of order doesn't natively exist, so that is whyis_match
doesn't require orders.Here are a few examples to use is_match