-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Left/Right/Inner Join #223
base: main
Are you sure you want to change the base?
Conversation
Thanks @flcong - looks like a lot of work! I might need some time checking the content. Make sure to ping me if I take too long... In the meantime, maybe @rajagurunath can have a first look (although this is also new code for you, sorry!) |
Codecov Report
@@ Coverage Diff @@
## main #223 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 64 64
Lines 2589 2594 +5
Branches 362 363 +1
=========================================
+ Hits 2589 2594 +5
Continue to review full report at Codecov.
|
No problem! First time contributing to a python package. |
Hi @flcong, Thanks for fixing this issue and welcome to dask-sql and open source :) |
tests/integration/test_join.py
Outdated
|
||
# Left Join | ||
querypnl = """ | ||
select a.*, b.startdate, b.enddate, b.lk_nullint, b.lk_int, b.lk_str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this Query using PostgresSQL, it somewhat matches with the correct dataset specified here except for some dtype
mismatch, I have one suggestion can Please you add this testing on test_postgres.py
as well, where you can make use of assert_query_gives_same_result
to compare your result directly with Postgres (similar to the one you specified in issue) result. let me know if you need any help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right. Let me check. To do the Postgre tests, I need to install docker, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :), if not, a quick hack for testing purpose is you can hardcore any available postgresSQL ip address in the engine
fixture of test_postgres.py.whichever is easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I've added the tests in test_postgres.py
. Also I don't know why but I cannot connect to the postgres container using the original test_postgres.py
. So I exposed port 5432 in client.containers.run
and change address
to "localhost"
. Then, the tests work fine.
how="left", | ||
) | ||
# Assign pd.NA | ||
for v in df.columns[df.columns.str.startswith(other_varpre)]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried running the tests and the same code in jupyter notebook , and got the following error TypeError: float() argument must be a string or a number, not 'NAType'
, if you are not getting this error then it is probably due to environment mismatch error I guess, Or any other guess regarding this Error?
maybe due to setting pd.NA
? what do you think @flcong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which test triggers the error? It seems to be the conversion float(pd.NA)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, first query (left join) was failing in test_join_lricomplex
I will try to provide more context by tomorrow :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. BTW, I use Python 3.8.10 and Pandas 1.3.2 to run the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch @flcong, The Above Error was raised on pandas version 1.2.4
and working fine in pandas==1.3.2
* Add new data * Allow for type conversion * Allow for specification of whether or not to check dtypes.
* Add new tests in `test_join.py` in `test_postgres.py` as well. * Expose port 5432 for postgres container. * Change address to "localhost"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Once these changes are done, I request @nils-braun for further review since I am also no expert in the dask-sql/physical/rel/logical
part.
tests/integration/test_postgres.py
Outdated
@@ -32,6 +33,7 @@ def engine(): | |||
# get the address and create the connection | |||
postgres.reload() | |||
address = postgres.attrs["NetworkSettings"]["Networks"]["dask-sql"]["IPAddress"] | |||
address = "localhost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @flcong, Apologies for not letting you know about this earlier, Once you have tested with a custom PostgreSQL address, please replace that address with the original docker container host address, if not Github Workflow will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I see.
""", | ||
["dates", "startdate", "enddate"], | ||
force_dtype="dask", | ||
check_dtype=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious here, Specifying check_dtype = False
was not working here? Any other reason for introducing this new argument (force_dtype)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how assert_frame_equal
works to determine if, for example, 2.0
and 2
are identical when check_dtype=False
. So here I try to make it more explicit that a type cast to the dask dataframe makes the two dataframes identical even if check_dtype=True
. I guess it conveys more information for developers? (Maybe just my OCD.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ui @flcong, that was probably quite difficult to find out. Nice job. It took me quite some time to dig through the code, so I think it would be very cool if you can add a bit more documentation. Maybe you can use a small example Dataframe in the comments to show, what is actually wrong and how each step helps in solving it (e.g. the three cases you have and then the different merges later which rows they actually add). If this is too much to ask for, I can also take care of it.
I had some smaller comments in the code, but there is also a more general issue I would like to discuss here (and maybe it is just me being too naive): let me recap the problem. When we do a LEFT or RIGHT join with a filter condition, it happens that we get rid of all NAN-rows in the other Dataframe (and therefore the JOIN turns into an inner join) - I did understand this part. In my naive understanding (please correct me) the solution should be rather easy: if in the resulting Dataframe in the end a certain line from the left (or right) Dataframe is missing and we are doing a LEFT (or RIGHT) join, we still need to keep it. Shouldn't here a very simple join
with the initial left (or right) Dataframe before the merge and on one of your newly created index columns be enough here? Your solution might also work (I did not think through all corner cases, but looks good so far) but it is quite complex and involves multiple merges (which are shuffles in the Dask world). Am I too simple?
# could be duplicates. (Yeah. It may be better to inform users that | ||
# index will break. After all, it is expected to be broken since the | ||
# number of rows changes. | ||
df = df.assign(uniqid=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did not understand why this is needed, as you are grouping and joining later anyways). Probably I am just too stupid to see it, but maybe it is wise adding more documentation on why you do it.
(also, I am not 100% sure if this does not trigger a calculation, as Dask needs to know about the divisions. But I did not check)
@@ -92,12 +95,32 @@ def convert( | |||
# 4. dask can only merge on the same column names. | |||
# We therefore create new columns on purpose, which have a distinct name. | |||
assert len(lhs_on) == len(rhs_on) | |||
# Add two columns (1,2,...) to keep track of observations in left and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes as well as the index-reset further down are only used when there is a filter condition. I do not expect this to be the default. Therefore I think it would be nice if we do not touch the "normal" use case and do not introduce another performance drawback for the "normal" user. I think a simple if filter_condition
and some comment should be enough, or what do you think?
@@ -177,8 +218,91 @@ def merge_single_partitions(lhs_partition, rhs_partition): | |||
for rex in filter_condition | |||
], | |||
) | |||
logger.debug(f"Additionally applying filter {filter_condition}") | |||
df = filter_or_scalar(df, filter_condition) | |||
# Three cases to deal with inequality conditions (left join as an example): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add more documentation here? From the PR context I know that we are dealing with complex join conditions, which consist of an equality and an in-equality join (but there could be more than just inequalities), but someone just reading the code will not know that. Just describe the setting you are dealing with and the actual problem with the naive implementation before going into the details.
your tests still succeed |
@@ -86,6 +86,88 @@ def datetime_table(): | |||
) | |||
|
|||
|
|||
@pytest.fixture | |||
def user_table_lk(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love those tests, they are super cool because they seem like coming from a real use-case, which is absolute brilliant.
However, can we also have a very simple one with just like 3-4 lines and two columns (e.g. the one I used in my comments)? This makes debugging much easier than skimming though multiple lines which (because the columns are so wide) even span a lot of space in the editor. I can also take care of this if you want!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I realize that. I simplified the new tests.
* Roll back to previous `join.py` but add lines to merge unmatched columns. * Fix a bug in `merge_single_partitions` where the returned dataframe has an extra column `"common"` that triggers metadata mismatch in the added lines of merging.
* Since int columns will be converted to float if there are unmatched rows, I use `check_dtype=False` but still use nullable int in the assumed correct table.
That's brilliant! I didn't think of that. I rolled back One caveat is that in this case, if any input column has type int and there are unmatched rows, By the way, I also find a bug in the |
pd.NA
to observations not matched by equality or inequality condition from the right (left) table.