Improve Left/Right/Inner Join #223

flcong · 2021-08-23T14:18:23Z

Fix Left Join becomes Inner Join for inequality conditions #190
For left (right) join, set pd.NA to observations not matched by equality or inequality condition from the right (left) table.
Add new data sets and tests for complex join.

nils-braun · 2021-08-23T20:19:50Z

Thanks @flcong - looks like a lot of work! I might need some time checking the content. Make sure to ping me if I take too long... In the meantime, maybe @rajagurunath can have a first look (although this is also new code for you, sorry!)

codecov-commenter · 2021-08-23T20:29:26Z

Codecov Report

Merging #223 (39980fd) into main (4dab949) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #223   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           64        64           
  Lines         2589      2594    +5     
  Branches       362       363    +1     
=========================================
+ Hits          2589      2594    +5

Impacted Files	Coverage Δ
dask_sql/physical/rel/logical/join.py	`100.00% <100.00%> (ø)`
dask_sql/context.py	`100.00% <0.00%> (ø)`
dask_sql/input_utils/hive.py	`100.00% <0.00%> (ø)`
dask_sql/input_utils/intake.py	`100.00% <0.00%> (ø)`
dask_sql/input_utils/convert.py	`100.00% <0.00%> (ø)`
dask_sql/input_utils/location.py	`100.00% <0.00%> (ø)`
dask_sql/input_utils/pandaslike.py	`100.00% <0.00%> (ø)`
dask_sql/physical/rel/custom/create_table.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4dab949...39980fd. Read the comment docs.

flcong · 2021-08-23T20:56:37Z

Thanks @flcong - looks like a lot of work! I might need some time checking the content. Make sure to ping me if I take too long... In the meantime, maybe @rajagurunath can have a first look (although this is also new code for you, sorry!)

No problem! First time contributing to a python package.

rajagurunath · 2021-08-24T11:45:13Z

Hi @flcong,

Thanks for fixing this issue and welcome to dask-sql and open source :)

rajagurunath · 2021-08-24T11:55:59Z

tests/integration/test_join.py

+
+    # Left Join
+    querypnl = """
+        select a.*, b.startdate, b.enddate, b.lk_nullint, b.lk_int, b.lk_str,


I have tested this Query using PostgresSQL, it somewhat matches with the correct dataset specified here except for some dtype mismatch, I have one suggestion can Please you add this testing on test_postgres.py as well, where you can make use of assert_query_gives_same_result to compare your result directly with Postgres (similar to the one you specified in issue) result. let me know if you need any help.

All right. Let me check. To do the Postgre tests, I need to install docker, right?

Yes :), if not, a quick hack for testing purpose is you can hardcore any available postgresSQL ip address in the engine fixture of test_postgres.py.whichever is easier.

OK. I've added the tests in test_postgres.py. Also I don't know why but I cannot connect to the postgres container using the original test_postgres.py. So I exposed port 5432 in client.containers.run and change address to "localhost". Then, the tests work fine.

rajagurunath · 2021-08-24T14:15:49Z

dask_sql/physical/rel/logical/join.py

+                    how="left",
+                )
+                # Assign pd.NA
+                for v in df.columns[df.columns.str.startswith(other_varpre)]:


Just tried running the tests and the same code in jupyter notebook , and got the following error TypeError: float() argument must be a string or a number, not 'NAType' , if you are not getting this error then it is probably due to environment mismatch error I guess, Or any other guess regarding this Error?
maybe due to setting pd.NA ? what do you think @flcong?

Which test triggers the error? It seems to be the conversion float(pd.NA)?

For me, first query (left join) was failing in test_join_lricomplex
I will try to provide more context by tomorrow :)

OK. BTW, I use Python 3.8.10 and Pandas 1.3.2 to run the tests.

That's a good catch @flcong, The Above Error was raised on pandas version 1.2.4 and working fine in pandas==1.3.2

* Add new data * Allow for type conversion * Allow for specification of whether or not to check dtypes.

* Add new tests in `test_join.py` in `test_postgres.py` as well. * Expose port 5432 for postgres container. * Change address to "localhost"

rajagurunath

Maybe Once these changes are done, I request @nils-braun for further review since I am also no expert in the dask-sql/physical/rel/logical part.

rajagurunath · 2021-08-25T21:31:18Z

tests/integration/test_postgres.py

@@ -32,6 +33,7 @@ def engine():
        # get the address and create the connection
        postgres.reload()
        address = postgres.attrs["NetworkSettings"]["Networks"]["dask-sql"]["IPAddress"]
+        address = "localhost"


Hi @flcong, Apologies for not letting you know about this earlier, Once you have tested with a custom PostgreSQL address, please replace that address with the original docker container host address, if not Github Workflow will fail.

rajagurunath · 2021-08-25T21:33:54Z

tests/integration/test_postgres.py

+        """,
+        ["dates", "startdate", "enddate"],
+        force_dtype="dask",
+        check_dtype=True,


Just curious here, Specifying check_dtype = False was not working here? Any other reason for introducing this new argument (force_dtype)?

I'm not sure how assert_frame_equal works to determine if, for example, 2.0 and 2 are identical when check_dtype=False. So here I try to make it more explicit that a type cast to the dask dataframe makes the two dataframes identical even if check_dtype=True. I guess it conveys more information for developers? (Maybe just my OCD.)

nils-braun

Ui @flcong, that was probably quite difficult to find out. Nice job. It took me quite some time to dig through the code, so I think it would be very cool if you can add a bit more documentation. Maybe you can use a small example Dataframe in the comments to show, what is actually wrong and how each step helps in solving it (e.g. the three cases you have and then the different merges later which rows they actually add). If this is too much to ask for, I can also take care of it.

I had some smaller comments in the code, but there is also a more general issue I would like to discuss here (and maybe it is just me being too naive): let me recap the problem. When we do a LEFT or RIGHT join with a filter condition, it happens that we get rid of all NAN-rows in the other Dataframe (and therefore the JOIN turns into an inner join) - I did understand this part. In my naive understanding (please correct me) the solution should be rather easy: if in the resulting Dataframe in the end a certain line from the left (or right) Dataframe is missing and we are doing a LEFT (or RIGHT) join, we still need to keep it. Shouldn't here a very simple join with the initial left (or right) Dataframe before the merge and on one of your newly created index columns be enough here? Your solution might also work (I did not think through all corner cases, but looks good so far) but it is quite complex and involves multiple merges (which are shuffles in the Dask world). Am I too simple?

nils-braun · 2021-08-27T11:53:46Z

dask_sql/physical/rel/logical/join.py

+        # could be duplicates. (Yeah. It may be better to inform users that
+        # index will break. After all, it is expected to be broken since the
+        # number of rows changes.
+        df = df.assign(uniqid=1)


I think I did not understand why this is needed, as you are grouping and joining later anyways). Probably I am just too stupid to see it, but maybe it is wise adding more documentation on why you do it.
(also, I am not 100% sure if this does not trigger a calculation, as Dask needs to know about the divisions. But I did not check)

nils-braun · 2021-08-27T11:55:20Z

dask_sql/physical/rel/logical/join.py

@@ -92,12 +95,32 @@ def convert(
        # 4. dask can only merge on the same column names.
        # We therefore create new columns on purpose, which have a distinct name.
        assert len(lhs_on) == len(rhs_on)
+        # Add two columns (1,2,...) to keep track of observations in left and


These changes as well as the index-reset further down are only used when there is a filter condition. I do not expect this to be the default. Therefore I think it would be nice if we do not touch the "normal" use case and do not introduce another performance drawback for the "normal" user. I think a simple if filter_condition and some comment should be enough, or what do you think?

nils-braun · 2021-08-27T11:59:13Z

dask_sql/physical/rel/logical/join.py

@@ -177,8 +218,91 @@ def merge_single_partitions(lhs_partition, rhs_partition):
                    for rex in filter_condition
                ],
            )
-            logger.debug(f"Additionally applying filter {filter_condition}")
-            df = filter_or_scalar(df, filter_condition)
+            # Three cases to deal with inequality conditions (left join as an example):


Can you add more documentation here? From the PR context I know that we are dealing with complex join conditions, which consist of an equality and an in-equality join (but there could be more than just inequalities), but someone just reading the code will not know that. Just describe the setting you are dealing with and the actual problem with the naive implementation before going into the details.

nils-braun · 2021-08-27T13:28:30Z

Am I too simple?
Just to make my point clear: if you replace your changes in join.py inside the filter_condition check just with

logger.debug(f"Additionally applying filter {filter_condition}")
df = filter_or_scalar(df, filter_condition)

# make sure we recover any lost rows in case of left, right or outer joins
if join_type in ["left", "outer"]:
    df = df.merge(df_lhs_renamed, on=list(df_lhs_renamed.columns), how="right")
elif join_type in ["right", "outer"]:
    df = df.merge(df_rhs_renamed, on=list(df_rhs_renamed.columns), how="right")

df = df.drop(columns=["left_idx", "right_idx"])

your tests still succeed

nils-braun · 2021-08-27T13:30:37Z

tests/integration/fixtures.py

@@ -86,6 +86,88 @@ def datetime_table():
    )


+@pytest.fixture
+def user_table_lk():


I love those tests, they are super cool because they seem like coming from a real use-case, which is absolute brilliant.

However, can we also have a very simple one with just like 3-4 lines and two columns (e.g. the one I used in my comments)? This makes debugging much easier than skimming though multiple lines which (because the columns are so wide) even span a lot of space in the editor. I can also take care of this if you want!

Yeah. I realize that. I simplified the new tests.

* Roll back to previous `join.py` but add lines to merge unmatched columns. * Fix a bug in `merge_single_partitions` where the returned dataframe has an extra column `"common"` that triggers metadata mismatch in the added lines of merging.

* Since int columns will be converted to float if there are unmatched rows, I use `check_dtype=False` but still use nullable int in the assumed correct table.

flcong · 2021-08-30T14:11:19Z

Am I too simple?
Just to make my point clear: if you replace your changes in join.py inside the filter_condition check just with

logger.debug(f"Additionally applying filter {filter_condition}")
df = filter_or_scalar(df, filter_condition)

# make sure we recover any lost rows in case of left, right or outer joins
if join_type in ["left", "outer"]:
    df = df.merge(df_lhs_renamed, on=list(df_lhs_renamed.columns), how="right")
elif join_type in ["right", "outer"]:
    df = df.merge(df_rhs_renamed, on=list(df_rhs_renamed.columns), how="right")

df = df.drop(columns=["left_idx", "right_idx"])

your tests still succeed

That's brilliant! I didn't think of that. I rolled back join.py to the previous version and added the lines you mentioned.

One caveat is that in this case, if any input column has type int and there are unmatched rows, np.nan will be added and the column type will be cast to float, instead of nullable int. Since we don't know whether or not there are unmatched rows without triggering compute() (I guess), we have to either leave it as is (users may want to type-cast float into nullable int later) or we cast all int columns to nullable int (which may not be good either). What do you think?

By the way, I also find a bug in the merge_single_partitions function in join.py. (I don't know how you quote code in the conversation XD.) In the returned dataframe, there is an extra column "common" which causes a metadata mismatch for left join when there are only inequality conditions (e.g. my time-series table tests), so I dropped the column.

flcong added 4 commits August 23, 2021 10:11

Improve Left/Right/Inner Join

d54b2a6

Add new datasets in tests

43edc68

Fix issue for new datasets

c199fe7

Add new test for complex join

29a201d

rajagurunath reviewed Aug 24, 2021

View reviewed changes

flcong added 2 commits August 25, 2021 08:39

Add new test data in fixture assert_query_gives_same_result

a7c1e30

* Add new data * Allow for type conversion * Allow for specification of whether or not to check dtypes.

Add test in Postgres

0fc00db

* Add new tests in `test_join.py` in `test_postgres.py` as well. * Expose port 5432 for postgres container. * Change address to "localhost"

rajagurunath reviewed Aug 25, 2021

View reviewed changes

nils-braun reviewed Aug 27, 2021

View reviewed changes

flcong added 2 commits August 30, 2021 10:09

Roll back to previous join.py and add merge

e32023c

* Roll back to previous `join.py` but add lines to merge unmatched columns. * Fix a bug in `merge_single_partitions` where the returned dataframe has an extra column `"common"` that triggers metadata mismatch in the added lines of merging.

Simplify new tests added

39980fd

* Since int columns will be converted to float if there are unmatched rows, I use `check_dtype=False` but still use nullable int in the assumed correct table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Left/Right/Inner Join #223

Improve Left/Right/Inner Join #223

flcong commented Aug 23, 2021

nils-braun commented Aug 23, 2021

codecov-commenter commented Aug 23, 2021 •

edited

Loading

flcong commented Aug 23, 2021

rajagurunath commented Aug 24, 2021

rajagurunath Aug 24, 2021

flcong Aug 24, 2021

rajagurunath Aug 24, 2021

flcong Aug 25, 2021

rajagurunath Aug 24, 2021

flcong Aug 24, 2021

rajagurunath Aug 24, 2021

flcong Aug 24, 2021

rajagurunath Aug 25, 2021

rajagurunath left a comment

rajagurunath Aug 25, 2021

flcong Aug 27, 2021

rajagurunath Aug 25, 2021

flcong Aug 30, 2021

nils-braun left a comment •

edited

Loading

nils-braun Aug 27, 2021

nils-braun Aug 27, 2021

nils-braun Aug 27, 2021

nils-braun commented Aug 27, 2021

nils-braun Aug 27, 2021

flcong Aug 30, 2021

flcong commented Aug 30, 2021

Improve Left/Right/Inner Join #223

Are you sure you want to change the base?

Improve Left/Right/Inner Join #223

Conversation

flcong commented Aug 23, 2021

nils-braun commented Aug 23, 2021

codecov-commenter commented Aug 23, 2021 • edited Loading

Codecov Report

flcong commented Aug 23, 2021

rajagurunath commented Aug 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajagurunath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nils-braun left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nils-braun commented Aug 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flcong commented Aug 30, 2021

codecov-commenter commented Aug 23, 2021 •

edited

Loading

nils-braun left a comment •

edited

Loading