added combine first function #1950

AishwaryaKalloli · 2020-12-03T12:06:16Z

#1929
This is an initial commit, want to know if I am going in the right direction.
Please let me know if I need to correct/improve anything.

xinrong-meng · 2020-12-04T16:02:27Z

databricks/koalas/frame.py

+            set(self._internal.column_labels).intersection(set(other._internal.column_labels))
+        )
+
+        update_sdf = self.join(


Shall we consider using https://github.com/databricks/koalas/blob/master/databricks/koalas/utils.py#L80?

Sure, makes sense. Will update the code and let you know.

AishwaryaKalloli · 2020-12-05T10:01:10Z

I have attached the results of cases provided in pandas docs
First and second dataframes in each image are df1 and df2, and third one is df1.combine_first(df2).
The results are correct, although I am having a hard time in replacing df._internal.spark_column_names. Is it necessary to reset them, if so can you point me to the function that I could use to reset them.

itholic

I'll review more detail after discussing about the comments below.

Thanks for the work on this!! :)

itholic · 2020-12-07T06:25:11Z

databricks/koalas/frame.py

+        if isinstance(other, ks.Series):
+            other = other.to_frame()


Maybe we don't need to consider about ks.Series since pandas seems not support the parameter as Series ??

>>> pdf A B 0 NaN NaN 1 0.0 4.0 >>> pser 1 3 2 3 Name: B, dtype: int64 >>> pdf.combine_first(pser) Traceback (most recent call last): ... ValueError: Must specify axis=0 or 1

and also specified in their docs

Parameters ---------- other : DataFrame Provided DataFrame to use to fill null values.

right, I removed it

databricks/koalas/frame.py

itholic · 2020-12-07T06:46:34Z

databricks/koalas/frame.py

+            set(self._internal.column_labels).intersection(set(other._internal.column_labels))
+        )
+
+        update_sdf = combine_frames(self, other)._internal.resolved_copy.spark_frame


I'd say we don't need to use resolved_copy here because the combined DataFrame by using combine_frames is new created DataFrame, so nothing is required to resolved.

How about create combined_df first and get the updated_sdf after then so that we can also use another items of combined_df not only spark_frame ??

like

combined_df = combine_frames(self, other) update_sdf = combined_df._internal.resolved_copy.spark_frame ... spark_columns = combined_df._internal.spark_columns update_sdf.select(spark_columns, ...)

itholic · 2020-12-07T06:52:11Z

databricks/koalas/frame.py

+
+        update_sdf = combine_frames(self, other)._internal.resolved_copy.spark_frame
+
+        for column_labels in update_columns:


nit: How about just column_label rather than column_labels to avoid confusion because I think it always indicates single column label ?

itholic · 2020-12-07T07:07:52Z

databricks/koalas/frame.py

+            update_sdf = update_sdf.withColumn(
+                "__this_" + column_name, F.when(old_col.isNull(), new_col).otherwise(old_col)
+            )
+            update_sdf = update_sdf.drop("__that_" + column_labels[0])


Maybe we can simply select the necessary columns rather than use withColumn and drop ??

like

spark_columns = combined_df._internal.spark_columns # Add some code to exclude `column_labels[0]` here rather than use `drop` spark_columns = ... cond = F.when(old_col.isNull(), new_col).otherwise(old_col).alias("__this_" + column_name) update_sdf = update_sdf.select(*spark_columns, cond)

itholic · 2020-12-07T07:12:21Z

databricks/koalas/frame.py

+        all_column_labels = []
+        for column in update_sdf.columns:
+            if column.startswith("__this_") or column.startswith("__that_"):
+                all_column_labels.append((column[7:],))
+
+        internal = InternalFrame(
+            spark_frame=update_sdf,
+            index_spark_column_names=list(self._internal.index_spark_column_names),
+            column_labels=list(all_column_labels),
+        )


Think we can simply get the parameters for creating InternalFrame from combined_df if we created it before with a just little modifying?

AishwaryaKalloli · 2020-12-07T13:18:56Z

Based on what I understood from the comments, what I have tried to do is

Get all the common columns between self and other
Combined both frames and created a sdf from that
Created a list called final_spark_columns which will collect all the columns that have to be present in the final df
looped through all the columns in combined frame and
- collected common column names only once and named them as __newc_~ (~ is original column name, newc for new column)
- collected other column names as it is
  in final_spark_columns
Finally I have looped through all the column names in final_spark_columns and renamed them

I am not sure if it is very efficient though, got the warning below, let me know if it makes sense and what I can do to improve the code
20/12/07 18:50:01 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

itholic · 2020-12-08T00:59:05Z

Thanks, @AishwaryaKalloli . Let me take a look at the changes soon! :)

AishwaryaKalloli · 2020-12-08T16:39:52Z

Sure, thanks :)

itholic

I think we can refer to the implementation concept of Series.combine_first().

koalas/databricks/koalas/series.py

Lines 4715 to 4768 in d1babcc

    
               def combine_first(self, other) -> "Series": 
        
                   """ 
        
                   Combine Series values, choosing the calling Series's values first. 
        
                   Parameters 
        
                   ---------- 
        
                   other : Series 
        
                       The value(s) to be combined with the `Series`. 
        
                   Returns 
        
                   ------- 
        
                   Series 
        
                       The result of combining the Series with the other object. 
        
                   See Also 
        
                   -------- 
        
                   Series.combine : Perform elementwise operation on two Series 
        
                       using a given function. 
        
                   Notes 
        
                   ----- 
        
                   Result index will be the union of the two indexes. 
        
                   Examples 
        
                   -------- 
        
                   >>> s1 = ks.Series([1, np.nan]) 
        
                   >>> s2 = ks.Series([3, 4]) 
        
                   >>> with ks.option_context("compute.ops_on_diff_frames", True): 
        
                   ...     s1.combine_first(s2) 
        
                   0    1.0 
        
                   1    4.0 
        
                   dtype: float64 
        
                   """ 
        
                   if not isinstance(other, ks.Series): 
        
                       raise ValueError("`combine_first` only allows `Series` for parameter `other`") 
        
                   if same_anchor(self, other): 
        
                       this = self.spark.column 
        
                       that = other.spark.column 
        
                       combined = self._kdf 
        
                   else: 
        
                       combined = combine_frames(self._kdf, other._kdf) 
        
                       this = combined["this"]._internal.spark_column_for(self._column_label) 
        
                       that = combined["that"]._internal.spark_column_for(other._column_label) 
        
                   # If `self` has missing value, use value of `other` 
        
                   cond = F.when(this.isNull(), that).otherwise(this) 
        
                   # If `self` and `other` come from same frame, the anchor should be kept 
        
                   if same_anchor(self, other): 
        
                       return self._with_new_scol(cond) 
        
                   index_scols = combined._internal.index_spark_columns 
        
                   sdf = combined._internal.spark_frame.select( 
        
                       *index_scols, cond.alias(self._internal.data_spark_column_names[0]) 
        
                   ).distinct() 
        
                   internal = self._internal.with_new_sdf(sdf) 
        
                   return first_series(DataFrame(internal))

databricks/koalas/frame.py

itholic · 2020-12-15T01:23:02Z

databricks/koalas/frame.py

+
+        combined_df = combine_frames(self, other)
+        column_labels = combined_df._internal.column_labels
+        updated_sdf = combined_df._internal.resolved_copy.spark_frame


I think maybe we don't need resolved_copy here.

itholic · 2020-12-15T05:51:50Z

databricks/koalas/frame.py

+        spark_columns = combined_df._internal.spark_columns
+
+        for column_label in column_labels:
+            if (column_label[1],) in update_columns:


Seems not valid for MultiIndex columns ??

Could you try with below self and other ??

self = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}, columns=pd.MultiIndex.from_tuples([('A', 'hello'), ('B', 'hi')])) other = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, columns=pd.MultiIndex.from_tuples([('B', 'hi'), ('C', 'okay')]))

right, would it make sense to use something like following

for column_label in column_labels: if column_label[1:] in update_columns: if column_label[0] == "this": column_name = self._internal.spark_column_name_for(column_label[1:])

AishwaryaKalloli · 2020-12-17T03:10:34Z

Thanks, I'll take a look at implementation in series and update accordingly.

shril · 2020-12-19T07:09:53Z

You are facing pycodestyle test failure.
Please try to run the following command before committing the code.

./dev/pytest` -k test_dataframe.py

The tests which are failing are mostly style tests. You can try rectifying them locally.

itholic · 2020-12-23T05:49:07Z

databricks/koalas/tests/test_dataframe.py

+        midx_df2 = ks.from_pandas(midx_pdf2)
+        with option_context("compute.ops_on_diff_frames", True):
+            midx_res = midx_df1.combine_first(midx_df2)
+        self.assert_eq(midx_res, midx_pdf1.combine_first(midx_pdf2))


Can we have more examples for exception cases ??

xinrong-meng · 2021-08-03T23:02:43Z

Hi @AishwaryaKalloli, since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36399. Otherwise, I may do that for you next week.

xinrong-meng · 2021-08-11T00:39:48Z

I am porting this now.

### What changes were proposed in this pull request? Implement `DataFrame.combine_first`. The PR is based on databricks/koalas#1950. Thanks AishwaryaKalloli for the prototype. ### Why are the changes needed? Update null elements with value in the same location in another is a common use case. That is supported in pandas. We should support that as well. ### Does this PR introduce _any_ user-facing change? Yes. `DataFrame.combine_first` can be used. ```py >>> ps.set_option("compute.ops_on_diff_frames", True) >>> df1 = ps.DataFrame({'A': [None, 0], 'B': [None, 4]}) >>> df2 = ps.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine_first(df2).sort_index() A B 0 1.0 3.0 1 0.0 4.0 # Null values still persist if the location of that null value does not exist in other >>> df1 = ps.DataFrame({'A': [None, 0], 'B': [4, None]}) >>> df2 = ps.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2]) >>> df1.combine_first(df2).sort_index() A B C 0 NaN 4.0 NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0 >>> ps.reset_option("compute.ops_on_diff_frames") ``` ### How was this patch tested? Unit tests. Closes #33714 from xinrong-databricks/df_combine_first. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

xinrong-meng · 2021-09-01T21:55:59Z

Hi @AishwaryaKalloli I would like to close the PR since it has been ported to Spark

added combine first function

efe4f86

xinrong-meng reviewed Dec 4, 2020

View reviewed changes

AishwaryaKalloli added 2 commits December 5, 2020 14:54

changed the function to use combine_frame instead of join

05375f3

formatting fixes

bdf6dd4

build fixes

125bdd1

itholic reviewed Dec 7, 2020

View reviewed changes

changes

12b865f

AishwaryaKalloli added 3 commits December 7, 2020 18:51

fix

184b2c3

improvised the code a little

65e9ec5

formating fix

306e44d

AishwaryaKalloli requested review from itholic and xinrong-meng December 7, 2020 14:18

fix

3d942b8

itholic reviewed Dec 15, 2020

View reviewed changes

fix for multiindex columns

1422dd0

lint fix

0731727

itholic reviewed Dec 23, 2020

View reviewed changes

xinrong-meng mentioned this pull request Aug 12, 2021

[SPARK-36399][PYTHON] Implement DataFrame.combine_first apache/spark#33714

Closed

xinrong-meng closed this Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added combine first function #1950

added combine first function #1950

AishwaryaKalloli commented Dec 3, 2020 •

edited

Loading

xinrong-meng Dec 4, 2020

AishwaryaKalloli Dec 5, 2020

AishwaryaKalloli commented Dec 5, 2020

itholic left a comment •

edited

Loading

itholic Dec 7, 2020 •

edited

Loading

AishwaryaKalloli Dec 7, 2020

itholic Dec 7, 2020

itholic Dec 7, 2020 •

edited

Loading

itholic Dec 7, 2020 •

edited

Loading

itholic Dec 7, 2020 •

edited

Loading

itholic Dec 7, 2020

AishwaryaKalloli commented Dec 7, 2020 •

edited

Loading

itholic commented Dec 8, 2020

AishwaryaKalloli commented Dec 8, 2020

itholic left a comment

itholic Dec 15, 2020

itholic Dec 15, 2020

AishwaryaKalloli Dec 18, 2020 •

edited

Loading

AishwaryaKalloli commented Dec 17, 2020

shril commented Dec 19, 2020 •

edited

Loading

itholic Dec 23, 2020

xinrong-meng commented Aug 3, 2021

xinrong-meng commented Aug 11, 2021

xinrong-meng commented Sep 1, 2021


		update_sdf = combine_frames(self, other)._internal.resolved_copy.spark_frame

		for column_labels in update_columns:

	def combine_first(self, other) -> "Series":
	"""
	Combine Series values, choosing the calling Series's values first.

	Parameters
	----------
	other : Series
	The value(s) to be combined with the `Series`.

	Returns
	-------
	Series
	The result of combining the Series with the other object.

	See Also
	--------
	Series.combine : Perform elementwise operation on two Series
	using a given function.

	Notes
	-----
	Result index will be the union of the two indexes.

	Examples
	--------
	>>> s1 = ks.Series([1, np.nan])
	>>> s2 = ks.Series([3, 4])
	>>> with ks.option_context("compute.ops_on_diff_frames", True):
	... s1.combine_first(s2)
	0 1.0
	1 4.0
	dtype: float64
	"""
	if not isinstance(other, ks.Series):
	raise ValueError("`combine_first` only allows `Series` for parameter `other`")
	if same_anchor(self, other):
	this = self.spark.column
	that = other.spark.column
	combined = self._kdf
	else:
	combined = combine_frames(self._kdf, other._kdf)
	this = combined["this"]._internal.spark_column_for(self._column_label)
	that = combined["that"]._internal.spark_column_for(other._column_label)
	# If `self` has missing value, use value of `other`
	cond = F.when(this.isNull(), that).otherwise(this)
	# If `self` and `other` come from same frame, the anchor should be kept
	if same_anchor(self, other):
	return self._with_new_scol(cond)
	index_scols = combined._internal.index_spark_columns
	sdf = combined._internal.spark_frame.select(
	*index_scols, cond.alias(self._internal.data_spark_column_names[0])
	).distinct()
	internal = self._internal.with_new_sdf(sdf)
	return first_series(DataFrame(internal))

added combine first function #1950

added combine first function #1950

Conversation

AishwaryaKalloli commented Dec 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AishwaryaKalloli commented Dec 5, 2020

itholic left a comment • edited Loading

Choose a reason for hiding this comment

itholic Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

itholic Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

itholic Dec 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AishwaryaKalloli commented Dec 7, 2020 • edited Loading

itholic commented Dec 8, 2020

AishwaryaKalloli commented Dec 8, 2020

itholic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AishwaryaKalloli Dec 18, 2020 • edited Loading

Choose a reason for hiding this comment

AishwaryaKalloli commented Dec 17, 2020

shril commented Dec 19, 2020 • edited Loading

Choose a reason for hiding this comment

xinrong-meng commented Aug 3, 2021

xinrong-meng commented Aug 11, 2021

xinrong-meng commented Sep 1, 2021

AishwaryaKalloli commented Dec 3, 2020 •

edited

Loading

itholic left a comment •

edited

Loading

itholic Dec 7, 2020 •

edited

Loading

itholic Dec 7, 2020 •

edited

Loading

itholic Dec 7, 2020 •

edited

Loading

itholic Dec 7, 2020 •

edited

Loading

AishwaryaKalloli commented Dec 7, 2020 •

edited

Loading

AishwaryaKalloli Dec 18, 2020 •

edited

Loading

shril commented Dec 19, 2020 •

edited

Loading