Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 2 additions & 7 deletions python/pyspark/sql/connect/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,13 +140,8 @@ def columns(self) -> List[str]:
"""Returns the list of columns of the current data frame."""
if self._plan is None:
return []
if "columns" not in self._cache and self._plan is not None:
pdd = self.limit(0).toPandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need to cache columns, just return self.schema.names?

Copy link
Contributor Author

@amaliujia amaliujia Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not depend on the underly API when doing caching...

E.g. what if someday the cache on the schema is gone but this API is not aware of it, etc.

Basically do not make assumptions :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm .. why do we need _cache? I think we can just remove.

Copy link
Contributor Author

@amaliujia amaliujia Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm if users call this API multiple times, we only need one gRPC. This should be useful right?

something like:

df.columns()

xxxx
xxx
df.columns()
xxxx
xxxx
df.columns()

Copy link
Contributor Author

@amaliujia amaliujia Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you think this is a bit over-engineering I can remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that case, we should probably have a proper cache layer instead of doing this alone in names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do that for all cases for metadata-ish cases. e.g., schema or even collected results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this caching stuff in this PR. After we pretty much support enough API, we can go back to build a cache layer for all metadata like API to save RPC calls.

if pdd is None:
raise Exception("Empty result")
# Translate to standard pytho array
self._cache["columns"] = pdd.columns.values
return self._cache["columns"]

return self.schema().names

def count(self) -> int:
"""Returns the number of rows in the data frame"""
Expand Down
5 changes: 5 additions & 0 deletions python/pyspark/sql/tests/connect/test_connect_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,11 @@ def test_simple_read(self):
# Check that the limit is applied
self.assertEqual(len(data.index), 10)

def test_columns(self):
# SPARK-41036: test `columns` API for python client.
columns = self.connect.read.table(self.tbl_name).columns
self.assertEqual(["id", "name"], columns)

def test_collect(self):
df = self.connect.read.table(self.tbl_name)
data = df.limit(10).collect()
Expand Down