Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve string representation of datafusion classes (dataframe, context, expression, ...) #158

Closed
simicd opened this issue Jan 29, 2023 · 0 comments · Fixed by #159
Closed
Labels
enhancement New feature or request

Comments

@simicd
Copy link
Contributor

simicd commented Jan 29, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When using the Python language bindings, the string representations of the datafusion's objects could be more useful. Currently they show Python's default string representation, e.g.:

>>> df
<datafusion.DataFrame object at 0x0000018AA0220ED0>

>>> literal(3.14)
<datafusion.Expression object at 0x0000018AA026A3F0>

>>> f.ceil(column("age"))
<datafusion.Expression object at 0x0000018AA026A4E0>

>>> accum(column("a"))
<datafusion.Expression object at 0x000001FB6C450C60>

>>> Config()
<datafusion.Config object at 0x000001C73AD87230>

>>> ctx = SessionContext()
>>> ctx
<datafusion.SessionContext object at 0x0000020C86742170>

>>> ctx.catalog()
<datafusion.Catalog object at 0x0000020C867A65A0>

>>> ctx.catalog().database()
<datafusion.Database object at 0x0000020C867A66F0>

>>> ctx.catalog().database().table("t")
<datafusion.Table object at 0x0000020C867A67E0>

Other packages such as pandas or polars provide more specific outputs:

>>> pandas_df = pd.DataFrame(data={"a": [1, 2, 3], "b": ["Hello", "World", "!"]})
>>> pandas_df
   a      b
0  1  Hello
1  2  World
2  3      !


>>> polars_df = pl.DataFrame(data={"a": [1, 2, 3], "b": ["Hello", "World", "!"]})
>>> polars_df
shape: (3, 2)
┌─────┬───────┐
│ ab     │
│ ------   │
│ i64str   │
╞═════╪═══════╡
│ 1Hello │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2World │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ !     │
└─────┴───────┘



Describe the solution you'd like
Ideally customize information displayed in the debugger/cosole. I raised PR #159 which implements Python's __repr__ methods for the datafusion's classes listed below. This method gets called to populate the debugger in VS Code, among others.

Debugging in VS Code - Before
Debugging in VS Code - After

Below is an overview of the proposed outputs, very curious to hear your thoughts/feedback:

Dataframe

Print up to ten rows of the dataframe

>>> df
DataFrame()
+---+-------+
| a | b     |
+---+-------+
| 1 | Hello |
| 2 | World |
| 3 | !     |
+---+-------+

Expressions

>>> literal(3.14)
Expr(Float64(3.14))

>>>f.ceil(column("age"))
Expr(ceil(age))

>>> accum(column("a"))
Expr(MissingMethods(a))

Context

Use available identifiers or properties - here it would be nice if Catalog, Database and Table would have a unique identifier or name but I didn't find such properties (in the example below the Table object doesn't seem to know it's labeled "t", only the Database object seems to store that info).

>>> ctx = SessionContext()
>>> ctx
SessionContext(session_id=7f2a7ddc-aa43-4900-a0e5-d22493c947e6)

>>> ctx.catalog()
Catalog(schema_names=[public])   # Ideally `Catalog(name=datafusion, schema_names=[public])`

>>> ctx.catalog().database()
Database(table_names=[t])        # Ideally `Database(name=public, table_names=[t])`

>>> ctx.catalog().database().table("t")
Table(kind=physical)             # Ideally `Table(name=t, kind=physical)`

Configuration

>>> config = Config()
>>> config
Config({'datafusion.catalog.create_default_catalog_and_schema': 'true', 'datafusion.catalog.default_catalog': 'datafusion', 'datafusion.catalog.default_schema': 'public', 'datafusion.catalog.information_schema': 'false', 'datafusion.catalog.location': None, 'datafusion.catalog.format': None, 'datafusion.catalog.has_header': 'false', 'datafusion.execution.batch_size': '8192', 'datafusion.execution.coalesce_batches': 'true', 'datafusion.execution.collect_statistics': 'false', 'datafusion.execution.target_partitions': '20', 'datafusion.execution.time_zone': '+00:00', 'datafusion.execution.parquet.enable_page_index': 'false', 'datafusion.execution.parquet.pruning': 'true', 'datafusion.execution.parquet.skip_metadata': 'true', 'datafusion.execution.parquet.metadata_size_hint': None, 'datafusion.execution.parquet.pushdown_filters': 'false', 'datafusion.execution.parquet.reorder_filters': 'false', 'datafusion.optimizer.enable_round_robin_repartition': 'true', 'datafusion.optimizer.filter_null_join_keys': 'false', 'datafusion.optimizer.repartition_aggregations': 'true', 'datafusion.optimizer.repartition_joins': 'true', 'datafusion.optimizer.repartition_windows': 'true', 'datafusion.optimizer.skip_failed_rules': 'true', 'datafusion.optimizer.max_passes': '3', 'datafusion.optimizer.top_down_join_key_reordering': 'true', 'datafusion.optimizer.prefer_hash_join': 'true', 'datafusion.optimizer.hash_join_single_partition_threshold': '1048576', 'datafusion.explain.logical_plan_only': 'false', 'datafusion.explain.physical_plan_only': 'false'})

Describe alternatives you've considered
n/a

Additional context
n/a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant