Skip to content

Add ANALYZE utility script#69

Merged
mattgara merged 4 commits intomainfrom
mattgara/analyze-tables
Oct 2, 2025
Merged

Add ANALYZE utility script#69
mattgara merged 4 commits intomainfrom
mattgara/analyze-tables

Conversation

@mattgara
Copy link
Copy Markdown
Contributor

@mattgara mattgara commented Oct 1, 2025

This PR is intended to address #34.

Copy link
Copy Markdown
Contributor

@paul-aiyedun paul-aiyedun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some minor comments, but changes look good to me.

Comment on lines +39 to +41
--host Presto coordinator hostname (default: localhost).
--port Presto coordinator port (default: 8080).
--user Presto user (default: test_user).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Update options to match

-h, --hostname Hostname of the Presto coordinator.
-p, --port Port number of the Presto coordinator.
-u, --user User who queries will be executed as.

ANALYZE_TABLES_SCRIPT_PATH=../testing/integration_tests/analyze_tables.py
REQUIREMENTS_PATH=../testing/requirements.txt

../../scripts/run_py_script.sh -p "$ANALYZE_TABLES_SCRIPT_PATH" -r "$REQUIREMENTS_PATH" "${SCRIPT_ARGS[@]}" No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Add new line to end of file.

analyze_tables(cursor, args.schema_name, verbose=args.verbose)
finally:
cursor.close()
conn.close() No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Add new line to end of file.

print(f"Warning: Failed to analyze table '{table_name}': {e}")
failure_count += 1

if verbose:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does if verbose or failure_count > 0: also work here instead of the elif?

@mattgara mattgara merged commit 3925aa6 into main Oct 2, 2025
@mattgara mattgara deleted the mattgara/analyze-tables branch October 2, 2025 23:52
@karthikeyann
Copy link
Copy Markdown
Contributor

karthikeyann commented Oct 3, 2025

This script should be called as part of registering the tables. Is there a script for registering existing TPCH data?
The script should identify the data types from parquet files and use right types in SQL to register the tables.

@mattgara
Copy link
Copy Markdown
Contributor Author

mattgara commented Oct 3, 2025

Is there a script for registering existing TPCH data?

Yes, the script that handles this is setup_benchmark_tables.sh. This was added as part of @paul-aiyedun 's PR #56 which implemented Presto benchmarks.

The script should identify the data types from parquet files and use right types in SQL to register the tables.

Could you please clarify what you mean by this?

I think there may be some confusion here. This PR is about adding analyze_tables.sh, which runs ANALYZE ... commands on already-registered tables. This is separate from table registration.

The question you're asking seems to be about setup_benchmark_tables.sh from PR #56. IIUC, that script currently uses DuckDB's built-in TPC-H schema definition rather than reading the parquet file metadata directly. If you have concerns about how that script handles data types, could you open a separate issue or comment on PR #56? That would be the appropriate place to discuss changes to the table registration logic.

@karthikeyann
Copy link
Copy Markdown
Contributor

Some columns have different datatypes in the velox tpch data that we have.
Some columns are BIGINT instead of INTEGER. The table registration does not check the types inside parquet files. So mismatch was only discovered when we were running tpch queries. So, when we are registering existing data from parquet files, checking the datatypes or using the datatypes from the parquet files would be a nice to have feature.

@mattgara
Copy link
Copy Markdown
Contributor Author

mattgara commented Oct 4, 2025

Okay thanks for the great feedback @karthikeyann , I've opened an issue to track this in #72.

Avinash-Raj pushed a commit that referenced this pull request Oct 16, 2025
This PR is intended to address
#34.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants