Skip to content

Generate dynamic Presto configs with pbench#48

Merged
simoneves merged 43 commits intomainfrom
seves/HERC-82_use_pbench_for_presto_memory_configs
Oct 22, 2025
Merged

Generate dynamic Presto configs with pbench#48
simoneves merged 43 commits intomainfrom
seves/HERC-82_use_pbench_for_presto_memory_configs

Conversation

@simoneves
Copy link
Copy Markdown
Contributor

@simoneves simoneves commented Sep 17, 2025

This PR adds the ability to generate Presto config values dynamically for a given machine spec (CPU count and CPU RAM size).

The config files are expanded based on more recent manual testing (@tmostak and @patdevinwilson) and converted to templates for population by pbench genconfig.

Pending a better solution, due to go not being installed on lab machines, I have committed pbench binaries for Linux AMD64 and ARM64, along with a script to generate the config files. These files MUST exist before Presto will run, as the Docker recipe has been adapted to map them in place of the previous hardwired versions.

Finally, there is a simple script to generate the dynamic files, which are the ones which the Presto containers will use.

This process first generates the config.json file required by pbench in the generated subdirectory, populating the CPU and RAM entries based on the current host (possibly too simplistic... discuss!). It then runs pbench genconfig which copies all files found in the template folder to the generated folder substituting computed values where required, based on the input values in config.json.

The auto-generated values are NOT applied to the Java config files. These remain at the basic default settings.

Note that there is variant-specific behavior in that vcpu_per_worker is set to 2 for GPU mode, and to CPU count for CPU and Java. This logic is in the generate script rather than having separate templates, although that may still be required in the future.

"native_query_mem_percent_of_sys_mem": 0.95,
"join_max_bcast_size_percent_of_container_mem": 0.01,
"memory_push_back_start_below_limit_gb": 5
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file comes with pbench. I'm sure the heuristics could be tweaked if needed.

hive.metastore.catalog.dir=file:/var/lib/presto/data/hive/metastore
hive.allow-drop-table=true

hive.file-splittable=false
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is basically always required. Discuss.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devavret asked me add it. This is required.

-server
-Xmx24G
-Xmx{{ .HeapSizeGb }}G
-Xms{{ .HeapSizeGb }}G
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Templated

-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.nio.maxCachedBufferSize=2000000
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @tmostak

query.max-total-memory={{ mul .JavaQueryMaxTotalMemPerNodeGb .NumberOfWorkers }}GB
query.max-memory-per-node={{ .JavaQueryMaxMemPerNodeGb }}GB
query.max-memory={{ mul .JavaQueryMaxMemPerNodeGb .NumberOfWorkers }}GB
memory.heap-headroom-per-node={{ .HeadroomGb }}GB
Copy link
Copy Markdown
Contributor Author

@simoneves simoneves Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THIS CONFIG WILL PROBABLY NOT WORK!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does not work, should it be removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remove it. We need a config for Java mode. I just need to fix it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the explicit memory configs for Java mode. It still works for the simple queries I tried. We can revisit this if we need to do comparisons with Java mode again.

query.max-total-memory={{ mul .JavaQueryMaxTotalMemPerNodeGb .NumberOfWorkers }}GB
query.max-memory-per-node={{ .JavaQueryMaxMemPerNodeGb }}GB
query.max-memory={{ mul .JavaQueryMaxMemPerNodeGb .NumberOfWorkers }}GB
memory.heap-headroom-per-node={{ .HeadroomGb }}GB
Copy link
Copy Markdown
Contributor Author

@simoneves simoneves Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THIS CONFIG WILL PROBABLY NOT WORK!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ditto)

fi
fi

$EXE "$@"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard pbench wrapper script to detect OS and ARCH and not much else.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add a license header, as it's copied directly from the pbench repo.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this annotation isn't now obsolete, as this file is no longer new in this PR but added by a merge-out from main after it was landed in a different PR.

presto-base-volumes:
volumes:
- ./config/etc_common:/opt/presto-server/etc
- ./config/generated/etc_common:/opt/presto-server/etc
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File structure is expected to be same as before, just moved down into generated.

native-execution-enabled=true
optimizer.optimize-hash-generation=false
regex-library=RE2J
use-alternative-function-signatures=true
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was changed a lot, based on @tmostak's adaptation of the IBM version.

runtime-metrics-collection-enabled=true
system-mem-pushback-enabled=true
system-mem-limit-gb={{ sub .ContainerMemoryGb .GeneratorParameters.MemoryPushBackStartBelowLimitGb }}
system-mem-shrink-gb=20
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file also changed based on @tmostak's adaptation of the IBM version.

@simoneves simoneves changed the title [HERC-82] Generate dynamic Presto configs Generate dynamic Presto configs with pbench Sep 17, 2025
pushd ../docker/config > /dev/null

# always move back even on failure
trap "popd > /dev/null" EXIT
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not actually sure this is necessary. I think pops happen automatically on exit.

# get host values
NPROC=`nproc`
# lsmem will report in SI. Make sure we get values in GB.
RAM_GB=$(( $(lsmem -b | grep "Total online memory" | awk '{print $4}') / (1024*1024*1024) ))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be right!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it IS right, but would be simpler if the divide was just in the awk expression.

Also, something is generating a docker/config/1 file containing a pbench usage message.

Also, no files are generated!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was due to the typo in the stderr redirection (see below). Fixed.

I moved the divide inside the awk expression to avoid the need for the extra layer of bash math.


# run pbench to generate the config files
# hide default pbench logging which goes to stderr so we only see any errors
../../pbench/pbench genconfig -p params.json -t template generated 2&>1 grep -v '\{\"level'
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is bogus and that's my bad for not testing it properly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be 2>& not 2&>1. Fixed.

Fix pbench output redirection

# get host values
NPROC=`nproc`
RAM_GB=`lsmem | awk '/Total online/ { print $4 }'`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAM_GB=cat /sys/devices/system/node/node0/meminfo | awk '/MemTotal/ {printf $4/1024/1024}'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be more accurate :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patdevinwilson my comment was out of date anyway. @misiugodfrey came up with a better one. The issue with yours is that it only considers one node of a NUMA system such as a dual GH200. It's still pretty fragile, though. I can't believe that getting a definitive "how many total GB of CPU RAM does this thing have" is so difficult.

@simoneves
Copy link
Copy Markdown
Contributor Author

@paul-aiyedun this is still blocked by your earlier request-for-changes. Do you still have any concerns?

@paul-aiyedun
Copy link
Copy Markdown
Contributor

@paul-aiyedun this is still blocked by your earlier request-for-changes. Do you still have any concerns?

The config_native_cpu.properties, config_native_gpu.properties, config_native_cpu.properties, and config_native_gpu.properties duplication is still a concern. Also, the Java Presto configurations appear to be incomplete.

Comment on lines +8 to +14
# Memory auto-configuration is not wired for Java engine in this template.
# Uncomment and set values if using Java workers with memory governance.
#query.max-total-memory-per-node={{ .JavaQueryMaxTotalMemPerNodeGb }}GB
#query.max-total-memory={{ mul .JavaQueryMaxTotalMemPerNodeGb .NumberOfWorkers }}GB
#query.max-memory-per-node={{ .JavaQueryMaxMemPerNodeGb }}GB
#query.max-memory={{ mul .JavaQueryMaxMemPerNodeGb .NumberOfWorkers }}GB
#memory.heap-headroom-per-node={{ .HeadroomGb }}GB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete commented out configurations.

Same comment applies to other configuration files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed two commented-out optimizer configs from shared native Coordinator config. Also removed all the commented-out Java memory configs as part of reinstating the original Java configs (see below).

Comment on lines +19 to +25
# Memory auto-configuration is not wired for Java engine in this template.
# Uncomment and set values if using Java workers/coordinator end-to-end.
#query.max-total-memory-per-node={{ .JavaQueryMaxTotalMemPerNodeGb }}GB
#query.max-total-memory={{ mul .JavaQueryMaxTotalMemPerNodeGb .NumberOfWorkers }}GB
#query.max-memory-per-node={{ .JavaQueryMaxMemPerNodeGb }}GB
#query.max-memory={{ mul .JavaQueryMaxMemPerNodeGb .NumberOfWorkers }}GB
#memory.heap-headroom-per-node={{ .HeadroomGb }}GB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are some of these configurations not required for Presto Java?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The automatic process generates an invalid config for Java, and the server will not start. I asked the group about this some time ago and nobody had an opinion. I guess I can set them to some nominal fixed but valid values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinstated the original fixed Java configs (Coordinator and Worker) from main.

@simoneves
Copy link
Copy Markdown
Contributor Author

@paul-aiyedun I recombined the CPU and GPU Coordinator config (although I left them individually mapped to the same file in the Dockerfiles)

@paul-aiyedun
Copy link
Copy Markdown
Contributor

@paul-aiyedun I recombined the CPU and GPU Coordinator config

I believe this also applies to the worker config files (see #48 (comment)).

(although I left them individually mapped to the same file in the Dockerfiles)

What is the reason for this?

@simoneves
Copy link
Copy Markdown
Contributor Author

@paul-aiyedun I recombined the CPU and GPU Coordinator config

I believe this also applies to the worker config files (see #48 (comment)).

I don't want to do it that way. That would require two passes of pbench with separate config files. Sorry.

(although I left them individually mapped to the same file in the Dockerfiles)

What is the reason for this?

Fine.

@simoneves
Copy link
Copy Markdown
Contributor Author

(although I left them individually mapped to the same file in the Dockerfiles)

What is the reason for this?

Fine.

Actually there is no change here. The previous CPU and GPU Dockerfiles already mapped the file in the leaf native services.

@paul-aiyedun
Copy link
Copy Markdown
Contributor

@paul-aiyedun I recombined the CPU and GPU Coordinator config

I believe this also applies to the worker config files (see #48 (comment)).

I don't want to do it that way. That would require two passes of pbench with separate config files. Sorry.

Unless I am missing something, this would require setting vcpu_per_worker to 2 if VARIANT_TYPE is gpu and to NPROC otherwise (

"vcpu_per_worker": ${NPROC},
). I don't think separate config files or multiple passes are needed.

@simoneves
Copy link
Copy Markdown
Contributor Author

Unless I am missing something, this would require setting vcpu_per_worker to 2 if VARIANT_TYPE is gpu and to NPROC otherwise. I don't think separate config files or multiple passes are needed.

You are not wrong. I get it now. Will do.

@simoneves
Copy link
Copy Markdown
Contributor Author

@paul-aiyedun worker config unsplit done

@simoneves simoneves merged commit 76c8162 into main Oct 22, 2025
@simoneves simoneves deleted the seves/HERC-82_use_pbench_for_presto_memory_configs branch October 22, 2025 18:51
simoneves added a commit that referenced this pull request Oct 29, 2025
This reinstates the two query optimizer flags which were removed from the config during the dev of #48.

These have been reconfirmed to be needed for optimum performance on GPU only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants