Generate dynamic Presto configs with pbench#48
Conversation
| "native_query_mem_percent_of_sys_mem": 0.95, | ||
| "join_max_bcast_size_percent_of_container_mem": 0.01, | ||
| "memory_push_back_start_below_limit_gb": 5 | ||
| } |
There was a problem hiding this comment.
This file comes with pbench. I'm sure the heuristics could be tweaked if needed.
| hive.metastore.catalog.dir=file:/var/lib/presto/data/hive/metastore | ||
| hive.allow-drop-table=true | ||
|
|
||
| hive.file-splittable=false |
There was a problem hiding this comment.
IIUC, this is basically always required. Discuss.
| -server | ||
| -Xmx24G | ||
| -Xmx{{ .HeapSizeGb }}G | ||
| -Xms{{ .HeapSizeGb }}G |
| -XX:+ExplicitGCInvokesConcurrent | ||
| -XX:+HeapDumpOnOutOfMemoryError | ||
| -XX:+ExitOnOutOfMemoryError | ||
| -Djdk.nio.maxCachedBufferSize=2000000 |
| query.max-total-memory={{ mul .JavaQueryMaxTotalMemPerNodeGb .NumberOfWorkers }}GB | ||
| query.max-memory-per-node={{ .JavaQueryMaxMemPerNodeGb }}GB | ||
| query.max-memory={{ mul .JavaQueryMaxMemPerNodeGb .NumberOfWorkers }}GB | ||
| memory.heap-headroom-per-node={{ .HeadroomGb }}GB |
There was a problem hiding this comment.
THIS CONFIG WILL PROBABLY NOT WORK!
There was a problem hiding this comment.
If it does not work, should it be removed?
There was a problem hiding this comment.
Can't remove it. We need a config for Java mode. I just need to fix it.
There was a problem hiding this comment.
I removed the explicit memory configs for Java mode. It still works for the simple queries I tried. We can revisit this if we need to do comparisons with Java mode again.
| query.max-total-memory={{ mul .JavaQueryMaxTotalMemPerNodeGb .NumberOfWorkers }}GB | ||
| query.max-memory-per-node={{ .JavaQueryMaxMemPerNodeGb }}GB | ||
| query.max-memory={{ mul .JavaQueryMaxMemPerNodeGb .NumberOfWorkers }}GB | ||
| memory.heap-headroom-per-node={{ .HeadroomGb }}GB |
There was a problem hiding this comment.
THIS CONFIG WILL PROBABLY NOT WORK!
| fi | ||
| fi | ||
|
|
||
| $EXE "$@" |
There was a problem hiding this comment.
Standard pbench wrapper script to detect OS and ARCH and not much else.
There was a problem hiding this comment.
I did not add a license header, as it's copied directly from the pbench repo.
There was a problem hiding this comment.
Not sure why this annotation isn't now obsolete, as this file is no longer new in this PR but added by a merge-out from main after it was landed in a different PR.
| presto-base-volumes: | ||
| volumes: | ||
| - ./config/etc_common:/opt/presto-server/etc | ||
| - ./config/generated/etc_common:/opt/presto-server/etc |
There was a problem hiding this comment.
File structure is expected to be same as before, just moved down into generated.
| native-execution-enabled=true | ||
| optimizer.optimize-hash-generation=false | ||
| regex-library=RE2J | ||
| use-alternative-function-signatures=true |
There was a problem hiding this comment.
This file was changed a lot, based on @tmostak's adaptation of the IBM version.
| runtime-metrics-collection-enabled=true | ||
| system-mem-pushback-enabled=true | ||
| system-mem-limit-gb={{ sub .ContainerMemoryGb .GeneratorParameters.MemoryPushBackStartBelowLimitGb }} | ||
| system-mem-shrink-gb=20 |
There was a problem hiding this comment.
This file also changed based on @tmostak's adaptation of the IBM version.
| pushd ../docker/config > /dev/null | ||
|
|
||
| # always move back even on failure | ||
| trap "popd > /dev/null" EXIT |
There was a problem hiding this comment.
Not actually sure this is necessary. I think pops happen automatically on exit.
| # get host values | ||
| NPROC=`nproc` | ||
| # lsmem will report in SI. Make sure we get values in GB. | ||
| RAM_GB=$(( $(lsmem -b | grep "Total online memory" | awk '{print $4}') / (1024*1024*1024) )) |
There was a problem hiding this comment.
This can't be right!
There was a problem hiding this comment.
OK, it IS right, but would be simpler if the divide was just in the awk expression.
Also, something is generating a docker/config/1 file containing a pbench usage message.
Also, no files are generated!
There was a problem hiding this comment.
This was due to the typo in the stderr redirection (see below). Fixed.
I moved the divide inside the awk expression to avoid the need for the extra layer of bash math.
|
|
||
| # run pbench to generate the config files | ||
| # hide default pbench logging which goes to stderr so we only see any errors | ||
| ../../pbench/pbench genconfig -p params.json -t template generated 2&>1 grep -v '\{\"level' |
There was a problem hiding this comment.
OK, this is bogus and that's my bad for not testing it properly
There was a problem hiding this comment.
It should be 2>& not 2&>1. Fixed.
Fix pbench output redirection
|
|
||
| # get host values | ||
| NPROC=`nproc` | ||
| RAM_GB=`lsmem | awk '/Total online/ { print $4 }'` |
There was a problem hiding this comment.
RAM_GB=cat /sys/devices/system/node/node0/meminfo | awk '/MemTotal/ {printf $4/1024/1024}'
There was a problem hiding this comment.
This seems to be more accurate :)
There was a problem hiding this comment.
@patdevinwilson my comment was out of date anyway. @misiugodfrey came up with a better one. The issue with yours is that it only considers one node of a NUMA system such as a dual GH200. It's still pretty fragile, though. I can't believe that getting a definitive "how many total GB of CPU RAM does this thing have" is so difficult.
|
@paul-aiyedun this is still blocked by your earlier request-for-changes. Do you still have any concerns? |
The |
| # Memory auto-configuration is not wired for Java engine in this template. | ||
| # Uncomment and set values if using Java workers with memory governance. | ||
| #query.max-total-memory-per-node={{ .JavaQueryMaxTotalMemPerNodeGb }}GB | ||
| #query.max-total-memory={{ mul .JavaQueryMaxTotalMemPerNodeGb .NumberOfWorkers }}GB | ||
| #query.max-memory-per-node={{ .JavaQueryMaxMemPerNodeGb }}GB | ||
| #query.max-memory={{ mul .JavaQueryMaxMemPerNodeGb .NumberOfWorkers }}GB | ||
| #memory.heap-headroom-per-node={{ .HeadroomGb }}GB |
There was a problem hiding this comment.
Delete commented out configurations.
Same comment applies to other configuration files.
There was a problem hiding this comment.
Removed two commented-out optimizer configs from shared native Coordinator config. Also removed all the commented-out Java memory configs as part of reinstating the original Java configs (see below).
| # Memory auto-configuration is not wired for Java engine in this template. | ||
| # Uncomment and set values if using Java workers/coordinator end-to-end. | ||
| #query.max-total-memory-per-node={{ .JavaQueryMaxTotalMemPerNodeGb }}GB | ||
| #query.max-total-memory={{ mul .JavaQueryMaxTotalMemPerNodeGb .NumberOfWorkers }}GB | ||
| #query.max-memory-per-node={{ .JavaQueryMaxMemPerNodeGb }}GB | ||
| #query.max-memory={{ mul .JavaQueryMaxMemPerNodeGb .NumberOfWorkers }}GB | ||
| #memory.heap-headroom-per-node={{ .HeadroomGb }}GB |
There was a problem hiding this comment.
Are some of these configurations not required for Presto Java?
There was a problem hiding this comment.
The automatic process generates an invalid config for Java, and the server will not start. I asked the group about this some time ago and nobody had an opinion. I guess I can set them to some nominal fixed but valid values.
There was a problem hiding this comment.
Reinstated the original fixed Java configs (Coordinator and Worker) from main.
|
@paul-aiyedun I recombined the CPU and GPU Coordinator config (although I left them individually mapped to the same file in the Dockerfiles) |
I believe this also applies to the worker config files (see #48 (comment)).
What is the reason for this? |
I don't want to do it that way. That would require two passes of
Fine. |
Actually there is no change here. The previous CPU and GPU Dockerfiles already mapped the file in the leaf native services. |
Unless I am missing something, this would require setting |
You are not wrong. I get it now. Will do. |
|
@paul-aiyedun worker config unsplit done |
This reinstates the two query optimizer flags which were removed from the config during the dev of #48. These have been reconfirmed to be needed for optimum performance on GPU only.
This PR adds the ability to generate Presto config values dynamically for a given machine spec (CPU count and CPU RAM size).
The config files are expanded based on more recent manual testing (@tmostak and @patdevinwilson) and converted to templates for population by
pbench genconfig.Pending a better solution, due to
gonot being installed on lab machines, I have committedpbenchbinaries for Linux AMD64 and ARM64, along with a script to generate the config files. These files MUST exist before Presto will run, as the Docker recipe has been adapted to map them in place of the previous hardwired versions.Finally, there is a simple script to generate the dynamic files, which are the ones which the Presto containers will use.
This process first generates the
config.jsonfile required bypbenchin thegeneratedsubdirectory, populating the CPU and RAM entries based on the current host (possibly too simplistic... discuss!). It then runspbench genconfigwhich copies all files found in thetemplatefolder to thegeneratedfolder substituting computed values where required, based on the input values inconfig.json.The auto-generated values are NOT applied to the Java config files. These remain at the basic default settings.
Note that there is variant-specific behavior in that
vcpu_per_workeris set to 2 for GPU mode, and to CPU count for CPU and Java. This logic is in thegeneratescript rather than having separate templates, although that may still be required in the future.