Excluding certain parameter values from future trials by abandoning them does not work #471

mkhan037 · 2021-01-12T22:55:34Z

Hi, I am trying to minimize the execution cost of some distributed applications through optimal resource allocation. Currently, I am exploring the optimization of a small search space, which will be expanded after we figure out the necessary implementation details. The search space right now only consists of a range parameter that indicates the number of worker nodes needed for execution.

When the total available memory allocated for the execution of these distributed applications is lower than some specific values, it either fails to execute or execution cost is sub-optimal due to taking a long time. We want to avoid executing these configurations, as it would take a lot of time and cost.

We can estimate the amount of memory needed by these applications to at least execute successfully. We can employ a linear constraint on the number of nodes(workers) for our toy search space. However, when we would expand the search space to contain different types of nodes, the constraint would not work, as different node types contain different amounts of memory and which can be calculated as follows.
total_cluster_memory = get_memory_amount_for_node_type(node_type) * worker_count

To avoid actually executing the clearly non-optimal and costly cluster configurations, we considered returning artificially high values to deter Ax from suggesting them in future iterations. However, this type of solution was discouraged in this comment.

Another possible solution we pursued was abandoning the suggested cluster configurations that do not have enough memory. According to this comment in #372, Ax client would not retry abandoned points. However, we see that even after abandoning a point, it again gets suggested by the Ax client.

ax_client = AxClient()
ax_client.create_experiment(name='test_horizontal_scaling',
                            parameters=[
                                {
                                    "name": "worker_count",
                                    "type": "range",
                                    "bounds": [1, 16],
                                    "value_type": "int"
                                },
                                {
                                    "name": "node_type",
                                    "type": "fixed",
                                    "value": "inv",
                                },
                                {
                                    "name": "executors_per_node",
                                    "type": "fixed",
                                    "value": 1,
                                },
                            ],
                            objective_name='cost',
                            minimize=True)
for i in range(16):
	parameters, trial_index = ax_client.get_next_trial()
	# run the trials, but skip bad ones
    if not skip_sample_based_on_memory_amount(params=parameters, memory_limit=4*48):
    	data = local_cluster_eval_function_for_service_api(parameters)
    else:
        ax_client.abandon_trial(trial_index=trial_index)
        continue
    ax_client.complete_trial(trial_index=trial_index, raw_data=data)

We ran the code for an application that needs at least 4*48 GB memory for execution, thus we would need 4 worker nodes, as each node has 48 GBs of memory. We detect whether the suggested configuration should be skipped or not using the function skip_sample_based_on_memory_amount. We re-ran this a few times, and in many cases, Ax keeps suggesting configurations that we marked abandoned. One such example shows that Ax repeatedly suggested configuration with worker number of 2, which was marked abandoned.

[INFO 01-05 00:04:53] ax.service.ax_client: Starting optimization with verbose logging. To disable logging, set the `verbose_logging` argument to `False`. Note that float values in the logs are rounded to 2 decimal points.
[INFO 01-05 00:04:53] ax.modelbridge.dispatch_utils: Using Bayesian Optimization generation strategy: GenerationStrategy(name='Sobol+GPEI', steps=[Sobol for 5 trials, GPEI for subsequent trials]). Iterations after 5 will take longer to generate due to  model-fitting.
[INFO 01-05 00:04:53] ax.service.ax_client: Generated new trial 0 with parameters {'worker_count': 10, 'node_type': 'inv', 'executors_per_node': 1}.
{'worker_count': 10, 'node_type': 'inv', 'executors_per_node': 1}, cores 16, memory 48G
Execution Time : 96.55689090490341, Cost: 1062.1257999539375
[INFO 01-05 00:04:53] ax.service.ax_client: Completed trial 0 with data: {'cost': (1062.13, 0.0)}.
[INFO 01-05 00:04:54] ax.service.ax_client: Generated new trial 1 with parameters {'worker_count': 3, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:54] ax.service.ax_client: Generated new trial 2 with parameters {'worker_count': 4, 'node_type': 'inv', 'executors_per_node': 1}.
{'worker_count': 4, 'node_type': 'inv', 'executors_per_node': 1}, cores 16, memory 48G
Execution Time : 190.18439613250084, Cost: 950.9219806625042
[INFO 01-05 00:04:54] ax.service.ax_client: Completed trial 2 with data: {'cost': (950.92, 0.0)}.
{'worker_count': 6, 'node_type': 'inv', 'executors_per_node': 1}, cores 16, memory 48G
Execution Time : 173.06547846039757, Cost: 1211.458349222783
[INFO 01-05 00:04:54] ax.service.ax_client: Generated new trial 3 with parameters {'worker_count': 6, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:54] ax.service.ax_client: Completed trial 3 with data: {'cost': (1211.46, 0.0)}.
[INFO 01-05 00:04:54] ax.service.ax_client: Generated new trial 4 with parameters {'worker_count': 1, 'node_type': 'inv', 'executors_per_node': 1}.
{'worker_count': 12, 'node_type': 'inv', 'executors_per_node': 1}, cores 16, memory 48G
Execution Time : 95.29943734430708, Cost: 1238.892685475992
[INFO 01-05 00:04:54] ax.service.ax_client: Generated new trial 5 with parameters {'worker_count': 12, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:54] ax.service.ax_client: Completed trial 5 with data: {'cost': (1238.89, 0.0)}.
{'worker_count': 5, 'node_type': 'inv', 'executors_per_node': 1}, cores 16, memory 48G
Execution Time : 171.18135719924854, Cost: 1027.0881431954913
[INFO 01-05 00:04:54] ax.service.ax_client: Generated new trial 6 with parameters {'worker_count': 5, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:54] ax.service.ax_client: Completed trial 6 with data: {'cost': (1027.09, 0.0)}.
[INFO 01-05 00:04:55] ax.service.ax_client: Generated new trial 7 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:56] ax.service.ax_client: Generated new trial 8 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:56] ax.service.ax_client: Generated new trial 9 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:57] ax.service.ax_client: Generated new trial 10 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:58] ax.service.ax_client: Generated new trial 11 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:04:58] ax.service.ax_client: Generated new trial 12 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:05:00] ax.service.ax_client: Generated new trial 13 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:05:01] ax.service.ax_client: Generated new trial 14 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
[INFO 01-05 00:05:01] ax.service.ax_client: Generated new trial 15 with parameters {'worker_count': 2, 'node_type': 'inv', 'executors_per_node': 1}.
  arm_name metric_name         mean  sem  trial_index
0      0_0        cost  1062.125800  0.0            0
1      2_0        cost   950.921981  0.0            2
2      3_0        cost  1211.458349  0.0            3
3      5_0        cost  1238.892685  0.0            5
4      6_0        cost  1027.088143  0.0            6

How to tackle such an issue? Another path can be using the total number of cores and the total memory as a search space parameter. However, that would lead to many suggested configurations being invalid, as the granularity of cores and memory depends on the type of worker node.

The text was updated successfully, but these errors were encountered:

lena-kashtelyan · 2021-01-15T21:22:26Z

Hi, @mkhan037, thank you for reporting this! This looks like a it might be a bug –– we'll investigate and get back to you.

mkhan037 · 2021-02-03T20:48:44Z

Hi, @lena-kashtelyan just following up to see if there is any update on this. Also, I was thinking of using the following strategy to tackle the issue if abandoning the trials does not work. Please let me know if they seem feasible or not.

For the initial points, generate points using SOBOL one by one and skip if the produced point is infeasible.
For the BO steps, customize the acquisition function (e.g., NEI) to set the expected improvement to 0 for points I want to avoid.

My concern is whether setting the expected improvement to 0 for specific points can lead to model fitting issues. Please let me know your opinion. Thank you for your time.

lena-kashtelyan · 2021-02-05T16:35:11Z

Hi, @mkhan037, I think customizing expected improvement function might not be a great idea from the model fit standpoint, cc @Balandat for his opinion. We'll prioritize debugging this!

Balandat · 2021-02-06T21:08:02Z

Just to make sure I understand: Do you know the memory requirements so that runs don't OOM? Or do you need to estimate that as well?

If these limits are unknown, then you'd have to estimate this binary response (feasible/not feasible). This probably wouldn't work well with our default models. There are specialized strategies for this (see e.g. https://arxiv.org/pdf/1907.10383.pdf), but we currently do not have those methods implemented.

If you do know the mapping from nodes to memory then things get easier since you don't need to estimate this, but we'd have to encode those as parameter constraints. It seems that even if you have different node types you could still use a linear constraint. If you parameterized your search space with a parameter for the number of nodes for each node type (say x_1, ..., x_k), then the parameter constraint would be x_1 * mem(x_1) + ... + x_k * mem(x_k) <= total_cluster_mem, where mem(x_j) = get_memory_amount_for_node_type(node_type(x_j)). Or am I missing something about the setup?

mkhan037 · 2021-02-08T18:53:42Z

Hi @Balandat, thanks for the response. The amount of total cluster memory that avoid OOM can be calculated (it is a formula involving some execution statistics).

Thanks for the suggestion on using x_1 * mem(x_1) + ... + x_k * mem(x_k) >= total_cluster_mem_required as a parameter constraint. However, the number of node types can become large, for example, AWS has general-purpose, compute-optimized, memory-optimized instance families and each family contains multiple instance sizes (large, xlarge, 2xlarge, 4xlarge, etc.). I want to avoid introducing too many parameters to keep the search space size smaller and also putting various types of instance families in the search space side by side can lead to combinations and hence heterogeneity in the cluster, which I want to avoid also.

I hope that clarifies a bit more about the problem scenario. Thank you for your time.

lena-kashtelyan · 2021-02-16T16:38:11Z

Update on this: abandoning trials not working was in fact a bug, and the fix should be on master today. Thank you for pointing it out, @mkhan037!

lena-kashtelyan · 2021-02-16T18:50:20Z

The fix for this is now on master as of f6457e6; we will be releasing a new stable version imminently as well.

mkhan037 · 2021-02-16T21:52:30Z

Hi @lena-kashtelyan , thanks for taking the time to fix the bug! I will check out the master branch and test later. In the meantime feel free to close this issue.

lena-kashtelyan · 2021-02-24T17:05:02Z

This should now be fixed in latest stable release, 0.1.20.

lena-kashtelyan self-assigned this Jan 15, 2021

lena-kashtelyan added the bug Something isn't working label Jan 15, 2021

lena-kashtelyan added the fixready Fix has landed on master. label Feb 16, 2021

lena-kashtelyan closed this as completed Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excluding certain parameter values from future trials by abandoning them does not work #471

Excluding certain parameter values from future trials by abandoning them does not work #471

mkhan037 commented Jan 12, 2021 •

edited

Loading

lena-kashtelyan commented Jan 15, 2021

mkhan037 commented Feb 3, 2021

lena-kashtelyan commented Feb 5, 2021

Balandat commented Feb 6, 2021

mkhan037 commented Feb 8, 2021

lena-kashtelyan commented Feb 16, 2021

lena-kashtelyan commented Feb 16, 2021

mkhan037 commented Feb 16, 2021

lena-kashtelyan commented Feb 24, 2021

Excluding certain parameter values from future trials by abandoning them does not work #471

Excluding certain parameter values from future trials by abandoning them does not work #471

Comments

mkhan037 commented Jan 12, 2021 • edited Loading

lena-kashtelyan commented Jan 15, 2021

mkhan037 commented Feb 3, 2021

lena-kashtelyan commented Feb 5, 2021

Balandat commented Feb 6, 2021

mkhan037 commented Feb 8, 2021

lena-kashtelyan commented Feb 16, 2021

lena-kashtelyan commented Feb 16, 2021

mkhan037 commented Feb 16, 2021

lena-kashtelyan commented Feb 24, 2021

mkhan037 commented Jan 12, 2021 •

edited

Loading