Accelerate sample request in benchmark script #1264

ispobock · 2024-03-08T07:38:25Z

Motivation

The sampling request progress is time consuming in the benchmark script.

Modification

Refer to vllm script, we can pre-sample the dataset before the filter step to avoid go through all the dataset.

ispobock · 2024-03-08T07:42:26Z

@lvhan028 @grimoire Could you help review?

grimoire · 2024-03-08T10:18:40Z

Can we make sure that we still have enough data after length filter?

ispobock · 2024-03-08T10:51:26Z

@grimoire
For the SharedGPT V3 dataset,

total size: 92886
size after filter: 81761
valid percentage: 81761/92886 ≈ 88.02%

Ideally, pre-sample 1.2*num_requests is enough (expected valid count is 1.2*0.8802 ≈ 1.0562) because the sample is random. But for small num_requests, for example num_requests=1, we need pre-sample more to make sure enough data.

zhyncs · 2024-03-08T10:55:19Z

LGTM

ispobock · 2024-03-08T11:46:30Z

For this problem, maybe we can model it as a hypergeometric distribution problem and use the following code to find the acceptable pre-sample size to make sure very low error rate.

from scipy.stats import hypergeom

def find_minimal_presample_size(dataset_size, valid_size, num_requests, error_rate):
    left, right = num_requests, dataset_size
    
    while left < right:
        mid = left + (right - left) // 2
        # get cumulative distribution for the hypergeometric distribution
        cdf = hypergeom.cdf(num_requests - 1, dataset_size, valid_size, mid)
        if cdf <= error_rate:
            right = mid 
        else:
            left = mid + 1
    
    return left

print(find_minimal_presample_size(92886, 81761, 1, 0.001)) # 4
print(find_minimal_presample_size(92886, 81761, 1000, 0.001)) # 1176

For example, if we pre-sample 4 samples, the error rate will be math.comb(11125,4)/math.comb(92886, 4) ≈ 0.0002 < 0.001. That means we only have 0.0002 probability to get 4 samples are all not valid and have 0.9998 probability to have at least 1 valid sample.

ispobock · 2024-03-08T12:01:48Z

A simpler solution. We also can set a threshold, maybe presample_size = max(num_requests*1.2, 1000), both time and error rate are acceptable.

pre-sample before filter

6e87932

ispobock force-pushed the accelerate_sample_request branch from e7e0978 to 6e87932 Compare March 8, 2024 09:52

make sure enough data for small sample size

4e21957

grimoire approved these changes Mar 10, 2024

View reviewed changes

lvhan028 approved these changes Mar 11, 2024

View reviewed changes

lvhan028 merged commit 5097476 into InternLM:main Mar 11, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate sample request in benchmark script #1264

Accelerate sample request in benchmark script #1264

ispobock commented Mar 8, 2024

ispobock commented Mar 8, 2024

grimoire commented Mar 8, 2024

ispobock commented Mar 8, 2024

zhyncs commented Mar 8, 2024

ispobock commented Mar 8, 2024

ispobock commented Mar 8, 2024

Accelerate sample request in benchmark script #1264

Accelerate sample request in benchmark script #1264

Conversation

ispobock commented Mar 8, 2024

Motivation

Modification

ispobock commented Mar 8, 2024

grimoire commented Mar 8, 2024

ispobock commented Mar 8, 2024

zhyncs commented Mar 8, 2024

ispobock commented Mar 8, 2024

ispobock commented Mar 8, 2024