Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate sample request in benchmark script #1264

Merged
merged 2 commits into from
Mar 11, 2024

Conversation

ispobock
Copy link
Contributor

@ispobock ispobock commented Mar 8, 2024

Motivation

The sampling request progress is time consuming in the benchmark script.

Modification

Refer to vllm script, we can pre-sample the dataset before the filter step to avoid go through all the dataset.

@ispobock
Copy link
Contributor Author

ispobock commented Mar 8, 2024

@lvhan028 @grimoire Could you help review?

@ispobock ispobock force-pushed the accelerate_sample_request branch from e7e0978 to 6e87932 Compare March 8, 2024 09:52
@grimoire
Copy link
Collaborator

grimoire commented Mar 8, 2024

Can we make sure that we still have enough data after length filter?

@ispobock
Copy link
Contributor Author

ispobock commented Mar 8, 2024

@grimoire
For the SharedGPT V3 dataset,

  • total size: 92886
  • size after filter: 81761
  • valid percentage: 81761/92886 ≈ 88.02%

Ideally, pre-sample 1.2*num_requests is enough (expected valid count is 1.2*0.8802 ≈ 1.0562) because the sample is random. But for small num_requests, for example num_requests=1, we need pre-sample more to make sure enough data.

@zhyncs
Copy link
Collaborator

zhyncs commented Mar 8, 2024

LGTM

@ispobock
Copy link
Contributor Author

ispobock commented Mar 8, 2024

For this problem, maybe we can model it as a hypergeometric distribution problem and use the following code to find the acceptable pre-sample size to make sure very low error rate.

from scipy.stats import hypergeom

def find_minimal_presample_size(dataset_size, valid_size, num_requests, error_rate):
    left, right = num_requests, dataset_size
    
    while left < right:
        mid = left + (right - left) // 2
        # get cumulative distribution for the hypergeometric distribution
        cdf = hypergeom.cdf(num_requests - 1, dataset_size, valid_size, mid)
        if cdf <= error_rate:
            right = mid 
        else:
            left = mid + 1
    
    return left

print(find_minimal_presample_size(92886, 81761, 1, 0.001)) # 4
print(find_minimal_presample_size(92886, 81761, 1000, 0.001)) # 1176

For example, if we pre-sample 4 samples, the error rate will be math.comb(11125,4)/math.comb(92886, 4) ≈ 0.0002 < 0.001. That means we only have 0.0002 probability to get 4 samples are all not valid and have 0.9998 probability to have at least 1 valid sample.

@ispobock
Copy link
Contributor Author

ispobock commented Mar 8, 2024

A simpler solution. We also can set a threshold, maybe presample_size = max(num_requests*1.2, 1000), both time and error rate are acceptable.

@lvhan028 lvhan028 merged commit 5097476 into InternLM:main Mar 11, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants