Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HumanEval doesn't actually runs test #1102

Open
cfytrok opened this issue Oct 19, 2024 · 2 comments
Open

HumanEval doesn't actually runs test #1102

cfytrok opened this issue Oct 19, 2024 · 2 comments

Comments

@cfytrok
Copy link

cfytrok commented Oct 19, 2024

HumanEval.evaluate checks only if generated sample is correct python.
It doesn't actually run tests like the original project does.

exec(function)
exec(golden.expected_output)

@Artifizer
Copy link

Up, the problem is still here

An example of generated 'function' here:

from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    
    for i in range(len(numbers) - 1):
        current_distance = numbers[i] - numbers[i + 1]
        if current_distance < threshold:
            return True
    
    return False

And then example of golden.expected_output


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}

def check(candidate):
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
    assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

This code obviously won't do anything. Moreover, current implementation of:

c = 0
for function in functions:
try:
exec(function)
exec(golden.expected_output)
c += 1
except AssertionError as e:
pass
self.c[task.value] = c
self.functions[task.value] = functions

  1. Is unsafe, and can harm your computer if LLM got crazy
  2. Doesn't handle Python exceptions if LLM generated malformed Python code

@ezlandau
Copy link

Stumbled upon the same issue, HumanEval isn't, in fact, supported

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants