HumanEval doesn't actually runs test #1102

cfytrok · 2024-10-19T10:00:10Z

HumanEval.evaluate checks only if generated sample is correct python.
It doesn't actually run tests like the original project does.

deepeval/deepeval/benchmarks/human_eval/human_eval.py

Lines 96 to 97 in f8d5f8f

    
           exec(function) 
        
           exec(golden.expected_output)

Artifizer · 2025-01-10T23:04:25Z

Up, the problem is still here

An example of generated 'function' here:

from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    
    for i in range(len(numbers) - 1):
        current_distance = numbers[i] - numbers[i + 1]
        if current_distance < threshold:
            return True
    
    return False

And then example of golden.expected_output


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}

def check(candidate):
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
    assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

This code obviously won't do anything. Moreover, current implementation of:

deepeval/deepeval/benchmarks/human_eval/human_eval.py

Lines 93 to 102 in f8d5f8f

    
           c = 0 
        
           for function in functions: 
        
               try: 
        
                   exec(function) 
        
                   exec(golden.expected_output) 
        
                   c += 1 
        
               except AssertionError as e: 
        
                   pass 
        
           self.c[task.value] = c 
        
           self.functions[task.value] = functions

Is unsafe, and can harm your computer if LLM got crazy
Doesn't handle Python exceptions if LLM generated malformed Python code

ezlandau · 2025-02-15T17:11:13Z

Stumbled upon the same issue, HumanEval isn't, in fact, supported

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HumanEval doesn't actually runs test #1102

HumanEval doesn't actually runs test #1102

cfytrok commented Oct 19, 2024

Artifizer commented Jan 10, 2025

ezlandau commented Feb 15, 2025

HumanEval doesn't actually runs test #1102

HumanEval doesn't actually runs test #1102

Comments

cfytrok commented Oct 19, 2024

Artifizer commented Jan 10, 2025

ezlandau commented Feb 15, 2025