-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate APPS benchmarking #1051
Conversation
Benchmark against three simple problems
Iterate over predefined set of problems TODOs left
- try using command line arguments - integrate self-healing
Add parse_diffs timeout Temporary except diffs related issues (Maybe some `try` statements are not necessary at this point) Handle problems with starter_code Improve results
Super excited about this, wanted to use it today I'll add some comments to suggest simplifications. Apart from that, are we ready to merge or anything you want to fix first? (apart from just resolving pyproject.toml and running |
I think @azrv wanted to do some cleanup. Currently, this PR comes with 2 new agent implementations, the benchmark_agent and the self_healing_agent. For the "base" PR of APPS, probably neither of these should be included; it should be sufficient to run the default_agent, even though performance may be deplorable. It is then the fun of coming PRs to add smart self-healing etc which improves performance. |
Thanks for PR @azrv Some good things here! I'll be direct here though: we need higher standards to merge any PR, specifically:
|
Finally, I don't think need to add inputs: You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense? |
Ah, the turning inputs and assertions into a list was my invention, but yes, I believe the design where the inputs are stored in the assertions is nicer. I would vouch for that. |
Drop self_healing_agent and benchmark_agent
Store commands within assertions
Store commands within assertions
Hey!
Here are few things I found could be handy for gpt-engineer benchmarks. Please share your thoughts whether this is smth we would consider doing later:
|
Not sure this is achievable with simple dictionaries, because you need to store Yes, other benchmarks are still broken. Will fix after we agree on single solution :) |
The code is getting more general which is great! I think we can still scale back a lot on the changes to the general run.py file by implementing the system that @AntonOsika suggested. It is not necessary to extend the assertion class. One way (which I think is quite nice), for doing this, is to use the AppsAssertion class that I implemented. The idea goes: For each input output/pair, create an AppsAssertion object. Pass code, (timeout), command + input, expected output and an execution environment to the constructor of the AppsAssertion object. In the evaluate method, run the code through the execution command and check whether the assertion holds. Timeout and runtime failures will make the assertion fail in the same way as a wrong output (but we could log it differently in the future). A drawback with this method may be that it complicates parallelization, which could, however, still be added somehow from run.py. Another idea would be to parallelize on problem level. Input on this @AntonOsika ? Hope this helps |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1051 +/- ##
=======================================
Coverage 84.29% 84.29%
=======================================
Files 27 27
Lines 1566 1566
=======================================
Hits 1320 1320
Misses 246 246 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work and this is more or less working. Just requesting a bit of housekeeping now really.
[result for result in self.assertion_results if list(result.values())[0]] | ||
) | ||
|
||
return succeeded / len(self.assertion_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get: "ZeroDivisionError: division by zero" here when I run the first 5 problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been addressed? I think it happened because I ran few problems, only the first 5...
Or might it have been due to the catch clause that the list was empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't manage to reproduce it, but I added a check in the beginning of the method:
if not self.assertion_results:
return 0.0
Can you still catch it?
gpt_engineer/benchmark/run.py
Outdated
files_dict = agent.improve(task.initial_code, task.prompt, task.command) | ||
try: | ||
files_dict = agent.improve(task.initial_code, task.prompt) | ||
except DiffError: # Temporary catch errors related to git diffs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a problem? I see that this was necessary during development, but I think we should aim to fix this on the diff side, rather than having a specific catch in the benchmarks.
+ "\nThe program, including its inputs, should be run from the command " | ||
"line like 'python main \"input1 input2 etc \"', with all inputs inside " | ||
"the quotation marks. The program should not read inputs from stdin.", | ||
assertions=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the list wrapping again and just use a dictionary comprehension like in the original design
gpt_engineer/benchmark/run.py
Outdated
assertion_name: assertion(exec_result) | ||
for assertion_name, assertion in task.assertions.items() | ||
}, | ||
assertion_results=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we should be able to revert, now that we are using the AppsAssertion design. Each apps assertion will be an entry in the dictionary.
gpt_engineer/benchmark/types.py
Outdated
@@ -57,7 +57,7 @@ class Task: | |||
initial_code: Optional[FilesDict] | |||
command: Optional[str] | |||
prompt: str | |||
assertions: Optional[Dict[str, Assertion]] | |||
assertions: Optional[List[OrderedDict[str, Assertion]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, revert this
gpt_engineer/benchmark/types.py
Outdated
@@ -72,5 +72,14 @@ class Benchmark: | |||
@dataclass | |||
class TaskResult: | |||
task_name: str | |||
assertion_results: dict[str, bool] | |||
assertion_results: List[dict[str, bool]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, revert
gpt_engineer/core/chat_to_files.py
Outdated
@@ -134,21 +140,25 @@ def parse_diffs(diff_string: str) -> dict: | |||
- dict: A dictionary of Diff objects keyed by filename. | |||
""" | |||
# Regex to match individual diff blocks | |||
diff_block_pattern = re.compile( | |||
diff_block_pattern = regex.compile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great that you are improving the diff-pipeline! Formally, this should be a separate PR. Would you be able to make a separate PR with the changes to chat_to_files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can merge that quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyproject.toml
Outdated
@@ -32,6 +32,8 @@ langchain_openai = "*" | |||
toml = ">=0.10.2" | |||
pyperclip = "^1.8.2" | |||
langchain-anthropic = "^0.1.1" | |||
datasets = "^2.17.1" | |||
regex = "^2023.12.25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice if you can comment what functionality makes the new dependencies necessary
self.command = command | ||
|
||
def evaluate(self, assertable: Assertable) -> bool: | ||
pro = assertable.env.popen(self.command) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may sound a little counterintuitive to the architecture, but I think it may be preferable to create a fresh DiskExecutionEnvironment, upload the code and run from there, instead of using the global one. The reason is that, executing the code may have side-effects like creating and removing files and folders. If this is the case, multiple executions of the same program with different inputs will not be independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, will do
gpt_engineer/benchmark/run.py
Outdated
files_dict = agent.improve(task.initial_code, task.prompt, task.command) | ||
try: | ||
files_dict = agent.improve(task.initial_code, task.prompt) | ||
except RecursionError: # Temporary catch errors related to git diffs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ATheorell either diff errors were indeed fixed in latest master or I added them prematurely at the first place, but now the only thing left is validate_and_correct
function getting into recursion.
I suggest I temporary catch it here and file an issue for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually suggest you remove this catch. We merge without it and let it fail. Then we can use it to repair what ever is broken in the diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@ATheorell @AntonOsika All fixed |
Solves #819 |
Integrate APPS dataset for benchmarking.
Changes:
DiffError
. I ignore such error and continue running. I guess that's a temporary change until we polish git diff featureregex
dependency added in order to timeout endless regex matching insideparse_diffs
function. (done in Add timeout while searching for git diffs in LLMs response #1067)run.py
had to be adjusted to work with multiple commands/inputs as APPS might provide multiple inputs and outputs per problemHow to run?
python3 gpt_engineer/benchmark gpt_engineer/core/default/simple_agent.py apps
I tried to run it on first 50 problems, here is the output (total time is not right, because generated code is cached from previous runs)