Integrate APPS benchmarking #1051

azrv · 2024-03-07T17:23:47Z

Integrate APPS dataset for benchmarking.

Changes:

I faced a few git diff related errors while parsing LLM's response which I wrapped into DiffError. I ignore such error and continue running. I guess that's a temporary change until we polish git diff feature
Also there was regex dependency added in order to timeout endless regex matching inside parse_diffs function. (done in Add timeout while searching for git diffs in LLMs response #1067)
run.py had to be adjusted to work with multiple commands/inputs as APPS might provide multiple inputs and outputs per problem

How to run?

python3 gpt_engineer/benchmark gpt_engineer/core/default/simple_agent.py apps

I tried to run it on first 50 problems, here is the output (total time is not right, because generated code is cached from previous runs)

--- Results ---
Total time: 0.38s
Completely correct tasks: 12/50
Total correct assertions: 229/500
Average success rate: 45.79999999999998% on 50 tasks
--- Results ---

Benchmark against three simple problems

Iterate over predefined set of problems TODOs left

- try using command line arguments - integrate self-healing

Add parse_diffs timeout Temporary except diffs related issues (Maybe some `try` statements are not necessary at this point) Handle problems with starter_code Improve results

gpt_engineer/core/default/self_healing_agent.py

AntonOsika · 2024-03-09T18:17:42Z

Super excited about this, wanted to use it today

I'll add some comments to suggest simplifications.

Apart from that, are we ready to merge or anything you want to fix first? (apart from just resolving pyproject.toml and running poetry lock

ATheorell · 2024-03-09T18:22:57Z

I think @azrv wanted to do some cleanup.

Currently, this PR comes with 2 new agent implementations, the benchmark_agent and the self_healing_agent. For the "base" PR of APPS, probably neither of these should be included; it should be sufficient to run the default_agent, even though performance may be deplorable. It is then the fun of coming PRs to add smart self-healing etc which improves performance.

AntonOsika · 2024-03-09T18:37:15Z

Thanks for PR @azrv

Some good things here!

I'll be direct here though: we need higher standards to merge any PR, specifically:

There is no explanation of why to do things in a new way (unless obvious)
The PR adds both a new agent to evaluate and a new benchmark. They should be separate PRs.
The PR completely changes the benchmark code, without explanations of why (and it seemingly adds "agent logic" in the PR?)

Referring to this in 3:

AntonOsika · 2024-03-09T18:38:50Z

Finally, I don't think need to add inputs:

You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense?

ATheorell · 2024-03-09T18:46:59Z

Finally, I don't think need to add inputs:

You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense?

Ah, the turning inputs and assertions into a list was my invention, but yes, I believe the design where the inputs are stored in the assertions is nicer. I would vouch for that.

Drop self_healing_agent and benchmark_agent

Store commands within assertions

gpt_engineer/benchmark/benchmarks/apps/load.py

azrv · 2024-03-10T21:08:25Z

Thanks for PR @azrv

Some good things here!

I'll be direct here though: we need higher standards to merge any PR, specifically:

There is no explanation of why to do things in a new way (unless obvious)

The PR adds both a new agent to evaluate and a new benchmark. They should be separate PRs.

The PR completely changes the benchmark code, without explanations of why (and it seemingly adds "agent logic" in the PR?)

Referring to this in 3:

Hey!

Sure, I updated the PR description
Both agents were dropped
I cleaned things up and removed 'agent based' logic, that was added solely for self healing in sake of better performance

Here are few things I found could be handy for gpt-engineer benchmarks. Please share your thoughts whether this is smth we would consider doing later:

retries AKA self-healing if most of the assertions didn't succeed
Running benchmark in multiple threads

azrv · 2024-03-10T21:12:04Z

Finally, I don't think need to add inputs:

You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense?

Not sure this is achievable with simple dictionaries, because you need to store command somewhere.
For now I simply extended Assertion class, let me know if it clicks with you or please elaborate on dict way

Yes, other benchmarks are still broken. Will fix after we agree on single solution :)

ATheorell · 2024-03-11T18:24:56Z

Finally, I don't think need to add inputs:
You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense?

Not sure this is achievable with simple dictionaries, because you need to store command somewhere. For now I simply extended Assertion class, let me know if it clicks with you or please elaborate on dict way

Yes, other benchmarks are still broken. Will fix after we agree on single solution :)

The code is getting more general which is great! I think we can still scale back a lot on the changes to the general run.py file by implementing the system that @AntonOsika suggested. It is not necessary to extend the assertion class. One way (which I think is quite nice), for doing this, is to use the AppsAssertion class that I implemented. The idea goes:

For each input output/pair, create an AppsAssertion object. Pass code, (timeout), command + input, expected output and an execution environment to the constructor of the AppsAssertion object. In the evaluate method, run the code through the execution command and check whether the assertion holds. Timeout and runtime failures will make the assertion fail in the same way as a wrong output (but we could log it differently in the future).

A drawback with this method may be that it complicates parallelization, which could, however, still be added somehow from run.py. Another idea would be to parallelize on problem level. Input on this @AntonOsika ?

Hope this helps

codecov · 2024-03-16T15:42:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.29%. Comparing base (164730a) to head (ce35057).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1051   +/-   ##
=======================================
  Coverage   84.29%   84.29%           
=======================================
  Files          27       27           
  Lines        1566     1566           
=======================================
  Hits         1320     1320           
  Misses        246      246

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ATheorell

Great work and this is more or less working. Just requesting a bit of housekeeping now really.

ATheorell · 2024-03-17T10:46:09Z

gpt_engineer/benchmark/types.py

+            [result for result in self.assertion_results if list(result.values())[0]]
+        )
+
+        return succeeded / len(self.assertion_results)


I get: "ZeroDivisionError: division by zero" here when I run the first 5 problems.

Has this been addressed? I think it happened because I ran few problems, only the first 5...
Or might it have been due to the catch clause that the list was empty?

I don't manage to reproduce it, but I added a check in the beginning of the method:

if not self.assertion_results: return 0.0

Can you still catch it?

ATheorell · 2024-03-17T10:51:38Z

gpt_engineer/benchmark/run.py

-        files_dict = agent.improve(task.initial_code, task.prompt, task.command)
+        try:
+            files_dict = agent.improve(task.initial_code, task.prompt)
+        except DiffError:  # Temporary catch errors related to git diffs


Is this still a problem? I see that this was necessary during development, but I think we should aim to fix this on the diff side, rather than having a specific catch in the benchmarks.

ATheorell · 2024-03-17T10:53:44Z

gpt_engineer/benchmark/benchmarks/apps/load.py

+                + "\nThe program, including its inputs, should be run from the command "
+                "line like 'python main \"input1 input2 etc \"', with all inputs inside "
+                "the quotation marks. The program should not read inputs from stdin.",
+                assertions=[


We can remove the list wrapping again and just use a dictionary comprehension like in the original design

ATheorell · 2024-03-17T10:54:25Z

gpt_engineer/benchmark/run.py

-                    assertion_name: assertion(exec_result)
-                    for assertion_name, assertion in task.assertions.items()
-                },
+                assertion_results=[


Same here, we should be able to revert, now that we are using the AppsAssertion design. Each apps assertion will be an entry in the dictionary.

ATheorell · 2024-03-17T10:55:47Z

gpt_engineer/benchmark/types.py

@@ -57,7 +57,7 @@ class Task:
    initial_code: Optional[FilesDict]
    command: Optional[str]
    prompt: str
-    assertions: Optional[Dict[str, Assertion]]
+    assertions: Optional[List[OrderedDict[str, Assertion]]]


Again, revert this

ATheorell · 2024-03-17T10:56:06Z

gpt_engineer/benchmark/types.py

@@ -72,5 +72,14 @@ class Benchmark:
 @dataclass
 class TaskResult:
    task_name: str
-    assertion_results: dict[str, bool]
+    assertion_results: List[dict[str, bool]]


again, revert

ATheorell · 2024-03-17T10:58:15Z

gpt_engineer/core/chat_to_files.py

@@ -134,21 +140,25 @@ def parse_diffs(diff_string: str) -> dict:
    - dict: A dictionary of Diff objects keyed by filename.
    """
    # Regex to match individual diff blocks
-    diff_block_pattern = re.compile(
+    diff_block_pattern = regex.compile(


It is great that you are improving the diff-pipeline! Formally, this should be a separate PR. Would you be able to make a separate PR with the changes to chat_to_files?

I can merge that quickly

@ATheorell #1067

ATheorell · 2024-03-17T10:59:28Z

pyproject.toml

@@ -32,6 +32,8 @@ langchain_openai = "*"
 toml = ">=0.10.2"
 pyperclip = "^1.8.2"
 langchain-anthropic = "^0.1.1"
+datasets = "^2.17.1"
+regex = "^2023.12.25"


it would be nice if you can comment what functionality makes the new dependencies necessary

ATheorell · 2024-03-18T08:25:31Z

gpt_engineer/benchmark/benchmarks/apps/load.py

+        self.command = command
+
+    def evaluate(self, assertable: Assertable) -> bool:
+        pro = assertable.env.popen(self.command)


This may sound a little counterintuitive to the architecture, but I think it may be preferable to create a fresh DiskExecutionEnvironment, upload the code and run from there, instead of using the global one. The reason is that, executing the code may have side-effects like creating and removing files and folders. If this is the case, multiple executions of the same program with different inputs will not be independent.

I see, will do

azrv · 2024-03-21T19:44:23Z

gpt_engineer/benchmark/run.py

-        files_dict = agent.improve(task.initial_code, task.prompt, task.command)
+        try:
+            files_dict = agent.improve(task.initial_code, task.prompt)
+        except RecursionError:  # Temporary catch errors related to git diffs


@ATheorell either diff errors were indeed fixed in latest master or I added them prematurely at the first place, but now the only thing left is validate_and_correct function getting into recursion.

I suggest I temporary catch it here and file an issue for it

I actually suggest you remove this catch. We merge without it and let it fail. Then we can use it to repair what ever is broken in the diffs.

azrv · 2024-03-21T19:45:59Z

@ATheorell @AntonOsika All fixed
The only thing left is RecursionError explained above

azrv · 2024-03-23T12:37:08Z

Solves #819

azrv added 6 commits February 20, 2024 17:02

Initial APPS benchmark integration

29f4680

Benchmark against three simple problems

Download and cache dataset

4ddfb5e

Iterate over predefined set of problems TODOs left

WIP APPS benchmarking

cf3db0e

- try using command line arguments - integrate self-healing

Merge remote-tracking branch 'origin/main' into apps_self_healing

ccb00ee

Multiple inputs, results

3c6c4aa

Add subprocess execution timeout

833e3c9

Add parse_diffs timeout Temporary except diffs related issues (Maybe some `try` statements are not necessary at this point) Handle problems with starter_code Improve results

AntonOsika reviewed Mar 9, 2024

View reviewed changes

gpt_engineer/core/default/self_healing_agent.py Outdated Show resolved Hide resolved

azrv added 5 commits March 10, 2024 20:27

Remove unused code

5e54e5b

Drop retry logic and outer loop in run.py

e804c0b

Drop self_healing_agent and benchmark_agent

Get rid of inputs

95c88f3

Store commands within assertions

Get rid of inputs

5f6f7d6

Store commands within assertions

Merge branch 'apps2' into apps_self_healing

c25c1d1

azrv commented Mar 10, 2024

View reviewed changes

gpt_engineer/benchmark/benchmarks/apps/load.py Show resolved Hide resolved

azrv requested a review from AntonOsika March 10, 2024 21:12

Remove AppsAssertion class

b6ecace

azrv marked this pull request as ready for review March 11, 2024 18:21

Merge remote-tracking branch 'origin/main' into apps_self_healing

a526a91

azrv changed the title ~~WIP Integrate APPS benchmarking~~ Integrate APPS benchmarking Mar 11, 2024

azrv added 4 commits March 16, 2024 15:35

Merge branch 'main' into apps_self_healing

37df506

Add AppsAssertion class, simplify run.py

cb0d9ca

Return run.py to its initial state

50cd771

Run pre-commit; Update lock file

20890d3

Run pre-commit on all files

befa702

azrv force-pushed the apps_self_healing branch from 45b8ddc to befa702 Compare March 16, 2024 15:46

ATheorell requested changes Mar 17, 2024

View reviewed changes

ATheorell reviewed Mar 18, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into apps_self_healing

adb60a2

azrv force-pushed the apps_self_healing branch from fe3601b to 63ce1e2 Compare March 21, 2024 16:42

Revert assertions back to using dicts

450c7e2

azrv force-pushed the apps_self_healing branch from 63ce1e2 to 450c7e2 Compare March 21, 2024 16:45

azrv added 3 commits March 21, 2024 18:59

Create new execution environment for every run to avoid side effects

b175938

Clean up

911e092

Revert DiffError; Temporary except RecursionError in benchmarks

d380f05

azrv commented Mar 21, 2024

View reviewed changes

azrv requested a review from ATheorell March 21, 2024 19:44

ATheorell approved these changes Mar 22, 2024

View reviewed changes

Remove RecursionError try-catch

ce35057

ATheorell merged commit 75b5438 into gpt-engineer-org:main Mar 23, 2024
6 checks passed

azrv mentioned this pull request Mar 23, 2024

Automatic benchmarking of gpt-engineer with APPS #819

Closed

AntonOsika added the benchmark label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate APPS benchmarking #1051

Integrate APPS benchmarking #1051

azrv commented Mar 7, 2024 •

edited

Loading

AntonOsika commented Mar 9, 2024

ATheorell commented Mar 9, 2024

AntonOsika commented Mar 9, 2024

AntonOsika commented Mar 9, 2024

ATheorell commented Mar 9, 2024 •

edited

Loading

azrv commented Mar 10, 2024

azrv commented Mar 10, 2024

ATheorell commented Mar 11, 2024

codecov bot commented Mar 16, 2024 •

edited

Loading

ATheorell left a comment

ATheorell Mar 17, 2024

ATheorell Mar 22, 2024

azrv Mar 22, 2024

ATheorell Mar 17, 2024

ATheorell Mar 17, 2024

ATheorell Mar 17, 2024

ATheorell Mar 17, 2024

ATheorell Mar 17, 2024

ATheorell Mar 17, 2024

ATheorell Mar 17, 2024

azrv Mar 17, 2024

azrv Mar 17, 2024

ATheorell Mar 17, 2024

ATheorell Mar 18, 2024

azrv Mar 18, 2024

azrv Mar 21, 2024

ATheorell Mar 22, 2024

azrv Mar 22, 2024

azrv commented Mar 21, 2024

azrv commented Mar 23, 2024

Integrate APPS benchmarking #1051

Integrate APPS benchmarking #1051

Conversation

azrv commented Mar 7, 2024 • edited Loading

AntonOsika commented Mar 9, 2024

ATheorell commented Mar 9, 2024

AntonOsika commented Mar 9, 2024

AntonOsika commented Mar 9, 2024

ATheorell commented Mar 9, 2024 • edited Loading

azrv commented Mar 10, 2024

azrv commented Mar 10, 2024

ATheorell commented Mar 11, 2024

codecov bot commented Mar 16, 2024 • edited Loading

Codecov Report

ATheorell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azrv commented Mar 21, 2024

azrv commented Mar 23, 2024

azrv commented Mar 7, 2024 •

edited

Loading

ATheorell commented Mar 9, 2024 •

edited

Loading

codecov bot commented Mar 16, 2024 •

edited

Loading