-
Notifications
You must be signed in to change notification settings - Fork 44.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve website summary quality via browse prompt change #3551
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## master #3551 +/- ##
==========================================
- Coverage 60.99% 59.98% -1.01%
==========================================
Files 73 69 -4
Lines 3310 3099 -211
Branches 542 513 -29
==========================================
- Hits 2019 1859 -160
+ Misses 1152 1109 -43
+ Partials 139 131 -8
☔ View full report in Codecov by Sentry. |
How can I make the linting check pass? Help would be appreciated. |
You need to run |
I’d also love to see you add a regression test to prevent this from happening in the future. You can copy paste tests/integration/goal_oriented/test_browse_website.py and change the agent and the assert statement to fit your needs! |
Hey @ntindle, I've fixed the linting issues, but I'm still struggling with the regression test for my changes. I'm not too experienced in this area, but I'd like to learn. Could you give me some guidance or a specific example of how to create a test for my improvements to the browsing functionality? Sorry if it's a bother, but any help would be awesome. Thanks! |
There is only one regression test in "test_browse_website.py" for the moment. It tests to see if AutoGPT can correctly find the price of an item in a pre-defined website text. This is something that is repeatable, and can be automated. I am not sure that it is possible to make a regression test for testing longer, more comprehensive summaries. ChatGPT will often give different summarizations (and we cannot control its random seed via the API). A qualitative benchmark is possible, but that cannot be automated. You basically look at a website, and see if the summary is good or not. Maybe do that multiple times for the same website, and possibly for a few more websites. This is a well known issue with testing LLMs, so I wouldn't worry too much, as long as a couple of qualitative tests show that this prompt indeed produces longer summaries, with more relevant detail retained. I remember that one of my issues with browsing was also that it made very small summaries, that lost a lot of useful detail. |
Thank you; I can't stress how much that cleared things up for me. I suppose now we'll wait for someone higher up to benchmark this. |
Okay, so here is my attempt at a qualitative benchmark. We can discuss if this is the right approach or if you have something else in mind. BackgroundAs a bit of background, remember that if a website is considered too long (ie. it will use up too much tokens just to send the contents to ChatGPT), then it will be split up into sentences using Spacy, and then into chunks, to max out the number of tokens allowed. (by default we use GPT 3.5-Turbo, and BROWSE_CHUNK_MAX_LENGTH=3000 tokens (to allow about 1000 tokens for the answer.)) Each chunk will be summarized, and then the summaries will be concatenated by AutoGPT and summarized again. MethodI have chosen a website that is too large, and will be split into 3 chunks... and then those 3 chunks will be summarized into a final summary. I will not communicate the intermediate summaries here to save space, only the so called final, overall summary of the website as returned by I will repeat the summary creation 3 times on the same website, and using the same question, just because we have a probabilistic ChatGPT so we need more than 1 sample. I will do this once for the current AutoGPT implementation, and once for the new prompt proposed by this PR. I will then copy the summaries below into the next paragraphs as they are, without modification. Here are the results using a Temperature 0.5, doing the following AutoGPT command:
RESULTS for original AutoGPT version:The text provides a list of commercially available cherry cultivars for Ontario, including recommended sweet and tart cherry cultivars, harvest dates, pollination information, and cherry cultivar descriptions. The cultivars are listed in order of maturity and are grouped into general planting, limited planting, and trial planting categories. The text also includes information on cherry rootstocks and pollen incompatibility groups for sweet cherry cultivars. The text provides a list of commercially available cherry cultivars for Ontario, including recommended sweet and tart cherry cultivars, their harvest dates, pollination requirements, and brief descriptions of each cultivar's characteristics and performance. The cultivars are grouped into general planting, limited planting, and trial planting categories, and recommendations for planting cultivars and adapted areas within the province have been determined by various organizations and consultations with industry stakeholders. The list includes cultivars such as Sunburst, Sweetheart, Tehranivee, Ulster, Valera, Van, and Montmorency, as well as information about different cherry rootstocks. The information is provided by the Ministry of Agriculture, Food and Rural Affairs in Ontario, Canada. The text provides a list of commercially available cherry cultivars, including both sweet and tart varieties, as well as information on cherry rootstocks and species collections. The list is organized by maturity and includes recommended cultivars for Ontario. The text also mentions the importance of maintaining cherry collections for evaluation and breeding programs. RESULTS for the prompt from this Pull Request:Query: "List of commercially available cherry cultivars" Data block 1: The article provides a list of commercially available cherry cultivars for Ontario, categorized by recommended general planting, limited planting, and trial planting. The recommended sweet cherry cultivars are Viva, Vista, Hartland, Valera, Vega, Cavalier, Viscount, Venus, Cristalina, Bing, Vic, Kristin, Vogue, Newstar, Vandalay, Stella, Tehranivee, Sonata, and Hedelfingen. The recommended tart cherry cultivars are Montmorency, Northstar, Balaton, Galaxy, and Meteor. The article also includes information on cherry harvest dates, pollination for sweet and tart cherries, cherry cultivar descriptions, and cherry rootstocks. Data block 2: The data block contains a list of commercially available cherry cultivars, including both sweet and tart varieties. Some of the sweet cherry cultivars mentioned are Colt, Gisela®5, Gisela®6, and Mazzard. The tart cherry cultivars mentioned include Balaton, English Morello, Meteor, and Montmorency. The list also includes descriptions of each cultivar's characteristics, such as fruit size, ripening time, and resistance to certain diseases. There is no specific statistic related to this query, as the data block provides qualitative descriptions rather than quantitative data. Data block 3: The data block contains a list of commercially available cherry cultivars, including their names and some additional information such as their ripening time, color, and origin. Some of the named cultivars include FrancisIII, Grosse GermersdorferIII, Harlemer Doppelte-Hartland, LambertIII, Lapins, NapoleonIII, RainierIX, and Stella. The block also includes a list of cherry rootstocks and species collections, as well as some miscellaneous information about cherry varieties. Overall, the block provides a comprehensive overview of the different types of cherries that are available commercially. Explanation: All three data blocks provide a list of commercially available cherry cultivars, along with additional information such as ripening time, disease resistance, and origin. The first data block provides recommended sweet and tart cherry cultivars for Ontario, while the second data block provides a list of sweet and tart cherry cultivars along with their characteristics. The third data block provides a list of named cherry cultivars along with their ripening time, color, and origin, as well as information on cherry rootstocks and species collections. No specific statistics are provided for this query, as the data blocks provide qualitative descriptions of the cherry cultivars. The data block provides a list of commercially available cherry cultivars for Ontario, categorized by sweet and tart cherries, with recommended cultivars listed in order of maturity. The list includes general planting, limited planting, and trial planting categories. The block also provides information on cherry harvest dates, pollination requirements for sweet and tart cherries, and brief descriptions of major sweet cherry cultivars. The data block does not provide specific statistics on the number of cultivars available or their market share. The cultivars listed include Sunburst, Sweetheart, Tehranivee, Ulster, Valera, Van, Vandalay, Vega, Venus, Vic, Viscount, Vista, Viva, and Vogue for sweet cherries, and Balaton, English Morello, Meteor, and Montmorency for tart cherries. The list also includes information on cherry rootstocks, including Colt, Gisela clones, Mahaleb, Mazzard, and MxM clones. There are no specific statistics provided on the number of commercially available cherry cultivars or their market share. The data block provides a list of commercially available cherry cultivars for Ontario, categorized by sweet and tart cherries, and grouped by recommended cultivars for general planting, limited planting, and trial planting. The list includes recommended cherry cultivars, harvest dates, pollination requirements, cultivar descriptions, and recommended rootstocks. Some sweet cherry cultivars that are recommended for Ontario include Sunburst, Sweetheart, Tehranivee, Ulster, Valera, Van, Vandalay™, Vega, Venus, Vic, Viscount, Viva, and Vogue. Some tart cherry cultivars that are recommended for Ontario include Balaton, English Morello, Meteor, and Montmorency. The data block also provides information on the importance of selecting appropriate cultivars for specific climatic zones, and the need for cross-pollination in sweet cherry cultivars. The list includes average first harvest dates for sweet and tart cherries, and pollen incompatibility groups for sweet cherry cultivars. The data block does not provide specific statistics on the number of commercially available cherry cultivars in Ontario. DISCUSSIONOriginal Version: Only 1 summary out of 3 actually contains the answer to the question. The two others allude to the fact that the website contains the answer, but does not actually list the requested cherry tree cultivars in the summary. This PR version: All three summaries actually have the answer to the question "list of commercially available cherry cultivars", and a bit more text and detail. This should help give a better Ada embedding down the line, when AutoGPT is looking at its memories. CONCLUSION:I think that the summaries are better than before, so I would recommend merging this PR. |
@bszollosinagy This is some really professional work. Overall, I think this benchmark effectively demonstrates and showcases this PR's improvements. Well done; I'll learn from you. |
A test is failing with what seems to be a "Incorrect API key provided" error. Is this an issue with the PR? It happened on the most recent merge from master. |
Apparently this is caused by some test which is caching its results into something called a cassettte. See CONTRIBUTING.md All you have to do is run If I understand correctly, this is done so that running the tests on Github will not need an API, as long as it finds a "cassette" which acts as a cache. For your case it did not find the casette, so it tried to use the OpenAI API, hence the error (because nobody is giving their own API key just so that 100s of unit tests may run on Github for each Pull Request). I hope that clears it up. |
This is a mass message from the AutoGPT core team. For more details (and for infor on joining our Discord), please refer to: |
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
@bszollosinagy If it matters, the test failed; the logs are below. I've pushed the updated YAML. Thanks for the patience. I'd appreciate if you'd let me know if I did anything wrong. I'll probably see more of what I could do in terms of troubleshooting tomorrow, but right now I'm fairly exhausted.
|
@bszollosinagy we're currently building challenges and I really liked your work on this PR. Can I talk to you in voice ? Please join use on Discord through this link https://discord.gg/autogpt (if not already) DM me on the Auto-GPT discord channel (my discord is merwanehamadi). |
@onekum If you are running the test on your computer, then somehow the .env file does not have your API key filled if. |
this should probably be using the same argument structure that other commands are using, i.e.
PS: please consider leaving your feedback about the idea of supporting "prompt profiles" (directories with different prompt configs) for these types of changes, as per: #1874 (comment) |
This comment was marked as duplicate.
This comment was marked as duplicate.
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
@bszollosinagy How can I switch it from sk-dummy? I made sure that the .env file had my API key and I even tried updating the env.template file as well, which resulted in the same error. What file it getting the API key from if not .env? It keeps trying to use sk-dummy. |
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request. |
@merwanehamadi This seems to be a bug with VCR. If I comment out the line @pytest.mark.vcr, then the test passes, if it is present, then the Dummy key is used, even if a correct API key is specified in the .env file. If you check out this PR, and try to run the browser test, it will also fail for you. Note: the test was executed from command line to avoid conflict with an IDEs. |
This directly conflicts with changes in #4208 (which has priority). Can you test that fork, see if it still needs improvement, and make a PR to that fork, or to master after it is merged? |
Hey, I've marked this as don't merge until the Memory Fixes are in. Sorry to keep it on hold longer, just no real way to test/validate functionality until the fixes are in |
Closing this as the changes in #4208 seem to fix the previous behavior adequately enough in my opinion. Additionally, due to a lack of free time and technical knowledge, my motivation to continue work on this PR is waning. If issues with the browse function arise again in the future or if it's thought that it can still see further improvement, someone may try an updated implementation of this PR. |
Background
I feel that "question" and "summary" are bad words to use when prompting an AI to extract data from a webpage. They're too general. Thus, I fixed the browse prompt to reflect my opinion. This ended up significantly improving GPT-3's performance in summarizing website content. By telling it to use maximum detail, the browse outputs won't get (as) decreasingly verbose and simple, which would lead to a bad final summary. The caveat of this is potentially using more tokens because the summaries will be longer, but at least the AI won't be neutered by bad browsing functionality.
This also makes it so you won't go homeless from needing to use GPT-4 to get decent website summaries.
Changes
Changed the browse prompt to be more effective.
Documentation
Nothing is added, only changed. No documentation necessary (I don't think).
Test Plan
I've tested this over the course of many different prompts on my own repo and I can say confidently that this alternate browse prompt significantly increases the quality of the summary returned. You'll see that the summaries are much, much more information dense and character-efficient. Example outputs are below; compare with current performance:
Output 1:
Output 2:
Output 3:
PR Quality Checklist
Things to consider
Perhaps change
question
toquery
in the initial prompt so that the AI treats it more as a query for information and thus makes its input more objective. The fact that I can't find where the initial prompt is anymore prevents me from making this change myself. If anyone wants to make this change or inform me how to do so, it'll happen, but I think this would be a fine PR merge even without that.