Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add headless browser to the WebSurferAgent #1534

Closed
wants to merge 1 commit into from

Conversation

vijaykramesh
Copy link
Contributor

@vijaykramesh vijaykramesh commented Feb 5, 2024

Why are these changes needed?

This adds a simple selenium driven headless browser to the WebSurferAgent. It does not yet make this headless browser agent multi-modal (e.g., it can't do anything with images in the browser) but this should work much better for javascript powered websites than the current SimpleTextBrowser implementation.

This also refactors the existing SImpleTextBrowser implementation a bit to have both share a common base class, and adds unit tests across both the existing and new HeadlessChromeBrowser implementation.

https://app.codecov.io/github/microsoft/autogen/pull/1534 shows the coverage additions there

Related issue number

Closes #1481

Checks

@codecov-commenter
Copy link

codecov-commenter commented Feb 5, 2024

Codecov Report

Attention: Patch coverage is 88.11189% with 17 lines in your changes missing coverage. Please review.

Project coverage is 48.00%. Comparing base (26daa18) to head (b0ab6c1).
Report is 734 commits behind head on 0.2.

Files with missing lines Patch % Lines
autogen/browser_utils/headless_chrome_browser.py 90.42% 7 Missing and 2 partials ⚠️
autogen/browser_utils/abstract_browser.py 73.33% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              0.2    #1534       +/-   ##
===========================================
+ Coverage   34.26%   48.00%   +13.73%     
===========================================
  Files          42       45        +3     
  Lines        5099     5225      +126     
  Branches     1165     1261       +96     
===========================================
+ Hits         1747     2508      +761     
+ Misses       3209     2513      -696     
- Partials      143      204       +61     
Flag Coverage Δ
unittests 47.94% <88.11%> (+13.68%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vijaykramesh vijaykramesh changed the title Add headless browser to the WebSurferAgent, closes #1481 Draft: Add headless browser to the WebSurferAgent Feb 5, 2024
@victordibia victordibia requested a review from afourney February 5, 2024 05:53
@afourney
Copy link
Member

afourney commented Feb 5, 2024

First of all, this looks fantastic. Thanks so much for working on this.

I will dig in and try this out as soon as possible today. We've discussed some possibilities on Discord already, but those are arguably future improvements. I think driving the browser, and using innerHTML are perfect first steps, and are completely sufficient for a first PR.

One thing to test will be PDFs. In many benchmark scenarios, PDFs are the final document sought by browsing, but they don't have innerHTML. As a first step, simply downloading them to the Downloads folder might be sufficient.

@afourney
Copy link
Member

afourney commented Feb 7, 2024

Just adding some notes while I continue to test this:

To install chrome in Docker or WSL:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb

Resolve dependencies with:
sudo apt -f install

Then try this again:
sudo dpkg -i google-chrome-stable_current_amd64.deb

script.extract()

# Convert to text
text = soup.get_text()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for now this should be:

    text = markdownify.MarkdownConverter().convert_soup(soup)

GPT models reason about markdown well, and we want to preserve links etc.

if uri_or_path.startswith("bing:"):
self._bing_search(uri_or_path[len("bing:") :].strip())
else:
self.driver.get(uri_or_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SimpleTextBrowser setting this would also process the content. I recommend we maintain this behavior (again, at least for now)


def _split_pages(self):
# Split only regular pages
if not self.address.startswith("http:") and not self.address.startswith("https:"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One intention here with SimpleTextBrowser is to not break up web search results across multiple viewports. In SimpleTextBrowser, that is indicated by "bing:" URLs, but this won't work here. I would recommend some other mechanism to ensure search results are kept whole to maintain the behavior.

def _bing_search(self, query):
self.driver.get("https://www.bing.com")

search_bar = self.driver.find_element(By.NAME, "q")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever, but maybe we can just navigate directly to "https://www.bing.com/search?q=" with the query provided in the url itself?

Also, an advantage of using the API is that we can render results to Markdown cleanly however we want. that may not happen here. I recommend using the API is we have a key available, otherwise reverting to this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to use implicit waits with driver.implicitly_wait(10).

Given the dynamic nature of the page, calling 126 right after 124 will cause NoSuchElementException.

"Note that as soon as the element is located, the driver will return the element reference and the code will continue executing, so a larger implicit wait value won’t necessarily increase the duration of the session.". So, here implicitly_wait(10) will not always wait for 10s. It will stop waiting as soon as the element is located.

Copy link

gitguardian bot commented Jul 20, 2024

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@ekzhu ekzhu changed the base branch from main to 0.2 October 2, 2024 18:30
@jackgerrits jackgerrits added the 0.2 Issues which are related to the pre 0.4 codebase label Oct 4, 2024
@rysweet
Copy link
Collaborator

rysweet commented Oct 10, 2024

This PR is against AutoGen 0.2. AutoGen 0.2 has been moved to the 0.2 branch. Please rebase your PR on the 0.2 branch or update it to work with the new AutoGen 0.4 that is now in main.

@rysweet rysweet added the awaiting-op-response Issue or pr has been triaged or responded to and is now awaiting a reply from the original poster label Oct 10, 2024
@rysweet
Copy link
Collaborator

rysweet commented Oct 11, 2024

@vijaykramesh closing as stale, please update addressing reviews if you would like to reopen

@rysweet rysweet closed this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 Issues which are related to the pre 0.4 codebase awaiting-op-response Issue or pr has been triaged or responded to and is now awaiting a reply from the original poster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Develop a version of WebSurferAgent that uses a proper headless browser (perhaps with multimodal LLM support)
6 participants