-
Notifications
You must be signed in to change notification settings - Fork 45.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve web-crawling system #4
Comments
This is currently majorly limiting Auto-GPT's capabilities. |
can you provide more context, please? |
Current ImplementationWeb browsing is currently handled in the following way: def browse_website(url):
summary = get_text_summary(url)
links = get_hyperlinks(url)
# Limit links to 5
if len(links) > 5:
links = links[:5]
result = f"""Website Content Summary: {summary}\n\nLinks: {links}"""
return result
def get_text_summary(url):
text = browse.scrape_text(url)
summary = browse.summarize_text(text)
return """ "Result" : """ + summary
def get_hyperlinks(url):
link_list = browse.scrape_links(url)
return link_list Where The ProblemThe This leads to instances where AutoGPT can be looking to find some news on CNN, and instead receives a summary of what CNN is. An Illustrated Example
Summary:
|
okay, thanks for the context. I'll try to see what I can do. |
tried to get it to work, but I don't have access to gpt-4 API, but with gpt-3.5-turbo, it works for some and doesn't for others even with strict prompting. but gpt-4 should be able to do it with strict prompting. |
Might be easier to add 3rd party support for example: https://www.algolia.com/pricing/ |
|
@Torantulino I see that you're using BeautifulSoup for processing the content of the site. This won't handle data that has to be injected into a site via say JavaScript. I'm not sure exactly how, but some of the RESTful / GraphQL / etc calls could be helpful for summarizing a page. We could also consider pulling the metadata from the page and using that to determine how to prompt the summarization. To be fair, I haven't looked at the prompting code yet and don't know if you're already doing this. |
What about bs4 (BeautifulSoup4)?
|
This is the single biggest issue in facing with GPT3.5-turbo and Auto-GPT It is not reliably able to go to a website and pull out and summarize key information. My use case is going to a job posting and pulling out a summary to compare to my resume. This works be a game changer |
I also modified above code to handle Javascript using Selenium.
|
Looking for feedback on #507, it's my first-time @Torantulino making a "official" PR, but the project is beyond compelling and I had to get this out to the community. It's magical what happens when it's able to access information from as recent as today (April 8th, 2023) in it's analysis, reasoning, and logic. edit: I made a 26-minute video showing what's possible with this PR. This isn't a minor incremental bug fix, it's a MAJOR unlock! |
Few tips for scraping here, use selenium with opencv, make sure to force scroll to bottom of the page to load everything. Actually, current chapgpt sometimes gives correct scraping results instead of code to scrape, so it's able to do it, just jailed for some reason. |
Here's pull request for fix. It utilizes Pyppteer to navigate JavaScript sites. Lots of adaptability and flexibility. |
Pix2struct to capture structure and plain ocr could give acceptable results if everything else fails . |
Where are we at for logging into websites during web-crawling? |
Selenium or pupetteer or similar for web ui logins would be optimal i guess imho. |
Mechanize would work well, although not functional with JS. |
+1 for selenium browser, supports JS injecting, and might be better than google console if you are running it from residental IP, just add jitter and wait time before reading, and you can easly safe page_source as html or parse it later with bs4 opencv is too OP for this and resource intensive, no point of using it, except for images, as I said, no point. I just found out about this project, might gonna look further into this for the weekend |
Selenium + Chromium + Firefox are now in |
…n-support [WIP] Openai plugins support
how much flexibility do we have to configure default code to direct the AI where we want it to go and grab what we want along the way? |
Co-authored-by: lc0rp <[email protected]>
Add spacy as dependency
Auto-GPT should be able to see links on webpages it visits, as well as the existing GPT-3 powered summary, so that it can browse further than 1 page away from google.
The text was updated successfully, but these errors were encountered: