-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
langchain/document_loaders/web_base.py #8505
Comments
Answer generated by a 🤖 AnswerI understand that you're encountering a The solution you've proposed, which involves using the Here's how you can implement this solution in the async def _fetch(self, url: str) -> str:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if self.raise_for_status:
response.raise_for_status()
body = await response.read()
return body.decode('utf-8', errors='ignore') This modification should resolve the issue you're facing. However, please note that this is one potential solution and there might be other ways to address this problem. If you find this solution to be effective and believe that it could benefit other users who might encounter the same issue, I encourage you to make a pull request with this change. This way, the LangChain community can review and potentially incorporate your solution into the framework. If this solution doesn't work as expected, or if you continue to encounter issues, I recommend checking if you're using the latest version of LangChain. Upgrading to the latest version might resolve your issue if it's been addressed in a more recent update. SourcesThis response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Hi, @tabee! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, the issue you reported is related to a UnicodeDecodeError in the code located in Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days. Thank you for your contribution to the LangChain repository! |
System Info
langchain/document_loaders/web_base.py > works for me only when i change:
with:
otherwise:
der Code produziert leider einen Fehler:
/home/codespace/.py
thon/current/bin/python3 /workspaces/b3rn_zero_ai/notebooks/ignite_vectorstore.py
Fetching pages: 13%|###8 | 33/256 [00:03<00:19, 11.18it/s]Traceback (most recent call last):
File "/workspaces/b3rn_zero_ai/notebooks/ignite_vectorstore.py", line 68, in
documents = loader.load()
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/sitemap.py", line 142, in load
results = self.scrape_all([el["loc"].strip() for el in els if "loc" in el])
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 168, in scrape_all
results = asyncio.run(self.fetch_all(urls))
File "/home/codespace/.local/lib/python3.10/site-packages/nest_asyncio.py", line 35, in run
return loop.run_until_complete(task)
File "/home/codespace/.local/lib/python3.10/site-packages/nest_asyncio.py", line 90, in run_until_complete
return f.result()
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 148, in fetch_all
return await tqdm_asyncio.gather(
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in gather
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 234, in __step
result = coro.throw(exc)
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 285, in await
yield self # This tells Task to wait for completion.
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
future.result()
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 136, in _fetch_with_rate_limit
return await self._fetch(url)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 120, in _fetch
return await response.text()
File "/home/codespace/.python/current/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1086, in text
return self._body.decode( # type: ignore[no-any-return,union-attr]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
Fetching pages: 15%|####4 | 38/256 [00:04<00:23, 9.25it/s]
Who can help?
No response
Information
Related Components
Reproduction
I tried to make embedding from a website in "french" language.
Expected behavior
we need a solution when : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
The text was updated successfully, but these errors were encountered: