Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for arxiv identifier in ArxivAPIWrapper() #9318

Merged
merged 26 commits into from
Sep 28, 2023
Merged

feat: add support for arxiv identifier in ArxivAPIWrapper() #9318

merged 26 commits into from
Sep 28, 2023

Conversation

LMC117
Copy link
Contributor

@LMC117 LMC117 commented Aug 16, 2023

  • Description: this PR adds the support for arxiv identifier of the ArxivAPIWrapper. I modified the run() and load() functions in arxiv.py, using regex to recognize if the query is in the form of arxiv identifier (see https://info.arxiv.org/help/find/index.html). If so, it will directly search the paper corresponding to the arxiv identifier. I also modified and added tests in test_arxiv.py.
  • Issue: ArxivLoader support searching by arxiv id_list #9047
  • Dependencies: N/A
  • Tag maintainer: N/A

@vercel
Copy link

vercel bot commented Aug 16, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Sep 28, 2023 0:04am

@dosubot dosubot bot added the 🤖:improvement Medium size change to existing code to handle new use-cases label Aug 16, 2023
@baskaryan
Copy link
Collaborator

lgtm, cc @leo-gan

@leo-gan
Copy link
Collaborator

leo-gan commented Aug 16, 2023

I've tried
loader = ArxivLoader(query="1605.08386 2308.07912 2308.07910", load_max_docs=3)
and it returns all 3 papers.
Disclaimer: if IDs are separated by , it didn't work! Only the space separator works.

I don't think we need special treatment for multiple paper IDs because it works right now.

@LMC117
Copy link
Contributor Author

LMC117 commented Aug 17, 2023

Hi,

If the version number is specified in the query (e.g. 2212.00794v2), no results will be returned. So I think there is a need to handle the arxiv identifier separately.

@leo-gan
Copy link
Collaborator

leo-gan commented Aug 17, 2023

If the version number is specified in the query (e.g. 2212.00794v2), no results will be returned. So I think there is a need to handle the arxiv identifier separately.

OK. Then, please add unit tests to work with Ids.

@LMC117
Copy link
Contributor Author

LMC117 commented Aug 18, 2023

Hi @leo-gan, I've committed new unit tests. You can check that

Copy link
Collaborator

@leo-gan leo-gan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
LGTM

@@ -54,6 +58,14 @@ class ArxivAPIWrapper(BaseModel):
load_all_available_meta: bool = False
doc_content_chars_max: Optional[int] = 4000

def is_arxiv_identifier(self, query: str) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a few simple unit tests for this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I will add it soon

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LMC117 Looks great! LGTM

@baskaryan
Copy link
Collaborator

thanks @LMC117!

@baskaryan baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Aug 24, 2023
@leo-gan
Copy link
Collaborator

leo-gan commented Sep 21, 2023

@LMC117 Hi , could you, please, resolve the merging issues? After that ping me and I push this PR for the review. Thanks!

@LMC117
Copy link
Contributor Author

LMC117 commented Sep 22, 2023

@leo-gan hi I resolved that the merging issue

).results()
if self.is_arxiv_identifier(query):
results = self.arxiv_search(
id_list=query[: self.ARXIV_MAX_QUERY_LENGTH].split(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we dont want to do self.ARXIV_MAX_QUERY_LENGTH right?

@hwchase17 hwchase17 merged commit 05b75f3 into langchain-ai:master Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants