Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrates Google Scholar's citation count functionality, a websocket client for JabRef and other extensions/fixes #131

Merged
merged 68 commits into from
Jan 30, 2021

Conversation

systemoperator
Copy link
Contributor

@systemoperator systemoperator commented Feb 6, 2020

done

@systemoperator
Copy link
Contributor Author

systemoperator commented Feb 6, 2020

I am looking for a very easy way to wait until all (asynchronously) fetched citation counts have been fetched and updated in the method zsc.processItems() in the file zsc_misc_post.js. The asynchronous call happens in zsc.retrieveCitationData() -> XMLHttpRequest. Suggestions are welcome.

… currently, the citation data is not fetched asynchronously (which probably reduces the amount of required captchas from Google Scholar to show that one is not a robot)
@systemoperator
Copy link
Contributor Author

systemoperator commented Feb 6, 2020

Fetching and transferring citation counts to JabRef works now. :) Currently, the citation data is not fetched asynchronously (which probably has the benefit, that it might reduce the amount of required captchas of Google Scholar to show that one is not a robot).
We could implement this in the reverse direction as well, that JabRef sends an item (or a list of items which i.e. gets pre-selected in the main table) to the JabRef-Browser-Extension and it sends back the corresponding citation count(s).

@systemoperator systemoperator changed the title [WIP] Integrates google scholar's citation count functionality Integrates google scholar's citation count functionality Feb 6, 2020
@tobiasdiez
Copy link
Member

Thanks a lot for your PR. I really like the feature and think it's a valuable addition to the JabRef eco system. However, I'm not convinced that this should be backed into the browser extension. Instead I would prefer if it would be directly implemented in JabRef. Feels like the more flexible approach that also covers other use cases (e.g. articles added by hand instead of via the browser). But I guess you have thought about this also. Why did you choose to implement it as an addition to the browser extension?

@systemoperator
Copy link
Contributor Author

systemoperator commented Feb 8, 2020

@tobiasdiez

My main reasoning was as follows:

  1. Most important aspect: browsers simply work for Google Scholar
    Fetching Google Scholar data from real browsers simply works. When fetching it from somewhere else, it does not or hardly does. Apparently Google does quite a good job for guaranteeing that. JabRef already offers searching articles from Google Scholar, but it is simply hardly usable, because it does not work most of the time. Honestly, I already played with the idea to also migrate the existing JabRef functionality "Web Search" using Google Scholar to the JabRef-Browser-Extension (meaning, letting the browser extension query the information from Google Scholar and then returning it to JabRef), because then it would be much more reliable. Google Scholar has integrated measures to prevent automated queries. To make sure robots are not involved, e.g. captchas are used. The captchas need to be solved manually so that queries will succeed. So user interaction is needed. In my experience correctly solved captchas are somehow linked to the browser, where they got solved, which is apparently another quite relevant issue. When one had been solved correctly, then quite a lot of queries are possible, at least, if they are not too much or too frequent. The implementation within the JabRef-Browser-Extension guarantees all functional requirements, because even if solving a captcha is required, the user gets an alert, can solve it and then this browser is (re-)enabled and possible to fetch the required information.

  2. Reusing code
    Most of the current implementation has been used within another web extension, which showed really reliable results. The implementation process was easier, faster and updates of the existing extension can be directly merged into the JabRef-Browser-Extension, if wanted.

I have also thought about the use case where citation counts of references within JabRef should be fetched or updated (see: JabRef/jabref#5849), which covers references imported from the browser extension and e.g. manually imported ones. Actually, I see this use case (triggering fetching citation counts from JabRef) as the primary one. For this case my idea was sending a request (with either one or several references) from JabRef to the JabRef-Browser-Extension, which subsequently fetches the required data and then sends the required information back to JabRef. If the user is required to solve a captcha, the user could be notified within JabRef. Probably also opening a corresponding browser tab and loading an url could be possible/sensible.

The formal steps in JabRef would be:

  1. Select elements in JabRef library, where the citations counts should be fetched/updated (Or probably: if no entries are selected: fetch for whole library)
  2. Trigger fetching/updating process

Even if a headless web browser was used within JabRef, there would still be the problem/challenge with solving Google Scholar's "not-a-robot" checks. Maybe it is possible to permanently "hold" a browser instance for retrieving data from Google Scholar within JabRef and to show a browser window whenever some user interaction for solving a captcha is required, where the user solves it and hopefully Google Scholar will probably accept it. But..., from a bird's eye perspective, this is actually already quite similar to the current approach, right? So I thought, the existing JabRef-Browser-Extension is a good starting point to integrate this. Probably, special cases need special treatment. The cool thing with browser extensions is, that they are really customizable.

I know, the current design is probably a somewhat clumsy approach, but at least it works really good and in the end guaranteeing functionality has the highest priority. In my opinion, for now this was at least the easiest way to go.

@systemoperator
Copy link
Contributor Author

What do you think is the proper roadmap for this situation?

@tobiasdiez
Copy link
Member

That are some really good points that you brought up. Let me think about it for a bit... I'm pretty busy at the moment, sorry. I'll get back to you soon, promised!

@systemoperator
Copy link
Contributor Author

systemoperator commented Feb 13, 2020

@tobiasdiez
Copy link
Member

So finally found some time to think about these matters. Thanks again for the detailed outline above which was good food for thought.

There are a few things of concern:

  1. The communication in the direction JabRef > JabFox in the browser is not possible. (To be more precise, only JabFox can send requests to JabRef and wait for responses. So in principle you could let JabFox constantly poll JabRef and asks if there is something to process and then send back the result. But given all the problems we already have or had making the simple "JabFox to JabRef" communication work, this will be a lot of work and I would be surprised if it results in a stable solution.). Thus it is not possible to update citation counts from JabRef by using JabFox.
  2. Fetching data from google is prone to run into quota problems and google will block further requests and ask for captcha input. This happens a lot currently using the web fetcher as is apparent from huge amount of issues on these matters. We tried to speak with google about this and establish a proper api or another solution, but this lead nowhere so far. Thus, I agree, using the citation fetch functionality in JabRef will run into similar problems.

Thus, it appears that there is now good way to fetch citations counts from google. Either using JabFox or JabRef leads to nasty issues.

Proposal: Don't use google. There are a few services that provide citation meta data using a freely available api. For example, Microsoft Academic API or opencitations or semantic scholar. Personally I would tend to use semantic scholar as the seem to have the most extensive data coverage. These api can be simply consumed directly from JabRef.

The advantages of this approach are multiple: you have a stable api to consume, which makes it a reliable and future proof solution, you get more metadata than just the citation count (e.g. it could be extended in the future to actually show a list of citing works providing easy ways to import them etc) and you don't need to worry about cross-communication JabRef <-> JabFox.

What do you think?

@systemoperator
Copy link
Contributor Author

systemoperator commented Feb 15, 2020

For now, I cannot respond in detail. I need to inspect the APIs in detail as well. Just some fragments to think about:

  • From an API it would be good to query references with the same criteria like in Google Scholar (title, authors, year), since not for every reference a DOI or something else is provided. (Furthermore, there are many different identifiers, which could be a problem as well.)
  • It is important to consider how complete other databases are in terms of actually finding entries.

Google Scholar

  • I have created a small project and made some tests with selenium. Basically it works better than expected at the beginning. But at some point lots of capchas appear, which need to be solved. But this depends on how intense the service is used.
  • Just an idea: What if we start a small websocket server in JabRef and a websocket endpoint in JabFox for a fast, local (localhost), bidirectional connection?

@systemoperator
Copy link
Contributor Author

systemoperator commented Feb 16, 2020

I am in the process of adding a general structure for fetching manifold reference metadata. For now I will add the semantic scholar for fetching citation counts, since it is very easy. Furthermore, I am also in the process of adding a small websocket server in JabRef for bidirectional communication between JabRef and JabFox, which will be much more stable. It will now be used for fetching the citation counts from Google Scholar as well. (Since it has the most accurate and most complete information. Semantic Scholar e.g. states for a reference 3 citation counts but Google Scholar states 33. Furthermore, semantic scholar requires some identifier like DOI, but not every reference has one. Additionally, Google Scholar finds entries where others don't. I am confident, that this approach will work acceptably fine, if used properly and moderately and it could be optional.) This websocket server can later be used for other communication purposes as well (to e.g. exchange additional information between JabFox or any other application).

@tobiasdiez
Copy link
Member

Sounds really nice! Thanks for your work @systemoperator.

I agree that the data of semantic scholar is not yet on the same level as google scholar, but I hope they are getting there. As you said, they have a nice API.

Web sockets. The last time I looked at them, it was not possible to communicate from a browser extension via sockets (i.e. the web sockets API is not accessible). According to https://bugzilla.mozilla.org/show_bug.cgi?id=1247628 this is still the case. If I remember correctly for chrome it might work but you need additional permissions. If you find a solution, that would be nice. This would make it possible to make progress on #32 and JabRef/jabref#5719

@systemoperator
Copy link
Contributor Author

systemoperator commented Feb 18, 2020

Currently, I start the websocket client as a background script (using Firefox). I have made some tests with it and it seems to meet all my requirements. At least, I could already send some test websocket messages to JabRef and receive some as well. :) I hope I will not find any pitfalls and I hope this meets all future requirements as well.

…and date field and can process various different date formats properly); handlerCmdFetchGoogleScholarCitationCounts() implemented, ...
@systemoperator
Copy link
Contributor Author

I don't understand, how this code fragment creates the bibtex data:

this.convertToBibTex = function(items) {
this.prepareForExport(items);
browser.runtime.sendMessage({
"onConvertToBibtex": "convertStarted"
});
return browser.tabs.query({
currentWindow: true,
active: true
}).then(tabs => {
for (let tab of tabs) {
return browser.tabs.sendMessage(
tab.id, {
convertToBibTex: items
}
);
}
})
}

Where and how does the conversion process take place?

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up! Codewise this looks good to me.

I'll have a look at your other PR @JabRef and then merge both at the same time.

@systemoperator
Copy link
Contributor Author

systemoperator commented May 13, 2020

The PR can be merged now and the ws client is disabled until the JabRef's counterpart has been integrated. :) Is it possible to still keep this branch, so that I can use it onwards?

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to come back to you.

I finally found the time to go through your PR again. I only have two small remarks concerning the code. Could you please take care of these, and fix the merge conflicts then I will merge and release a new version. Thanks!

bibtexConverter.js Outdated Show resolved Hide resolved
connector.js Outdated Show resolved Hide resolved
data/options.js Outdated Show resolved Hide resolved
@systemoperator
Copy link
Contributor Author

done :)

Base automatically changed from master to main January 24, 2021 19:03
@tobiasdiez tobiasdiez merged commit 747433c into JabRef:main Jan 30, 2021
@tobiasdiez
Copy link
Member

Many thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants