Integrates Google Scholar's citation count functionality, a websocket client for JabRef and other extensions/fixes #131

systemoperator · 2020-02-06T13:35:32Z

done

…ed + startUrl extended, package.json: missing dependencies added

…content/zotero/xpcom/data/item.js

…md extended, manifest.json extended, connector.js extended

systemoperator · 2020-02-06T15:38:55Z

I am looking for a very easy way to wait until all (asynchronously) fetched citation counts have been fetched and updated in the method zsc.processItems() in the file zsc_misc_post.js. The asynchronous call happens in zsc.retrieveCitationData() -> XMLHttpRequest. Suggestions are welcome.

… currently, the citation data is not fetched asynchronously (which probably reduces the amount of required captchas from Google Scholar to show that one is not a robot)

systemoperator · 2020-02-06T23:55:24Z

Fetching and transferring citation counts to JabRef works now. :) Currently, the citation data is not fetched asynchronously (which probably has the benefit, that it might reduce the amount of required captchas of Google Scholar to show that one is not a robot).
We could implement this in the reverse direction as well, that JabRef sends an item (or a list of items which i.e. gets pre-selected in the main table) to the JabRef-Browser-Extension and it sends back the corresponding citation count(s).

tobiasdiez · 2020-02-07T16:27:09Z

Thanks a lot for your PR. I really like the feature and think it's a valuable addition to the JabRef eco system. However, I'm not convinced that this should be backed into the browser extension. Instead I would prefer if it would be directly implemented in JabRef. Feels like the more flexible approach that also covers other use cases (e.g. articles added by hand instead of via the browser). But I guess you have thought about this also. Why did you choose to implement it as an addition to the browser extension?

systemoperator · 2020-02-08T09:58:30Z

@tobiasdiez

My main reasoning was as follows:

Most important aspect: browsers simply work for Google Scholar
Fetching Google Scholar data from real browsers simply works. When fetching it from somewhere else, it does not or hardly does. Apparently Google does quite a good job for guaranteeing that. JabRef already offers searching articles from Google Scholar, but it is simply hardly usable, because it does not work most of the time. Honestly, I already played with the idea to also migrate the existing JabRef functionality "Web Search" using Google Scholar to the JabRef-Browser-Extension (meaning, letting the browser extension query the information from Google Scholar and then returning it to JabRef), because then it would be much more reliable. Google Scholar has integrated measures to prevent automated queries. To make sure robots are not involved, e.g. captchas are used. The captchas need to be solved manually so that queries will succeed. So user interaction is needed. In my experience correctly solved captchas are somehow linked to the browser, where they got solved, which is apparently another quite relevant issue. When one had been solved correctly, then quite a lot of queries are possible, at least, if they are not too much or too frequent. The implementation within the JabRef-Browser-Extension guarantees all functional requirements, because even if solving a captcha is required, the user gets an alert, can solve it and then this browser is (re-)enabled and possible to fetch the required information.
Reusing code
Most of the current implementation has been used within another web extension, which showed really reliable results. The implementation process was easier, faster and updates of the existing extension can be directly merged into the JabRef-Browser-Extension, if wanted.

I have also thought about the use case where citation counts of references within JabRef should be fetched or updated (see: JabRef/jabref#5849), which covers references imported from the browser extension and e.g. manually imported ones. Actually, I see this use case (triggering fetching citation counts from JabRef) as the primary one. For this case my idea was sending a request (with either one or several references) from JabRef to the JabRef-Browser-Extension, which subsequently fetches the required data and then sends the required information back to JabRef. If the user is required to solve a captcha, the user could be notified within JabRef. Probably also opening a corresponding browser tab and loading an url could be possible/sensible.

The formal steps in JabRef would be:

Select elements in JabRef library, where the citations counts should be fetched/updated (Or probably: if no entries are selected: fetch for whole library)
Trigger fetching/updating process

Even if a headless web browser was used within JabRef, there would still be the problem/challenge with solving Google Scholar's "not-a-robot" checks. Maybe it is possible to permanently "hold" a browser instance for retrieving data from Google Scholar within JabRef and to show a browser window whenever some user interaction for solving a captcha is required, where the user solves it and hopefully Google Scholar will probably accept it. But..., from a bird's eye perspective, this is actually already quite similar to the current approach, right? So I thought, the existing JabRef-Browser-Extension is a good starting point to integrate this. Probably, special cases need special treatment. The cool thing with browser extensions is, that they are really customizable.

I know, the current design is probably a somewhat clumsy approach, but at least it works really good and in the end guaranteeing functionality has the highest priority. In my opinion, for now this was at least the easiest way to go.

systemoperator · 2020-02-10T11:31:21Z

What do you think is the proper roadmap for this situation?

tobiasdiez · 2020-02-10T12:23:03Z

That are some really good points that you brought up. Let me think about it for a bit... I'm pretty busy at the moment, sorry. I'll get back to you soon, promised!

systemoperator · 2020-02-13T23:03:56Z

Maybe relevant for decision process:

Report proper error when google fetcher hits limit jabref#4423
Google Scholar web search not opening results box jabref#4027
Google Scholar fetching not working jabref#2173
Exception in google scholar search jabref#5029
Display better error message if Google download limit is reached / Show Catpcha dialog jabref#1887
Google search issue in version 3.6 [Fixed in DevBuilds] jabref#1886
WebDrivers like selenium are sometimes reported getting identified as automated web scraping processes, thus more and more effort is necessary to prevent this.

tobiasdiez · 2020-02-15T10:05:52Z

So finally found some time to think about these matters. Thanks again for the detailed outline above which was good food for thought.

There are a few things of concern:

The communication in the direction JabRef > JabFox in the browser is not possible. (To be more precise, only JabFox can send requests to JabRef and wait for responses. So in principle you could let JabFox constantly poll JabRef and asks if there is something to process and then send back the result. But given all the problems we already have or had making the simple "JabFox to JabRef" communication work, this will be a lot of work and I would be surprised if it results in a stable solution.). Thus it is not possible to update citation counts from JabRef by using JabFox.
Fetching data from google is prone to run into quota problems and google will block further requests and ask for captcha input. This happens a lot currently using the web fetcher as is apparent from huge amount of issues on these matters. We tried to speak with google about this and establish a proper api or another solution, but this lead nowhere so far. Thus, I agree, using the citation fetch functionality in JabRef will run into similar problems.

Thus, it appears that there is now good way to fetch citations counts from google. Either using JabFox or JabRef leads to nasty issues.

Proposal: Don't use google. There are a few services that provide citation meta data using a freely available api. For example, Microsoft Academic API or opencitations or semantic scholar. Personally I would tend to use semantic scholar as the seem to have the most extensive data coverage. These api can be simply consumed directly from JabRef.

The advantages of this approach are multiple: you have a stable api to consume, which makes it a reliable and future proof solution, you get more metadata than just the citation count (e.g. it could be extended in the future to actually show a list of citing works providing easy ways to import them etc) and you don't need to worry about cross-communication JabRef <-> JabFox.

What do you think?

systemoperator · 2020-02-15T13:55:49Z

For now, I cannot respond in detail. I need to inspect the APIs in detail as well. Just some fragments to think about:

From an API it would be good to query references with the same criteria like in Google Scholar (title, authors, year), since not for every reference a DOI or something else is provided. (Furthermore, there are many different identifiers, which could be a problem as well.)
It is important to consider how complete other databases are in terms of actually finding entries.

Google Scholar

I have created a small project and made some tests with selenium. Basically it works better than expected at the beginning. But at some point lots of capchas appear, which need to be solved. But this depends on how intense the service is used.
Just an idea: What if we start a small websocket server in JabRef and a websocket endpoint in JabFox for a fast, local (localhost), bidirectional connection?

systemoperator · 2020-02-16T10:46:35Z

I am in the process of adding a general structure for fetching manifold reference metadata. For now I will add the semantic scholar for fetching citation counts, since it is very easy. Furthermore, I am also in the process of adding a small websocket server in JabRef for bidirectional communication between JabRef and JabFox, which will be much more stable. It will now be used for fetching the citation counts from Google Scholar as well. (Since it has the most accurate and most complete information. Semantic Scholar e.g. states for a reference 3 citation counts but Google Scholar states 33. Furthermore, semantic scholar requires some identifier like DOI, but not every reference has one. Additionally, Google Scholar finds entries where others don't. I am confident, that this approach will work acceptably fine, if used properly and moderately and it could be optional.) This websocket server can later be used for other communication purposes as well (to e.g. exchange additional information between JabFox or any other application).

tobiasdiez · 2020-02-18T12:00:37Z

Sounds really nice! Thanks for your work @systemoperator.

I agree that the data of semantic scholar is not yet on the same level as google scholar, but I hope they are getting there. As you said, they have a nice API.

Web sockets. The last time I looked at them, it was not possible to communicate from a browser extension via sockets (i.e. the web sockets API is not accessible). According to https://bugzilla.mozilla.org/show_bug.cgi?id=1247628 this is still the case. If I remember correctly for chrome it might work but you need additional permissions. If you find a solution, that would be nice. This would make it possible to make progress on #32 and JabRef/jabref#5719

systemoperator · 2020-02-18T20:16:18Z

Currently, I start the websocket client as a background script (using Firefox). I have made some tests with it and it seems to meet all my requirements. At least, I could already send some test websocket messages to JabRef and receive some as well. :) I hope I will not find any pitfalls and I hope this meets all future requirements as well.

…and date field and can process various different date formats properly); handlerCmdFetchGoogleScholarCitationCounts() implemented, ...

systemoperator · 2020-02-20T17:56:59Z

I don't understand, how this code fragment creates the bibtex data:

JabRef-Browser-Extension/connector.js

Lines 49 to 68 in 7072f37

    
           this.convertToBibTex = function(items) { 
        
           	this.prepareForExport(items); 
        
           	browser.runtime.sendMessage({ 
        
           		"onConvertToBibtex": "convertStarted" 
        
           	}); 
        
           	return browser.tabs.query({ 
        
           		currentWindow: true, 
        
           		active: true 
        
           	}).then(tabs => { 
        
           		for (let tab of tabs) { 
        
           			return browser.tabs.sendMessage( 
        
           				tab.id, { 
        
           					convertToBibTex: items 
        
           				} 
        
           			); 
        
           		} 
        
           	}) 
        
           }

Where and how does the conversion process take place?

tobiasdiez

Thanks for the follow-up! Codewise this looks good to me.

I'll have a look at your other PR @JabRef and then merge both at the same time.

…ource

# Conflicts: # package-lock.json # package.json

systemoperator · 2020-05-13T10:42:33Z

The PR can be merged now and the ws client is disabled until the JabRef's counterpart has been integrated. :) Is it possible to still keep this branch, so that I can use it onwards?

tobiasdiez

Sorry for taking so long to come back to you.

I finally found the time to go through your PR again. I only have two small remarks concerning the code. Could you please take care of these, and fix the merge conflicts then I will merge and release a new version. Thanks!

bibtexConverter.js

connector.js

data/options.js

# Conflicts: # data/progressPanel.js # package-lock.json # package.json

systemoperator · 2021-01-21T10:01:14Z

done :)

package.json

tobiasdiez · 2021-01-30T10:53:43Z

Many thanks again!

systemoperator added 6 commits February 5, 2020 18:13

IntelliJ configuration files added

6314bb1

package.json automatically updated

714daf7

zotero-scholar-citations added, web-ext-config.js: ignoreFiles extend…

236a2f1

…ed + startUrl extended, package.json: missing dependencies added

unmodified "item.js" copied from zotero-connectors/src/zotero/chrome/…

a53bdb4

…content/zotero/xpcom/data/item.js

current (reduced) version of item.js

2544514

refactoring, zsc_misc_pre.js and zsc_misc_post.js introduced, README.…

0f90960

…md extended, manifest.json extended, connector.js extended

systemoperator added 2 commits February 6, 2020 19:29

extension

2e6564d

bugfix; fetching and transferring citation count to JabRef works now;…

231b7df

… currently, the citation data is not fetched asynchronously (which probably reduces the amount of required captchas from Google Scholar to show that one is not a robot)

systemoperator changed the title ~~[WIP] Integrates google scholar's citation count functionality~~ Integrates google scholar's citation count functionality Feb 6, 2020

systemoperator requested a review from tobiasdiez February 7, 2020 00:01

systemoperator added 2 commits February 7, 2020 12:40

progress panel extended

3236b4e

refactoring progress panel, small improvements, options.html extended

e3175bb

websocket client skeleton added

0cf5b5a

systemoperator added 3 commits February 19, 2020 22:01

wsClient.js: extension, refactoring

d4f38b4

wsClient websocket: extension

ae0b60a

ZscItem.prototype.getField() improved (it allows processing the year …

7072f37

…and date field and can process various different date formats properly); handlerCmdFetchGoogleScholarCitationCounts() implemented, ...

setting _preferDoiForLookupIfExisting to false

d8c0027

tobiasdiez approved these changes Mar 18, 2020

View reviewed changes

fetching citation count is more reliable now by picking appropriate s…

58c3811

…ource

tobiasdiez mentioned this pull request May 12, 2020

Extension preferences seem to have no effect #172

Closed

systemoperator added 3 commits May 13, 2020 12:23

Merge remote-tracking branch 'upstream/master' into dev-zsc

42ea1ad

# Conflicts: # package-lock.json # package.json

packages updated

ad9b6c2

don't start websocket client, until JabRef's counterpart is integrated

7c034df

fix for fetching references when citation count is disabled

11b6fd4

tobiasdiez requested changes Jan 20, 2021

View reviewed changes

bibtexConverter.js Outdated Show resolved Hide resolved

connector.js Outdated Show resolved Hide resolved

data/options.js Outdated Show resolved Hide resolved

tobiasdiez added the status:changes-required label Jan 20, 2021

systemoperator added 8 commits January 20, 2021 22:56

Merge remote-tracking branch 'upstream/master' into dev-zsc

71261c6

# Conflicts: # data/progressPanel.js # package-lock.json # package.json

update package-lock

0bb3816

small cleanup

74674d1

small cleanup

83af655

refactoring connector

f69e460

refactor options.js

e9c0ade

refactoring bibtex/biblatex export mode

331164b

minor cleanup

efe5e38

systemoperator added 3 commits January 21, 2021 11:15

readme updated

ca19144

readme: fix wording

306a875

vs code workspace added

cc6a14b

tobiasdiez reviewed Jan 21, 2021

View reviewed changes

package.json Show resolved Hide resolved

systemoperator added 2 commits January 21, 2021 14:12

revert package-lock.json

f842599

minor cleanup

d1ba670

Base automatically changed from master to main January 24, 2021 19:03

Merge branch 'main' into dev-zsc

ef3ebdd

tobiasdiez merged commit 747433c into JabRef:main Jan 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrates Google Scholar's citation count functionality, a websocket client for JabRef and other extensions/fixes #131

Integrates Google Scholar's citation count functionality, a websocket client for JabRef and other extensions/fixes #131

systemoperator commented Feb 6, 2020 •

edited

Loading

systemoperator commented Feb 6, 2020 •

edited

Loading

systemoperator commented Feb 6, 2020 •

edited

Loading

tobiasdiez commented Feb 7, 2020

systemoperator commented Feb 8, 2020 •

edited

Loading

systemoperator commented Feb 10, 2020

tobiasdiez commented Feb 10, 2020

systemoperator commented Feb 13, 2020 •

edited

Loading

tobiasdiez commented Feb 15, 2020

systemoperator commented Feb 15, 2020 •

edited

Loading

systemoperator commented Feb 16, 2020 •

edited

Loading

tobiasdiez commented Feb 18, 2020

systemoperator commented Feb 18, 2020 •

edited

Loading

systemoperator commented Feb 20, 2020

tobiasdiez left a comment

systemoperator commented May 13, 2020 •

edited

Loading

tobiasdiez left a comment

systemoperator commented Jan 21, 2021

tobiasdiez commented Jan 30, 2021

Integrates Google Scholar's citation count functionality, a websocket client for JabRef and other extensions/fixes #131

Integrates Google Scholar's citation count functionality, a websocket client for JabRef and other extensions/fixes #131

Conversation

systemoperator commented Feb 6, 2020 • edited Loading

systemoperator commented Feb 6, 2020 • edited Loading

systemoperator commented Feb 6, 2020 • edited Loading

tobiasdiez commented Feb 7, 2020

systemoperator commented Feb 8, 2020 • edited Loading

systemoperator commented Feb 10, 2020

tobiasdiez commented Feb 10, 2020

systemoperator commented Feb 13, 2020 • edited Loading

tobiasdiez commented Feb 15, 2020

systemoperator commented Feb 15, 2020 • edited Loading

systemoperator commented Feb 16, 2020 • edited Loading

tobiasdiez commented Feb 18, 2020

systemoperator commented Feb 18, 2020 • edited Loading

systemoperator commented Feb 20, 2020

tobiasdiez left a comment

Choose a reason for hiding this comment

systemoperator commented May 13, 2020 • edited Loading

tobiasdiez left a comment

Choose a reason for hiding this comment

systemoperator commented Jan 21, 2021

tobiasdiez commented Jan 30, 2021

systemoperator commented Feb 6, 2020 •

edited

Loading

systemoperator commented Feb 6, 2020 •

edited

Loading

systemoperator commented Feb 6, 2020 •

edited

Loading

systemoperator commented Feb 8, 2020 •

edited

Loading

systemoperator commented Feb 13, 2020 •

edited

Loading

systemoperator commented Feb 15, 2020 •

edited

Loading

systemoperator commented Feb 16, 2020 •

edited

Loading

systemoperator commented Feb 18, 2020 •

edited

Loading

systemoperator commented May 13, 2020 •

edited

Loading