-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different results for same search, two years later #496
Comments
Hello @sofiatipa, I can only guess, but over two years it would sound reasonable that the websites you crawled did change quite a bit since, hence returning logically different results as of today. |
Hi Benjamin,
thanks for your swift response. I did something similar, I used webarchives through a site called web.archive.org<http://web.archive.org>, because the site I want to crawl cannot be crawled from Hyphe anylonger (I don’t know why).
My hint is that this is the problem, that something changed in the webpage I am trying to crawl that made it "un-crawlable” through hyphe.
The problematic page is https://www.geopolitika.ru<http://Geopolitica.ru>. I am trying to find the co-linkages with the page https://www.paulcraigroberts.org - it is a very simple search, even for me.
The search through web.archive.org<http://web.archive.org> does not give the original results, although it is from the same year and month as the original search (as you say, probably because archives are not always complete). This is turning a big problem for my team, as we are about to send our findings for publication, but now with this we are in trouble, please help!
[cid:11D081BD-27C2-4FBB-A58D-29FA69946B10]
I am a basic user of hyphe, so I don’t really know how to activate from an empty corpus (I can only see the pages I selected for the corpus in settings). Maybe I could pass you the project name & password to take a look?
All the best,
Sofia
Dr Sofia Tipaldou
Assistant Professor
Department of Political Science and History
Panteion University of Social and Political Sciences
|
Hello again, It looks like the Geopolitika.ru website has quite an aggressive approach towards web crawler and it basically refuses most robots through some (quite smart) methods, which apparently also block Web.Archive.org from archiving it (see for instance here https://web.archive.org/web/20200417113623/https://www.geopolitika.ru/). There is no way to make Hyphe work with this website as of today unfortunately. You can although go back far enough in time before they put those measures in place: just explore the web archives until you find a functional version and ask Hyphe to crawl at that date. For instance I got a crawl working with more than 70 pages visited in 2018 by using this url as startpoint: https://web.archive.org/web/20180212120000/https://www.geopolitika.ru |
Hi Benjamin, I have a new question to ask: the installed version of Hyphe stopped creating web entities out of some sites it previously crawled (in fact the last crawl was in August). I tried the crawl in the online demo version and it works perfectly. Any ideas why it might be happening with the desktop version? Also, is it possible that the amount of pages hyphe crawls may vary from one day to another? Many thanks! Sofia |
Hello @sofiatipa, it's hard to tell without more information. But there's a priori no reason your desktop version of Hyphe would behave differently than the online demo. Did you try in a new corpus or in a preexisting one? |
Hi Benjamin,
I hope you are receiving my answer from my email. I tried it in a new corpus, twice. Today I am trying again to run the crawl to that new corpus, but it is still unable to define the web entities. In the meanwhile, the online version had already ran the crawl in a few minutes.
What more infos could I send you?
May I take the chance to ask you something else regarding the crawl depth in light of a publication me and my team are working on: How could I better explain the ‘crawl depth’ for non expert publics?
Many thanks!
Sofia
Dr Sofia Tipaldou
Assistant Professor in International Relations
Department of Political Science and History
Panteion University of Social and Political Sciences
On 19 Sep 2024, at 17:17, Benjamin Ooghe-Tabanou ***@***.******@***.***>> wrote:
Hello @sofiatipa<https://github.com/sofiatipa>, it's hard to tell without more information. But there's a priori no reason your desktop version of Hyphe would behave differently than the online demo. Did you try in a new corpus or in a preexisting one?
—
Reply to this email directly, view it on GitHub<#496 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BEZKLH4IPECPWNGYRP7EJEDZXLMGNAVCNFSM6AAAAABOQAPSSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRRGEYTQMRZGA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I apologize Sofia but I don't really understand what you mean by "unable to define the web entities". Could you precisely explain the steps you did and where you get stuck at? It might be that your whole local hyphe instance would require to be restarted, have you tried that? Regarding the crawl depth, you can present it as the number of links a user would click from the starting page: |
Hi,
I repeated a search I did nearly 2 years ago through Hyphe, I am trying to find the co-linkages between two webentities, but the results are quite different. The original search came up with 6 pages that were used by both sites, while the new search shows 3 different pages. Why is that happening? And, is there any way to retrieve the original search from your online version?
The text was updated successfully, but these errors were encountered: