Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlwatch 2.25-1 on Debian Stable 12.5 (navigate fails) #809

Open
jpiszcz opened this issue Mar 21, 2024 · 5 comments
Open

urlwatch 2.25-1 on Debian Stable 12.5 (navigate fails) #809

jpiszcz opened this issue Mar 21, 2024 · 5 comments

Comments

@jpiszcz
Copy link

jpiszcz commented Mar 21, 2024

More websites are requiring javascript to obtain diffs, currently on Debian stable 12.5.

What is the proper way to fix this issue and/or which option is best to track changes in pages that require javascript?

Also logged a bug with Debian:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1067470

$ urlwatch --test-filter 1
Exception while releasing resources for job: <browser navigate='https://support.wyze.com/hc/en-us/articles/360015979872-Service-Status-Known-Issues' name='Wyze Service Status & Known Issues' filter=['html2text', 'striplines']>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urlwatch/command.py", line 139, in test_filter
    raise job_state.exception
  File "/usr/lib/python3/dist-packages/urlwatch/handler.py", line 68, in __enter__
    self.job.main_thread_enter()
  File "/usr/lib/python3/dist-packages/urlwatch/jobs.py", line 406, in main_thread_enter
    from .browser import BrowserContext
  File "/usr/lib/python3/dist-packages/urlwatch/browser.py", line 42, in <module>
    class BrowserLoop(object):
  File "/usr/lib/python3/dist-packages/urlwatch/browser.py", line 49, in BrowserLoop
    @asyncio.coroutine
     ^^^^^^^^^^^^^^^^^
AttributeError: module 'asyncio' has no attribute 'coroutine'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urlwatch/handler.py", line 78, in __exit__
    self.job.main_thread_exit()
  File "/usr/lib/python3/dist-packages/urlwatch/jobs.py", line 410, in main_thread_exit
    self.ctx.close()
    ^^^^^^^^
AttributeError: 'BrowserJob' object has no attribute 'ctx'
Traceback (most recent call last):
  File "/usr/bin/urlwatch", line 33, in <module>
    sys.exit(load_entry_point('urlwatch==2.25', 'console_scripts', 'urlwatch')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urlwatch/cli.py", line 112, in main
    urlwatch_command.run()
  File "/usr/lib/python3/dist-packages/urlwatch/command.py", line 431, in run
    self.handle_actions()
  File "/usr/lib/python3/dist-packages/urlwatch/command.py", line 231, in handle_actions
    sys.exit(self.test_filter(self.urlwatch_config.test_filter))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/urlwatch/command.py", line 139, in test_filter
    raise job_state.exception
  File "/usr/lib/python3/dist-packages/urlwatch/handler.py", line 68, in __enter__
    self.job.main_thread_enter()
  File "/usr/lib/python3/dist-packages/urlwatch/jobs.py", line 406, in main_thread_enter
    from .browser import BrowserContext
  File "/usr/lib/python3/dist-packages/urlwatch/browser.py", line 42, in <module>
    class BrowserLoop(object):
  File "/usr/lib/python3/dist-packages/urlwatch/browser.py", line 49, in BrowserLoop
    @asyncio.coroutine
     ^^^^^^^^^^^^^^^^^
AttributeError: module 'asyncio' has no attribute 'coroutine'. Did you mean: 'coroutines'?

@mwerlen
Copy link
Contributor

mwerlen commented Mar 22, 2024

Hi,

Your problem is linked to the python 3.11 upgrade. This problem has been fixed in urlwatch 2.27 as explained in changelog.

You can either :

  • use urlwatch >= 2.27 by manually upgrading (with pip)
  • use python < 3.11 specifically for urlwatch (it may still be installed, just point on a python3.10 binary)
  • use latest urlwatch version by pointing on the Debian sid repo for the urlwatch package.

@jpiszcz
Copy link
Author

jpiszcz commented Mar 22, 2024

Thank you! I pulled the latest urlwatch via github and installed playwright and it seems to work now; although sites that are protected with Cloudflare/CDN, is there an option that can be used to get past this with urlwatch?

$ urlwatch
....
Verifying you are human. This may take a few seconds.support.wyze.com needs to review the security of your connection before proceeding.Verification successfulWaiting for support.wyze.com to respond...Enable JavaScript and cookies to continue
...
This may take a few seconds.camelcamelcamel.com needs to review the security of your connection before proceeding.Verification successfulWaiting for camelcamelcamel.com to respond...
...

@Jamstah
Copy link
Contributor

Jamstah commented Mar 22, 2024

Waiting is something raised in #763 - it would be good to be able to wait for a specific selector.

@nille02
Copy link
Contributor

nille02 commented Jan 20, 2025

You could install playwright stealth and edit the jobs.py (its on the bottom of that file). it just need the import from playwright_stealth import stealth_sync and after page = browser.new_page(user_agent=self.useragent) you add stealth_sync(page)

e.g.

    def retrieve(self, job_state):
        from playwright.sync_api import sync_playwright
        from playwright_stealth import stealth_sync
        with sync_playwright() as playwright:
            browser = playwright[self.browser or "chromium"].launch()
            page = browser.new_page(user_agent=self.useragent)
            stealth_sync(page)

playwright stealth just sets options for playwright to let the headless browser not looks like a headless browser

@jpiszcz
Copy link
Author

jpiszcz commented Jan 26, 2025

You could install playwright stealth and edit the jobs.py (its on the bottom of that file). it just need the import from playwright_stealth import stealth_sync and after page = browser.new_page(user_agent=self.useragent) you add stealth_sync(page)

e.g.

    def retrieve(self, job_state):
        from playwright.sync_api import sync_playwright
        from playwright_stealth import stealth_sync
        with sync_playwright() as playwright:
            browser = playwright[self.browser or "chromium"].launch()
            page = browser.new_page(user_agent=self.useragent)
            stealth_sync(page)

playwright stealth just sets options for playwright to let the headless browser not looks like a headless browser

I will look into this, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants