-
Notifications
You must be signed in to change notification settings - Fork 129
is it possible to use playwright-stealth with the scrapy-playwright integration? #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There was a PR about adding built-in support for a playwright-stealth Python port at #109. The PR was closed, but it is possible to use it after #128 as shown in #109 (comment). Perhaps this is enough for your case. The plugin you mention is the upstream JS version of the above port. As you said it seems to be a replacement for the JS playwright package which is in turn used by the Python version, but I don't have any plans to base this package in anything else than the official Python version of playwright. AFAICT the playwright-stealth Python port works by adding init scripts to pages, I don't know if the JS one does more. I'm not particularly well-versed in stealth techniques for browsers, I don't think this package will be going in that direction in the future. Instead, I'd prefer to provide ways to interact with Playwright objects (such as |
@Chryron Were you able to make it work? @elacuesta We tried the different approaches you provided in multiple issues but none of them seems to work. We tried a simple js file with the following (taken from the stealth plugin): // headless.js
// replace Headless references in default useragent
const current_ua = navigator.userAgent
Object.defineProperty(Object.getPrototypeOf(navigator), 'userAgent', {
get: () => opts.navigator_user_agent || current_ua.replace('HeadlessChrome/', 'Chrome/')
}) With this code (a part of, the code is longer but you get the idea): async def init_page(page, request):
await page.add_init_script(path="./headless.js") # not working
# await stealth_async(page) # not working with the stealth plugin
class RandomCrawler(CrawlSpider):
def start_requests(self):
yield scrapy.Request(
'https://httpbin.org/headers',
meta={
'playwright': True,
'playwright_page_init_callback': init_page,
}, The user-agent returned is:
|
@kinoute I got it to work with the playwright-stealth plugin when I tried using the init_page method. I wrote a simple spider (included below) to see if it returned different results to a few browser tests while changing the stealth variable to async def init_page(page, request):
await stealth_async(page)
class PlaywrightTester(CrawlSpider):
stealth = True
if stealth:
screenshot = "enabled"
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_init_callback": init_page,
}
else:
screenshot = "disabled"
meta={
"playwright": True,
"playwright_include_page": True,
}
name = "playwright-tester"
start_urls = ["https://bot.sannysoft.com/"]
custom_settings = {
"PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": False},
"PLAYWRIGHT_PROCESS_REQUEST_HEADERS": None,
}
async def parse(self, response):
page = response.meta['playwright_page']
input("Press Enter to continue...")
await page.screenshot( path = f"sannysoft_{self.screenshot}.png", full_page = True)
await page.goto("https://abrahamjuliot.github.io/creepjs/")
await page.wait_for_timeout(20000)
await page.screenshot( path = f"creepjs_{self.screenshot}.png", full_page = True)
await page.goto("http://f.vision/")
await page.wait_for_timeout(20000)
await page.screenshot( path = f"fvision_{self.screenshot}.png", full_page = True)
await page.goto("https://pixelscan.net/")
await page.wait_for_timeout(20000)
await page.screenshot( path = f"pixelscan_{self.screenshot}.png", full_page = True) |
@Chryron Thanks a lot, it works perfectly! |
I'm trying to get past some cloudflare restrictions on a site with scrapy-playwright and I was wondering if it was possible to somehow use playwright-extras and the stealth plugin with this integration? The plugin is currently in beta (development here) and serves as drop-in replacement for regular playwright from my limited understanding. I haven't used the original playwright much and was wondering if it would be possible to port over some of the changes they've made to the scrapy integration.
The text was updated successfully, but these errors were encountered: