You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since downloading 14K images with corresponding prompts is quite insane, I figured I'd write a crawler for it instead.
2. Login issues
Since some of my content is set to private and can't be set to public, I wanted to log in with my user so I could scrape all of my content. Since I couldn't get the Google Auth to work in Chromium for some reason, I figured it was best to just inject my Firebase user after the page was loaded.
I found some JS code at #11164 that allowed me to achieve what I wanted after some modifications, but IMO this approach is way too hacky, and I'd expect out-of-the-box support for this in Playwright.
I'll probably create a demo repo of my finished project in the very near future after I cleaned up my code, but for the time being here's some snippets with code that allowed me to get me to correctly log in on Chromium.
3. Snippets
3.1 Dump Firebase use to Json
First, I extracted my Firebase user by copy-pasting the following script in the console of the website I'm trying to scrape :
3.2 Adding the correct user info after first removing the wrong info
This is the Javascript code that's injected when the page is loaded in my request handler in __main__.py . For the time being it's just a test string in __main__.py, but it will be moved into a separate .js file with the code & a .json file with the login data.
It's an adapation of the code from the previous comment by from OVO-Josh.
Here's my Python request handler in __main__.py, where I actually add the JS inject, after first waiting until the DOM has loaded :
asyncdefrequest_handler(context: PlaywrightCrawlingContext) ->None:
context.log.info(f"Processing {context.request.url} ...")
page=context.page# Wait until the content I'm interested in is loadedawaitpage.wait_for_selector(selector)
# Update the users in the Firebase DB with the correct value# add_user is the above JS code that's injectedawaitpage.evaluate(add_user)
_____
Example
asyncdefrequest_handler(context: PlaywrightCrawlingContext) ->None:
context.log.info(f"Processing {context.request.url} ...")
page=context.page# Wait until the content I'm interested in is loadedawaitpage.wait_for_selector(selector)
# Update the user in the Firebase DB with the correct valueawaitpage.authenticate(firebase_user)
Motivation
Since I'm sure other users of Playright struggle with similar issues (see #11164), it makes sense for this behavior to be supported out-of-the box.
The text was updated successfully, but these errors were encountered:
🚀 Feature Request
1. Use case
I'm trying to extract all content I produced @ https://legacy.mage.space/u/johnslegers before disappears for good in less than 10 days.
Since downloading 14K images with corresponding prompts is quite insane, I figured I'd write a crawler for it instead.
2. Login issues
Since some of my content is set to private and can't be set to public, I wanted to log in with my user so I could scrape all of my content. Since I couldn't get the Google Auth to work in Chromium for some reason, I figured it was best to just inject my Firebase user after the page was loaded.
I found some JS code at #11164 that allowed me to achieve what I wanted after some modifications, but IMO this approach is way too hacky, and I'd expect out-of-the-box support for this in Playwright.
I'll probably create a demo repo of my finished project in the very near future after I cleaned up my code, but for the time being here's some snippets with code that allowed me to get me to correctly log in on Chromium.
3. Snippets
3.1 Dump Firebase use to
Json
First, I extracted my Firebase user by copy-pasting the following script in the console of the website I'm trying to scrape :
3.2 Adding the correct user info after first removing the wrong info
This is the Javascript code that's injected when the page is loaded in my request handler in
__main__.py
. For the time being it's just a test string in__main__.py
, but it will be moved into a separate.js
file with the code & a.json
file with the login data.It's an adapation of the code from the previous comment by from OVO-Josh.
3.2 Load page from Python
Here's my Python request handler in
__main__.py
, where I actually add the JS inject, after first waiting until the DOM has loaded :Example
Motivation
Since I'm sure other users of Playright struggle with similar issues (see #11164), it makes sense for this behavior to be supported out-of-the box.
The text was updated successfully, but these errors were encountered: