-
Notifications
You must be signed in to change notification settings - Fork 17
Form submit functionality
multiscrape:
- resource: 'https://thepagewiththedatathatyouwant.com'
scan_interval: 30
form_submit:
submit_once: True
resource: 'https://thesitewiththeform.com'
select: ".unique-css-selector-for-the-form"
input:
username: [email protected]
password: '12345678'
sensor:
- select: 'td.mydata:nth-child(1) > a:nth-child(1)'
name: scraped-value-after-form-submit
Update: It is now also possible to just submit the form without scraping it. You can do this by omitting the form_submit>select
. The input fields will then be submitted to the form_submit>resource
url.
Obviously, when there is not a form_submit entry in the configuration, no form will be submitted. If it exists, the form will always be submitted after startup. By default it will be submitted before each scraping action (in the scan interval).
Most forms used for authentication will only need to be submitted once. You can then set submit_once: True
. The site will probably set a cookie in the session. The sessions are reused between all scraping requests. This will save a lot of requests.
If one sensor gives a scraping error, by default multiscrape will try to resubmit the form on the next try. You can disable this behaviour by setting resubmit_on_error: False
.
First the page with the form will be fetched. If a resource is provided in the form_submit part of the config, that url will be used. Otherwise it is assumed that the form can be found on the resource in the main part of the config. If the form can not be found on the page, the form will be skipped and we continuing trying to scrape the values we need.
All fields from the form are loaded and merged with the input fields given in the configuration. If a field exists in both, the value will be assigned. If only in the config, it will be added to be submitted as well.
If the site specifies a submit method on the form, that method will be used. Otherwise the form will be POSTed. The url the form will be submitted to is determined as follows:
- if the site specified a (relative) url in the action attribute of the form, that url will be merged with the form resource in the config if given, and otherwise the main resource.
- if no action is specified, the form will be submitted to the same url it was fetched from
After the form has been submitted, the resource from the main config will be fetched and scraped. If the form was already submitted to this resource, it will not be fetched again, but the values will be scraped from the response instead.