-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommended approach to scrape multiple jurisdictions at once? #70
Comments
any idea of how you'd like to see pupa handle this? |
I don't have strong opinions on how the API should work, but one way is to be able to change the "active jurisdiction" so that objects are yielded to the appropriate jurisdiction. Pseudo-code:
However, I can imagine a lot of challenges in changing Pupa to work this way. Maybe there are some Python metaprogramming tricks I can use, to make it seem like there are several thousand modules with common |
the people.py files won't be needed if they're all the same, as multiple your proposed solution might work, I'll play with some proof of concept code On Tue, May 20, 2014 at 3:54 PM, James McKinney [email protected]:
|
Cool - how do you make multiple jurisdictions point to the same scrapers? |
there's now an example of this in https://github.com/opencivicdata/scrapers-us-state there's still one file per jurisdiction (maybe we can improve that, maybe this is good enough though) but they all point to the same scraper (and the jurisdictions in this case are actually auto-generated classes) |
Thanks! In Quebec I'll have 1000 auto-generated jurisdictions, mixed in with manual jurisdictions; we scrape the big cities individually (to get email addresses), but we're happy to use a provincial directory for the smaller cities (which has one email for the entire council). It may be confusing to have this mix, so avoiding one file per jurisdiction would still be ideal. How is Pupa 0.0.4 coming along? How soon can I start upgrading to the PostgreSQL version? |
pupa 0.4 is pretty much ready, there are still rough edges but no more than existed in the mongo version I believe. I was hoping to update some docs before calling it 0.4 officially, but we're using it in development now and will be releasing it as 0.4 and switching production over soon the 1000 jurisdiction issue still requires more work/thinking on the best way to do it. i think a different command like pupa bulkupdate might get around some of the challenges we'd face, once things settle down here I'll try and think of a cleaner interface for this |
Pinging for any updates on how to implement common scraper code for 1000s of jurisdictions. In the |
My workaround is to just put all the jurisdictions into one jurisdiction, in an organization hierarchy, which is fine for my needs, but maybe not in the general case. However, as there is no other demand for the general case, I'm closing. |
For example, if a provincial website has information for all its municipalities.
The text was updated successfully, but these errors were encountered: