-
-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
914 spider il finance authority #995
base: main
Are you sure you want to change the base?
914 spider il finance authority #995
Conversation
…or scrapying. Delete/clean spider file code and changed functions around to better help with single modularity for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, and sorry for the delay! This is looking great so far, especially when the PDF scrapers tend to be pretty tricky. I left some comments here, let me know if anything is unclear
@@ -22,6 +22,7 @@ parts/ | |||
sdist/ | |||
var/ | |||
wheels/ | |||
city_scrapers/get-pip.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we'll want to keep unrelated .gitignore changes out of PRs. If you want to ignore something locally you can add it to .git/info/exclude
in your repo
timezone = "America/Chicago" | ||
start_urls = ["https://www.il-fa.com/public-access/board-documents/"] | ||
|
||
def __init__(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for leaving this in? Right now it has no effect, but not sure if it was used earlier
address = address_match.group(0) | ||
name = re.findall(r"(?:in\s*the|at\s*the).*?,", pdf_content)[0] | ||
except Exception: | ||
address = "Address Not Found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just return blank strings if that's the case, but it looks like the location is pretty consistent so it could be easier to use _validate_location
and raise an exception if the location isn't a set address like we do here
def _validate_location(self, response): |
clean_text = re.sub(r"_*", "", clean_text) | ||
return clean_text | ||
|
||
except PDFSyntaxError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this can return None
, does this happen often? If so we should allow it to raise the error normally
"""Parse or generate all-day status. Defaults to False.""" | ||
return False | ||
|
||
def _parse_links(self, link, title): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are multiple links here, so ideally we would want to pull the notice, minutes, and any other information as links even if we aren't parsing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the the other pdf links on the website? Could you elaborate more?
Is there an example you could provide? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm seeing up to 4 PDF links in each row for the Agenda, Board Book, Minutes, and Voting Record. Ideally we'll want to include all of those in the links
list
meeting_dict["location"] = location | ||
meeting_dict["time"] = time | ||
|
||
yield scrapy.Request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this last request necessary? It looks like we're trying to yield
the meeting after parsing the PDF, but is there other information here that we need to get from _parse_meeting
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last request is necessary for the code to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more about what you mean? When you create a new request to response.url
, it looks like you're submitting a second request to the URL you parsed the response from rather than going to a new page
"""Parse or generate classification from allowed options.""" | ||
if "Comm" in title: | ||
return COMMITTEE | ||
if "Board" in title: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can default to BOARD
instead of Commission since it seems like that's the majority of these
def test_location(): | ||
location = spider._parse_location(pdf_text) | ||
assert ( | ||
"in the Authority’s Chicago Office, 160 North LaSalle Street, Suite S-1000, Chicago, Illinois 60601" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to simplify this so that only the street address is included in the address
value of the location dict. This could be fixed by using a static location checked by _validate_location
mentioned in another comment here
|
||
def _parse_title(self, item): | ||
"""Parse or generate meeting title.""" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing, but if possible it would be good to replace "Comm" with "Committee" for committee meetings. To be safe we'd want to replace it only when it's at the end of the title, so something like this could work
title = re.sub(r"Comm$", "Committee", title)
|
||
|
||
def test_time(): | ||
time = spider._parse_start(pdf_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work, but is there a reason for not testing against a Meeting item? Usually that's the way we've done it to make sure the whole process works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I'm not sure how to test the id since it doesn't appear from parsed_items. Perhaps I'm missing the obvious from my novice coding skills, but I can't seem to think of way to get all the information to recreate//test the id. Since the process of yielding a a meeting object is very specific in order and 'tricky' for this spider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, we aren't typically using the meta
attribute on requests, but it seems like the right approach here and testing it can be tricky.
Here's an example of how we're doing that for the chi_buildings
scraper where we're using it in a similar way. Let me know if that helps!
city-scrapers/tests/test_chi_buildings.py
Lines 22 to 32 in c426204
class MockRequest(object): | |
meta = {} | |
def __getitem__(self, key): | |
return self.meta["meeting"].get(key) | |
def mock_request(*args, **kwargs): | |
mock = MockRequest() | |
mock.meta = {"meeting": {}} | |
return mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the responses! I batched some replies here
"""Parse or generate all-day status. Defaults to False.""" | ||
return False | ||
|
||
def _parse_links(self, link, title): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm seeing up to 4 PDF links in each row for the Agenda, Board Book, Minutes, and Voting Record. Ideally we'll want to include all of those in the links
list
super().__init__(*args, **kwargs) | ||
|
||
def parse(self, response): | ||
for item in response.css("tr:nth-child(1n+2)"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! We want to be careful about spamming sites with a ton of requests at once, and typically when we scrape a site we're only interested in the last few past meetings and next few upcoming. To reduce the amount of requests we make, as well as simplify the output for anyone using the feeds directly, we try to set ranges of time relative to the current date that we're interested like everything in the past year in that example.
Scrapy's settings are a way for managing configuration across spiders like where the output is written or how quickly requests should be made. You can find more info on them in the scrapy documentation on settings.
Could you explain more about what isn't working? It's hard for me to debug without an example, but in general all the CITY_SCRAPERS_ARCHIVE
settings is doing is giving us a boolean that can be put inside a conditional, so it doesn't have to work the same way as the example
meeting_dict["location"] = location | ||
meeting_dict["time"] = time | ||
|
||
yield scrapy.Request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more about what you mean? When you create a new request to response.url
, it looks like you're submitting a second request to the URL you parsed the response from rather than going to a new page
|
||
|
||
def test_time(): | ||
time = spider._parse_start(pdf_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, we aren't typically using the meta
attribute on requests, but it seems like the right approach here and testing it can be tricky.
Here's an example of how we're doing that for the chi_buildings
scraper where we're using it in a similar way. Let me know if that helps!
city-scrapers/tests/test_chi_buildings.py
Lines 22 to 32 in c426204
class MockRequest(object): | |
meta = {} | |
def __getitem__(self, key): | |
return self.meta["meeting"].get(key) | |
def mock_request(*args, **kwargs): | |
mock = MockRequest() | |
mock.meta = {"meeting": {}} | |
return mock |
Summary
Issue: #914
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.
Comments
Glad to say after months I'm finally doing working on this spider as my first ever pull request. Thanks for the opportunity as I learned a lot. I used scrapy meta arguments for each crawl since the self keyword didn't make each item unique and would repeat values. The code for tests might seem "weird" but scarped information can vary when working with PDFs so I tried a general approach.