914 spider il finance authority #995

solisedwin · 2020-12-20T00:34:11Z

Summary

Issue: #914

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Tests are implemented
All tests are passing
Style checks run (see documentation for more details)
Style checks are passing
Code comments from template removed

Comments

Glad to say after months I'm finally doing working on this spider as my first ever pull request. Thanks for the opportunity as I learned a lot. I used scrapy meta arguments for each crawl since the self keyword didn't make each item unique and would repeat values. The code for tests might seem "weird" but scarped information can vary when working with PDFs so I tried a general approach.

…or scrapying. Delete/clean spider file code and changed functions around to better help with single modularity for testing.

pjsier

Thanks for the PR, and sorry for the delay! This is looking great so far, especially when the PDF scrapers tend to be pretty tricky. I left some comments here, let me know if anything is unclear

pjsier · 2020-12-28T17:17:31Z

.gitignore

@@ -22,6 +22,7 @@ parts/
 sdist/
 var/
 wheels/
+city_scrapers/get-pip.py


Usually we'll want to keep unrelated .gitignore changes out of PRs. If you want to ignore something locally you can add it to .git/info/exclude in your repo

pjsier · 2020-12-28T17:17:50Z

city_scrapers/spiders/il_finance_authority.py

+    timezone = "America/Chicago"
+    start_urls = ["https://www.il-fa.com/public-access/board-documents/"]
+
+    def __init__(self, *args, **kwargs):


Is there a reason for leaving this in? Right now it has no effect, but not sure if it was used earlier

pjsier · 2020-12-28T17:21:50Z

city_scrapers/spiders/il_finance_authority.py

+            address = address_match.group(0)
+            name = re.findall(r"(?:in\s*the|at\s*the).*?,", pdf_content)[0]
+        except Exception:
+            address = "Address Not Found"


We can just return blank strings if that's the case, but it looks like the location is pretty consistent so it could be easier to use _validate_location and raise an exception if the location isn't a set address like we do here

city-scrapers/city_scrapers/spiders/chi_ssa_61.py

Line 65 in c426204

def _validate_location(self, response):

pjsier · 2020-12-28T17:23:55Z

city_scrapers/spiders/il_finance_authority.py

+            clean_text = re.sub(r"_*", "", clean_text)
+            return clean_text
+
+        except PDFSyntaxError as e:


It looks like this can return None, does this happen often? If so we should allow it to raise the error normally

pjsier · 2020-12-28T17:24:53Z

city_scrapers/spiders/il_finance_authority.py

+        """Parse or generate all-day status. Defaults to False."""
+        return False
+
+    def _parse_links(self, link, title):


It looks like there are multiple links here, so ideally we would want to pull the notice, minutes, and any other information as links even if we aren't parsing them.

Are you talking about the the other pdf links on the website? Could you elaborate more?
Is there an example you could provide? Thanks

Sure, I'm seeing up to 4 PDF links in each row for the Agenda, Board Book, Minutes, and Voting Record. Ideally we'll want to include all of those in the links list

pjsier · 2020-12-28T17:27:27Z

city_scrapers/spiders/il_finance_authority.py

+        meeting_dict["location"] = location
+        meeting_dict["time"] = time
+
+        yield scrapy.Request(


Is this last request necessary? It looks like we're trying to yield the meeting after parsing the PDF, but is there other information here that we need to get from _parse_meeting?

The last request is necessary for the code to work.

Could you explain more about what you mean? When you create a new request to response.url, it looks like you're submitting a second request to the URL you parsed the response from rather than going to a new page

pjsier · 2020-12-28T17:28:09Z

city_scrapers/spiders/il_finance_authority.py

+        """Parse or generate classification from allowed options."""
+        if "Comm" in title:
+            return COMMITTEE
+        if "Board" in title:


We can default to BOARD instead of Commission since it seems like that's the majority of these

pjsier · 2020-12-28T17:29:03Z

tests/test_il_finance_authority.py

+def test_location():
+    location = spider._parse_location(pdf_text)
+    assert (
+        "in the Authority’s Chicago Office, 160 North LaSalle Street, Suite S-1000, Chicago, Illinois 60601"


We'll want to simplify this so that only the street address is included in the address value of the location dict. This could be fixed by using a static location checked by _validate_location mentioned in another comment here

pjsier · 2020-12-28T17:31:02Z

city_scrapers/spiders/il_finance_authority.py

+
+    def _parse_title(self, item):
+        """Parse or generate meeting title."""
+        try:


Small thing, but if possible it would be good to replace "Comm" with "Committee" for committee meetings. To be safe we'd want to replace it only when it's at the end of the title, so something like this could work

title = re.sub(r"Comm$", "Committee", title)

pjsier · 2020-12-28T17:32:01Z

tests/test_il_finance_authority.py

+
+
+def test_time():
+    time = spider._parse_start(pdf_text)


This should work, but is there a reason for not testing against a Meeting item? Usually that's the way we've done it to make sure the whole process works

To be honest, I'm not sure how to test the id since it doesn't appear from parsed_items. Perhaps I'm missing the obvious from my novice coding skills, but I can't seem to think of way to get all the information to recreate//test the id. Since the process of yielding a a meeting object is very specific in order and 'tricky' for this spider.

Gotcha, we aren't typically using the meta attribute on requests, but it seems like the right approach here and testing it can be tricky.

Here's an example of how we're doing that for the chi_buildings scraper where we're using it in a similar way. Let me know if that helps!

city-scrapers/tests/test_chi_buildings.py

Lines 22 to 32 in c426204

class MockRequest(object):

meta = {}

def __getitem__(self, key):

return self.meta["meeting"].get(key)

def mock_request(*args, **kwargs):

mock = MockRequest()

mock.meta = {"meeting": {}}

return mock

pjsier

Thanks for the responses! I batched some replies here

pjsier · 2021-01-04T19:48:03Z

city_scrapers/spiders/il_finance_authority.py

+        """Parse or generate all-day status. Defaults to False."""
+        return False
+
+    def _parse_links(self, link, title):


Sure, I'm seeing up to 4 PDF links in each row for the Agenda, Board Book, Minutes, and Voting Record. Ideally we'll want to include all of those in the links list

pjsier · 2021-01-04T19:52:43Z

city_scrapers/spiders/il_finance_authority.py

+        super().__init__(*args, **kwargs)
+
+    def parse(self, response):
+        for item in response.css("tr:nth-child(1n+2)"):


Sure! We want to be careful about spamming sites with a ton of requests at once, and typically when we scrape a site we're only interested in the last few past meetings and next few upcoming. To reduce the amount of requests we make, as well as simplify the output for anyone using the feeds directly, we try to set ranges of time relative to the current date that we're interested like everything in the past year in that example.

Scrapy's settings are a way for managing configuration across spiders like where the output is written or how quickly requests should be made. You can find more info on them in the scrapy documentation on settings.

Could you explain more about what isn't working? It's hard for me to debug without an example, but in general all the CITY_SCRAPERS_ARCHIVE settings is doing is giving us a boolean that can be put inside a conditional, so it doesn't have to work the same way as the example

pjsier · 2021-01-04T19:53:40Z

city_scrapers/spiders/il_finance_authority.py

+        meeting_dict["location"] = location
+        meeting_dict["time"] = time
+
+        yield scrapy.Request(


Could you explain more about what you mean? When you create a new request to response.url, it looks like you're submitting a second request to the URL you parsed the response from rather than going to a new page

pjsier · 2021-01-04T19:55:29Z

tests/test_il_finance_authority.py

+
+
+def test_time():
+    time = spider._parse_start(pdf_text)


Gotcha, we aren't typically using the meta attribute on requests, but it seems like the right approach here and testing it can be tricky.

Here's an example of how we're doing that for the chi_buildings scraper where we're using it in a similar way. Let me know if that helps!

city-scrapers/tests/test_chi_buildings.py

Lines 22 to 32 in c426204

class MockRequest(object):

meta = {}

def __getitem__(self, key):

return self.meta["meeting"].get(key)

def mock_request(*args, **kwargs):

mock = MockRequest()

mock.meta = {"meeting": {}}

return mock

solisedwin added 2 commits December 13, 2020 16:04

Added Illinois Finance Authority spider. Added get-pip.py to .gitignore

3c27dfa

Added pdf file for testing along with website html page that I used f…

2ba52d8

…or scrapying. Delete/clean spider file code and changed functions around to better help with single modularity for testing.

pjsier requested changes Dec 28, 2020

View reviewed changes

pjsier requested changes Jan 4, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

914 spider il finance authority #995

914 spider il finance authority #995

solisedwin commented Dec 20, 2020 •

edited

Loading

pjsier left a comment

pjsier Dec 28, 2020

pjsier Dec 28, 2020

pjsier Dec 28, 2020

pjsier Dec 28, 2020

pjsier Dec 28, 2020

solisedwin Jan 3, 2021

pjsier Jan 4, 2021

pjsier Dec 28, 2020

solisedwin Jan 3, 2021

pjsier Jan 4, 2021

pjsier Dec 28, 2020

pjsier Dec 28, 2020

pjsier Dec 28, 2020

pjsier Dec 28, 2020

solisedwin Jan 3, 2021

pjsier Jan 4, 2021

pjsier left a comment

pjsier Jan 4, 2021

pjsier Jan 4, 2021

pjsier Jan 4, 2021

pjsier Jan 4, 2021

	class MockRequest(object):
	meta = {}

	def __getitem__(self, key):
	return self.meta["meeting"].get(key)


	def mock_request(args, *kwargs):
	mock = MockRequest()
	mock.meta = {"meeting": {}}
	return mock

914 spider il finance authority #995

Are you sure you want to change the base?

914 spider il finance authority #995

Conversation

solisedwin commented Dec 20, 2020 • edited Loading

Summary

Checklist

Comments

pjsier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjsier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

solisedwin commented Dec 20, 2020 •

edited

Loading