New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

0572 spider chi ssa 38 #962

Open

jaspsingh wants to merge 3 commits into City-Bureau:main from jaspsingh:0572-spider-chi_ssa_38

jaspsingh commented Jun 24, 2020 •

edited

Loading

Summary

Issue: #572

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Tests are implemented
All tests are passing
Style checks run (see documentation for more details)
Style checks are passing
Code comments from template removed

Questions

Include any questions you have about what you're working on.

jaspsingh added 3 commits

June 23, 2020 22:40


          WHAT: created spider, set basic values in spider meeting object, and …

78f04af

…wrote spider parse_start function. WHY: wrote parse_start to extract unstructured date from page


          WHAT: Added html and wrote tests for spider. WHY: to test recently de…

666ba85

…veloped functionality on a test set


          WHAT: Linting and style-checking fixes. WHY: to comply with style sta…

6b6dc89

…ndards

pjsier requested changes

View reviewed changes

Collaborator

pjsier left a comment

Thanks for the PR! This is looking good so far, let me know if any of my comments aren't clear

city_scrapers/spiders/chi_ssa_38.py

+                  def _parse_classification(self, item):
+                      """Parse or generate classification from allowed options."""
+                      return NOT_CLASSIFIED

Collaborator

pjsier Jun 24, 2020

This should be COMMISSION for all meetings on this spider

city_scrapers/spiders/chi_ssa_38.py

+                  def _parse_description(self, item):
+                      """Parse or generate meeting description."""
+                      description = ""
+                      return description

Collaborator

pjsier Jun 24, 2020

It's fine to just return "" instead of setting a variable first

city_scrapers/spiders/chi_ssa_38.py

+                  def _parse_title(self, item):
+                      """Parse or generate meeting title."""
+                      title = "Chamber of Commerce"
+                      return title

Collaborator

pjsier Jun 24, 2020

Mentioned in _parse_description, but it's fine to just return the string without assigning to a variable first. It's a bit odd for SSAs, but this one should be "Commission". They're technically separate entities managed by a nonprofit

city_scrapers/spiders/chi_ssa_38.py

+                          "name": "Northcenter Chamber of Commerce",
+                      }
+                  def _parse_links(self, item):

Collaborator

pjsier Jun 24, 2020

We'll need to parse a mapping of dates to relevant links from the page so that things like meeting minutes can be associated with the meetings listed. Here's an example of that:

city-scrapers/city_scrapers/spiders/chi_il_medical_district.py

Lines 109 to 140 in c6771d5

    
           def _parse_link_date_map(self, response): 
        
               """Generate a defaultdict mapping of meeting dates and associated links""" 
        
               link_date_map = defaultdict(list) 
        
               for link in response.css( 
        
                   ".vc_col-sm-4.column_container:nth-child(1) .mk-text-block.indent16" 
        
               )[:1].css("a"): 
        
                   link_str = link.xpath("./text()").extract_first() 
        
                   link_start = self._parse_start(link_str) 
        
                   if link_start: 
        
                       link_date_map[link_start.date()].append( 
        
                           { 
        
                               "title": re.sub(r"\s+", " ", link_str.split(" – ")[-1]).strip(), 
        
                               "href": link.attrib["href"], 
        
                           } 
        
                       ) 
        
               for section in response.css( 
        
                   ".vc_col-sm-4.column_container:nth-child(1) .vc_tta-panel" 
        
               ): 
        
                   year_str = section.css(".vc_tta-title-text::text").extract_first().strip() 
        
                   for section_link in section.css("p > a"): 
        
                       link_str = section_link.xpath("./text()").extract_first() 
        
                       link_dt = self._parse_start(link_str, year=year_str) 
        
                       if link_dt: 
        
                           link_date_map[link_dt.date()].append( 
        
                               { 
        
                                   "title": re.sub( 
        
                                       r"\s+", " ", link_str.split(" – ")[-1] 
        
                                   ).strip(), 
        
                                   "href": section_link.xpath("@href").extract_first(), 
        
                               } 
        
                           ) 
        
               return link_date_map

city_scrapers/spiders/chi_ssa_38.py

+                      return False
+                  def _parse_location(self, item):
+                      # Meetings seemingly ocurred at

Collaborator

pjsier Jun 24, 2020

It looks like an abbreviated version of this is in the meeting item, since we don't have examples of another format we can do something like this to return a default location if "4054" is present and raise an exception otherwise:

if "4054" not in item.extract():
  raise ValueError("Meeting location has changed")
return {
  "address": "4054 N Lincoln Ave, Chicago, IL 60618",
  "name": "Northcenter Chamber of Commerce",
}

city_scrapers/spiders/chi_ssa_38.py

+                  def _parse_start(self, item):
+                      """Parse start datetime as a naive datetime object."""
+                      date_str = item.extract()

Collaborator

pjsier Jun 24, 2020

It looks like we might be able to simplify this a bit, and we'll also need to handle situations where minutes may be supplied for the time. Haven't tested this, but something like this snippet could work:

item_str = item.extract()
month_day_str = re.search(r"[A-Z][a-z]{2,9} \d{1,2}", item_str).group()

year_str = re.search(r"\d{4}", item_str).group()
if not year_str[:2] == "20":
  year_str = str(datetime.today().year)  # Default to current year

time_match = re.search(r"\d{1,2}(\:\d\d) [apm\.]{2,4}", item_str)  # We want to check for a minutes portion here
time_str = "12 am"
if time_match:
  time_str = time_match.group().replace(".", "")

time_fmt = "%I %p"
if ":" in time_str:
  time_fmt = "%I:%M %p"

return datetime.strptime(f"{month_day_str} {year_str} {time_str}", f"%B %d %Y {time_str}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet