Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0572 spider chi ssa 38 #962

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jaspsingh
Copy link

@jaspsingh jaspsingh commented Jun 24, 2020

Summary

Issue: #572

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • Tests are implemented
  • All tests are passing
  • Style checks run (see documentation for more details)
  • Style checks are passing
  • Code comments from template removed

Questions

Include any questions you have about what you're working on.

…wrote spider parse_start function. WHY: wrote parse_start to extract unstructured date from page
Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is looking good so far, let me know if any of my comments aren't clear


def _parse_classification(self, item):
"""Parse or generate classification from allowed options."""
return NOT_CLASSIFIED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be COMMISSION for all meetings on this spider

def _parse_description(self, item):
"""Parse or generate meeting description."""
description = ""
return description
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to just return "" instead of setting a variable first

def _parse_title(self, item):
"""Parse or generate meeting title."""
title = "Chamber of Commerce"
return title
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in _parse_description, but it's fine to just return the string without assigning to a variable first. It's a bit odd for SSAs, but this one should be "Commission". They're technically separate entities managed by a nonprofit

"name": "Northcenter Chamber of Commerce",
}

def _parse_links(self, item):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to parse a mapping of dates to relevant links from the page so that things like meeting minutes can be associated with the meetings listed. Here's an example of that:

def _parse_link_date_map(self, response):
"""Generate a defaultdict mapping of meeting dates and associated links"""
link_date_map = defaultdict(list)
for link in response.css(
".vc_col-sm-4.column_container:nth-child(1) .mk-text-block.indent16"
)[:1].css("a"):
link_str = link.xpath("./text()").extract_first()
link_start = self._parse_start(link_str)
if link_start:
link_date_map[link_start.date()].append(
{
"title": re.sub(r"\s+", " ", link_str.split(" – ")[-1]).strip(),
"href": link.attrib["href"],
}
)
for section in response.css(
".vc_col-sm-4.column_container:nth-child(1) .vc_tta-panel"
):
year_str = section.css(".vc_tta-title-text::text").extract_first().strip()
for section_link in section.css("p > a"):
link_str = section_link.xpath("./text()").extract_first()
link_dt = self._parse_start(link_str, year=year_str)
if link_dt:
link_date_map[link_dt.date()].append(
{
"title": re.sub(
r"\s+", " ", link_str.split(" – ")[-1]
).strip(),
"href": section_link.xpath("@href").extract_first(),
}
)
return link_date_map

return False

def _parse_location(self, item):
# Meetings seemingly ocurred at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like an abbreviated version of this is in the meeting item, since we don't have examples of another format we can do something like this to return a default location if "4054" is present and raise an exception otherwise:

if "4054" not in item.extract():
  raise ValueError("Meeting location has changed")
return {
  "address": "4054 N Lincoln Ave, Chicago, IL 60618",
  "name": "Northcenter Chamber of Commerce",
}


def _parse_start(self, item):
"""Parse start datetime as a naive datetime object."""
date_str = item.extract()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we might be able to simplify this a bit, and we'll also need to handle situations where minutes may be supplied for the time. Haven't tested this, but something like this snippet could work:

item_str = item.extract()
month_day_str = re.search(r"[A-Z][a-z]{2,9} \d{1,2}", item_str).group()

year_str = re.search(r"\d{4}", item_str).group()
if not year_str[:2] == "20":
  year_str = str(datetime.today().year)  # Default to current year

time_match = re.search(r"\d{1,2}(\:\d\d) [apm\.]{2,4}", item_str)  # We want to check for a minutes portion here
time_str = "12 am"
if time_match:
  time_str = time_match.group().replace(".", "")

time_fmt = "%I %p"
if ":" in time_str:
  time_fmt = "%I:%M %p"

return datetime.strptime(f"{month_day_str} {year_str} {time_str}", f"%B %d %Y {time_str}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants