Scraper doesn't run due to Census data unavailability #157

sydeaka · 2020-10-15T02:54:38Z

When I tried to run the scraper this evening, I got an error that this 2018 US Census Excel file no longer exists. The file also fails to load when I paste the URL directly into a web browser.

https://www2.census.gov/programs-surveys/popest/geographies/2018/all-geocodes-v2018.xlsx

Unfortunately this means we must halt daily scraper runs until this is resolved.

Do we have a local copy saved? Or, alternatively, could we modify the scraper so that it continues to pull the data while ignoring the unavailable Census file?

The error message is provided below.

2020-10-14 21:35:36,120 INFO covid19_scrapers.web_cache:  Connecting web cache to DB: work/web_cache.db
Traceback (most recent call last):
  File "run_scrapers.py", line 189, in <module>
    main()
  File "run_scrapers.py", line 165, in main
    registry_args=dict(enable_beta_scrapers=opts.enable_beta_scrapers),
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/__init__.py", line 61, in make_scraper_registry
    census_api = CensusApi(census_api_key)
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/census/census_api.py", line 31, in __init__
    self.fips = FipsLookup()
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/census/fips_lookup.py", line 22, in __init__
    df = pd.read_excel(get_content_as_file(self.CODES_URL), skiprows=4)
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/utils/http.py", line 100, in get_content_as_file
    return BytesIO(get_content(url, **kwargs))
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/utils/http.py", line 94, in get_content
    r = get_cached_url(url, **kwargs)
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/utils/http.py", line 59, in get_cached_url
    return UTILS_WEB_CACHE.fetch(url, **kwargs)
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/web_cache.py", line 263, in fetch
    response.raise_for_status()
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/covid19_data_test_003/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://www2.census.gov/programs-surveys/popest/geographies/2018/all-geocodes-v2018.xlsx

The text was updated successfully, but these errors were encountered:

sydeaka · 2020-10-15T03:11:41Z

Update: A few moments after I created the issue, subsequent refreshes of the website revealed a message saying that the system is down due to maintenance. Perhaps that explains why the file was unavailable. After a few additional moments, the file appeared to be back online and the scraper run resumed without incident.

I will leave this up so that we can work toward a solution that caches the 2018 data table and stores it in the repo for later reference.

sydeaka added the bug Something isn't working label Oct 15, 2020

sydeaka assigned nkrishnaswami Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper doesn't run due to Census data unavailability #157

Scraper doesn't run due to Census data unavailability #157

sydeaka commented Oct 15, 2020

sydeaka commented Oct 15, 2020

Scraper doesn't run due to Census data unavailability #157

Scraper doesn't run due to Census data unavailability #157

Comments

sydeaka commented Oct 15, 2020

sydeaka commented Oct 15, 2020