Add Excel parsing (have broken work-in-progress branch) #564

jzohrab · 2020-08-12T20:36:12Z

Description.

Some places, such as us-dc, report their data in Excel spreadsheets. Add crawl and parse support for that.

Work-in-progress branch `add-excel-parsing` on master repo

There are some libraries that parse xlsx. It seemed simple to add, but at the moment it breaks on crawl -- the crawled file is not a valid xslx file. This branch contains a test (marked with only) that demonstrates the problem: all the test does is crawl a local "fake source" .xslx file, and then it checks that the file in the crawler-cache can be parsed in the same way that the "fake source" can be:

$ git fetch upstream
$ git checkout -b upstream/add-excel-parsing add-excel-parsing
$ npm run test

... etc
  crawled Excel file has same parseable content as source

    sanity check of src sheets
    Sandbox Found Architect project manifest, starting up
    Created test cache /Users/jeff/Documents/Projects/li/zz-testing-fake-cache
    Created test report dir /Users/jeff/Documents/Projects/li/zz-reports-dir
    Wrote to local cache: /Users/jeff/Documents/Projects/li/zz-testing-fake-cache/excel-source/2020-08-12/2020-08-12t20_27_54.266z-default-59988.xlsx.gz
...

    x Error: End of data reached (data length = 10043, asked index = 347979759). Corrupted zip ? (fail at: undefined)

If we can get this crawl method to work, we can get Excel crawls and scrapes in general to work.

Things tried to get crawl to work

The "crawl" method (src/events/crawler/crawler) actually calls src/http/get-get-normal/index.js to get the file. I've tried:

setting the Content-Typein get-get-normal
setting content type in events/crawler/crawler/index.js (the got call)
a few other hacks!

Some other people ran into this trouble as well -- e.g. see SheetJS/sheetjs#337.

A minimal repo

... demonstrating this is at https://github.com/covidatlas/arc-excel-downloading-trouble.

The text was updated successfully, but these errors were encountered:

jzohrab · 2020-08-13T15:50:23Z

Done and merged!

jzohrab added enhancement New feature or request help wanted Extra attention is needed labels Aug 12, 2020

jzohrab closed this as completed Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Excel parsing (have broken work-in-progress branch) #564

Add Excel parsing (have broken work-in-progress branch) #564

jzohrab commented Aug 12, 2020 •

edited

Loading

jzohrab commented Aug 13, 2020

Add Excel parsing (have broken work-in-progress branch) #564

Add Excel parsing (have broken work-in-progress branch) #564

Comments

jzohrab commented Aug 12, 2020 • edited Loading

Description.

Work-in-progress branch add-excel-parsing on master repo

Things tried to get crawl to work

A minimal repo

jzohrab commented Aug 13, 2020

jzohrab commented Aug 12, 2020 •

edited

Loading

Work-in-progress branch `add-excel-parsing` on master repo