Specify unique id #23

magick93 · 2021-06-20T05:40:33Z

Is your feature request related to a problem? Please describe.

I would like to be able to specify a column as being a unique value, so as to open up for more convenient use of the downloaded data, including performing subsequent runs.

Describe the solution you'd like
When persisting to disk, currently each scrape run creates a new folder, and each item (row in a table) has a randomly generated, unique ID.

If I were able to specify which column in the table had a unique ID, then:

rather than creating a new folder, the same folder could be used
rather than creating a new file for each table row, the same file could be used (eg, similar to upsert)
when a new item is added, then a new file is created (eg, similar to an insert)

Describe alternatives you've considered

Alternatives including using a database, and would involve a lot more development overhead. This solution is, IMO, more lightweight, as it is basically using the local file system as the database, and also depends on the data provider having some kind of unique id, and the scraper developer being able to identify and use this ID.

Additional context

Additionally, it would be good if, on subsequent runs/scrapes, if spatula could read in the already persisted json, and compare it to what is being scraped, and only persist if there is differences.

Then, for example, if scrape results are committed to git, only those that have been changed will be committed.
Also, it would be easy to see, using the filesystem modified field, which has been recently updated.

The text was updated successfully, but these errors were encountered:

jamesturk · 2021-06-21T19:01:43Z

thanks for this, I like this idea & think it is definitely worth exploring

jamesturk · 2021-06-22T17:39:02Z

I'm thinking of tackling this in 3 pieces:

allow specification of "primary key" & use primary key to name files
allow using same folder for each scrape
modify save code to only save if there are changes (so that modified date doesn't update)

I'm realizing that 1 and 2 are actually already nearly possible, but the UX here could be better (or at least better documented).

If you define get_filename on your output type, that will be used instead of the UUID.
if you pass --output-dir to spatula scrape that directory will be used instead of a randomly generated one. (Right now however, output-dir will not run if the directory isn't empty.)

I'm wondering if you have any thoughts about the UX here, is get_filename a suitable option instead of defining a specific primary key? Perhaps get_primary_key should be available instead?

I'm thinking at least the first two pieces of this can fit into 0.9. I'm undecided on the modified date piece since it seems like it'll come with a whole new set of edge cases, and the mentioned use case of checking results into git already works as long as the output itself doesn't change.

magick93 · 2021-06-22T20:15:47Z

Thanks @jamesturk

I'm wondering if you have any thoughts about the UX here, is get_filename a suitable option instead of defining a specific primary key? Perhaps get_primary_key should be available instead?

Yes I think get_filename I think this is a suitable option.

Another option, though it may not always be available, is using the the col name for the primary key.

magick93 added the enhancement label Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify unique id #23

Specify unique id #23

magick93 commented Jun 20, 2021 •

edited

Loading

jamesturk commented Jun 21, 2021

jamesturk commented Jun 22, 2021

magick93 commented Jun 22, 2021

Specify unique id #23

Specify unique id #23

Comments

magick93 commented Jun 20, 2021 • edited Loading

jamesturk commented Jun 21, 2021

jamesturk commented Jun 22, 2021

magick93 commented Jun 22, 2021

magick93 commented Jun 20, 2021 •

edited

Loading