Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify unique id #23

Open
magick93 opened this issue Jun 20, 2021 · 3 comments
Open

Specify unique id #23

magick93 opened this issue Jun 20, 2021 · 3 comments

Comments

@magick93
Copy link

magick93 commented Jun 20, 2021

Is your feature request related to a problem? Please describe.

I would like to be able to specify a column as being a unique value, so as to open up for more convenient use of the downloaded data, including performing subsequent runs.

Describe the solution you'd like
When persisting to disk, currently each scrape run creates a new folder, and each item (row in a table) has a randomly generated, unique ID.

If I were able to specify which column in the table had a unique ID, then:

  • rather than creating a new folder, the same folder could be used
  • rather than creating a new file for each table row, the same file could be used (eg, similar to upsert)
  • when a new item is added, then a new file is created (eg, similar to an insert)

Describe alternatives you've considered

Alternatives including using a database, and would involve a lot more development overhead. This solution is, IMO, more lightweight, as it is basically using the local file system as the database, and also depends on the data provider having some kind of unique id, and the scraper developer being able to identify and use this ID.

Additional context

Additionally, it would be good if, on subsequent runs/scrapes, if spatula could read in the already persisted json, and compare it to what is being scraped, and only persist if there is differences.

  • Then, for example, if scrape results are committed to git, only those that have been changed will be committed.
  • Also, it would be easy to see, using the filesystem modified field, which has been recently updated.
@jamesturk
Copy link
Owner

thanks for this, I like this idea & think it is definitely worth exploring

@jamesturk
Copy link
Owner

I'm thinking of tackling this in 3 pieces:

  • allow specification of "primary key" & use primary key to name files
  • allow using same folder for each scrape
  • modify save code to only save if there are changes (so that modified date doesn't update)

I'm realizing that 1 and 2 are actually already nearly possible, but the UX here could be better (or at least better documented).

  • If you define get_filename on your output type, that will be used instead of the UUID.
  • if you pass --output-dir to spatula scrape that directory will be used instead of a randomly generated one. (Right now however, output-dir will not run if the directory isn't empty.)

I'm wondering if you have any thoughts about the UX here, is get_filename a suitable option instead of defining a specific primary key? Perhaps get_primary_key should be available instead?

I'm thinking at least the first two pieces of this can fit into 0.9. I'm undecided on the modified date piece since it seems like it'll come with a whole new set of edge cases, and the mentioned use case of checking results into git already works as long as the output itself doesn't change.

@magick93
Copy link
Author

Thanks @jamesturk

I'm wondering if you have any thoughts about the UX here, is get_filename a suitable option instead of defining a specific primary key? Perhaps get_primary_key should be available instead?

Yes I think get_filename I think this is a suitable option.

Another option, though it may not always be available, is using the the col name for the primary key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants