Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] New event-based history storage #50

Open
berekuk opened this issue Apr 7, 2022 · 1 comment
Open

[RFC] New event-based history storage #50

berekuk opened this issue Apr 7, 2022 · 1 comment

Comments

@berekuk
Copy link
Collaborator

berekuk commented Apr 7, 2022

This is something I've had in mind for the last few days, but it's still incomplete.

Right now the history table is populated with question snapshots. This seems suboptimal:

  • We'll probably want to normalize the questions data eventually, e.g. extract options in a separate table (for performance and to unlock the possibility of more complex SQL queries), and it's unclear how to adapt the current history schema for that
  • history table is denormalized and includes a lot of duplicate data; also, it grows proportionally to the frequency with which we fetch the sources, and so doesn't play well with plans from Independent update schedules for different platforms #35
  • Performance will probably suffer too; this might affect Figure out how to display forecast history #28, though I'm not sure by how much

Alternative: implement an event-based storage which tracks only the changes in fields.

E.g., list of fields for the new table:

  • pk (serial id)
  • question_id
  • field (can be title, description, stars)
  • value (new value)
  • timestamp

Unique index by question_id + field.

This table would be populated only if the field value has changed. If the field hasn't changed from the previous fetch then there's no need to save it again.

This proposal is incomplete:

  • it doesn't explain how to track "deep" properties, e.g. if question had a change in one of the option titles, it's unclear what to put in field
  • I'm still confused on "we just fetched the new question data with its entire forecasts history from the platform" (because the platform provides the historical data) vs "we fetched the new question data and store its snapshot in our history table" — these are two different scenarios, ideally we need to handle both and abstract it away from the end users

I'll think about this some more before doing any code changes, and I'll wait until I become more familiar with the specifics of different platforms that we support. Just throwing this idea out there to gestate for now.

@berekuk
Copy link
Collaborator Author

berekuk commented Apr 7, 2022

Also, this is a better approach if we ever get more platforms with realtime capabilities (e.g., with webhooks for every event that happens on the platform), or if we implement pseudo-realtime capabilities ourselves (e.g., "fetch metaculus frontpage ordered by activity, and refetch only the questions which we haven't seen yet").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant