-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental fetchers #91
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Note on deletions. Several possible solutions to consider:
I lean towards (1), though I don't like that it'll require a significant amount of new code. |
@NunoSempere I'd appreciate any feedback you have on this. I might be missing some corner cases, since I still haven't read the code for all the platforms carefully. |
Ok, looking at this, I don't understand what type of pattern the following is: async fetcher({ robot, storage }) {
...
} Should this be something like: No comments for now while I understand what the code is doing. |
It's a shorthand;
is the same as
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/Method_definitions |
Ok, so looking through this I think I would have tended to do something much hackier, like saving the page for apis that implement pagination. Overall not sure how to judge this though; the approach is a bit more complicated and, as you mention, it will take some tweaks to make the robot conform to the different APIs of all the platforms. |
On question deletion, note that we do want to keep questions after they resolve, even if we don't show them in the frontpage. |
That would help with the interruptible metaculus fetcher, but the main reason for this PR is the future near-real-time capabilities, which are impossible to get with the current "once in 24 hours" approach.
Right. Scenarios when deletion is necessary I can think of are:
|
Makes sense |
This is a draft for #35 and #36, and it's not ready yet, but the changes are significant and I want to braindump my thoughts on it.
So, currently all platform modules fetch all questions and then store a huge array in the DB (and then on Algolia).
As I mentioned in #36, I'd like to change that.
Sidenote: I spent several hours today fighting the new metaculus fetcher which failed for one reason or another (mostly because of excessive validation, but also once because one question was on the frontpage and
ON DELETE
was set to restrict instead of cascade). Every time I had to wait until it got past the last point of failure, only to have it fail again further down the road.I really don't like to have such a long feedback loop to get some initial results; also, the current architecture gets in the way when I want to get some questions in my dev DB. Though I've recently implemented the
npm run cli metaculus -- --id=12345
command, what I really want is to say "fetch some stuff for this platform" without waiting several hours for the script to finish.Of course, there are also other benefits for why I'm doing this; getting us closer to the real-time capabilities, etc.
The basic idea is: we crawl the graph of urls; there are some leaf nodes (question page urls or graphql endpoints with questions data or whatever) and some intermediate nodes which allow us to discover leaf nodes, e.g.
/api2/questions/
on metaculus which doesn't give us full data but gives it us urls for other api pages with full data.To store the progress we can use the table (
Robot
) with jobs as rows; each job includes an url, a json context, and some metadata for when the job was created and whether it was completed. Then we can encapsulate the common pattern of "keep fetching stuff until there's some stuff to process" behind a common API.Here's a draft which uses this approach:
Notes on this example:
storage.save
insteadmaxAge
values; e.g., it's easy to schedule a metaculus frontpage with a smallmaxAge
and crawl urls from it more frequentlystorage.save
will also update history and algolia synchronously, no need to do it in a separate stepIn the future, we could also:
Stuff I'm still figuring out:
maxAge
andrepeatAfter
is the right approach, still experimenting with thisDELETE from "Robot" WHERE platform = "myplatform"
, not sure if we need anything more