Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip articles that haven't changed between dumps #9

Open
newsch opened this issue Jun 26, 2023 · 2 comments
Open

Skip articles that haven't changed between dumps #9

newsch opened this issue Jun 26, 2023 · 2 comments

Comments

@newsch
Copy link
Collaborator

newsch commented Jun 26, 2023

The dump schema includes a date_modified timestamp and other revision metadata.

To reduce disk I/O, we could store some metadata along the articles, compare it against the new one when processing, and skip them if they haven't changed.

One way to do this would be to store the date_modified timestamp as the modified attribute of the article file.

@newsch newsch mentioned this issue Jun 26, 2023
5 tasks
@biodranik
Copy link
Member

An interesting optimization, but it may not worth it. Need to prove its benefits first. Let's leave it in a very low priority for now.

@newsch
Copy link
Collaborator Author

newsch commented Jun 26, 2023

Understood, I've been thinking of it since you mentioned it here, we'll see what the profiling shows for the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants