Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL availability optimization #6

Open
jpmckinney opened this issue May 5, 2021 · 1 comment
Open

URL availability optimization #6

jpmckinney opened this issue May 5, 2021 · 1 comment
Labels
dataset checks Relating to dataset-level checks performance

Comments

@jpmckinney
Copy link
Member

(Migrated from GitLab)

URL availability dataset check should be considered for further improvements. Its current behavior is that 100 URLs are randomly picked from the whole dataset. Then the websites are sequentially visited. The issue with this approach is that the timeout is currently set to 30 seconds. So the worst case scenario means that just this check alone can take up to 50 minutes. This is surprisingly not that uncommon if the same website is linked to many times in a given dataset (e.g. https://dqt.datlab.eu/dataset/108/detail/misc.url_availability).

The suggested solutions are:

  • lowering the timeout period (might lead to false negatives due to insufficient wait time)
  • checking multiple URLs in parallel (might lead to false negatives due to server overload)
  • checking smaller number of URLs
  • checking just the base URLs
@jpmckinney jpmckinney added the dataset checks Relating to dataset-level checks label May 5, 2021
@jpmckinney
Copy link
Member Author

See also #85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset checks Relating to dataset-level checks performance
Projects
None yet
Development

No branches or pull requests

1 participant