You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
URL availability dataset check should be considered for further improvements. Its current behavior is that 100 URLs are randomly picked from the whole dataset. Then the websites are sequentially visited. The issue with this approach is that the timeout is currently set to 30 seconds. So the worst case scenario means that just this check alone can take up to 50 minutes. This is surprisingly not that uncommon if the same website is linked to many times in a given dataset (e.g. https://dqt.datlab.eu/dataset/108/detail/misc.url_availability).
The suggested solutions are:
lowering the timeout period (might lead to false negatives due to insufficient wait time)
checking multiple URLs in parallel (might lead to false negatives due to server overload)
checking smaller number of URLs
checking just the base URLs
The text was updated successfully, but these errors were encountered:
(Migrated from GitLab)
URL availability dataset check should be considered for further improvements. Its current behavior is that 100 URLs are randomly picked from the whole dataset. Then the websites are sequentially visited. The issue with this approach is that the timeout is currently set to 30 seconds. So the worst case scenario means that just this check alone can take up to 50 minutes. This is surprisingly not that uncommon if the same website is linked to many times in a given dataset (e.g. https://dqt.datlab.eu/dataset/108/detail/misc.url_availability).
The suggested solutions are:
The text was updated successfully, but these errors were encountered: