Adding progress bar for crawl #193
Replies: 2 comments
-
Thanks @dsottimano I like the idea, and we can have a simple script that provides the latest status. The progress bar would mainly work if we have a known number of URLs (crawling in list mode with A potential solution for current crawl status, is based on the fact the the output file gets appended to, whenever a bunch of pages are crawled. So, at any point while crawling, you can open/analyze the available file. In a separate notebook/session, you can open the file and see how many pages were crawled. Or provide a more informative crawl status as you mention with status codes for example. import pandas as pd
df = pd.read_json("output_file.jl", lines=True)
df["status"].value_counts()
200 159
404 2
503 1
Name: status, dtype: int64 |
Beta Was this translation helpful? Give feedback.
-
Hello,
This will update the total. We can also update the number etc. |
Beta Was this translation helpful? Give feedback.
-
When logging to a file, there is no valuable input as to how many pages have been crawled in a notebook. It might be an idea to incorporate https://tqdm.github.io/ for crawl progress and potentially also use the progress bar to report on status (# 4xx, #2xx,#3xx,#5xx)
Beta Was this translation helpful? Give feedback.
All reactions