Skip to content

Commit

Permalink
check_for_errors: check scrapy log file
Browse files Browse the repository at this point in the history
  • Loading branch information
duncandewhurst committed Apr 19, 2022
1 parent 5e6259a commit 231df67
Showing 1 changed file with 39 additions and 3 deletions.
42 changes: 39 additions & 3 deletions check_for_errors.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@
"metadata": {
"colab": {
"name": "data_collection_and_processing_errors",
"provenance": [],
"authorship_tag": "ABX9TyPffcgelqfpE7r9y+mDKzan"
"provenance": []
},
"kernelspec": {
"name": "python3",
Expand All @@ -22,6 +21,43 @@
"## Check for data collection and processing errors"
]
},
{
"cell_type": "markdown",
"source": [
"### Kingfisher Collect Log"
],
"metadata": {
"id": "DWcRuKnZt--_"
}
},
{
"cell_type": "markdown",
"source": [
"Print the crawler statistics from the log file specified in the setup section. If `downloader/response_status_count/{code}` is non-zero and `{code}` is an HTTP error code (400-599), then the collection may be incomplete. Where possible, you should check the total number of releases and/or contracting processes against the front-end of the data source."
],
"metadata": {
"id": "YoxNFk17uFZe"
}
},
{
"cell_type": "code",
"source": [
"if log_url != '':\n",
"\n",
" response = requests.get(log_url, auth=('scrape', scrapy_password))\n",
"\n",
" with open('log_file', 'wb') as f:\n",
" f.write(response.content)\n",
" \n",
" log = ScrapyLogFile('log_file').logparser\n",
" pprint(dict(log['crawler_stats']))"
],
"metadata": {
"id": "kfzwh_ExuEVX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
Expand Down Expand Up @@ -409,4 +445,4 @@
"outputs": []
}
]
}
}

0 comments on commit 231df67

Please sign in to comment.