-
-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
反爬虫通知 | Anti-Crawler Notification #63
Comments
感谢作者,非常支持。 |
IP地址:117.131.40.211, 116.228.231.98, 58.35.10.68 应该是正在循环请求本服务器,三个IP的请求拥有相同的Referer(某一个局域网内非80/443端口的页面)和Request Header,未限制响应频率为每秒2次前,请求数占所有请求数量近80%,已拉入黑名单。 限制前, 2月21日更新: 此IP正在在调整请求频率限制的情况下均能跑满峰值速度,持续请求 3月12日更新: 上述7个IP持续循环请求所有省份列表的时间序列数据超过24个小时,无 3月12日更新: 上述7个IP持续循环请求所有省份列表的时间序列数据超过24个小时,无 7月7日更新:
7月11日更新:
如有误判或不慎禁用了你的IP需要恢复,请修正请求频率,并在下方或发邮件与我联系。 |
感谢作者! |
Sorry, I just curious to know what is |
you can f12, and then see the "Network" |
cool. I am assuming you using Chrome? What information you think that is most useful to a programmer? That' really cool. I see a lot of new things. Thanks for inspiration me to the TCP world. |
感谢作者!有个项目用到了/nCoV/api/area接口,之前获取历史数据的时候没有增加请求频率限制,可能超过限制导致被禁,能否帮忙恢复下,正常一个小时请求最新数据一次,且已修正对历史数据的请求频率。公网访问IP:115.231.99.171、183.131.19.110、183.136.237.171 |
感谢作者!有个项目用到了/nCoV/api/area接口,之前获取历史数据的时候没有增加请求频率限制,可能超过限制导致被禁,能否帮忙恢复下,正常一个小时请求最新数据一次,且已修正对历史数据的请求频率。公网访问IP:115.231.99.171、183.131.19.110、183.136.237.171。 |
你好,除了此处提到的3个IP以外,没有禁用其他任何一个IP对服务器的请求,只需要调整请求频率,一分钟之后即可正常获取数据。 |
好的,多谢。 |
If you would like to get a functional crawler, you might need to use the information in the As Ding Xiang Yuan does not detect whether you are a "real person" or not, you can just check out my script to see how a basic crawler work and what information we need to crawl data from others website. But I do not think the information in |
如果仍然显示503,可能的原因有很多。比如你的requests在请求服务器后,服务器响应时间超过了requests能允许的默认最大响应时间,requests中断了这次请求而又自动执行了下一次请求(由其中的 可以尝试设定 |
…p time increase sleep time from 3 s to 5 s, according to BlankerL/DXY-COVID-19-Crawler#63 (comment)
作者你好,项目中用到了/nCoV/api/area接口,最近有观察到接口响应时间有时候高达30秒,不知是偶发还是有其他原因? |
非常感谢作者!在此呼吁大家(尤其是企业和商业人士),历史数据完全可以爬完一遍之后自己缓存在自己的数据库里,历史数据本身量就大,这个API且留存于个人项目的服务器,不能因为你方的懒惰导致本来公益的项目搁浅。请大家自我约束,不要滥用网络资源,谢谢! |
抱歉,或许是因为网络堵塞,随着国外疫情逐渐严重,API的调用量又开始增加了... |
非常感谢支持!在目前的限制下,响应速度还可以接受,但可能会考虑关闭返回所有时间序列数据的功能,把这部分流量交给数据仓库。 |
同意!,应该把历史数据直接留存成csv或者json放进数据仓库,API只提供实时数据。现在国外疫情处于上升爆发状态,API请求有可能进一步增多。 |
是的,从三月初以来日均流量均跑满了带宽,1000G的月流量到今天为止已经用了350G...有点担心这个月会不够用。同时,从Google Analytics的数据来看,最近一周海外的访问量已经超过总访问量的一般。 |
确实照这个速度来看月流量真的有点悬。。google或者bing “Covid 19 data API”的话基本上第一个出来的就是这个API |
作者您好,被封ip:47.93.98.134是否能帮忙恢复下访问权限。之前是一小时获取一次全省市、国外疫情数据,目前已调整为每天早晚各获取一次, |
感谢支持!已经恢复。 如果方便的话,麻烦在每一次请求中间加上30秒左右的间隔,因为时间序列数据提取占用较大的带宽,后端仅有4个线程,如果4个线程同时在给你发送消息,则其他用户都无法提取到数据。 |
目前是有10秒的间隔,我再调整下间隔时间,谢谢了。
…------------------ 原始邮件 ------------------
发件人: "Isaac Lin"<[email protected]>;
发送时间: 2020年3月13日(星期五) 下午2:59
收件人: "BlankerL/DXY-COVID-19-Crawler"<[email protected]>;
抄送: "王建"<[email protected]>;"Comment"<[email protected]>;
主题: Re: [BlankerL/DXY-COVID-19-Crawler] 反爬虫通知 | Anti-Crawler Notification (#63)
作者您好,被封ip:47.93.98.134是否能帮忙恢复下访问权限。之前是一小时获取一次全省市、国外疫情数据,目前已调整为每天早晚各获取一次,
感谢支持!已经恢复。
如果方便的话,麻烦在每一次请求中间加上30秒左右的间隔,因为时间序列数据提取占用较大的带宽,后端仅有4个线程,如果4个线程同时在给你发送消息,则其他用户都无法提取到数据。
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
请求API时出现 504 Gateway Time-out,辛苦作者查一下问题 |
感谢,似乎是数据录入有问题,我目前正在查看原因,已经关闭Nginx暂时无法访问。 |
目前已经恢复,已经在 BlankerL/DXY-COVID-19-Data#47 回复。 |
接口无法请求了,再辛苦您看一下问题 |
已经恢复,目前服务器压力太大了... |
有意愿支援百兆带宽,具体已送信至[email protected] |
十分感谢,已经回复邮件! |
sorry to trouble, but the api still dont't work. please check it. And the browser return "504 Gateway Time-out". |
Fixed. |
万分感谢大家对本项目的支持。
本API建立的初衷,是让更多人能够回溯数据、直观地看到这次疫情的变化,能够满足部分人非商业性的开发或者科研的目的,而并不希望为单个企业来服务。
我在第一个版本的API文档上就已经明确表示,本API主要开放给非商业用途使用。 然而,近期我收到一封邮件,要求增加服务器带宽以满足更高频率请求,发件邮箱的域名是企业域名,同时落款为某上市公司。甚至,在该公司的官方网站上,明确标识了他们近期开发的项目为政府提供数据指导。
因此,昨天和今天我分析网站后端请求数据的logfile,有大量来自同一个C类IP下的请求,每秒大约有5次,严重占用了其他用户对API的调用和对数据的获取。
服务器目前每秒钟需要处理的请求数量超过20个,而服务器仅10Mbps的带宽已经无法响应如此庞大的商用业务的请求数量,因此不得不部署反爬虫措施。
目前的措施很简单: 限制单个IP每秒请求数量最多为5次,如果超过该访问频率,则返回503错误。
2月21日更新:考虑到目前有4个API,并且偶尔会有断线情况,且设置反爬虫措施后API响应速度已有明显改善,因此每秒最大请求数量更改为5次。
3月18日更新:自3月16日起,每秒最大请求数量重新限制为2次。
如果大家有高频请求的需求,可以在本地建立缓存,定时请求服务器刷新数据,而不是将每一次的请求都发送到本服务器,否则会严重影响双方的数据传输效率。
再次感谢。
Thank you very much for your support.
The original intention of this API is to allow more people to backtrack the data and visually see the changes in the coronavirus. It aims to meet the non-commercial development or scientific research purposes of some people and does not want to serve a single enterprise.
I have clearly stated in the first version of the API documentation that this API is mainly open for non-commercial use. However, I recently received an email asking for increased server bandwidth to meet higher frequency requests. The domain name of the sending mailbox is a corporate domain name, and at the same time, the signature shows it is a listed company. Moreover, on the company's official website, they have stated that they recently developed a project serving the data to provide guidance to the government.
Therefore, I analyzed the logfile of the request to the server in the past two days. There were a large number of requests from the same Class C IP, about 5 times per second, which seriously occupied other users' API calls and data acquisition.
The server currently needs to respond to more than 20 requests per second, and the server's bandwidth of only 10 Mbps cannot respond to such a large number of requests for commercial services, so I have to deploy anti-crawling methods.
The current method is simple: limit the number of requests per second for a single IP to a maximum of 5 times, if it exceeds the access frequency, it returns a 503 error.
Updated on February 21: Considering that there are currently 4 APIs, and occasional disconnection happens from time to time, and the API response speed has improved significantly after setting anti-crawling measures, the maximum number of requests per second has been changed to 5.
Updated on March 18: Starting from March 16, the maximum number of requests per second has been changed to 2.
If you have a high-frequency-request requirement, you can set up a local cache and periodically request the server to refresh the data instead of sending each request to the server, otherwise, the data transmission efficiency of both parties will be seriously affected.
Thanks again.
The text was updated successfully, but these errors were encountered: