Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request Retry param for get_*** #82

Closed
FinBird opened this issue Jan 13, 2023 · 8 comments · Fixed by #86
Closed

Request Retry param for get_*** #82

FinBird opened this issue Jan 13, 2023 · 8 comments · Fixed by #86
Labels
enhancement New feature or request

Comments

@FinBird
Copy link

FinBird commented Jan 13, 2023

我需要什么功能

get_threads get_posts get_comment 三个函数在异步爬取帖子速率过快时,会出现429错误。
可否提供retry的参数,设置行为成重新爬取而非raise然后打log忽略掉,多谢
https://github.com/Starry-OvO/aiotieba/blob/4fba4b58c4b1e98c11e198f832f62827dab5b539/aiotieba/client/__init__.py#L678
https://github.com/Starry-OvO/aiotieba/blob/4fba4b58c4b1e98c11e198f832f62827dab5b539/aiotieba/client/__init__.py#L718
https://github.com/Starry-OvO/aiotieba/blob/4fba4b58c4b1e98c11e198f832f62827dab5b539/aiotieba/client/__init__.py#L773

我想将这个功能应用于何种场景
快速爬取一个吧内的帖子

...

@lumina37
Copy link
Owner

通过判断是否为空重试?retry参数并不优雅,exception hooks是更好的解决方案,如果要设计这个东西的话可能至少需要一周时间

@lumina37 lumina37 added the enhancement New feature or request label Jan 13, 2023
@n0099
Copy link

n0099 commented Jan 13, 2023

c.tieba.baidu.com下的endpoint如果超过rps限制要么返回HTTP 429要么返回一个html,其内容是

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
        <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no">
        <title>贴吧404</title>
    <link href="//tb1.bdstatic.com/tb/common-main-static/css/error.dc4bf960.css" rel="stylesheet"></head>
    <body style="margin: 0;">
        <!--[if lte IE 8]>
        <style>
            li {
                list-style: none;
                margin: 0;
                padding: 0;
            }

            .content {
                width: 490px;
                margin-left: -245px;
                margin-top: -245px;
                position: absolute;
                top: 45%;
                bottom: 0;
                left: 50%;
                right: 0;
            }
            .emotion {
                display: block;
                margin: 0 auto 42px;
                width: 144px;
                height: 144px;
                padding-top: 10%;
                text-align: center;
            }
            p {
                margin-bottom: 2px;
                text-align: center;
                color: #8F8E94;
                font-size: 14px;
                line-height: 14px;
            }
            .hr {
                margin: 36px 0;
                border-top: 1px solid #EEEDF0;
            }
            .bold {
                font-weight: 500;
                color: #141414;
            }
        </style>
        <div class="content">
            <img class="emotion" src="https://tieba-fe.gz.bcebos.com/hybrid_offline/assets/thread-not-found.44ef1fb5.png">
            <p>为了保护您的账号安全和最佳的浏览体验,当前业务已经不支持IE8以下浏览器</p>
            <p>版本访问,我们邀请您百度搜索下载以下几款浏览器,获得最佳网上冲浪体验~ </p>
            <div class="hr"></div>
            <p>以下四款官方正版浏览器任您选择</p>
            <p class="bold">谷歌浏览器、QQ浏览器、搜狗浏览器、火狐浏览器</p>
        </div>
        <![endif]-->
        <div id="app">
        </div>
    <script src="//tb1.bdstatic.com/tb/common-main-static/js/vendors.56dfadad.js"></script><script src="//tb1.bdstatic.com/tb/common-main-static/js/utils.f0fa7046.js"></script><script src="//tb1.bdstatic.com/tb/common-main-static/js/common.ec24fe0d.js"></script><script src="//tb1.bdstatic.com/tb/common-main-static/js/error.ddf32bc1.js"></script></body>
</html>

我从来没见过HTTP 403的限流

@n0099
Copy link

n0099 commented Jan 13, 2023

通过判断是否为空重试?retry参数并不优雅,exception hooks是更好的解决方案,如果要设计这个东西的话可能至少需要一周时间

那得看您这仨函数的抽象隔离想要隔离多少,也就是说对于来自贴吧方面错误(您的网线断了 非HTTP200 malform的protobuf/json等)是否应该通过阁下的抽象层来隔离他们并执行用户指定的重试操作(比如回调或单纯的无限重试)

常见的设计仍然是max retry timeswait for timeouterror callback参数,无脑throw异常反而会让用户不知所措(因为用户以为这些错误已经被抽象隔离了)

我的设计也是无脑throw异常:
https://github.com/n0099/TiebaMonitor/blob/290c43ccf9054481d23b7bbc7ab6e6db54d6a38a/crawler/src/Tieba/ClientRequester.cs#L44
https://github.com/n0099/TiebaMonitor/blob/290c43ccf9054481d23b7bbc7ab6e6db54d6a38a/crawler/src/Tieba/ClientRequester.cs#L60
https://github.com/n0099/TiebaMonitor/blob/290c43ccf9054481d23b7bbc7ab6e6db54d6a38a/crawler/src/Tieba/TiebaException.cs#L17
但他们会被这的try拦截
https://github.com/n0099/TiebaMonitor/blob/290c43ccf9054481d23b7bbc7ab6e6db54d6a38a/crawler/src/Tieba/Crawl/Facade/BaseCrawlFacade.cs#L154
所以如果用户使用高度抽象隔离的BaseCrawlFacade来爬那不会遇到任何异常(都被这try给catch了,哪怕是无法写入数据库(爬了但无法保存)的严重错误也只是打log而不会整个程序由于unhandled exception而立即exit),除非用户去用BaseCrawlFacade内部所拼装组合调用的BaseCrawler BaseParser BaseSaver的concrete子类甚至是ClientRequester直接发请求

然后放进全局的retry队列里等一段时间再重试
https://github.com/n0099/TiebaMonitor/blob/290c43ccf9054481d23b7bbc7ab6e6db54d6a38a/crawler/src/Worker/RetryCrawlWorker.cs#L26

@n0099
Copy link

n0099 commented Jan 13, 2023

另外c.tieba.baidu.com的rps限制因您所请求的cdnip而异,国内的百度cdnip的rps限制要高一些在20~30rps左右,而hk节点(所有非大陆ip都会被dns解析到那一个hk节点去,当然您也可以手动改hosts指定使用别的节点ip,但百度国内cdn节点的海外线路稀烂)只有10rps

我对此写了个动态试探rps并将发出的rps控制在试探出的rps上限来回浮动以尽可能使用更高的rps请求但又避免了一直撞上限,这样也能发现贴吧cdn运维如果突然修改了rps限制后的新rps上限: https://github.com/n0099/TiebaMonitor/blob/290c43ccf9054481d23b7bbc7ab6e6db54d6a38a/crawler/src/Tieba/ClientRequesterTcs.cs#L14

@FinBird
Copy link
Author

FinBird commented Jan 13, 2023

另外c.tieba.baidu.com的rps限制因您所请求的cdnip而异,国内的百度cdnip的rps限制要高一些在20~30rps左右,而hk节点(所有非大陆ip都会被dns解析到那一个hk节点去)只有10

麻烦你回复那么多了,我是大陆的IP

@FinBird
Copy link
Author

FinBird commented Jan 13, 2023

c.tieba.baidu.com下的endpoint如果超过rps限制要么返回HTTP 429要么返回一个html,其内容是

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
        <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no">
        <title>贴吧404</title>
    <link href="//tb1.bdstatic.com/tb/common-main-static/css/error.dc4bf960.css" rel="stylesheet"></head>
    <body style="margin: 0;">
        <!--[if lte IE 8]>
        <style>
            li {
                list-style: none;
                margin: 0;
                padding: 0;
            }

            .content {
                width: 490px;
                margin-left: -245px;
                margin-top: -245px;
                position: absolute;
                top: 45%;
                bottom: 0;
                left: 50%;
                right: 0;
            }
            .emotion {
                display: block;
                margin: 0 auto 42px;
                width: 144px;
                height: 144px;
                padding-top: 10%;
                text-align: center;
            }
            p {
                margin-bottom: 2px;
                text-align: center;
                color: #8F8E94;
                font-size: 14px;
                line-height: 14px;
            }
            .hr {
                margin: 36px 0;
                border-top: 1px solid #EEEDF0;
            }
            .bold {
                font-weight: 500;
                color: #141414;
            }
        </style>
        <div class="content">
            <img class="emotion" src="https://tieba-fe.gz.bcebos.com/hybrid_offline/assets/thread-not-found.44ef1fb5.png">
            <p>为了保护您的账号安全和最佳的浏览体验,当前业务已经不支持IE8以下浏览器</p>
            <p>版本访问,我们邀请您百度搜索下载以下几款浏览器,获得最佳网上冲浪体验~ </p>
            <div class="hr"></div>
            <p>以下四款官方正版浏览器任您选择</p>
            <p class="bold">谷歌浏览器、QQ浏览器、搜狗浏览器、火狐浏览器</p>
        </div>
        <![endif]-->
        <div id="app">
        </div>
    <script src="//tb1.bdstatic.com/tb/common-main-static/js/vendors.56dfadad.js"></script><script src="//tb1.bdstatic.com/tb/common-main-static/js/utils.f0fa7046.js"></script><script src="//tb1.bdstatic.com/tb/common-main-static/js/common.ec24fe0d.js"></script><script src="//tb1.bdstatic.com/tb/common-main-static/js/error.ddf32bc1.js"></script></body>
</html>

我从来没见过HTTP 403的限流

我目前是直接hack库内部解决了限流重试问题。可能记错了了错误码

@lumina37
Copy link
Owner

lumina37 commented Jan 20, 2023

目前已经实现的解决方案

使用Client.exc_handlers[method like client.get_fid] = your handler来设置一个针对特定方法上发生的所有异常的处理函数

该处理函数可以抛出一个类型不同的异常,比如自定义的NeedRetry,然后在外部捕获这个NeedRetry就能知道应当在何时重试请求

import asyncio

import aiotieba as tb


class NeedRetry(RuntimeError):
    pass


def exce_handler(err: tb.HTTPStatusError):
    if isinstance(err, tb.HTTPStatusError):
        if err.code == 429:
            raise NeedRetry("need retry")


async def main():
    async with tb.Client('default') as client:
        tb.client.exc_handlers[client.get_fid] = exce_handler

        for i in range(1, 4):
            try:
                await client.get_fid('天堂鸡汤')
            except NeedRetry:
                tb.LOG().debug(f"retry for the {i} time")
                continue
            else:
                break


asyncio.run(main())

输出的日志

<2023-01-20 20:53:38.946> [WARN] [get_fid] (429, 'Too Many Requests')
<2023-01-20 20:53:38.946> [DEBUG] [main] retry for the 1 time
<2023-01-20 20:53:38.946> [WARN] [get_fid] (429, 'Too Many Requests')
<2023-01-20 20:53:38.946> [DEBUG] [main] retry for the 2 time
<2023-01-20 20:53:38.946> [WARN] [get_fid] (429, 'Too Many Requests')
<2023-01-20 20:53:38.947> [DEBUG] [main] retry for the 3 time

@n0099
Copy link

n0099 commented Jan 20, 2023

经典c人最爱的全局error handler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants