Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

被动式扫描流量需要去重啊 #1

Open
aipengjie opened this issue Aug 16, 2017 · 3 comments
Open

被动式扫描流量需要去重啊 #1

aipengjie opened this issue Aug 16, 2017 · 3 comments

Comments

@aipengjie
Copy link

直接借用猪猪侠的流量解析代码,但是没有去重。后面会消耗很多资源

@jjf012
Copy link
Owner

jjf012 commented Aug 16, 2017

@aipengjie (⊙﹏⊙),去重也是个大问题,但我想不出有什么好的方法来解决用来去重带来的资源消耗,比如单纯用set(hash)的方式的话。

@aipengjie
Copy link
Author

aipengjie commented Aug 16, 2017

@controller.handler
    def response(self, f):
        infos = getinfo.parse(f)
        host = urlparse.urlparse(infos['url']).netloc
        if self.host != host and self.host != '*':
            return
        result = filterer.filterer(infos)
        if result['unique'] not in url_lists and result['unique']:
            url_lists.add(result['unique'])
            q.put(infos)


tlag = "***"
re_rule = "(?P<param>[^&?]+\=(?P<value>[^&$]*))"
suffixs = [
    ".jpg", ".gif", ".png", ".js", ".xlsx", ".apk", ".flv", ".css",
    ".swf", '.xml', '.avi', '.zip', '.pdf', '.rar', '.doc', '.xls'
]


def filterer(infos):
    result = {}
    try:
        url = infos['url']
        data = infos['data']
        new_url = url + '?' + data if data else url.split('#')[0]
        flag = filter(lambda x: url.endswith(x), suffixs)
        if flag:
            u = ""
        elif '?' in new_url:
            u = re.sub(common_rule, tlag, new_url)
        else:
            u = hashlib.md5(new_url).hexdigest()
        result['url'] = url
        result['unique'] = u
    except:
        traceback.print_exc()
    finally:
        return result

这是很早写的去重代码 可以参考一下

@jjf012
Copy link
Owner

jjf012 commented Aug 16, 2017

更了更了 @aipengjie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants