-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
被动式扫描流量需要去重啊 #1
Comments
@aipengjie (⊙﹏⊙),去重也是个大问题,但我想不出有什么好的方法来解决用来去重带来的资源消耗,比如单纯用set(hash)的方式的话。 |
@controller.handler
def response(self, f):
infos = getinfo.parse(f)
host = urlparse.urlparse(infos['url']).netloc
if self.host != host and self.host != '*':
return
result = filterer.filterer(infos)
if result['unique'] not in url_lists and result['unique']:
url_lists.add(result['unique'])
q.put(infos)
tlag = "***"
re_rule = "(?P<param>[^&?]+\=(?P<value>[^&$]*))"
suffixs = [
".jpg", ".gif", ".png", ".js", ".xlsx", ".apk", ".flv", ".css",
".swf", '.xml', '.avi', '.zip', '.pdf', '.rar', '.doc', '.xls'
]
def filterer(infos):
result = {}
try:
url = infos['url']
data = infos['data']
new_url = url + '?' + data if data else url.split('#')[0]
flag = filter(lambda x: url.endswith(x), suffixs)
if flag:
u = ""
elif '?' in new_url:
u = re.sub(common_rule, tlag, new_url)
else:
u = hashlib.md5(new_url).hexdigest()
result['url'] = url
result['unique'] = u
except:
traceback.print_exc()
finally:
return result 这是很早写的去重代码 可以参考一下 |
更了更了 @aipengjie |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
直接借用猪猪侠的流量解析代码,但是没有去重。后面会消耗很多资源
The text was updated successfully, but these errors were encountered: