Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds detection for various bots #7656

Merged
merged 8 commits into from
Apr 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 110 additions & 2 deletions Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3139,10 +3139,10 @@
bot:
name: Uptimebot
category: Site Monitor
url: https://uptime.com/uptimebot
url: https://uptime.com/uptime-bot
producer:
name: Uptime
url: https://uptime.com
url: https://uptime.com/
-
user_agent: Mozilla/5.0 (compatible; vkShare; +http://vk.com/dev/Share)
bot:
Expand Down Expand Up @@ -7368,3 +7368,111 @@
producer:
name: 'Probely - Soluções de Cibersegurança, S.A.'
url: https://probely.com/
-
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Uptime/1.0 (https://uptime.com)
bot:
name: Uptimebot
category: Site Monitor
url: https://uptime.com/uptime-bot
producer:
name: Uptime
url: https://uptime.com/
-
user_agent: Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.114 Mobile Safari/537.36 Uptime/1.0 (https://uptime.com)
bot:
name: Uptimebot
category: Site Monitor
url: https://uptime.com/uptime-bot
producer:
name: Uptime
url: https://uptime.com/
-
user_agent: Mozilla/5.0 (compatible; Uptime/1.0; http://uptime.com)
bot:
name: Uptimebot
category: Site Monitor
url: https://uptime.com/uptime-bot
producer:
name: Uptime
url: https://uptime.com/
-
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5002.111 Safari/537.36 Uptimia/1.0 (https://www.uptimia.com)
bot:
name: Uptimia
category: Site Monitor
url: https://www.uptimia.com/
producer:
name: JJ Online GmbH
url: https://www.uptimia.com/
-
user_agent: Mozilla/5.0 (compatible; Uptimia; www.uptimia.com)
bot:
name: Uptimia
category: Site Monitor
url: https://www.uptimia.com/
producer:
name: JJ Online GmbH
url: https://www.uptimia.com/
-
user_agent: Mozilla/5.0 (compatible; 2GDPR/1.2; https://2gdpr.com)
bot:
name: 2GDPR
category: Service Agent
url: https://2gdpr.com/tos
producer:
name: 2GDPR
url: https://2gdpr.com/
-
user_agent: abuse.xmco.fr
bot:
name: Serenety
category: Security Checker
url: https://abuse.xmco.fr/
producer:
name: XMCO, SASU
url: https://www.xmco.fr/
-
user_agent: CheckHost (https://check-host.net/)
bot:
name: CheckHost
category: Site Monitor
url: https://check-host.net/
producer:
name: CheckHost
url: https://check-host.net/
-
user_agent: Mozilla/5.0 (compatible; LAC_IAHarvester/3.3.0; +https://library-archives.canada.ca/eng/services/government-canada/web-social-media-preservation-program/Pages/web-archive.aspx)
bot:
name: LAC IA Harvester
category: Crawler
url: https://library-archives.canada.ca/eng/services/government-canada/web-social-media-preservation-program/Pages/web-archive.aspx
producer:
name: Library and Archives Canada
url: https://library-archives.canada.ca/
-
user_agent: GoogleSites
bot:
name: Googlebot
category: Search bot
url: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
producer:
name: Google Inc.
url: https://www.google.com/
-
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google-Sites-Thumbnails) Chrome/120.0.6099.71 Safari/537.36
bot:
name: Googlebot
category: Search bot
url: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
producer:
name: Google Inc.
url: https://www.google.com/
-
user_agent: Google-adstxt
bot:
name: Googlebot
category: Search bot
url: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
producer:
name: Google Inc.
url: https://www.google.com/
64 changes: 52 additions & 12 deletions regexes/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -781,7 +781,7 @@
name: 'Google Inc.'
url: 'https://www.google.com/'

- regex: 'Adwords-(?:DisplayAds|Express|Instant)|Google Web Preview|Google[ -]Publisher[ -]Plugin|Google-(?:Ads-Conversions|Ads-Qualify|Adwords|AMPHTML|Assess|Extended|HotelAdsVerifier|InspectionTool|Lens|PageRenderer|Read-Aloud|Safety|Shopping-Quality|Site-Verification|speakr|Stale-Content-Probe|Test|Youtube-Links)|(?:AdsBot|APIs|DuplexWeb|Feedfetcher|Mediapartners)-Google(?:-Mobile)?|Google(?:AdSenseInfeed|AssociationService|bot|Other|Prober|Producer)|Google.*/\+/web/snippet'
- regex: 'Adwords-(?:DisplayAds|Express|Instant)|Google Web Preview|Google[ -]Publisher[ -]Plugin|Google-(?:adstxt|Ads-Conversions|Ads-Qualify|Adwords|AMPHTML|Assess|Extended|HotelAdsVerifier|InspectionTool|Lens|PageRenderer|Read-Aloud|Safety|Shopping-Quality|Site-Verification|Sites-Thumbnails|speakr|Stale-Content-Probe|Test|Youtube-Links)|(?:AdsBot|APIs|DuplexWeb|Feedfetcher|Mediapartners)-Google(?:-Mobile)?|Google(?:AdSenseInfeed|AssociationService|bot|Other|Prober|Producer|Sites)|Google.*/\+/web/snippet'
name: 'Googlebot'
category: 'Search bot'
url: 'https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers'
Expand Down Expand Up @@ -1681,13 +1681,13 @@
name: 'UkrNet Ltd'
url: 'https://www.ukr.net/'

- regex: 'Uptimebot'
- regex: 'Uptime(?:bot)?/[\d.]+'
name: 'Uptimebot'
category: 'Site Monitor'
url: 'https://uptime.com/uptimebot'
url: 'https://uptime.com/uptime-bot'
producer:
name: 'Uptime'
url: 'https://uptime.com'
url: 'https://uptime.com/'

- regex: 'UptimeRobot'
name: 'UptimeRobot'
Expand Down Expand Up @@ -4326,18 +4326,58 @@
category: 'Crawler'
url: 'https://who.is/'

# Generic bots
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherweb|kirkland-signature|LinkChain|survey-security-dot-txt|^xenu|^ZmEu|^(?:chrome|firefox|Zeus)$'
name: 'Generic Bot'

# Generic detections
- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|inspector|monitor|project(?!or)|(?<!Google Wap )proxy|research|resolver|robots|scanner|scraper|script|searcher|(?<!-)security|spider|study|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
name: 'Generic Bot'

- regex: 'Probely'
name: 'Probely'
category: 'Security Checker'
url: 'https://probely.com/sos/'
producer:
name: 'Probely - Soluções de Cibersegurança, S.A.'
url: 'https://probely.com/'

- regex: 'Uptimia(?:/[\d.]+)?'
name: 'Uptimia'
category: 'Site Monitor'
url: 'https://www.uptimia.com/'
producer:
name: 'JJ Online GmbH'
url: 'https://www.uptimia.com/'

- regex: '2GDPR/[\d.]+'
name: '2GDPR'
category: 'Service Agent'
url: 'https://2gdpr.com/tos'
producer:
name: '2GDPR'
url: 'https://2gdpr.com/'

- regex: 'abuse\.xmco\.fr'
name: 'Serenety'
category: 'Security Checker'
url: 'https://abuse.xmco.fr/'
producer:
name: 'XMCO, SASU'
url: 'https://www.xmco.fr/'

- regex: 'CheckHost'
name: 'CheckHost'
category: 'Site Monitor'
url: 'https://check-host.net/'
producer:
name: 'CheckHost'
url: 'https://check-host.net/'

- regex: 'LAC_IAHarvester/[\d.]+'
name: 'LAC IA Harvester'
category: 'Crawler'
url: 'https://library-archives.canada.ca/eng/services/government-canada/web-social-media-preservation-program/Pages/web-archive.aspx'
producer:
name: 'Library and Archives Canada'
url: 'https://library-archives.canada.ca/'

# Generic bots
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherweb|kirkland-signature|LinkChain|survey-security-dot-txt|^xenu|^ZmEu|^(?:chrome|firefox|Zeus)$'
name: 'Generic Bot'

# Generic detections
- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver?|checker|collector|crawl|crawler|fetcher|indexer|inspector|monitor|project(?!or)|(?<!Google Wap )proxy|research|resolver|robots|scanner|scraper|script|searcher|(?<!-)security|spider|study|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
name: 'Generic Bot'
Loading