Skip to content

Commit

Permalink
feat: Document bot categories (#72)
Browse files Browse the repository at this point in the history
Co-authored-by: David Mytton <[email protected]>
  • Loading branch information
blaine-arcjet and davidmytton authored Sep 16, 2024
1 parent cbd44fb commit 405d445
Show file tree
Hide file tree
Showing 49 changed files with 688 additions and 293 deletions.
249 changes: 126 additions & 123 deletions package-lock.json

Large diffs are not rendered by default.

24 changes: 12 additions & 12 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@
"astro": "astro"
},
"dependencies": {
"@arcjet/body": "1.0.0-alpha.24",
"@arcjet/bun": "1.0.0-alpha.24",
"@arcjet/decorate": "1.0.0-alpha.24",
"@arcjet/env": "1.0.0-alpha.24",
"@arcjet/eslint-config": "1.0.0-alpha.24",
"@arcjet/next": "1.0.0-alpha.24",
"@arcjet/node": "1.0.0-alpha.24",
"@arcjet/protocol": "1.0.0-alpha.24",
"@arcjet/redact": "1.0.0-alpha.24",
"@arcjet/sveltekit": "1.0.0-alpha.24",
"@arcjet/tsconfig": "1.0.0-alpha.24",
"@arcjet/body": "1.0.0-alpha.26",
"@arcjet/bun": "1.0.0-alpha.26",
"@arcjet/decorate": "1.0.0-alpha.26",
"@arcjet/env": "1.0.0-alpha.26",
"@arcjet/eslint-config": "1.0.0-alpha.26",
"@arcjet/next": "1.0.0-alpha.26",
"@arcjet/node": "1.0.0-alpha.26",
"@arcjet/protocol": "1.0.0-alpha.26",
"@arcjet/redact": "1.0.0-alpha.26",
"@arcjet/sveltekit": "1.0.0-alpha.26",
"@arcjet/tsconfig": "1.0.0-alpha.26",
"@astrojs/check": "0.9.3",
"@astrojs/react": "3.6.2",
"@astrojs/starlight": "0.27.1",
Expand All @@ -36,7 +36,7 @@
"@langchain/community": "0.2.33",
"@sveltejs/kit": "2.5.26",
"ai": "3.3.29",
"arcjet": "1.0.0-alpha.24",
"arcjet": "1.0.0-alpha.26",
"astro": "4.15.4",
"astro-embed": "0.7.2",
"astro-robots-txt": "1.0.0",
Expand Down
40 changes: 40 additions & 0 deletions src/content/docs/bot-protection/identifying-bots.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,44 @@ updates will be included in the next SDK release. Since bot detection is handled
within the Arcjet WebAssembly module bundled with the SDK, new patterns must be
compiled into the module as part of the release process.

## Bot categories

In addition to identifying individual bots, we also group bots into various
categories. You can leverage these categories for easier configuration of your
allow or deny lists.

Currently, we provide the following categories. You can see which bots are in
each category from the [bot list](https://arcjet.com/bot-list):

- `CATEGORY:ACADEMIC`: Scrape data for research purposes
- `CATEGORY:ADVERTISING`: Scrape data for advertising and marketing purposes
- `CATEGORY:AI`: Scrape data for AI and LLM purposes
- `CATEGORY:AMAZON`: Scrape data for Amazon products and services
- `CATEGORY:ARCHIVE`: Scrape data for archival purposes
- `CATEGORY:FEEDFETCHER`: Request data for RSS and other feeds
- `CATEGORY:GOOGLE`: Scrape data for Google products and services
- `CATEGORY:META`: Scrape data for Meta/Facebook products and services
- `CATEGORY:MICROSOFT`: Scrape data for Microsoft products and services
- `CATEGORY:MONITOR`: Interact for monitoring purposes
- `CATEGORY:OPTIMIZER`: Interact for optimization purposes
- `CATEGORY:PREVIEW`: Request data for image and URL previews
- `CATEGORY:PROGRAMMATIC`: Interact via programming language libraries
- `CATEGORY:SEARCH_ENGINE`: Index data for search engines
- `CATEGORY:SLACK`: Scrape data for Slack products and services
- `CATEGORY:SOCIAL`: Scrape data for social media products and services
- `CATEGORY:TOOL`: Interact via command line and GUI tools
- `CATEGORY:UNKNOWN`: Undetermined purposes
- `CATEGORY:VERCEL`: Scrape data for Vercel products and services
- `CATEGORY:YAHOO`: Scrape data for Yahoo products and services

We're continuously evaluating bots to decide if things should be reclassified.
If we determine enough bots exist for a new category, we'll consider adding new
ones. Please open an issue on our
[arcjet/well-known-bots](https://github.com/arcjet/well-known-bots) repository
if you need a specific category.

Only configured categories are checked for performance reasons. Each detected
bot must be compared to a category, so the worst case performance is
`count(detectedBot) * count(configuredCategories)`.

<Comments />
36 changes: 28 additions & 8 deletions src/content/docs/bot-protection/reference/bun.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ import IdentifiedBotsTS from "/src/snippets/bot-protection/reference/bun/Identif
import IdentifiedBotsJS from "/src/snippets/bot-protection/reference/bun/IdentifiedBots.js?raw";
import ErrorsTS from "/src/snippets/bot-protection/reference/bun/Errors.ts?raw";
import ErrorsJS from "/src/snippets/bot-protection/reference/bun/Errors.js?raw";
import FilteringTS from "/src/snippets/bot-protection/reference/bun/Filtering.ts?raw";
import FilteringJS from "/src/snippets/bot-protection/reference/bun/Filtering.js?raw";
import Comments from "/src/components/Comments.astro";

Arcjet bot detection allows you to manage traffic by automated clients and bots.
Expand All @@ -31,14 +33,14 @@ You can use only one of the following configuration definitions:
```ts
type BotOptionsAllow = {
mode?: "LIVE" | "DRY_RUN";
allow: ArcjetWellKnownBot[];
allow: Array<ArcjetWellKnownBot | ArcjetBotCategory>;
};
```

```ts
type BotOptionsDeny = {
mode?: "LIVE" | "DRY_RUN";
deny: ArcjetWellKnownBot[];
deny: Array<ArcjetWellKnownBot | ArcjetBotCategory>;
};
```

Expand All @@ -58,7 +60,8 @@ some bots to access your system, such as bots for search indexing or API
access from the command line.

This behavior is configured with an `allow` list from our [full list of
bots](https://arcjet.com/bot-list).
bots](https://arcjet.com/bot-list) and/or [bot
categories](/bot-protection/identifying-bots#bot-categories).

<Tabs>
<TabItem label="TS">
Expand All @@ -75,8 +78,9 @@ Some applications may only want to block a small subset of bots, while allowing
the majority continued access. This may be due to many reasons, such as
misconfigured or high-traffic bots.

This behavior is configured with an `deny` list from our [full list of
bots](https://arcjet.com/bot-list).
This behavior is configured with a `deny` list from our [full list of
bots](https://arcjet.com/bot-list) and/or [bot
categories](/bot-protection/identifying-bots#bot-categories).

<Tabs>
<TabItem label="TS">
Expand Down Expand Up @@ -140,9 +144,10 @@ This example will log the full result as well as the bot protection rule:
### Identified bots

The decision also contains all of the [identified
bots](/bot-protection/identifying-bots) detected from the request. A request may
be identified as zero, one, or more bots—all of which will be available on the
`decision.allowed` and `decision.denied` properties.
bots and matched categories](/bot-protection/identifying-bots) detected from the
request. A request may be identified as zero, one, or more bots/categories—all
of which will be available on the `decision.allowed` and `decision.denied`
properties.

<Tabs>
<TabItem label="TS">
Expand Down Expand Up @@ -181,6 +186,21 @@ allow or deny the request. Our recommendation is to block requests without
</TabItem>
</Tabs>

## Filtering categories

All categories are also provided as enumerations, which allows for programmatic
access. For example, you may want to allow most of `CATEGORY:GOOGLE` except
their "advertising quality" bot.

<Tabs>
<TabItem label="TS">
<Code code={FilteringTS} lang="ts" />
</TabItem>
<TabItem label="JS">
<Code code={FilteringJS} lang="js" />
</TabItem>
</Tabs>

## Testing

Arcjet runs the same in any environment, including locally and in CI. You can
Expand Down
44 changes: 36 additions & 8 deletions src/content/docs/bot-protection/reference/nextjs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ import ErrorsAppTS from "/src/snippets/bot-protection/reference/nextjs/ErrorsApp
import ErrorsPagesTS from "/src/snippets/bot-protection/reference/nextjs/ErrorsPages.ts?raw";
import ErrorsAppJS from "/src/snippets/bot-protection/reference/nextjs/ErrorsApp.js?raw";
import ErrorsPagesJS from "/src/snippets/bot-protection/reference/nextjs/ErrorsPages.js?raw";
import FilteringAppTS from "/src/snippets/bot-protection/reference/nextjs/FilteringApp.ts?raw";
import FilteringPagesTS from "/src/snippets/bot-protection/reference/nextjs/FilteringPages.ts?raw";
import FilteringAppJS from "/src/snippets/bot-protection/reference/nextjs/FilteringApp.js?raw";
import FilteringPagesJS from "/src/snippets/bot-protection/reference/nextjs/FilteringPages.js?raw";
import ProtectPageMiddlewareTS from "/src/snippets/bot-protection/reference/nextjs/ProtectPageMiddleware.ts?raw";
import ProtectPagePagesTS from "/src/snippets/bot-protection/reference/nextjs/ProtectPagePages.tsx?raw";
import ProtectPageMiddlewareJS from "/src/snippets/bot-protection/reference/nextjs/ProtectPageMiddleware.js?raw";
Expand Down Expand Up @@ -64,14 +68,14 @@ You can use only one of the following configuration definitions:
```ts
type BotOptionsAllow = {
mode?: "LIVE" | "DRY_RUN";
allow: ArcjetWellKnownBot[];
allow: Array<ArcjetWellKnownBot | ArcjetBotCategory>;
};
```

```ts
type BotOptionsDeny = {
mode?: "LIVE" | "DRY_RUN";
deny: ArcjetWellKnownBot[];
deny: Array<ArcjetWellKnownBot | ArcjetBotCategory>;
};
```

Expand All @@ -91,7 +95,8 @@ some bots to access your system, such as bots for search indexing or API
access from the command line.

This behavior is configured with an `allow` list from our [full list of
bots](https://arcjet.com/bot-list).
bots](https://arcjet.com/bot-list) and/or [bot
categories](/bot-protection/identifying-bots#bot-categories).

{/* prettier-ignore */}
<Tabs>
Expand Down Expand Up @@ -131,8 +136,9 @@ Some applications may only want to block a small subset of bots, while allowing
the majority continued access. This may be due to many reasons, such as
misconfigured or high-traffic bots.

This behavior is configured with an `deny` list from our [full list of
bots](https://arcjet.com/bot-list).
This behavior is configured with a `deny` list from our [full list of
bots](https://arcjet.com/bot-list) and/or [bot
categories](/bot-protection/identifying-bots#bot-categories).

{/* prettier-ignore */}
<Tabs>
Expand Down Expand Up @@ -340,9 +346,10 @@ Create a new API route at `/pages/api/arcjet.js`:
### Identified bots

The decision also contains all of the [identified
bots](/bot-protection/identifying-bots) detected from the request. A request may
be identified as zero, one, or more bots—all of which will be available on the
`decision.allowed` and `decision.denied` properties.
bots and matched categories](/bot-protection/identifying-bots) detected from the
request. A request may be identified as zero, one, or more bots/categories—all
of which will be available on the `decision.allowed` and `decision.denied`
properties.

<Tabs>
<TabItem label="TS (App)">
Expand Down Expand Up @@ -413,6 +420,27 @@ allow or deny the request. Our recommendation is to block requests without
</TabItem>
</Tabs>

## Filtering categories

All categories are also provided as enumerations, which allows for programmatic
access. For example, you may want to allow most of `CATEGORY:GOOGLE` except
their "advertising quality" bot.

<Tabs>
<TabItem label="TS (App)">
<Code code={FilteringAppTS} title="/app/api/hello/route.ts" lang="ts" />
</TabItem>
<TabItem label="TS (Pages)">
<Code code={FilteringPagesTS} title="/pages/api/hello.ts" lang="ts" />
</TabItem>
<TabItem label="JS (App)">
<Code code={FilteringAppJS} title="/app/api/hello/route.js" lang="js" />
</TabItem>
<TabItem label="JS (Pages)">
<Code code={FilteringPagesJS} title="/pages/api/hello.js" lang="js" />
</TabItem>
</Tabs>

## Testing

Arcjet runs the same in any environment, including locally and in CI. You can
Expand Down
36 changes: 28 additions & 8 deletions src/content/docs/bot-protection/reference/nodejs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ import IdentifiedBotsTS from "/src/snippets/bot-protection/reference/nodejs/Iden
import IdentifiedBotsJS from "/src/snippets/bot-protection/reference/nodejs/IdentifiedBots.js?raw";
import ErrorsTS from "/src/snippets/bot-protection/reference/nodejs/Errors.ts?raw";
import ErrorsJS from "/src/snippets/bot-protection/reference/nodejs/Errors.js?raw";
import FilteringTS from "/src/snippets/bot-protection/reference/nodejs/Filtering.ts?raw";
import FilteringJS from "/src/snippets/bot-protection/reference/nodejs/Filtering.js?raw";
import Comments from "/src/components/Comments.astro";

Arcjet bot detection allows you to manage traffic by automated clients and bots.
Expand All @@ -31,14 +33,14 @@ You can use only one of the following configuration definitions:
```ts
type BotOptionsAllow = {
mode?: "LIVE" | "DRY_RUN";
allow: ArcjetWellKnownBot[];
allow: Array<ArcjetWellKnownBot | ArcjetBotCategory>;
};
```

```ts
type BotOptionsDeny = {
mode?: "LIVE" | "DRY_RUN";
deny: ArcjetWellKnownBot[];
deny: Array<ArcjetWellKnownBot | ArcjetBotCategory>;
};
```

Expand All @@ -58,7 +60,8 @@ some bots to access your system, such as bots for search indexing or API
access from the command line.

This behavior is configured with an `allow` list from our [full list of
bots](https://arcjet.com/bot-list).
bots](https://arcjet.com/bot-list) and/or [bot
categories](/bot-protection/identifying-bots#bot-categories).

<Tabs>
<TabItem label="TS">
Expand All @@ -75,8 +78,9 @@ Some applications may only want to block a small subset of bots, while allowing
the majority continued access. This may be due to many reasons, such as
misconfigured or high-traffic bots.

This behavior is configured with an `deny` list from our [full list of
bots](https://arcjet.com/bot-list).
This behavior is configured with a `deny` list from our [full list of
bots](https://arcjet.com/bot-list) and/or [bot
categories](/bot-protection/identifying-bots#bot-categories).

<Tabs>
<TabItem label="TS">
Expand Down Expand Up @@ -140,9 +144,10 @@ This example will log the full result as well as the bot protection rule:
### Identified bots

The decision also contains all of the [identified
bots](/bot-protection/identifying-bots) detected from the request. A request may
be identified as zero, one, or more bots—all of which will be available on the
`decision.allowed` and `decision.denied` properties.
bots and matched categories](/bot-protection/identifying-bots) detected from the
request. A request may be identified as zero, one, or more bots/categories—all
of which will be available on the `decision.allowed` and `decision.denied`
properties.

<Tabs>
<TabItem label="TS">
Expand Down Expand Up @@ -181,6 +186,21 @@ allow or deny the request. Our recommendation is to block requests without
</TabItem>
</Tabs>

## Filtering categories

All categories are also provided as enumerations, which allows for programmatic
access. For example, you may want to allow most of `CATEGORY:GOOGLE` except
their "advertising quality" bot.

<Tabs>
<TabItem label="TS">
<Code code={FilteringTS} lang="ts" />
</TabItem>
<TabItem label="JS">
<Code code={FilteringJS} lang="js" />
</TabItem>
</Tabs>

## Testing

Arcjet runs the same in any environment, including locally and in CI. You can
Expand Down
Loading

0 comments on commit 405d445

Please sign in to comment.