Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an explainer for traffic advice. #10

Merged
merged 4 commits into from
Apr 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,12 @@ Where:
Users can opt-out of the feature at any time. Furthermore, users can temporarily opt-out of the feature by using their browser’s private browsing mode.

#### Publisher opt-out
One option for origin-wide opt-out is to leverage the publisher's DNS record:
Publishers can opt out by disallowing connections in their [traffic advice](traffic-advice.md). This advice would be fetched and cached by the proxy, and can be used by publishers by adding a single resource to their origin at a well-known path.

Another option for origin-wide opt-out is to leverage the publisher's DNS record:
* Publishers specify in their DNS entry that they are opting out of proxied prefetching (completely or with some TBD granularity if necessary).
* The DNS check would be done by the proxy for privacy reasons; issuing a DNS request from the browser before navigation would share prefetch information with the DNS resolver and potentially the target host.

Alternatively (or in addition), we could define a [/.well-known URL](https://tools.ietf.org/html/rfc5785) that can be used for publisher opt-out, and this URL would be fetched and cached by the proxy. This has the advantage that it is easier for developers to add a new resource than to modify their DNS record.

Ideally, the browser would fetch the opt-out signal *before* making a connection to the proxy. While there are proposals to enable anonymous fetching of both DNS records ([Oblivious DNS](https://tools.ietf.org/html/draft-pauly-dprive-oblivious-doh-00)) and HTTP resources ([Oblivious HTTP](https://tools.ietf.org/html/draft-thomson-http-oblivious-00)), neither is well-supported yet. If either of those proposals gains traction, we may want to revisit the publisher opt-out design to take advantage of Oblivous fetching.

In addition, publishers can opt-out for individual requests, for example, when dealing with temporary traffic spikes or other issues. For these, publishers should look for the `Purpose: prefetch` request header and reject requests accordingly (see [Geolocation](https://github.com/buettner/private-prefetch-proxy#geolocation) for an example use case).
Expand Down
47 changes: 47 additions & 0 deletions traffic-advice.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Traffic Advice

Publishers may wish not to accept traffic from [private prefetch proxies](README.md) and other sources other than direct user traffic, for instance to reduce server load due to speculative prefetch activity.

We propose a well-known "traffic advice" resource, analogous to `/robots.txt` (for web crawlers), which allows an HTTP server to declare that implementing agents should stop sending traffic to it for some time.

## Proposal

HTTP request activity can broadly be divided into:
* activity on behalf of a user interaction (e.g., a web browser a web page requested by the user), or which for another reason cannot easily be discarded
* activity for which there is an existing specialized mechanism for throttling traffic (e.g. web crawlers respecting `robots.txt`)
* activity which can easily be discarded (e.g., because it corresponds to a prefetch which improves loading performance but not correctness) at the server's request (e.g., because it is under load or the operator otherwise does not wish to serve non-essential traffic)

Applications in the third category should consider acting as *agents which respect traffic advice*, so as to respect the server operator's wishes with a minimum resource impact.

Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice`. If it returns a response with an [ok status](https://fetch.spec.whatwg.org/#ok-status) and a `application/trafficadvice+json` MIME type, the response body should contain valid UTF-8 encoded JSON like the following:

```json
[
{"user_agent": "prefetch-proxy", "disallow": true}
]
```

Each agent has a series of identifiers it recognizes, in order of specificity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "agent" needs much more definition. Is Chrome a user agent? Is curl? Is googlebot?

It sounds like the intent is to cover "agents" which don't send "direct user traffic". Those don't sound like user_agents to me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the term in the HTTP sense; the application which is constructing HTTP requests and sending them to a server, i.e. the application which would ordinarily identify itself with the User-Agent request header.

I don't love the term in this context, but I think it's consistent with RFC 7231 and the robots.txt format, both of which would consider curl and googlebot to be user agents.

* its own agent name (e.g. `"ExamplePrivatePrefetchProxy"`)
* decreasingly specific generic categories that describe it, like `"prefetch-proxy"`
* `"*"` (which applies to every implementing agent)

It finds the most specific element of the response, and applies the corresponding advice (currently only a boolean which advises disallowing all traffic) to its behavior. The agent should respect the cache-related response headers to minimize the frequency of such requests and to revalidate the resource when it is stale.

Currently the only advice is the key `"disallow"`, which specifies a boolean which, if present and `true`, advises the agent not to establish connections to the origin. In the future other advice may be added.

If the response has a `404 Not Found` status (or a similar status), on the other hand, the agent should apply its default behavior.

## Why not robots.txt?

`robots.txt` is designed for crawlers, especially search engine crawlers, and so site owners have likely already established robots rules because they wish to limit traffic from crawlers -- even though they have no such concern about prefetch proxy traffic. The `robots.txt` format is also designed to limit traffic by path, which isn't appropriate for agents which do not know the path of the requests they are responsible for throttling (as with a CONNECT proxy carrying TLS traffic).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of ties into my above feedback about agents. I think you want to divide the universe up into three categories:

  • Whatever agents this applies to
  • Whatever agents robots.txt applies to
  • Agents which won't care about either (I think normal-navigations from browsers fall into this category)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a short explanation of who should consider being an "agent which respects traffic advice". (I do consider it a politeness thing more than anything else; HTTP clients can and will send traffic anyway if they think it's more important than the server's advice.)


A more similar textual format would be possible, but the format for parsing `robots.txt` is not consistently specified and implemented. By contrast, JSON implementations are widely available on a wide variety of platforms used by site owners and authors.

## Application to private prefetch proxies

For example, suppose a private prefetch proxy, `ExamplePrivatePrefetchProxy`, would like to respect traffic advice in order to allow site owners to limit inbound traffic from the proxy.

When a client of the proxy service (e.g., a web browser) requests a connection to `https://www.example.com`, the proxy server issues an HTTP request for `https://www.example.com/.well-known/traffic-advice`. It receives the sample response body from above. It recognizes `"prefetch-proxy"` as the most specific advice to apply to itself.

It caches this result (traffic is presently disallowed) at the proxy server (or even across multiple proxy server instances run by the same operator), and refuses client connections to `https://www.example.com` until an updated `/.well-known/traffic-advice` resource no longer disallows traffic. Even if a large number of proxy clients request connections to `https://www.example.com`, the site operator and its CDN do not receive traffic from the proxy except for infrequent requests to revalidate the traffic advice (which may be, for example, once per hour).