From 1d4403d2cc4aa950e3187e8cc8996a0bbb38554a Mon Sep 17 00:00:00 2001 From: Jeremy Roman Date: Thu, 8 Apr 2021 18:16:26 -0400 Subject: [PATCH 1/4] Add an explainer for traffic advice. --- README.md | 6 +++--- traffic-advice.md | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+), 3 deletions(-) create mode 100644 traffic-advice.md diff --git a/README.md b/README.md index ad7766a..c961acf 100644 --- a/README.md +++ b/README.md @@ -68,12 +68,12 @@ Where: Users can opt-out of the feature at any time. Furthermore, users can temporarily opt-out of the feature by using their browser’s private browsing mode. #### Publisher opt-out -One option for origin-wide opt-out is to leverage the publisher's DNS record: +Publishers can opt out by disallowing connections in their [traffic advice](traffic-advice.md). This advice would be fetched and cached by the proxy, and can be used by publishers by adding a single resource to their origin at a well-known path. + +Another option for origin-wide opt-out is to leverage the publisher's DNS record: * Publishers specify in their DNS entry that they are opting out of proxied prefetching (completely or with some TBD granularity if necessary). * The DNS check would be done by the proxy for privacy reasons; issuing a DNS request from the browser before navigation would share prefetch information with the DNS resolver and potentially the target host. -Alternatively (or in addition), we could define a [/.well-known URL](https://tools.ietf.org/html/rfc5785) that can be used for publisher opt-out, and this URL would be fetched and cached by the proxy. This has the advantage that it is easier for developers to add a new resource than to modify their DNS record. - Ideally, the browser would fetch the opt-out signal *before* making a connection to the proxy. While there are proposals to enable anonymous fetching of both DNS records ([Oblivious DNS](https://tools.ietf.org/html/draft-pauly-dprive-oblivious-doh-00)) and HTTP resources ([Oblivious HTTP](https://tools.ietf.org/html/draft-thomson-http-oblivious-00)), neither is well-supported yet. If either of those proposals gains traction, we may want to revisit the publisher opt-out design to take advantage of Oblivous fetching. In addition, publishers can opt-out for individual requests, for example, when dealing with temporary traffic spikes or other issues. For these, publishers should look for the `Purpose: prefetch` request header and reject requests accordingly (see [Geolocation](https://github.com/buettner/private-prefetch-proxy#geolocation) for an example use case). diff --git a/traffic-advice.md b/traffic-advice.md new file mode 100644 index 0000000..8f845cb --- /dev/null +++ b/traffic-advice.md @@ -0,0 +1,32 @@ +# Traffic Advice + +Publishers may wish not to accept traffic from [private prefetch proxies](README.md) and other sources other than direct user traffic, for instance to reduce server load due to speculative prefetch activity. + +We propose a well-known "traffic advice" resource, analogous to `/robots.txt` (for web crawlers), which allows an HTTP server to declare that implementing agents should stop sending traffic to it for some time. + +## Proposal + +Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice.json`. If it returns a `200 OK` response with the `application/json` MIME type, the response body should contain valid JSON like the following: + +```json +[ + {"user_agent": "prefetch-proxy", "disallow": true} +] +``` + +Each agent has a series of identifiers it recognizes, in order of specificity: +* its own agent name +* decreasingly specific generic categories that describe it, like `"prefetch-proxy"` +* `"*"` (which applies to every implementing agent) + +It finds the most specific element of the response, and applies the corresponding advice (currently only a boolean which advises disallowing all traffic) to its behavior. The agent should respect the cache-related response headers to minimize the frequency of such requests and to revalidate the resource when it is stale. + +Currently the only advice is the key `"disallow"`, which specifies a boolean which, if present and `true`, advises the agent not to establish connections to the origin. In the future other advice may be added. + +If the response has a `404 Not Found` status, on the other hand, the agent should apply its default behavior. + +## Why not robots.txt? + +`robots.txt` is designed for crawlers, especially search engine crawlers, and so site owners have likely already established robots rules because they wish to limit traffic from crawlers -- even though they have no such concern about prefetch proxy traffic. The `robots.txt` format is also designed to limit traffic by path, which isn't appropriate for agents which do not know the path of the requests they are responsible for throttling (as with a CONNECT proxy carrying TLS traffic). + +A more similar textual format would be possible, but the format for parsing `robots.txt` is not consistently specified and implemented. By contrast, JSON implementations are widely available on a wide variety of platforms used by site owners and authors. \ No newline at end of file From 9433c521aecba2acba8ced0239791eadddadfa7c Mon Sep 17 00:00:00 2001 From: Jeremy Roman Date: Mon, 12 Apr 2021 12:00:19 -0400 Subject: [PATCH 2/4] explain agents a little more --- traffic-advice.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/traffic-advice.md b/traffic-advice.md index 8f845cb..bd78339 100644 --- a/traffic-advice.md +++ b/traffic-advice.md @@ -6,6 +6,13 @@ We propose a well-known "traffic advice" resource, analogous to `/robots.txt` (f ## Proposal +HTTP request activity can broadly be divided into: +* activity on behalf of a user interaction (e.g., a web browser a web page requested by the user), or which for another reason cannot easily be discarded +* activity for which there is an existing specialized mechanism for throttling traffic (e.g. web crawlers respecting `robots.txt`) +* activity which can easily be discarded (e.g., because it corresponds to a prefetch which improves loading performance but not correctness) at the server's request (e.g., because it is under load or the operator otherwise does not wish to serve non-essential traffic) + +Applications in the third category should consider acting as *agents which respect traffic advice*, so as to respect the server operator's wishes with a minimum resource impact. + Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice.json`. If it returns a `200 OK` response with the `application/json` MIME type, the response body should contain valid JSON like the following: ```json From 8798e18b88bed712d0662264fb3f128bd5becfa5 Mon Sep 17 00:00:00 2001 From: Jeremy Roman Date: Mon, 12 Apr 2021 14:37:50 -0400 Subject: [PATCH 3/4] no file extension, change mime type, utf8, example --- traffic-advice.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/traffic-advice.md b/traffic-advice.md index bd78339..58c9557 100644 --- a/traffic-advice.md +++ b/traffic-advice.md @@ -13,7 +13,7 @@ HTTP request activity can broadly be divided into: Applications in the third category should consider acting as *agents which respect traffic advice*, so as to respect the server operator's wishes with a minimum resource impact. -Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice.json`. If it returns a `200 OK` response with the `application/json` MIME type, the response body should contain valid JSON like the following: +Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice`. If it returns a response with an [ok status](https://fetch.spec.whatwg.org/#ok-status) and a `application/trafficadvice+json` MIME type, the response body should contain valid UTF-8 encoded JSON like the following: ```json [ @@ -30,10 +30,18 @@ It finds the most specific element of the response, and applies the correspondin Currently the only advice is the key `"disallow"`, which specifies a boolean which, if present and `true`, advises the agent not to establish connections to the origin. In the future other advice may be added. -If the response has a `404 Not Found` status, on the other hand, the agent should apply its default behavior. +If the response has a `404 Not Found` status (or a similar status), on the other hand, the agent should apply its default behavior. ## Why not robots.txt? `robots.txt` is designed for crawlers, especially search engine crawlers, and so site owners have likely already established robots rules because they wish to limit traffic from crawlers -- even though they have no such concern about prefetch proxy traffic. The `robots.txt` format is also designed to limit traffic by path, which isn't appropriate for agents which do not know the path of the requests they are responsible for throttling (as with a CONNECT proxy carrying TLS traffic). -A more similar textual format would be possible, but the format for parsing `robots.txt` is not consistently specified and implemented. By contrast, JSON implementations are widely available on a wide variety of platforms used by site owners and authors. \ No newline at end of file +A more similar textual format would be possible, but the format for parsing `robots.txt` is not consistently specified and implemented. By contrast, JSON implementations are widely available on a wide variety of platforms used by site owners and authors. + +## Application to private prefetch proxies + +For example, suppose a private prefetch proxy, `ExamplePrivatePrefetchProxy`, would like to respect traffic advice in order to allow site owners to limit inbound traffic from the proxy. + +When a client of the proxy service requests a connection to `https://www.example.com`, the proxy server issues an HTTP request for `https://www.example.com/.well-known/traffic-advice`. It receives the sample response body from above. It recognizes `"prefetch-proxy"` as the most specific advice to apply to itself. + +It caches this result (traffic is presently disallowed) at the proxy server (or even across multiple proxy server instances run by the same operator), and refuses client connections to `https://www.example.com` until an updated `/.well-known/traffic-advice` resource no longer disallows traffic. Even if a large number of proxy clients request connections to `https://www.example.com`, the site operator and its CDN do not receive traffic from the proxy except for infrequent requests to revalidate the traffic advice (which may be, for example, once per hour). \ No newline at end of file From 95bc3fdd5f10ce41db6d22d3319156fff6be166f Mon Sep 17 00:00:00 2001 From: Jeremy Roman Date: Tue, 13 Apr 2021 13:28:59 -0400 Subject: [PATCH 4/4] more examples --- traffic-advice.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/traffic-advice.md b/traffic-advice.md index 58c9557..2555f4b 100644 --- a/traffic-advice.md +++ b/traffic-advice.md @@ -22,7 +22,7 @@ Agents which respect traffic advice should fetch the well-known path `/.well-kno ``` Each agent has a series of identifiers it recognizes, in order of specificity: -* its own agent name +* its own agent name (e.g. `"ExamplePrivatePrefetchProxy"`) * decreasingly specific generic categories that describe it, like `"prefetch-proxy"` * `"*"` (which applies to every implementing agent) @@ -42,6 +42,6 @@ A more similar textual format would be possible, but the format for parsing `rob For example, suppose a private prefetch proxy, `ExamplePrivatePrefetchProxy`, would like to respect traffic advice in order to allow site owners to limit inbound traffic from the proxy. -When a client of the proxy service requests a connection to `https://www.example.com`, the proxy server issues an HTTP request for `https://www.example.com/.well-known/traffic-advice`. It receives the sample response body from above. It recognizes `"prefetch-proxy"` as the most specific advice to apply to itself. +When a client of the proxy service (e.g., a web browser) requests a connection to `https://www.example.com`, the proxy server issues an HTTP request for `https://www.example.com/.well-known/traffic-advice`. It receives the sample response body from above. It recognizes `"prefetch-proxy"` as the most specific advice to apply to itself. It caches this result (traffic is presently disallowed) at the proxy server (or even across multiple proxy server instances run by the same operator), and refuses client connections to `https://www.example.com` until an updated `/.well-known/traffic-advice` resource no longer disallows traffic. Even if a large number of proxy clients request connections to `https://www.example.com`, the site operator and its CDN do not receive traffic from the proxy except for infrequent requests to revalidate the traffic advice (which may be, for example, once per hour). \ No newline at end of file