-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable handling of data access delays using HTTP 301/Retry-After #274
Enable handling of data access delays using HTTP 301/Retry-After #274
Conversation
headers: | ||
Retry-After: | ||
description: > | ||
Delay in seconds. The client should follow the redirect after waiting for this duration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Retry-After is also mentioned in the HTTP RFCs, it might make sense to qualify that only the integral relative time in seconds may be returned by the server and that the absolute string date (that's overly complicated to parse) is explicitly disallowed.
This is an interesting use case - a good example of where one would not want the getObject operation to try to return a live URL, but instead use an access_id. A few questions
The pattern I have seen and implemented is to return 202 (ACCEPTED) with a Location header that says where to ask for status and a header that provides a "when to retry" hint. This seems more accurate to me, because the REST request has implicitly requested a long-running operation. It does mean a DRS service that returns 202 needs to implement a polling endpoint. Standardizing the polling endpoint would make life easier for clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments:
- Seems like the object request is what will take time but I think it makes sense to apply to any endpoint we think may take time just in case
- is 202 better than 301? It seems like the description like Dan said fits better: https://restfulapi.net/http-status-codes/ . Is it just a drop in replacement, swap 202 for 301 in this PR but use the same Retry-After header and documentation in the spec?
- this PR will need to be updated now that objects and bundles are unified
|
{
'access_id': 'ABC123'
'href': '/objects/<object_id>/access/ABC123'
} ..but this needs to be in the specification and API response so there is no ambiguity to what the client needs to do next. |
@susheel , are you suggesting that |
@dglazer You're right. I assumed the choice according to Rishi's email was between I can see how adding both would simplify things, but you would still need to provide the full Location response in the header. So the essentially the flow will be: GET /objects/XYZ/access/ABC HTTP/1.1
Host: example.com
HTTP/1.1 202 Accepted
Location: http://example.com/objects/XYZ/access/ABC
Retry-After: 30 You would still need to provide the Location, as the client may not have the same state, esp. if it is part of polling queue. |
I think we're mostly agreeing -- I'm picturing a 202 response code with a
Not sure why a |
@dglazer you're summary looks good to me. I'll ping Brian H and Michael to take a look and weigh in. |
Since this is on a branch in Michael's repo we need to get him to update to 202 (if he agrees). @mikebaumann can you do this ASAP? We're trying to get this PR voted on by 6/15 at the latest so we can merge. Also, there are merge conflicts now, can you resolve? |
There seems to be some consensus on returning a Otherwise this looks fine to me. |
Enable DRS schema support for data repository services that may incur delays, such as retrieval of data from cold storage with substantial latency. When an operation is delayed, a response is provided with HTTP code 301 and a Retry-After header indicating the duration the client should wait before following the redirect. Resolves #238
Changed the response in the case of a delay from status code 301 (Moved Permanently) to 202 (Accepted). This is a better choice as it is more consistent with the IETF specifications for HTTP and the DRS API overall.
I changed the response in the case of a delay from 301 (Moved Permanently) to 202 (Accepted), consistent with the consensus above and today's (6/10/19) workstream meeting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thanks for the quick changes @mikebaumann
+1 from Terra Data Repo |
Since Brian H is on vacation and Michael B is proposing this I think it's safe to vote +1 for HCA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Motivation
There are cases in which storage systems incur data access delays longer than a reasonable HTTP request timeout, and the DRS specification must include provision for handling this.
The causes of these delays may include access to data in cold storage, system-specific internal data transfers, system load, and occasional aberrant delays in the system or underlying cloud platform.
For DRS to be a robust and broadly applicable standard, it must provide a way for the client and server to negotiate through data access delays.
Because DRS is intended for use with genomic data, genomic datasets tend to be large and long-lived, and storage costs must be minimized/managed, the ability to support cold storage is essential.
Although some major cold storage systems provide low latency access to data in cold storage, others do not.
To focus on a concrete example, consider the use of DRS for a genomic data storage system utilizing AWS Glacier for cold storage. The AWS product page for Glacier defines Glacier's data access latency as follows:
Moreover, consider the case of the storage system service running on AWS infrastructure utilizing the AWS API Gateway, which has a fixed HTTP request timeout value of 30 seconds. Because data access requiring minutes (or hours) cannot be completed within the 30-second timeout, there must be an interactive collaboration between the client and server to complete the data access operation.
DRS must be applicable to genomic data repository services that support a major cloud vendor using the vendor's current services for cost-effective storage of large data sets. Doing so requires that the client and server be able to negotiate through potentially long data access delays.
Design
The proposed design is based on standard HTTP semantics defined in RFC 7321. In short, if a data access operation is going to take more time than can be reasonably accommodated by a single HTTP request, the operation may return a response with HTTP code 301 (Moved Permanently) with a
Retry-After
header.HTTP Code 301 (Moved Permanently):
If/when a data repository service incurs a delay longer than a reasonable HTTP timeout, it should respond with HTTP code 301 and include the HTTP header
Retry-After
. After waiting for the specifiedRetry-After
duration, the client is should redirect to the URL provided in theLocation
header.HTTP Header
Retry-After
The
Retry-After
value SHOULD represent the minimum duration the client should wait before attempting the operation again with a reasonable expectation of success.However, it may not be feasible for a DRS service to accurately identify the expected duration, depending on the operation, underlying cloud storage service, etc. Therefore, it is permissible to return a relatively short, fixed duration (e.g. 10 seconds) and repeat this interaction as needed until such time as the data access operation is complete.
This optional 301 response with a
Retry-After
header is defined for all DRS operations involving stored data access. Although some DRS operations (e.g./objects/{object_id}/access/{access_id}
) may be more likely to encounter a data access delay than others (e.g./bundles/{bundle_id}
), it is not possible to identify which data access operations may or may not encounter delays in all current and future storage systems. It seems better to define a simple mechanism broadly across all data access operations than to add it for a subset of operations now and incrementally add support for more as needed over time, requiring additional and otherwise unnecessary DRS specification revisions.Implementation and Use Considerations
The impact of data access delays on workflow execution services and other large-scale analyses warrants special consideration. Data access delays occurring during execution can be very costly and time-consuming, especially when multiplied by thousands of files and a large scale execution infrastructure is already spun-up.
Any viable data repository service providing data for analysis purposes must provide a way to mitigate/minimize data access delays during workflow execution.
One effective technique is for the service to use a long-lived data cache. For example, although the delay of retrieving a given file from cold storage may be unavoidable, the service may place the file in a long-lived cache so that subsequent access to the file is fast.
The presence of a long-lived cache enables workflow orchestration services to perform an initial DRS data access operation for all input data before spinning-up the execution infrastructure, thus incurring the delays less expensively and ensuring the actual workflow execution data access is fast.
Notes
This PR is submitted on behalf of the Human Cell Atlas (HCA) driver project.
The HCA data-store is designed to store large volumes of genomic sequencing, imaging, and other data for researchers worldwide. It replicates the data across major cloud platforms (currently AWS and GCP) to allow researchers to use the cloud platform of their choice for workflow execution and analysis.
The use of SHOULD is as defined in RFC 2119.
Resolves #238