Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable handling of data access delays using HTTP 301/Retry-After #274

Merged
merged 2 commits into from
Jun 17, 2019
Merged

Enable handling of data access delays using HTTP 301/Retry-After #274

merged 2 commits into from
Jun 17, 2019

Conversation

mikebaumann
Copy link
Contributor

Motivation

There are cases in which storage systems incur data access delays longer than a reasonable HTTP request timeout, and the DRS specification must include provision for handling this.
The causes of these delays may include access to data in cold storage, system-specific internal data transfers, system load, and occasional aberrant delays in the system or underlying cloud platform.

For DRS to be a robust and broadly applicable standard, it must provide a way for the client and server to negotiate through data access delays.
Because DRS is intended for use with genomic data, genomic datasets tend to be large and long-lived, and storage costs must be minimized/managed, the ability to support cold storage is essential.
Although some major cold storage systems provide low latency access to data in cold storage, others do not.

To focus on a concrete example, consider the use of DRS for a genomic data storage system utilizing AWS Glacier for cold storage. The AWS product page for Glacier defines Glacier's data access latency as follows:

Expedited retrievals typically return data in 1-5 minutes, and are great for Active Archive use cases. Standard retrievals typically complete between 3-5 hours work, and work well for less time-sensitive needs like backup data, media editing, or long-term analytics. Bulk retrievals are the lowest-cost retrieval option, returning large amounts of data within 5-12 hours.

Moreover, consider the case of the storage system service running on AWS infrastructure utilizing the AWS API Gateway, which has a fixed HTTP request timeout value of 30 seconds. Because data access requiring minutes (or hours) cannot be completed within the 30-second timeout, there must be an interactive collaboration between the client and server to complete the data access operation.

DRS must be applicable to genomic data repository services that support a major cloud vendor using the vendor's current services for cost-effective storage of large data sets. Doing so requires that the client and server be able to negotiate through potentially long data access delays.

Design

The proposed design is based on standard HTTP semantics defined in RFC 7321. In short, if a data access operation is going to take more time than can be reasonably accommodated by a single HTTP request, the operation may return a response with HTTP code 301 (Moved Permanently) with a Retry-After header.

  • HTTP Code 301 (Moved Permanently):
    If/when a data repository service incurs a delay longer than a reasonable HTTP timeout, it should respond with HTTP code 301 and include the HTTP header Retry-After. After waiting for the specified Retry-After duration, the client is should redirect to the URL provided in the Location header.

  • HTTP Header Retry-After
    The Retry-After value SHOULD represent the minimum duration the client should wait before attempting the operation again with a reasonable expectation of success.
    However, it may not be feasible for a DRS service to accurately identify the expected duration, depending on the operation, underlying cloud storage service, etc. Therefore, it is permissible to return a relatively short, fixed duration (e.g. 10 seconds) and repeat this interaction as needed until such time as the data access operation is complete.

This optional 301 response with a Retry-After header is defined for all DRS operations involving stored data access. Although some DRS operations (e.g. /objects/{object_id}/access/{access_id}) may be more likely to encounter a data access delay than others (e.g. /bundles/{bundle_id}), it is not possible to identify which data access operations may or may not encounter delays in all current and future storage systems. It seems better to define a simple mechanism broadly across all data access operations than to add it for a subset of operations now and incrementally add support for more as needed over time, requiring additional and otherwise unnecessary DRS specification revisions.

Implementation and Use Considerations

The impact of data access delays on workflow execution services and other large-scale analyses warrants special consideration. Data access delays occurring during execution can be very costly and time-consuming, especially when multiplied by thousands of files and a large scale execution infrastructure is already spun-up.

Any viable data repository service providing data for analysis purposes must provide a way to mitigate/minimize data access delays during workflow execution.
One effective technique is for the service to use a long-lived data cache. For example, although the delay of retrieving a given file from cold storage may be unavoidable, the service may place the file in a long-lived cache so that subsequent access to the file is fast.

The presence of a long-lived cache enables workflow orchestration services to perform an initial DRS data access operation for all input data before spinning-up the execution infrastructure, thus incurring the delays less expensively and ensuring the actual workflow execution data access is fast.

Notes

This PR is submitted on behalf of the Human Cell Atlas (HCA) driver project.

The HCA data-store is designed to store large volumes of genomic sequencing, imaging, and other data for researchers worldwide. It replicates the data across major cloud platforms (currently AWS and GCP) to allow researchers to use the cloud platform of their choice for workflow execution and analysis.

The use of SHOULD is as defined in RFC 2119.

Resolves #238

@mikebaumann mikebaumann mentioned this pull request May 22, 2019
@mikebaumann mikebaumann marked this pull request as ready for review May 23, 2019 17:59
headers:
Retry-After:
description: >
Delay in seconds. The client should follow the redirect after waiting for this duration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Retry-After is also mentioned in the HTTP RFCs, it might make sense to qualify that only the integral relative time in seconds may be returned by the server and that the absolute string date (that's overly complicated to parse) is explicitly disallowed.

@ddietterich
Copy link
Contributor

This is an interesting use case - a good example of where one would not want the getObject operation to try to return a live URL, but instead use an access_id.

A few questions

  • Do we need the async method at all?
    An alternative for this use case is to respond with a protocol that gets the slow object. In this case, I could imagine returning the "s3 glacier" protocol. The client would operate S3 glacier using its REST interface. I'm not a fan of this; just making the observing that there is another approach.

  • Do we always need the async method?
    Do we anticipate ver slow DRS implementations such that we need async on every endpoint? I suppose it is possible that an object lookup would take more than 30 seconds. If not, then perhaps we only need to allow for async on the object/access method.

  • Should we use the 301 return as proposed?
    I am not a fan of the 301 approach. 301 is saying, "this URL no longer works for that object". What we want to say is, "this will be slow, don't wait up." Some frameworks automatically handle 301 in way that is not compatible with the proposed semantic.

The pattern I have seen and implemented is to return 202 (ACCEPTED) with a Location header that says where to ask for status and a header that provides a "when to retry" hint. This seems more accurate to me, because the REST request has implicitly requested a long-running operation. It does mean a DRS service that returns 202 needs to implement a polling endpoint. Standardizing the polling endpoint would make life easier for clients.

Copy link
Contributor

@briandoconnor briandoconnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  • Seems like the object request is what will take time but I think it makes sense to apply to any endpoint we think may take time just in case
  • is 202 better than 301? It seems like the description like Dan said fits better: https://restfulapi.net/http-status-codes/ . Is it just a drop in replacement, swap 202 for 301 in this PR but use the same Retry-After header and documentation in the spec?
  • this PR will need to be updated now that objects and bundles are unified

@dglazer
Copy link
Member

dglazer commented Jun 3, 2019

  1. I agree with @ddietterich and @briandoconnor that 202 fits this use case much better than 301. From the HTTP/1.1 RFC:

The request has been accepted for processing, but the processing has not been completed. ... The entity returned with this response SHOULD include ... some estimate of when the user can expect the request to be fulfilled.

  1. I slightly prefer only documenting this behavior for the /objects/{object_id}/access/{access_id} endpoint, since as @ddietterich says I can't see any good reason for /objects/{object_id} to ever need it. But I don't feel strongly.

  2. As @briandoconnor says, the PR needs (trivial) changes to be mergeable -- it should just be deleting the reference to the GetBundle endpoint.

@susheel
Copy link
Member

susheel commented Jun 9, 2019

202 ACCEPTED would be preferred for ELIXIR with the additional caveat that DRS spec to support the returned access_id. It could be as simple as:

{
  'access_id': 'ABC123'
  'href': '/objects/<object_id>/access/ABC123'
}

..but this needs to be in the specification and API response so there is no ambiguity to what the client needs to do next.

@dglazer
Copy link
Member

dglazer commented Jun 9, 2019

@susheel , are you suggesting that access_id and href be part of the 202 ACCEPTED response, in addition to Retry-After? If so I think that makes sense in theory, but is overkill in practice -- the 202 would be returned by a call that's already using that href to fetch that access_id, so we can make the client's life easier by just saying "if you're told to Retry-After, please retry the same call after the specified delay". (And the same is true if we decide to support 202 on /objects/{object_id}, which I still don't think is needed, but also don't object to.)

@susheel
Copy link
Member

susheel commented Jun 9, 2019

@dglazer You're right. I assumed the choice according to Rishi's email was between 301 Retry-After and 202 Accepted, not both.

I can see how adding both would simplify things, but you would still need to provide the full Location response in the header. So the essentially the flow will be:

GET /objects/XYZ/access/ABC HTTP/1.1
Host: example.com

HTTP/1.1 202 Accepted
Location: http://example.com/objects/XYZ/access/ABC
Retry-After: 30

You would still need to provide the Location, as the client may not have the same state, esp. if it is part of polling queue.

@dglazer
Copy link
Member

dglazer commented Jun 9, 2019

I think we're mostly agreeing -- I'm picturing a 202 response code with a Retry-After header,:

  1. Client sends a request to server:
GET /objects/XYZ/access/ABC HTTP/1.1
  1. Server responds:
HTTP/1.1 202 Accepted
Retry-After: 30
  1. Client waits 30 seconds and resends the exact same request to the exact same address. (And potentially repeats if it gets another 202 back.)

Not sure why a Location would be needed? Even if the request is sent from some queue, the queue has to know what address it's sending to.

@briandoconnor
Copy link
Contributor

@dglazer you're summary looks good to me. I'll ping Brian H and Michael to take a look and weigh in.

@briandoconnor
Copy link
Contributor

Since this is on a branch in Michael's repo we need to get him to update to 202 (if he agrees). @mikebaumann can you do this ASAP? We're trying to get this PR voted on by 6/15 at the latest so we can merge.

Also, there are merge conflicts now, can you resolve?

@xbrianh
Copy link

xbrianh commented Jun 10, 2019

There seems to be some consensus on returning a 202 status code with a Retry-After header. However it would be useful for the HCA to support 301 as well, since that is what we currently use.

Otherwise this looks fine to me.

Michael Baumann added 2 commits June 10, 2019 19:46
Enable DRS schema support for data repository services that
may incur delays, such as retrieval of data from cold storage
with substantial latency.

When an operation is delayed, a response is provided with
HTTP code 301 and a Retry-After header indicating the duration
 the client should wait before following the redirect.

Resolves #238
Changed the response in the case of a delay from
status code 301 (Moved Permanently) to 202 (Accepted).
This is a better choice as it is more consistent with
the IETF specifications for HTTP and the DRS API overall.
@mikebaumann
Copy link
Contributor Author

I changed the response in the case of a delay from 301 (Moved Permanently) to 202 (Accepted), consistent with the consensus above and today's (6/10/19) workstream meeting.
Note that the HCA data store uses the 301 redirect URL to convey state information, and this is not possible with 202 (as there is no redirect URL). Instead, the HCA data store will need to track this state internally. This seems feasible to me and having DRS support both 301 and 202 seems unnecessarily complex for clients.
All the same, I defer to @xbrianh to cast the HCA vote for this PR.

Copy link
Member

@dglazer dglazer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks for the quick changes @mikebaumann

@ddietterich
Copy link
Contributor

+1 from Terra Data Repo

@briandoconnor
Copy link
Contributor

Since Brian H is on vacation and Michael B is proposing this I think it's safe to vote +1 for HCA.

@briandoconnor briandoconnor self-requested a review June 17, 2019 15:27
Copy link
Contributor

@briandoconnor briandoconnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dglazer dglazer merged commit e841670 into ga4gh:develop Jun 17, 2019
@mikebaumann mikebaumann deleted the feature/issue-238-retry-in-drs branch June 25, 2019 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants