LG-9449: Prevent Location Search slowdown and errors from ArcGIS auth#8579
LG-9449: Prevent Location Search slowdown and errors from ArcGIS auth#8579dawei-nava wants to merge 34 commits intomainfrom
Conversation
changelog: Internal, ArcGis token management, job to manage/refresh token
| @@ -0,0 +1,22 @@ | |||
| module ArcgisApi | |||
| # Stateless factory object that create a new Farady::COnnection object | |||
| class ConnectionFactory | |||
There was a problem hiding this comment.
What is the purpose of this class? If there a reason we couldn't reuse geocoder.faraday?
There was a problem hiding this comment.
To easily inject a mock one that stub request on connection for testing purpose, also customize faraday middleware behaviours.
There was a problem hiding this comment.
FWIW there is test support that allows you to simulate HTTP interactions, making mocks of HTTP libraries like Faraday unnecessary (see the ArcgisApiHelper). That said I like the separation.
There was a problem hiding this comment.
@allthesignals , good point, this was to mock on faraday level, webmock covers much more. It was adapted to the mock geocoder used in feature test. Let me c whether I can handle it better.
There was a problem hiding this comment.
I'm ambivalent to the existence of this class. There are two effective usages of this (excluding suggest which is not currently used).
| def handle_expired_token(cache_value) | ||
| cache_value = wrap_raw_token(cache_value) unless cache_value.is_a?(Hash) | ||
| expires_at = cache_value.fetch(:expires_at, nil) | ||
| sliding_expires_at = cache_value&.fetch(:sliding_expires_at, nil) | ||
| # start fetch new token before actual expires time | ||
| if IdentityConfig.store.arcgis_token_sliding_expiration_enabled | ||
| now = Time.zone.now.to_f | ||
| if sliding_expires_at && now >= sliding_expires_at | ||
| # if passed sliding expiration time(exp-3*prefetch_ttl) | ||
| # but not 2*prefetch_ttl from hard expiration time | ||
| if now > (sliding_expires_at + prefetch_ttl) | ||
| cache_value = retrieve_token | ||
| else | ||
| # extend sliding expiration time | ||
| cache_value[:sliding_expires_at] = sliding_expires_at + prefetch_ttl | ||
| save_token(cache_value, expires_at) | ||
| end | ||
| end | ||
| elsif expires_at && expires_at <= now | ||
| cache_value = retrieve_token | ||
| end | ||
| cache_value | ||
| end |
There was a problem hiding this comment.
The particular complexity and conditionals used in this method suggest that we should apply a strategy pattern. Also, it's likely that we will want to remove the previous caching strategy after the new one has proven effective - the way we implement this change should support safe removal of the previous strategy.
There was a problem hiding this comment.
@NavaTim , this is part of dealing with expiring token on web application side, and disabled by default since we assume the job will refresh the token as needed.
Act as an escape route hopefully never needed?
At the same time let me c what refactor/abstraction can be done.
There was a problem hiding this comment.
@NavaTim played with it a little bit, got kind of circular dependency with strategy pattern, trying subclassing now.
There was a problem hiding this comment.
Act as an escape route hopefully never needed?
I think it's acceptable (though undesirable) to use this as backup to the job, but that does increase the risk of excessive authentication calls. It may be better to turn it off entirely, but for now we will need both.
@NavaTim played with it a little bit, got kind of circular dependency with strategy pattern, trying subclassing now.
I think the sliding window approach used here may be more complicated than necessary. We can use a specific amount of time before expiration instead, and maybe that will make this logic simpler.
| raise Faraday::ServerError.new( | ||
| RuntimeError.new(error_message), | ||
| { | ||
| status: error_code, | ||
| body: { details: response_body.dig('error', 'details')&.join(', ') }, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
Is this only going to be reached on 5xx errors?
There was a problem hiding this comment.
This is to deal with 200 response but actually erred without token available. The faraday raise_error middleware does not handle this situation.
There was a problem hiding this comment.
Since this isn't a 5xx, I don't think we should use Faraday::ServerError. We can use a custom error type instead and then if needed update the exceptions configured with faraday-retry.
| @@ -0,0 +1,28 @@ | |||
| module ArcgisApi | |||
| # Faraday middleware to raise exception when response status is 200 but with error in body | |||
| class ResponseValidation < Faraday::Middleware | |||
There was a problem hiding this comment.
is there a way to keep the response logic closer to where the request is made?
There was a problem hiding this comment.
@mitchellhenke when a new connection is created on demand, this middleware is registered with the connection for response processing.
There was a problem hiding this comment.
right, sorry, I understand how it works, but if this is specific to one API, could we pull it into the class that's making those requests rather than separating it across files?
There was a problem hiding this comment.
@mitchellhenke , we use farady-retry for automatic retrying, if we handle this after response returned to client, I think it's too late to trigger retry with farady-retry.
There was a problem hiding this comment.
I think it's possible to have similar logic for retries when making the request, and I think it'd be better to move it there instead of having a separate class.
There was a problem hiding this comment.
@mitchellhenke not exactly, for invalid response with 200 http status code, once the response is returned to caller, it's out of faraday middleware chain, and wont be able to utilize faraday-retry gem which works as a faraday middleware and we have to write our own retry mechanisms(bookkeeping, counting, making new requests the whole nine yard that's implemented by farady-retry).
There was a problem hiding this comment.
@mitchellhenke I think that using this as a separate middleware class is acceptable given the way ESRI itself handles this in its arcgis-rest-js library. That said, I think it could have somewhat better documentation.
| @@ -0,0 +1,35 @@ | |||
| class ArcgisTokenJob < ApplicationJob | |||
There was a problem hiding this comment.
would it be more accurate to call this ArcgisRefreshTokenJob?
There was a problem hiding this comment.
@svalexander, will in effect it's refresh the token, but it does not check whether the token need to be refreshed and replace with a new one anyway.
| # do not support entry expiration. | ||
| # @member token the token string | ||
| # @member expires_at hard expiration timestamp in epoch seconds | ||
| # @member sliding_expires_at optional the token keeper to maintain for |
There was a problem hiding this comment.
what is sliding expiration?
There was a problem hiding this comment.
oh i see there is an explanation below but I don't fully understand it, is there another way to explain this?
There was a problem hiding this comment.
@svalexander some description here https://brokul.dev/authentication-cookie-lifetime-and-sliding-expiration, not exactly with dotnet, but similar concept, basically avoiding a lot of clients getting token at the time very close to it's expiration time.
There was a problem hiding this comment.
The sliding expiration doesn't seem to fit the use case as well since we have a regularly scheduled job to refresh the token and don't have to worry about user actions to trigger a refresh?
There was a problem hiding this comment.
As I mentioned elsewhere, I think sliding expiration is a bit more complicated we require for this use case. Mostly we need to renew the token before it expires & before the IDP attempts to renew it. I'd be ok removing the frontend IDP renewal altogether if we can have a relatively high frequency & concurrency controls on the job.
| "#{IdentityConfig.store.arcgis_api_token_cache_key_prefix}:#{API_TOKEN_HOST}" | ||
| API_PREFETCH_TTL_SECONDS = IdentityConfig.store.arcgis_api_token_prefetch_ttl | ||
|
|
||
| RETRY_EXCEPTION = [404, 408, 409, 421, 429, 500, 502, 503, 504, 509] |
There was a problem hiding this comment.
This is a little complicated to dig up, but based on my prior research we'll be dealing with errors here that are similar to those mentioned in the ArcGIS REST JS documentation. You can view the underlying logic that ESRI uses to handle the errors in esri/arcgis-rest-js.
Mentioning this because I noticed a few additional errors here beyond what we're likely to see while using the ArcGIS API.
sheldon-b
left a comment
There was a problem hiding this comment.
I didn't have time to finish reviewing so left some questions/suggestions for now
| arcgis_get_token_retry_interval: 1 | ||
| arcgis_get_token_retry_backoff_factor: 2 | ||
| arcgis_token_sliding_expiration_enabled: true | ||
| arcgis_token_sync_request_enabled: false |
There was a problem hiding this comment.
What does this do? When would it be set to true vs. false? It's hard to tell from reading the code
There was a problem hiding this comment.
@sheldon-b , it's to enable/disable arcgis token request from the web tier. We are creating new job to refresh the token, so we dont have to get it from web tier.
There was a problem hiding this comment.
I don't understand what you mean, can you explain it more?
There was a problem hiding this comment.
@sheldon-b The flag determines whether the user-facing servers will attempt to set the auth token if the auth token is missing. Disabling this flag will reduce the likelihood of a flood of USPS auth requests during high traffic to in-person proofing.
| expires_at = cache_value&.expires_at | ||
| sliding_expires_at = cache_value&.sliding_expires_at | ||
| # start fetch new token before actual expires time | ||
| if IdentityConfig.store.arcgis_token_sliding_expiration_enabled |
There was a problem hiding this comment.
When would we have this disabled vs. enabled? I don't see anything about this in the ticket or PR description so I don't entirely understand why it's added or how it's planned to be used
| @@ -0,0 +1,22 @@ | |||
| module ArcgisApi | |||
| # Stateless factory object that create a new Farady::COnnection object | |||
| class ConnectionFactory | |||
There was a problem hiding this comment.
I'm ambivalent to the existence of this class. There are two effective usages of this (excluding suggest which is not currently used).
| @@ -0,0 +1,28 @@ | |||
| module ArcgisApi | |||
| # Faraday middleware to raise exception when response status is 200 but with error in body | |||
| class ResponseValidation < Faraday::Middleware | |||
There was a problem hiding this comment.
@mitchellhenke I think that using this as a separate middleware class is acceptable given the way ESRI itself handles this in its arcgis-rest-js library. That said, I think it could have somewhat better documentation.
| current_sliding_expires_at = cache_value.sliding_expires_at | ||
| expiration_strategy. | ||
| extend_sliding_expires_at(token_info: cache_value, prefetch_ttl: prefetch_ttl) | ||
| if current_sliding_expires_at != cache_value.sliding_expires_at | ||
| save_token( | ||
| cache_value, | ||
| cache_value.expires_at, | ||
| ) | ||
| end | ||
|
|
||
| # now retrieve new token | ||
| update_value = retrieve_token | ||
| expires_at = update_value&.expires_at | ||
| if sliding_expiration_enabled | ||
| update_value.sliding_expires_at = prefetch_ttl >= 0 ? | ||
| expires_at - TIMES_TTL * prefetch_ttl : expires_at | ||
| end | ||
| save_token( | ||
| update_value, | ||
| update_value.expires_at, | ||
| ) | ||
| return update_value |
There was a problem hiding this comment.
@dawei-nava Per our discussion, let's refactor this logic so that the the sliding window approach is extremely clear and easy to read. This approach does look like it will reduce the likelihood of duplicate USPS auth calls and cache updates.
| arcgis_get_token_retry_interval: 1 | ||
| arcgis_get_token_retry_backoff_factor: 2 | ||
| arcgis_token_sliding_expiration_enabled: true | ||
| arcgis_token_sync_request_enabled: false |
There was a problem hiding this comment.
@sheldon-b The flag determines whether the user-facing servers will attempt to set the auth token if the auth token is missing. Disabling this flag will reduce the likelihood of a flood of USPS auth requests during high traffic to in-person proofing.
| save_token(cache_entry, expires_at) | ||
| cache_entry | ||
| end | ||
|
|
||
| def remove_token | ||
| Rails.cache.delete(cache_key) | ||
| end | ||
|
|
||
| private | ||
|
|
||
| def handle_expired_token(cache_value) | ||
| cache_value = wrap_raw_token(cache_value) unless cache_value.is_a?(Hash) | ||
| expires_at = cache_value.fetch(:expires_at, nil) | ||
| sliding_expires_at = cache_value&.fetch(:sliding_expires_at, nil) | ||
| # start fetch new token before actual expires time | ||
| if IdentityConfig.store.arcgis_token_sliding_expiration_enabled | ||
| now = Time.zone.now.to_f | ||
| if sliding_expires_at && now >= sliding_expires_at | ||
| # if passed sliding expiration time(exp-3*prefetch_ttl) | ||
| # but not 2*prefetch_ttl from hard expiration time | ||
| if now > (sliding_expires_at + prefetch_ttl) | ||
| cache_value = retrieve_token | ||
| else | ||
| # extend sliding expiration time | ||
| cache_value[:sliding_expires_at] = sliding_expires_at + prefetch_ttl | ||
| save_token(cache_value, expires_at) |
There was a problem hiding this comment.
@dawei-nava Per our discussion, this code still needs to be updated to permit both the previous and new servers to correctly read the token during the rollout. The rollout will not include planned downtime.
…/LG-9449-arcgis-token-refresh
|
Closed in favor of #8662. |
🎫 LG-9449: Prevent Location Search slowdown and errors from ArcGIS auth
🛠 Summary of changes
Create a scheduled job to refresh the auth token out of band. Application will assume the token is always available and valid by default.
On the web tier side, it depends on the job functioning correctly by default. On the other hand can be configured to fetch token as needed synchronously.
(Details need to be updated when it's finalized)
📜 Testing Plan
Provide a checklist of steps to confirm the changes.(Need to be updated)