Skip to content

LG-9449: Prevent Location Search slowdown and errors from ArcGIS auth#8579

Closed
dawei-nava wants to merge 34 commits intomainfrom
dwang/LG-9449-arcgis-token-refresh
Closed

LG-9449: Prevent Location Search slowdown and errors from ArcGIS auth#8579
dawei-nava wants to merge 34 commits intomainfrom
dwang/LG-9449-arcgis-token-refresh

Conversation

@dawei-nava
Copy link
Contributor

@dawei-nava dawei-nava commented Jun 12, 2023

🎫 LG-9449: Prevent Location Search slowdown and errors from ArcGIS auth

🛠 Summary of changes

Create a scheduled job to refresh the auth token out of band. Application will assume the token is always available and valid by default.

On the web tier side, it depends on the job functioning correctly by default. On the other hand can be configured to fetch token as needed synchronously.

(Details need to be updated when it's finalized)

📜 Testing Plan

Provide a checklist of steps to confirm the changes.(Need to be updated)

  • Check job scheduled
  • Validate token refreshed as scheduled
  • Validate PO search returns successfully

@dawei-nava dawei-nava requested review from a team and tomas-nava June 12, 2023 16:39
@@ -0,0 +1,22 @@
module ArcgisApi
# Stateless factory object that create a new Farady::COnnection object
class ConnectionFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this class? If there a reason we couldn't reuse geocoder.faraday?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To easily inject a mock one that stub request on connection for testing purpose, also customize faraday middleware behaviours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there is test support that allows you to simulate HTTP interactions, making mocks of HTTP libraries like Faraday unnecessary (see the ArcgisApiHelper). That said I like the separation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allthesignals , good point, this was to mock on faraday level, webmock covers much more. It was adapted to the mock geocoder used in feature test. Let me c whether I can handle it better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent to the existence of this class. There are two effective usages of this (excluding suggest which is not currently used).

Comment on lines +105 to +127
def handle_expired_token(cache_value)
cache_value = wrap_raw_token(cache_value) unless cache_value.is_a?(Hash)
expires_at = cache_value.fetch(:expires_at, nil)
sliding_expires_at = cache_value&.fetch(:sliding_expires_at, nil)
# start fetch new token before actual expires time
if IdentityConfig.store.arcgis_token_sliding_expiration_enabled
now = Time.zone.now.to_f
if sliding_expires_at && now >= sliding_expires_at
# if passed sliding expiration time(exp-3*prefetch_ttl)
# but not 2*prefetch_ttl from hard expiration time
if now > (sliding_expires_at + prefetch_ttl)
cache_value = retrieve_token
else
# extend sliding expiration time
cache_value[:sliding_expires_at] = sliding_expires_at + prefetch_ttl
save_token(cache_value, expires_at)
end
end
elsif expires_at && expires_at <= now
cache_value = retrieve_token
end
cache_value
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The particular complexity and conditionals used in this method suggest that we should apply a strategy pattern. Also, it's likely that we will want to remove the previous caching strategy after the new one has proven effective - the way we implement this change should support safe removal of the previous strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NavaTim , this is part of dealing with expiring token on web application side, and disabled by default since we assume the job will refresh the token as needed.
Act as an escape route hopefully never needed?

At the same time let me c what refactor/abstraction can be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NavaTim played with it a little bit, got kind of circular dependency with strategy pattern, trying subclassing now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Act as an escape route hopefully never needed?

I think it's acceptable (though undesirable) to use this as backup to the job, but that does increase the risk of excessive authentication calls. It may be better to turn it off entirely, but for now we will need both.

@NavaTim played with it a little bit, got kind of circular dependency with strategy pattern, trying subclassing now.

I think the sliding window approach used here may be more complicated than necessary. We can use a specific amount of time before expiration instead, and maybe that will make this logic simpler.

Comment on lines +19 to +25
raise Faraday::ServerError.new(
RuntimeError.new(error_message),
{
status: error_code,
body: { details: response_body.dig('error', 'details')&.join(', ') },
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only going to be reached on 5xx errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to deal with 200 response but actually erred without token available. The faraday raise_error middleware does not handle this situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't a 5xx, I don't think we should use Faraday::ServerError. We can use a custom error type instead and then if needed update the exceptions configured with faraday-retry.

@@ -0,0 +1,28 @@
module ArcgisApi
# Faraday middleware to raise exception when response status is 200 but with error in body
class ResponseValidation < Faraday::Middleware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to keep the response logic closer to where the request is made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchellhenke when a new connection is created on demand, this middleware is registered with the connection for response processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, sorry, I understand how it works, but if this is specific to one API, could we pull it into the class that's making those requests rather than separating it across files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchellhenke , we use farady-retry for automatic retrying, if we handle this after response returned to client, I think it's too late to trigger retry with farady-retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to have similar logic for retries when making the request, and I think it'd be better to move it there instead of having a separate class.

Copy link
Contributor Author

@dawei-nava dawei-nava Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchellhenke not exactly, for invalid response with 200 http status code, once the response is returned to caller, it's out of faraday middleware chain, and wont be able to utilize faraday-retry gem which works as a faraday middleware and we have to write our own retry mechanisms(bookkeeping, counting, making new requests the whole nine yard that's implemented by farady-retry).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchellhenke I think that using this as a separate middleware class is acceptable given the way ESRI itself handles this in its arcgis-rest-js library. That said, I think it could have somewhat better documentation.

@tomas-nava tomas-nava changed the title Dwang/lg 9449 arcgis token refresh LG-9449: Prevent Location Search slowdown and errors from ArcGIS auth Jun 13, 2023
@@ -0,0 +1,35 @@
class ArcgisTokenJob < ApplicationJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be more accurate to call this ArcgisRefreshTokenJob?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svalexander, will in effect it's refresh the token, but it does not check whether the token need to be refreshed and replace with a new one anyway.

# do not support entry expiration.
# @member token the token string
# @member expires_at hard expiration timestamp in epoch seconds
# @member sliding_expires_at optional the token keeper to maintain for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is sliding expiration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see there is an explanation below but I don't fully understand it, is there another way to explain this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svalexander some description here https://brokul.dev/authentication-cookie-lifetime-and-sliding-expiration, not exactly with dotnet, but similar concept, basically avoiding a lot of clients getting token at the time very close to it's expiration time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sliding expiration doesn't seem to fit the use case as well since we have a regularly scheduled job to refresh the token and don't have to worry about user actions to trigger a refresh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned elsewhere, I think sliding expiration is a bit more complicated we require for this use case. Mostly we need to renew the token before it expires & before the IDP attempts to renew it. I'd be ok removing the frontend IDP renewal altogether if we can have a relatively high frequency & concurrency controls on the job.

"#{IdentityConfig.store.arcgis_api_token_cache_key_prefix}:#{API_TOKEN_HOST}"
API_PREFETCH_TTL_SECONDS = IdentityConfig.store.arcgis_api_token_prefetch_ttl

RETRY_EXCEPTION = [404, 408, 409, 421, 429, 500, 502, 503, 504, 509]
Copy link
Contributor

@NavaTim NavaTim Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little complicated to dig up, but based on my prior research we'll be dealing with errors here that are similar to those mentioned in the ArcGIS REST JS documentation. You can view the underlying logic that ESRI uses to handle the errors in esri/arcgis-rest-js.

Mentioning this because I noticed a few additional errors here beyond what we're likely to see while using the ArcGIS API.

Copy link
Contributor

@sheldon-b sheldon-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have time to finish reviewing so left some questions/suggestions for now

arcgis_get_token_retry_interval: 1
arcgis_get_token_retry_backoff_factor: 2
arcgis_token_sliding_expiration_enabled: true
arcgis_token_sync_request_enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? When would it be set to true vs. false? It's hard to tell from reading the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sheldon-b , it's to enable/disable arcgis token request from the web tier. We are creating new job to refresh the token, so we dont have to get it from web tier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean, can you explain it more?

Copy link
Contributor

@NavaTim NavaTim Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sheldon-b The flag determines whether the user-facing servers will attempt to set the auth token if the auth token is missing. Disabling this flag will reduce the likelihood of a flood of USPS auth requests during high traffic to in-person proofing.

expires_at = cache_value&.expires_at
sliding_expires_at = cache_value&.sliding_expires_at
# start fetch new token before actual expires time
if IdentityConfig.store.arcgis_token_sliding_expiration_enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we have this disabled vs. enabled? I don't see anything about this in the ticket or PR description so I don't entirely understand why it's added or how it's planned to be used

@@ -0,0 +1,22 @@
module ArcgisApi
# Stateless factory object that create a new Farady::COnnection object
class ConnectionFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent to the existence of this class. There are two effective usages of this (excluding suggest which is not currently used).

@@ -0,0 +1,28 @@
module ArcgisApi
# Faraday middleware to raise exception when response status is 200 but with error in body
class ResponseValidation < Faraday::Middleware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchellhenke I think that using this as a separate middleware class is acceptable given the way ESRI itself handles this in its arcgis-rest-js library. That said, I think it could have somewhat better documentation.

Comment on lines +191 to +212
current_sliding_expires_at = cache_value.sliding_expires_at
expiration_strategy.
extend_sliding_expires_at(token_info: cache_value, prefetch_ttl: prefetch_ttl)
if current_sliding_expires_at != cache_value.sliding_expires_at
save_token(
cache_value,
cache_value.expires_at,
)
end

# now retrieve new token
update_value = retrieve_token
expires_at = update_value&.expires_at
if sliding_expiration_enabled
update_value.sliding_expires_at = prefetch_ttl >= 0 ?
expires_at - TIMES_TTL * prefetch_ttl : expires_at
end
save_token(
update_value,
update_value.expires_at,
)
return update_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dawei-nava Per our discussion, let's refactor this logic so that the the sliding window approach is extremely clear and easy to read. This approach does look like it will reduce the likelihood of duplicate USPS auth calls and cache updates.

arcgis_get_token_retry_interval: 1
arcgis_get_token_retry_backoff_factor: 2
arcgis_token_sliding_expiration_enabled: true
arcgis_token_sync_request_enabled: false
Copy link
Contributor

@NavaTim NavaTim Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sheldon-b The flag determines whether the user-facing servers will attempt to set the auth token if the auth token is missing. Disabling this flag will reduce the likelihood of a flood of USPS auth requests during high traffic to in-person proofing.

Comment on lines +95 to +120
save_token(cache_entry, expires_at)
cache_entry
end

def remove_token
Rails.cache.delete(cache_key)
end

private

def handle_expired_token(cache_value)
cache_value = wrap_raw_token(cache_value) unless cache_value.is_a?(Hash)
expires_at = cache_value.fetch(:expires_at, nil)
sliding_expires_at = cache_value&.fetch(:sliding_expires_at, nil)
# start fetch new token before actual expires time
if IdentityConfig.store.arcgis_token_sliding_expiration_enabled
now = Time.zone.now.to_f
if sliding_expires_at && now >= sliding_expires_at
# if passed sliding expiration time(exp-3*prefetch_ttl)
# but not 2*prefetch_ttl from hard expiration time
if now > (sliding_expires_at + prefetch_ttl)
cache_value = retrieve_token
else
# extend sliding expiration time
cache_value[:sliding_expires_at] = sliding_expires_at + prefetch_ttl
save_token(cache_value, expires_at)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dawei-nava Per our discussion, this code still needs to be updated to permit both the previous and new servers to correctly read the token during the rollout. The rollout will not include planned downtime.

@NavaTim
Copy link
Contributor

NavaTim commented Jun 26, 2023

Closed in favor of #8662.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants