Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace caffeine cache with reference to full map #2214

Merged
merged 45 commits into from
Aug 6, 2021

Conversation

rosalind210
Copy link
Contributor

@rosalind210 rosalind210 commented Jun 24, 2021

Originally had the CaffeineCache set up in the RequestResource by user and call parameters, but the return values couldn't be cached across all users because of differing authorization and settings. We ended up still experiencing 504s when SingularityClient's getRequests got hit hard again. We decided to split up the cache to cover the heaviest ZK calls in the getRequest method: RequestManager's fetchRequests and DeployManager's fetchDeployStatesByRequestIds, which don't have caching around client calls, but then there was a performance degradation in the getRequest call because of the cache expiry.

Now we are replacing caffeine cache altogether with an AtomicReference to the full map of requests and deploys that are updated every second five seconds by a different thread. The update to five seconds came after overloading the CuratorFramework with every second calls for request and deploy data.

cc: @jschlather

return values;
}

public void put(K key, V value) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to use a loader method. If we do it this way, then we won't get any debouncing from the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both caches are now LoadingCaches.

LOG.trace("Grabbed mapped values for {} from cache", keys);
}

return values.isEmpty() ? null : values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more idiomatic to return an empty map instead of null

Collections.singletonList(getRequestDeployStatePath(requestId)),
requestDeployStateTranscoder
)
.get(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this throws an IndexOutOfBoundsException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check of that returned list before grabbing the first.

}

public void setCaffeineCacheTtl(int caffeineCacheTtl) {
this.caffeineCacheTtl = caffeineCacheTtl;
public void setDeployCaffeineCacheTtl(int deployCaffeineCacheTtl) {
Copy link
Contributor

@pschoenfelder pschoenfelder Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — can we remove caffeine from the config terminology here? Users don't need to be aware of cache choice/impl, and we could change caches in the future if we wanted to without any awkwardness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also updated the useCaffeineCache to useZKFastCache

}

public int getRequestCaffeineCacheTtl() {
return requestCaffeineCacheTtl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit — can we append the time unit to this so users don't have to search code for it?

@jschlather
Copy link

LGTM

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class ManagerCache<K, V> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the ZkCache class. It's basically the same thing as this and might be better suited (at least for the deployManager part of things)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ZkCache just uses a Guava cache and we were hoping to get benefits from the LoadingCache created by CaffeineCache for debouncing requests because we were also getting timeouts at the DeployManager's request, do you still think that I should update the DeployManager's usage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, thought the Zk one was more similar. Can keep the separate one then

Comment on lines 643 to 648
if (requestsCache.isEnabled()) {
List<SingularityRequestWithState> requests = requestsCache.get("all");
if (requests != null) {
return requests;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just not use the web cache here if this is the case. Seems overkill to have the leaderCache -> web cache -> zk cache all here. Lots of duplicate memory usage to store it all 3 times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, to have ease in revert, the removal of web cache will be in a separate PR

configuration.useApiCacheInDeployManager(),
configuration.getDeployCacheTtl(),
this::fetchAllDeployStates,
executorServiceFactory.get("deploy-api-cache-reloader")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a version of this that returns a single threaded executor, which is all we should need for this case

Copy link
Contributor Author

@rosalind210 rosalind210 Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an option for a single threaded ScheduledExecutorService in SingularityManagedScheduledExecutorServiceFactory or SingularityManagedThreadPoolFactory but Executors has newSingleThreadScheduledExecutor that I can add.

@@ -90,6 +93,7 @@
);

private final Map<Class<? extends SingularityExpiringRequestActionParent<? extends SingularityExpiringRequestParent>>, Transcoder<? extends SingularityExpiringRequestActionParent<? extends SingularityExpiringRequestParent>>> expiringTranscoderMap;
private final ApiCache<String, SingularityRequestWithState> requestsCache;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoudl the getRequests(List<String> requestIds) and the singular getRequest(String requestId) also use the cache? Doesn't look like they are at the moment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't update elsewhere yet because I wasn't sure if that's what we wanted to do since we were most concerned by the endpoint to get all requests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it all pulls form the same place, and we are constnatly updating everything, I think it'd be worth it to update. I believe that the individual request endpoint was also pretty high up on the usage. Can always be a follow up PR if we want to check how effective this is first too

Copy link
Contributor Author

@rosalind210 rosalind210 Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the cache to the getRequests(List requestIds) and the singular getRequest(String requestId), and the singular request has a non-cache flag now for Orion usage

this.executor = executor;
}

public void startReloader() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also synchronously perform the first fetch I wonder? Would avoid the case where getAll was incorrectly empty or get(K key) was null

if (requestsCache.isEnabled() && !skipApiCache) {
SingularityRequestWithState request = requestsCache.get(requestId);
if (request != null) {
return Optional.of(request);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make this Optional.ofNullable(...) in case it was a false null, aka the request came in between the 5 second reload, but this means that we will go to ZK for true nulls (the request was removed).

@@ -544,6 +567,15 @@ public SingularityDeleteResult markDeleted(
}
}

if (requestsCache.isEnabled()) {
List<SingularityRequestWithState> requests = new ArrayList<>(
(requestsCache.getAll()).values()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, extra parens here?

Copy link
Member

@ssalinas ssalinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One overall comment for zk efficiency. Do we want to have the reloader short circuit if the current instance is the leader (i.e. will have the leader cache instead)? I don't think there are any cases where we would hit this over the leader cache, so would likely want to limit the calls flooding the leader CuratorFramework

return getSingleThreaded(name, false);
}

public synchronized ScheduledExecutorService getSingleThreaded(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this method is unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the pattern in the SingularityManagedThreadPoolFactory for getSingleThreaded but I'll update to not overload that method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 just realzied the one above calls it. Nevermind didn't even see that

Comment on lines +413 to +416
@Deprecated
public void setSkipApiCache(boolean skipApiCache) {
this.skipApiCache = skipApiCache;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for structure/usability of the client. Is there a reason you have this at the class level instead of just being an arg on the relevant method(s)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the request call usages in Orion and talked to Suruu and we decided it would be easier on Deploy's side to have a class level argument rather than updating all usages. It also makes clean up easier for when we've solved the underlying CuratorFramework issue

if (allValues.isEmpty()) {
LOG.debug("ApiCache getAll returned empty");
} else {
LOG.debug("getAll returned {} values", allValues.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this and the debug statement above, maybe these are more like TRACE level lines? Would get pretty noisy given that we could acall these multiple times a second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made all of the get calls' logs trace level.

@ssalinas
Copy link
Member

ssalinas commented Aug 6, 2021

🚢

@rosalind210 rosalind210 merged commit 916716d into master Aug 6, 2021
@rosalind210 rosalind210 deleted the break_up_caffeine_caching branch August 6, 2021 14:19
@ssalinas ssalinas added this to the 1.5.0 milestone May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants