Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSSIDECAR-173: Endpoint to retrieve C* gossip status #159

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

skoppu22
Copy link

@skoppu22 skoppu22 commented Dec 10, 2024

https://issues.apache.org/jira/browse/CASSSIDECAR-173
Developing a new Sidecar endpoint to retrieve gossip running status of Cassandra, so the end user can use this endpoint without having to invoke C* JMX call. This endpoint will have authorization checks enabled once authorization feature is added to the Sidecar.

Circle CI run: https://app.circleci.com/pipelines/github/skoppu22/cassandra-sidecar?branch=gossips2

return gossipRunning;
}

private final boolean gossipRunning;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define this at the beginning of the class

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of Sidecar request/response classes following this style. I am also used to the style of defining at the beginning of the class, but just trying to be inline with existing classes. May be we can start changing the style now, as easy to find all variables right at the beginning of the class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of Sidecar request/response classes following this style

This is not true, I'm not sure which class you are looking at, but all classes under client-common/src/main/java/org/apache/cassandra/sidecar/common/response define fields at the beginning of the class. And we should follow the Cassandra code structure https://cassandra.apache.org/_/development/code_style.html under Code Structure section > Class Layout.

Copy link
Author

@skoppu22 skoppu22 Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we should fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be enforced with code style? We want to layout the class in this order.

// variables 
public static 
protected static
private static
public 
protected
private 
// methods 
// same ordering goes again
// visible for testing methods
// then, inner class

@@ -380,6 +382,9 @@ public Router vertxRouter(Vertx vertx,
router.get(ApiEndpointsV1.GOSSIP_INFO_ROUTE)
.handler(gossipInfoHandler);

router.get(ApiEndpointsV1.GOSSIP_STATUS_ROUTE)
Copy link

@bbotella bbotella Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is really unfortunate having to add an entire new API when we already have the GOSSIP_INFO_ROUTE endpoint? I even see that the GossipInfoResponse already has a deprecated STATUS field.

I understand that we are trying to make this call more efficient by only checking the gossip status. I wonder, could we add a filtering parameter that helps us only executing the part of the handler that we want?

I'm thinking of a filtering pattern like:
/api/v1/cassandra/gossip?filter=['status'] or something along those lines?

That way, and finding out why/where it was deprecated, maybe we can hijack it to use this storageOperations::isGossipRunning can save us from having to add this new API

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?filter=<> doesn't sound right in this case, as filter is usually used to chose some/partial fields of the output which was returned without using filter. In this case /gossip endpoint doesn't return 'status', so we can't filter status from it. Let me see what are alternatives and will get back to you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, we cannot infer gossip running status from STATUS (or STATUS_WITH_PORT) fields of gossip info. Hence we cannot use gossip/filter=status for gossip running status, as that filter can be mistaken for STATUS field of gossip info. Will continue with gossip/status endpoint (Other option could be gossip/runningStatus, but sounds too verbose)

Copy link
Contributor

@frankgh frankgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to the PR

return gossipRunning;
}

private final boolean gossipRunning;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of Sidecar request/response classes following this style

This is not true, I'm not sure which class you are looking at, but all classes under client-common/src/main/java/org/apache/cassandra/sidecar/common/response define fields at the beginning of the class. And we should follow the Cassandra code structure https://cassandra.apache.org/_/development/code_style.html under Code Structure section > Class Layout.

@JsonProperty("gossipRunning")
public boolean gossipRunning()
{
return gossipRunning;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning a boolean becomes inflexible if you need to support a use case where you have more than one value. I suggest we change this to a payload that looks like

{
   "status": "RUNNING"
}

and

{
   "status": "NOT_RUNNING"
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client need to rely on deterministic values to parse the response and take actions accordingly. String "RUNNING" or "NOT_RUNNING" can break the client even with any minor changes or extra spaces etc in the output. Instead, if it is possible to return an enum as rest API response, we can use an enum. Otherwise boolean value true/false would be deterministic for the caller to parse the output

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, both enum or boolean flags are technically strings that need to be parsed/interpreted by the client as well. I do like the idea @frankgh is proposing of having the enum with two possible states, that allows us to easily expand the amount of information we can provide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why boolean is treated as a string? Client can prepare GossipStatusResponse from the json received in the response and can get boolean using GossipStatusResponse.gossipRunning()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that a response payload is a bunch of bytes that need to be parsed anyway, even if it is a json payload, it needs to be parsed. The client is in charge of parsing it and interpreting fields as booleans, strings, or anything else.

That said, I do agree with @frankgh in which the flexibility added by returning that enum is preferable in this case. On the client side, it is client's responsibility to parse the response correctly. In our current implementation, we control de java client for instance. We are also adding OpenAPI documentation that will help clients parse the given responses as part of this PR.

Long story short, I don't think returning a "RUNNING" response is making things harder for the client than returning a true, and it gives us flexibility to grow the API in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to String

*/
public CompletableFuture<GossipStatusResponse> gossipStatus()
{
return executor.executeRequestAsync(requestBuilder().gossipStatusRequest().build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this request should be instance-specific

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this

/**
* @return returns gossip status
*/
GossipStatusResponse isGossipRunning();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better aligned with the action we are trying to perform

Suggested change
GossipStatusResponse isGossipRunning();
GossipStatusResponse gossipStatus();

Comment on lines 336 to 327
protected StorageOperations getStorageOperations(String host)
{
CassandraAdapterDelegate delegate = this.metadataFetcher.delegate(host);
StorageOperations storageOperations = delegate == null ? null : delegate.storageOperations();
if (storageOperations == null)
{
throw cassandraServiceUnavailable();
}

return storageOperations;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have a method that does something similar in this class, we should just generalize that method. I suggest renaming ifMetricsOpsAvailable , and the code would look like this instead

    protected <P> void ifAvailableFromDelegate(RoutingContext context,
                                               String host,
                                               Function<CassandraAdapterDelegate, P> mapper,
                                               Consumer<P> ifAvailable)
    {
        CassandraAdapterDelegate delegate = metadataFetcher.delegate(host);
        if (delegate == null)
        {
            context.fail(cassandraServiceUnavailable());
            return;
        }
        P applied = mapper.apply(delegate);
        if (applied == null)
        {
            context.fail(cassandraServiceUnavailable());
            return;
        }
        ifAvailable.accept(applied);
    }

The existing callsite would change to this:

ifAvailableFromDelegate(context, host, CassandraAdapterDelegate::metricsOperations, operations -> {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is no longer needed as you moved null checks into getters

Void request)
{
StorageOperations storageOperations = getStorageOperations(host);
logger.debug("Retrieving gossip status, remoteAddress={}, instance={}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this log entry is redundant, we already log

logger.debug("{} received request={}, remoteAddress={}, instance={}",
                         this.getClass().getSimpleName(), requestParams, remoteAddress, host);

for all requests in handlers of the AbstractHandler type.

private static final String testRoute = "/api/v1/cassandra/gossip/status";

@CassandraIntegrationTest()
void testGossipDisabled(CassandraTestContext context, VertxTestContext testContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine both test cases into a single test case? these are expensive to spin up, so whenever possible we should reuse the in-jvm dtest apparatus available .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

*/
class JmxCommonTest
{
static final Logger LOGGER = LoggerFactory.getLogger(JmxCommonTest.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already have an in-jvm dtest this becomes less useful IMO. There is no need to do mocking if we have the actual server running

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We do more test scenarios using mock, especially related to failure cases as mock makes it easy to throw exceptions or return invalid values
  • Can we get code coverage measured with in-jvm dtests? For example, someone using open source Sidecar and hooking to their build systems, can they get code coverage numbers with in-jvm dtests?

Copy link
Contributor

@frankgh frankgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I have a few additional comments.

* Constructs a response with gossip status
* @param gossipRunningStatus running status of gossip
*/
public GossipStatusResponse(@JsonProperty("gossipRunningStatus") String gossipRunningStatus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think status should be sufficient. We don't have a precedent for qualifying the field name like this.

Copy link
Author

@skoppu22 skoppu22 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just calling it 'status' can be mistaken with STATUS field of gossip info, as we discussed before regarding this field. Gossip running status just tells is gossip thread running in C*. Where as STATUS field has different meaning. So we need to be clear that this is gossip thread running status

}

@JsonProperty("gossipRunningStatus")
public String gossipRunningStatus()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a utility method here isRunning that returns "RUNNING".equals(gossipRunningStatus). Similar to what you had in an earlier iteration of this PR.

@@ -102,4 +103,9 @@ default void outOfRangeDataCleanup(@NotNull String keyspace, @NotNull String tab
{
outOfRangeDataCleanup(keyspace, table, 1);
}

/**
* @return returns gossip status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @return returns gossip status
* @return the gossip status



/**
* Handler for to retrieve gossip status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Handler for to retrieve gossip status
* Handles gossip status retrieval

}

@CassandraIntegrationTest()
void testGossipStatus(CassandraTestContext context, VertxTestContext testContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

private static final String testRoute = "/api/v1/cassandra/gossip/status";

@Test
void testGossipRunning(VertxTestContext context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is redundant, not sure what we gain here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gain depends on answer to https://github.com/apache/cassandra-sidecar/pull/159/files#r1886797448

Do we get code coverage with in-jvm tests? If yes, we can remove this test

}

@Test
void testGossipNotRunning(VertxTestContext context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also redundant not sure what exactly we are testing here.

}

@Test
void testWithInstanceId(VertxTestContext context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this test doesn't make a lot of sense to me. My problem with the test is that in the setup JmxCommonTest you share the delegate and storageOperations among 3 instances. This setup is problematic because in the real execution of sidecar each instance will have its own delegate and all the operations associated with a single delegate. I would drop this test, we already cover accessing different instances with the query parameter, so I would consider testing this outside of the scope of this test.

}));
}

private void verifyValidResponse(VertxTestContext context, HttpResponse<Buffer> response, String expectedValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this verifyAndComplete

@Singleton
public InstancesMetadata instanceConfig()
{
int instanceId1 = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably model this correctly if we are meaning to use this in other tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove multi instance stuff from this. But I see almost all tests in this directory are having module extending AbstractModule and using mocks, I believe we can move common code to a common test framework file and reuse in most of tests

Copy link
Contributor

@yifan-c yifan-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution.

First of all, please add an entry to CHANGES.txt.

Second, the patch is to add an API to check whether gossip is enabled or disabled in the companion Cassandra node. There are existing health APIs that check the status of corresponding components. See the following code block for example.

GET /api/v1/cassandra/native/__health
{"status" : "OK | NOT_OK"}

GET /api/v1/cassandra/jmx/__health
{"status" : "OK | NOT_OK"}

I think we should be consistent with the existing pattern. In other words, the API should be this.

GET /api/v1/cassandra/gossip/__health
{"status" : "OK | NOT_OK"}

public GossipStatusResponse gossipStatus()
{
return new GossipStatusResponse(jmxClient.proxy(StorageJmxOperations.class, STORAGE_SERVICE_OBJ_NAME)
.isGossipRunning() ? "RUNNING" : "NOT_RUNNING");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving "RUNNING" and "NOT_RUNNING" to an enum?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually agree with @yifan-c suggestion. So, what about having the enum for the "OK" and "NOT_OK"?

@frankgh
Copy link
Contributor

frankgh commented Jan 9, 2025

I think we should be consistent with the existing pattern. In other words, the API should be this.

I agree with Yifan, we should try to stick closer to the existing patterns. +1 on this:

GET /api/v1/cassandra/gossip/__health
{"status" : "OK | NOT_OK"}

@skoppu22
Copy link
Author

Thanks for the contribution.

First of all, please add an entry to CHANGES.txt.

Second, the patch is to add an API to check whether gossip is enabled or disabled in the companion Cassandra node. There are existing health APIs that check the status of corresponding components. See the following code block for example.

GET /api/v1/cassandra/native/__health
{"status" : "OK | NOT_OK"}

GET /api/v1/cassandra/jmx/__health
{"status" : "OK | NOT_OK"}

I think we should be consistent with the existing pattern. In other words, the API should be this.

GET /api/v1/cassandra/gossip/__health
{"status" : "OK | NOT_OK"}

This API is returning C* gossip thread running status (just tells is gossip thread running or not). Having thread running doesn't mean gossip health is OK. So I wouldn't prefer to call it 'gossip health'. Gossip thread may be running, but gossip state might be not healthy at C* side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants