The HealthService is an app component that is running along other components like CatalogService and BackgroundProcessor on the compute cluster. It provides a REST API that is being called by Azure Front Door to determine the health of a stamp (region). Unlike basic liveness probes, which are present on every API, health service is a more complex component which reflects the state of dependencies, in addition to its own.
The idea is, first of all, if the cluster itself is down, the health service won't respond at all. When the service is up and running, it performs periodic checks against various components of the solution:
- It attempts to do a simple query against Cosmos DB
- It attempts to send a message to Event Hub (the message will be filtered out by the background worker)
- It looks up a state file on the storage account. This file can be used to turn off a region, even while the other checks are still working ok.
All health check results are cached in memory for a configurable number of seconds (by default 10) so that not every call to the API results in backend calls. While this does add a small potential latency in detecting outages, it also reduces the additional cluster load generated by health checks.
Refer to CatalogService configuration for details of the implementation.
Apart from the configuration settings which are common between components, such as Cosmos DB connection settings, the following settings are used exclusively by the HealthService:
HealthServiceCacheDurationSeconds
: Controls the expiration time of memory cache, in seconds.HealthServiceStorageConnectionString
: Connection string for the Storage Account where the status file should be present.HealthServiceBlobContainerName
: Storage Container where the status file should be present.HealthServiceBlobName
: Name of the status file - health check will look for this.HealthServiceOverallTimeoutSeconds
: Timeout for the whole check - defaults to 3 seconds. If the check doesn't finish in this interval, the service reports unhealthy.
All checks are done asynchronously and in parallel. If either of them fails, the whole stamp will be considered unavailable.
Check results are cached in memory, using the standard, non-distributed ASP.NET Core MemoryCache
. Cache expiration is controlled by SysConfig.HealthServiceCacheDurationSeconds
and is set to 10 seconds by default.
This reduces the additional load generated by health checks as not every request will result in downstream call to the dependent services.
The blob check currently serves two purposes:
- Test if it's possible to reach Blob Storage. This storage account is also used by other components in the stamp and hence considered a critical resource.
- Manually "turn off" a region by manipulating (i.e. deleting) the state file.
We decided that this check should only look for the presence of a state file in the specified Blob Container, but not process its content in any way. There is also the possibility to set up a more sophisticated system which would read the content of the file and return different status based on that (such as "HEALTHY", "UNHEALTHY", "MAINTENANCE" etc.).
Remove the state file to disable a stamp.
Make sure the file is present after deploying the application - otherwise the health service will always respond with UNHEALTHY and Front Door will not recognize the backend as available. This file does get created by Terraform so it should be present after the infrastructure deployment.
Event Hub health reporting is handled by the EventHubProducerService
. This service reports healthy if it's able to send a new message to Event Hub. For filtering, this message has an identifying property added to it:
HEALTHCHECK=TRUE
This message is ignored on the receiving end (AlwaysOn.BackgroundProcessor.EventHubProcessorService.ProcessEventHanderAsync()
), which checks for the HEALTHCHECK property.
Cosmos DB health reporting is handled by the CosmosDbService
, which reports healthy if it is:
- Able to connect to Cosmos DB database and perform a simple query.
- Able to write a test document to the database (the test document has a very short Time-to-Live set, so Cosmos DB automatically removes it).
The HealthService is doing two separate probes since Cosmos DB could be in a state in which reads still work, but writing documents does not.
For the Read-only query, the following query is being used, which doesn't fetch any data and doesn't have large impact on overall load:
SELECT GetCurrentDateTime ()
The write query creates a dummy ItemRating with minimum content:
var testRating = new ItemRating()
{
Id = Guid.NewGuid(),
CatalogItemId = Guid.NewGuid(), // Create some random (=non-existing) item id
CreationDate = DateTime.UtcNow,
Rating = 1,
TimeToLive = 10 // will be auto-deleted after 10sec
};
await AddNewRatingAsync(testRating);