Improve Snapshot Logging#137470
Conversation
Clean up the reported logging start/end milliseconds into human-readable string dates Relates to: ES-12794
|
@DiannaHohensee I am assigning you as the reviewer since you were the one to create the ticket. The acceptance criteria was a bit vague - is this the only instances of snapshot logging you wanted improved regarding human readable dates? I searched through the As a note, I will push separate PRs to address the other two issues from the ticket:
|
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
burqen
left a comment
There was a problem hiding this comment.
It's great to have more human friendly logs! And I do have some comments.
| "[{}] Writing [{}] of size [{}b] to [{}] took [{}ms]", | ||
| "[{}] Writing [{}] of size [{}b] to [{}]. Started at [{}], ended at [{}] and took [{}ms]", |
There was a problem hiding this comment.
When calculating the human readable form for start and end time we risk those values being different from the timestamps provided by the logger, since that might be configured with a different ZoneId (log4j can be configured by the user). This will be confusing for user. A safer way would be to only print the elapsed time, but change it to human readable form.
This could be done using using a TimeValue
Example here: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/ClusterStateObserver.java#L143
I tried to figure out if there is a way to get the zone id from log4j but it doesn't seem so, at least according to ChatGPT.
There was a problem hiding this comment.
Good catch! Coming back to this PR, I don't think ES-12794 was intended to modify this log message (it was entirely intended to be within the SnapshotShutdownProgressTracker). However, there is definite value in using the TimeValue class to improve the logging format, and so I will incorporate that in the next revision
| Node shutdown cluster state update received at [{} millis]. \ | ||
| Finished signalling shard snapshots to pause at [{} millis]. \ | ||
| Node shutdown cluster state update received at [{}] [{} millis]. \ | ||
| Finished signalling shard snapshots to pause at [{}] [{} millis]. \ |
There was a problem hiding this comment.
The same comment about converting to clock time outside of the logger applies here.
In this case the value of wall clock time seems higher than in the write blob case, since we are reporting on an historic event rather than on something that just happened. Maybe it's worth it in this case?
| public static String convertMillisToDateTime(long millis) { | ||
| DateTimeFormatter formatter = DateTimeFormatter.ISO_LOCAL_DATE_TIME.withZone(ZoneId.of("UTC")); | ||
| return formatter.format(Instant.ofEpochMilli(millis)); | ||
| } |
There was a problem hiding this comment.
If we need to do this conversion to wall clock time (see my other comments), I think we should try to use the formatters in org.elasticsearch.common.time rather than using Java's DateTimeFormatter directly. Especially since we get millis from a org.elasticsearch.common.time.TimeProvider (ThreadPool).
There is a DateUtils class already that I think would be a good fit, rather than creating a new one, see org.elasticsearch.common.time.DateUtils.
There was a problem hiding this comment.
Good idea - have removed the unnecessary util method and inlined a DateFormatter
| import java.time.ZoneId; | ||
| import java.time.format.DateTimeFormatter; | ||
|
|
||
| public class DateUtilsTests extends ESTestCase { |
There was a problem hiding this comment.
... and it's gone 🤣
…shua-adams-1/elasticsearch into snapshot-logging-improvements
| blobStoreSnapshotMetrics.incrementCountersForPartUpload(partBytes, uploadTimeInMillis); | ||
| logger.trace( | ||
| "[{}] Writing [{}] of size [{}b] to [{}] took [{}ms]", | ||
| "[{}] Writing [{}] of size [{}b] to [{}] took [{}/{}ms]", |
There was a problem hiding this comment.
Improving the log. Example: ...took [1.4s/1400ms]
burqen
left a comment
There was a problem hiding this comment.
This is a lot cleaner I think. Also, probably a reasonable middle way to use UTC for the date formatting. Approving 👍
| ); | ||
|
|
||
| private static final Logger logger = LogManager.getLogger(SnapshotShutdownProgressTracker.class); | ||
| private static final DateFormatter DATE_TIME_FORMATTER = DateFormatter.forPattern("strict_date_optional_time"); |
There was a problem hiding this comment.
I haven't delved into the possibilities, but looking at the superficial outcome a log message from the testing you added looks like
[2025-11-11T02:25:35,945][INFO ][o.e.s.SnapshotShutdownProgressTracker][testTrackerPauseTimestamp] Current active shard snapshot stats on data node [local-node-id-for-test]. Node shutdown cluster state update received at [1970-01-01T00:00:00.000Z]. Finished signalling shard snapshots to pause at [1970-01-01T00:00:00.000Z]. Time between the node shutdown cluster state update and signalling shard snapshots to pause is [0 millis]Number shard snapshots running [0]. Number shard snapshots waiting for master node reply to status update request [0] Shard snapshot completion stats since shutdown began: Done [0]; Failed [0]; Aborted [0]; Paused [0]
Specifically, [1970-01-01T00:00:00.000Z]. Is there any way to get a UTC label on this so we know (without looking at the code) what timezone is being reported?
There was a problem hiding this comment.
N.B I used DateFormatter.formatMillis() here which uses UTC as a default
N.B The Z at the end is actually the result of ZoneOffset.UTC.toString(). However, this isn't very user friendly so I have changed the log message to read:
... Node shutdown cluster state update received at [1970-01-01T00:00:00.000Z UTC] ...
There was a problem hiding this comment.
Works for me, thanks 👍
DiannaHohensee
left a comment
There was a problem hiding this comment.
Looks good to me, just a nit and wondering if we can tack on UTC without too much trouble.
Thanks, Anton, for doing the review heavy lifting.
| Finished signalling shard snapshots to pause at [{} millis]. \ | ||
| Node shutdown cluster state update received at [{}]. \ | ||
| Finished signalling shard snapshots to pause at [{}]. \ | ||
| Time between the node shutdown cluster state update and signalling shard snapshots to pause is [{} millis]\ |
There was a problem hiding this comment.
Missing a period and a space, . , at the end of this line, looks like this
... Time between the node shutdown cluster state update and signalling shard snapshots to pause is [0 millis]Number shard snapshots running [0]. ...
| ); | ||
|
|
||
| private static final Logger logger = LogManager.getLogger(SnapshotShutdownProgressTracker.class); | ||
| private static final DateFormatter DATE_TIME_FORMATTER = DateFormatter.forPattern("strict_date_optional_time"); |
There was a problem hiding this comment.
I haven't delved into the possibilities, but looking at the superficial outcome a log message from the testing you added looks like
[2025-11-11T02:25:35,945][INFO ][o.e.s.SnapshotShutdownProgressTracker][testTrackerPauseTimestamp] Current active shard snapshot stats on data node [local-node-id-for-test]. Node shutdown cluster state update received at [1970-01-01T00:00:00.000Z]. Finished signalling shard snapshots to pause at [1970-01-01T00:00:00.000Z]. Time between the node shutdown cluster state update and signalling shard snapshots to pause is [0 millis]Number shard snapshots running [0]. Number shard snapshots waiting for master node reply to status update request [0] Shard snapshot completion stats since shutdown began: Done [0]; Failed [0]; Aborted [0]; Paused [0]
Specifically, [1970-01-01T00:00:00.000Z]. Is there any way to get a UTC label on this so we know (without looking at the code) what timezone is being reported?
…-json * upstream/main: (158 commits) Cleanup files from repo root folder (elastic#138030) Implement OpenShift AI integration for chat completion, embeddings, and reranking (elastic#136624) Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716) Removes support for null TransportService in RemoteClusterService (elastic#137939) Mute org.elasticsearch.index.mapper.DateFieldMapperTests testSortShortcuts elastic#138018 rest-api-spec: fix type of enums (elastic#137521) Update Gradle wrapper to 9.2.0 (elastic#136155) Add RCS Strong Verification Documentation (elastic#137822) Use docvalue skippers on dimension fields (elastic#137029) Introduce INDEX_SHARD_COUNT_FORMAT (elastic#137210) Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesChatCompletion_AndThenCreatesTextEmbedding elastic#138012 Fix ES|QL search context creation to use correct results type (elastic#137994) Improve Snapshot Logging (elastic#137470) Support extra output field in TOP function (elastic#135434) Remove NumericDoubleValues class (elastic#137884) [ML] Fix ML calendar event update scalability issues (elastic#136886) Task may be unregistered outside of the trace context in exceptional cases. (elastic#137865) Refine workaround for S3 repo analysis known issue (elastic#138000) Additional DEBUG logging on authc failures (elastic#137941) Cleanup index resolution (elastic#137867) ...
Clean up the reported logging start/end milliseconds into human-readable
string dates
Relates to: ES-12794