Skip to content

Improve Snapshot Logging#137470

Merged
joshua-adams-1 merged 17 commits intoelastic:mainfrom
joshua-adams-1:snapshot-logging-improvements
Nov 13, 2025
Merged

Improve Snapshot Logging#137470
joshua-adams-1 merged 17 commits intoelastic:mainfrom
joshua-adams-1:snapshot-logging-improvements

Conversation

@joshua-adams-1
Copy link
Contributor

Clean up the reported logging start/end milliseconds into human-readable
string dates

Relates to: ES-12794

Clean up the reported logging start/end milliseconds into human-readable
 string dates

 Relates to: ES-12794
@joshua-adams-1 joshua-adams-1 self-assigned this Oct 31, 2025
@joshua-adams-1 joshua-adams-1 added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Oct 31, 2025
@joshua-adams-1
Copy link
Contributor Author

@DiannaHohensee I am assigning you as the reviewer since you were the one to create the ticket. The acceptance criteria was a bit vague - is this the only instances of snapshot logging you wanted improved regarding human readable dates? I searched through the snapshots/ repo and found these two files, but it's highly likely I've missed some!

As a note, I will push separate PRs to address the other two issues from the ticket:

  • Stop logging when there are no snapshots left
  • Add a log message indicating that all snapshots have finished or been paused

@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review November 3, 2025 12:16
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Nov 3, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@burqen burqen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to have more human friendly logs! And I do have some comments.

Comment on lines +4226 to +4228
"[{}] Writing [{}] of size [{}b] to [{}] took [{}ms]",
"[{}] Writing [{}] of size [{}b] to [{}]. Started at [{}], ended at [{}] and took [{}ms]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calculating the human readable form for start and end time we risk those values being different from the timestamps provided by the logger, since that might be configured with a different ZoneId (log4j can be configured by the user). This will be confusing for user. A safer way would be to only print the elapsed time, but change it to human readable form.

This could be done using using a TimeValue
Example here: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/ClusterStateObserver.java#L143

I tried to figure out if there is a way to get the zone id from log4j but it doesn't seem so, at least according to ChatGPT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Coming back to this PR, I don't think ES-12794 was intended to modify this log message (it was entirely intended to be within the SnapshotShutdownProgressTracker). However, there is definite value in using the TimeValue class to improve the logging format, and so I will incorporate that in the next revision

Comment on lines +139 to +141
Node shutdown cluster state update received at [{} millis]. \
Finished signalling shard snapshots to pause at [{} millis]. \
Node shutdown cluster state update received at [{}] [{} millis]. \
Finished signalling shard snapshots to pause at [{}] [{} millis]. \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment about converting to clock time outside of the logger applies here.

In this case the value of wall clock time seems higher than in the write blob case, since we are reporting on an historic event rather than on something that just happened. Maybe it's worth it in this case?

Comment on lines +26 to +29
public static String convertMillisToDateTime(long millis) {
DateTimeFormatter formatter = DateTimeFormatter.ISO_LOCAL_DATE_TIME.withZone(ZoneId.of("UTC"));
return formatter.format(Instant.ofEpochMilli(millis));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to do this conversion to wall clock time (see my other comments), I think we should try to use the formatters in org.elasticsearch.common.time rather than using Java's DateTimeFormatter directly. Especially since we get millis from a org.elasticsearch.common.time.TimeProvider (ThreadPool).

There is a DateUtils class already that I think would be a good fit, rather than creating a new one, see org.elasticsearch.common.time.DateUtils.

Copy link
Contributor Author

@joshua-adams-1 joshua-adams-1 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - have removed the unnecessary util method and inlined a DateFormatter

import java.time.ZoneId;
import java.time.format.DateTimeFormatter;

public class DateUtilsTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test coverage 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and it's gone 🤣

blobStoreSnapshotMetrics.incrementCountersForPartUpload(partBytes, uploadTimeInMillis);
logger.trace(
"[{}] Writing [{}] of size [{}b] to [{}] took [{}ms]",
"[{}] Writing [{}] of size [{}b] to [{}] took [{}/{}ms]",
Copy link
Contributor Author

@joshua-adams-1 joshua-adams-1 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improving the log. Example: ...took [1.4s/1400ms]

@joshua-adams-1 joshua-adams-1 requested a review from burqen November 7, 2025 16:13
Copy link
Contributor

@burqen burqen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot cleaner I think. Also, probably a reasonable middle way to use UTC for the date formatting. Approving 👍

);

private static final Logger logger = LogManager.getLogger(SnapshotShutdownProgressTracker.class);
private static final DateFormatter DATE_TIME_FORMATTER = DateFormatter.forPattern("strict_date_optional_time");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the same approach as here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't delved into the possibilities, but looking at the superficial outcome a log message from the testing you added looks like

[2025-11-11T02:25:35,945][INFO ][o.e.s.SnapshotShutdownProgressTracker][testTrackerPauseTimestamp] Current active shard snapshot stats on data node [local-node-id-for-test]. Node shutdown cluster state update received at [1970-01-01T00:00:00.000Z]. Finished signalling shard snapshots to pause at [1970-01-01T00:00:00.000Z]. Time between the node shutdown cluster state update and signalling shard snapshots to pause is [0 millis]Number shard snapshots running [0]. Number shard snapshots waiting for master node reply to status update request [0] Shard snapshot completion stats since shutdown began: Done [0]; Failed [0]; Aborted [0]; Paused [0]

Specifically, [1970-01-01T00:00:00.000Z]. Is there any way to get a UTC label on this so we know (without looking at the code) what timezone is being reported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B I used DateFormatter.formatMillis() here which uses UTC as a default

N.B The Z at the end is actually the result of ZoneOffset.UTC.toString(). However, this isn't very user friendly so I have changed the log message to read:

... Node shutdown cluster state update received at [1970-01-01T00:00:00.000Z UTC] ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, thanks 👍

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a nit and wondering if we can tack on UTC without too much trouble.

Thanks, Anton, for doing the review heavy lifting.

Finished signalling shard snapshots to pause at [{} millis]. \
Node shutdown cluster state update received at [{}]. \
Finished signalling shard snapshots to pause at [{}]. \
Time between the node shutdown cluster state update and signalling shard snapshots to pause is [{} millis]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a period and a space, . , at the end of this line, looks like this

... Time between the node shutdown cluster state update and signalling shard snapshots to pause is [0 millis]Number shard snapshots running [0]. ...

);

private static final Logger logger = LogManager.getLogger(SnapshotShutdownProgressTracker.class);
private static final DateFormatter DATE_TIME_FORMATTER = DateFormatter.forPattern("strict_date_optional_time");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't delved into the possibilities, but looking at the superficial outcome a log message from the testing you added looks like

[2025-11-11T02:25:35,945][INFO ][o.e.s.SnapshotShutdownProgressTracker][testTrackerPauseTimestamp] Current active shard snapshot stats on data node [local-node-id-for-test]. Node shutdown cluster state update received at [1970-01-01T00:00:00.000Z]. Finished signalling shard snapshots to pause at [1970-01-01T00:00:00.000Z]. Time between the node shutdown cluster state update and signalling shard snapshots to pause is [0 millis]Number shard snapshots running [0]. Number shard snapshots waiting for master node reply to status update request [0] Shard snapshot completion stats since shutdown began: Done [0]; Failed [0]; Aborted [0]; Paused [0]

Specifically, [1970-01-01T00:00:00.000Z]. Is there any way to get a UTC label on this so we know (without looking at the code) what timezone is being reported?

@joshua-adams-1 joshua-adams-1 merged commit 1ce5c7c into elastic:main Nov 13, 2025
34 checks passed
@joshua-adams-1 joshua-adams-1 deleted the snapshot-logging-improvements branch November 13, 2025 11:37
szybia added a commit to szybia/elasticsearch that referenced this pull request Nov 13, 2025
…-json

* upstream/main: (158 commits)
  Cleanup files from repo root folder (elastic#138030)
  Implement OpenShift AI integration for chat completion, embeddings, and reranking (elastic#136624)
  Optimize AsyncSearchErrorTraceIT to avoid failures (elastic#137716)
  Removes support for null TransportService in RemoteClusterService (elastic#137939)
  Mute org.elasticsearch.index.mapper.DateFieldMapperTests testSortShortcuts elastic#138018
  rest-api-spec: fix type of enums (elastic#137521)
  Update Gradle wrapper to 9.2.0 (elastic#136155)
  Add RCS Strong Verification Documentation (elastic#137822)
  Use docvalue skippers on dimension fields (elastic#137029)
  Introduce INDEX_SHARD_COUNT_FORMAT (elastic#137210)
  Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesChatCompletion_AndThenCreatesTextEmbedding elastic#138012
  Fix ES|QL search context creation to use correct results type (elastic#137994)
  Improve Snapshot Logging (elastic#137470)
  Support extra output field in TOP function (elastic#135434)
  Remove NumericDoubleValues class (elastic#137884)
  [ML] Fix ML calendar event update scalability issues (elastic#136886)
  Task may be unregistered outside of the trace context in exceptional cases. (elastic#137865)
  Refine workaround for S3 repo analysis known issue (elastic#138000)
  Additional DEBUG logging on authc failures (elastic#137941)
  Cleanup index resolution (elastic#137867)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants