Skip to content

Conversation

@simplynaveen20
Copy link
Member

@simplynaveen20 simplynaveen20 commented Dec 11, 2020

Currently CosmosException toString() and getMessage() method return a mix of non-json text and json diagnostic information.
This is causing problems during debugging the issue when the customer reports the exception.
This PR will convert both toString() and getMessage() into json format. Added test coverage to avoid future contract break.

Examples
toString()-

{
"ClassName": "NotFoundException",
"userAgent": "azsdk-java-cosmos/4.10.0 Windows10/10.0 JRE/11.0.6",
"statusCode": 404,
"resourceAddress": "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer20/partitions/a4cb4960-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
"error": "{"Errors":["Resource Not Found. Learn more: https://aka.ms/cosmosdb-tsg-not-found"]}",
"innerErrorMessage": "["Resource Not Found. Learn more: https://aka.ms/cosmosdb-tsg-not-found"]",
"causeInfo": null,
"responseHeaders": "{x-ms-current-replica-set-size=1, x-ms-last-state-change-utc=Tue, 15 Dec 2020 19:26:43.217 GMT, x-ms-session-token=0:-1#1632, lsn=1632, x-ms-request-charge=1.0, x-ms-schemaversion=1.10, x-ms-transport-request-id=2, x-ms-number-of-read-regions=0, x-ms-current-write-quorum=1, x-ms-cosmos-quorum-acked-llsn=1632, x-ms-quorum-acked-lsn=1632, x-ms-activity-id=42c69559-3f0c-11eb-9f3c-59b9e4f788a5, x-ms-xp-role=0, x-ms-global-Committed-lsn=-1, x-ms-cosmos-llsn=1632, x-ms-serviceversion= version=2.11.0.0}",
"cosmosDiagnostics": {
"userAgent": "azsdk-java-cosmos/4.10.0 Windows10/10.0 JRE/11.0.6",
"requestLatencyInMs": 392,
"requestStartTimeUTC": "2020-12-15T19:32:25.730376600Z",
"requestEndTimeUTC": "2020-12-15T19:32:26.122406500Z",
"connectionMode": "DIRECT",
"responseStatisticsList": [
{
"storeResult": {
"storePhysicalAddress": null,
"lsn": 1632,
"globalCommittedLsn": -1,
"partitionKeyRangeId": "0",
"isValid": true,
"statusCode": 404,
"subStatusCode": 0,
"isGone": false,
"isNotFound": true,
"isInvalidPartition": false,
"requestCharge": 1,
"itemLSN": -1,
"sessionToken": "-1#1632",
"exception": "["Resource Not Found. Learn more: https://aka.ms/cosmosdb-tsg-not-found"]",
"transportRequestTimeline": [
{
"eventName": "created",
"startTimeUTC": "2020-12-15T19:32:26.025410700Z",
"durationInMicroSec": 1993
},
{
"eventName": "queued",
"startTimeUTC": "2020-12-15T19:32:26.027404500Z",
"durationInMicroSec": 0
},
{
"eventName": "channelAcquisitionStarted",
"startTimeUTC": "2020-12-15T19:32:26.027404500Z",
"durationInMicroSec": 36003
},
{
"eventName": "pipelined",
"startTimeUTC": "2020-12-15T19:32:26.063407800Z",
"durationInMicroSec": 13000
},
{
"eventName": "transitTime",
"startTimeUTC": "2020-12-15T19:32:26.076407900Z",
"durationInMicroSec": 2998
},
{
"eventName": "received",
"startTimeUTC": "2020-12-15T19:32:26.079406600Z",
"durationInMicroSec": 32000
},
{
"eventName": "completed",
"startTimeUTC": "2020-12-15T19:32:26.111407100Z",
"durationInMicroSec": 8000
}
],
"rntbdRequestLengthInBytes": 463,
"rntbdResponseLengthInBytes": 283,
"requestPayloadLengthInBytes": 0,
"responsePayloadLengthInBytes": null,
"channelTaskQueueSize": 1,
"pendingRequestsCount": 1,
"serviceEndpointStatistics": {
"availableChannels": 0,
"acquiredChannels": 0,
"executorTaskQueueSize": 0,
"inflightRequests": 1,
"lastSuccessfulRequestTime": "2020-12-15T19:32:26.026Z",
"lastRequestTime": "2020-12-15T19:32:26.026Z",
"createdTime": "2020-12-15T19:32:26.026406900Z",
"isClosed": false
}
},
"requestResponseTimeUTC": "2020-12-15T19:32:26.122406500Z",
"requestResourceType": "Document",
"requestOperationType": "Read"
}
],
"supplementalResponseStatisticsList": [],
"addressResolutionStatistics": {
"42ded84d-3f0c-11eb-9f3c-59b9e4f788a5": {
"startTimeUTC": "2020-12-15T19:32:25.900406600Z",
"endTimeUTC": "2020-12-15T19:32:25.917410400Z",
"targetEndpoint": "https://127.0.0.1:8081/addresses/?$resolveFor=dbs%2FRuR-AA%3D%3D%2Fcolls%2FRuR-AKEqke8%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=0",
"errorMessage": null,
"inflightRequest": false
}
},
"regionsContacted": [
"https://127.0.0.1:8081/"
],
"retryContext": {
"retryCount": 0,
"statusAndSubStatusCodes": null,
"retryLatency": 0
},
"metadataDiagnosticsContext": {
"metadataDiagnosticList": [
{
"metaDataName": "CONTAINER_LOOK_UP",
"startTimeUTC": "2020-12-15T19:32:25.750382200Z",
"endTimeUTC": "2020-12-15T19:32:25.789377600Z",
"durationinMS": 38
},
{
"metaDataName": "PARTITION_KEY_RANGE_LOOK_UP",
"startTimeUTC": "2020-12-15T19:32:25.857394800Z",
"endTimeUTC": "2020-12-15T19:32:25.894406900Z",
"durationinMS": 37
},
{
"metaDataName": "SERVER_ADDRESS_LOOKUP",
"startTimeUTC": "2020-12-15T19:32:25.900406600Z",
"endTimeUTC": "2020-12-15T19:32:25.917410400Z",
"durationinMS": 17
}
]
},
"serializationDiagnosticsContext": {
"serializationDiagnosticsList": null
},
"gatewayStatistics": null,
"systemInformation": {
"usedMemory": "79083 KB",
"availableMemory": "4074261 KB",
"systemCpuLoad": "(2020-12-15T19:32:22.653618900Z 33.7%), (2020-12-15T19:32:22.990406700Z 33.7%)"
},
"clientCfgs": {
"id": -2,
"numberOfClients": 4,
"connCfg": {
"rntbd": "(cto:PT5S, rto:PT5S, icto:PT0S, ieto:PT1H, mcpe:130, mrpc:30, cer:false)",
"gw": "(cps:1000, rto:PT5S, icto:null, p:false)",
"other": "(ed: true, cs: false)"
},
"consistencyCfg": "(consistency: null, mm: true, prgns: [])"
}
}
}

getMessage()-
{
"innerErrorMessage": "["Resource Not Found. Learn more: https://aka.ms/cosmosdb-tsg-not-found"]",
"cosmosDiagnostics": {
"userAgent": "azsdk-java-cosmos/4.10.0 Windows10/10.0 JRE/11.0.6",
"requestLatencyInMs": 392,
"requestStartTimeUTC": "2020-12-15T19:32:25.730376600Z",
"requestEndTimeUTC": "2020-12-15T19:32:26.122406500Z",
"connectionMode": "DIRECT",
"responseStatisticsList": [
{
"storeResult": {
"storePhysicalAddress": null,
"lsn": 1632,
"globalCommittedLsn": -1,
"partitionKeyRangeId": "0",
"isValid": true,
"statusCode": 404,
"subStatusCode": 0,
"isGone": false,
"isNotFound": true,
"isInvalidPartition": false,
"requestCharge": 1,
"itemLSN": -1,
"sessionToken": "-1#1632",
"exception": "["Resource Not Found. Learn more: https://aka.ms/cosmosdb-tsg-not-found"]",
"transportRequestTimeline": [
{
"eventName": "created",
"startTimeUTC": "2020-12-15T19:32:26.025410700Z",
"durationInMicroSec": 1993
},
{
"eventName": "queued",
"startTimeUTC": "2020-12-15T19:32:26.027404500Z",
"durationInMicroSec": 0
},
{
"eventName": "channelAcquisitionStarted",
"startTimeUTC": "2020-12-15T19:32:26.027404500Z",
"durationInMicroSec": 36003
},
{
"eventName": "pipelined",
"startTimeUTC": "2020-12-15T19:32:26.063407800Z",
"durationInMicroSec": 13000
},
{
"eventName": "transitTime",
"startTimeUTC": "2020-12-15T19:32:26.076407900Z",
"durationInMicroSec": 2998
},
{
"eventName": "received",
"startTimeUTC": "2020-12-15T19:32:26.079406600Z",
"durationInMicroSec": 32000
},
{
"eventName": "completed",
"startTimeUTC": "2020-12-15T19:32:26.111407100Z",
"durationInMicroSec": 8000
}
],
"rntbdRequestLengthInBytes": 463,
"rntbdResponseLengthInBytes": 283,
"requestPayloadLengthInBytes": 0,
"responsePayloadLengthInBytes": null,
"channelTaskQueueSize": 1,
"pendingRequestsCount": 1,
"serviceEndpointStatistics": {
"availableChannels": 0,
"acquiredChannels": 0,
"executorTaskQueueSize": 0,
"inflightRequests": 1,
"lastSuccessfulRequestTime": "2020-12-15T19:32:26.026Z",
"lastRequestTime": "2020-12-15T19:32:26.026Z",
"createdTime": "2020-12-15T19:32:26.026406900Z",
"isClosed": false
}
},
"requestResponseTimeUTC": "2020-12-15T19:32:26.122406500Z",
"requestResourceType": "Document",
"requestOperationType": "Read"
}
],
"supplementalResponseStatisticsList": [],
"addressResolutionStatistics": {
"42ded84d-3f0c-11eb-9f3c-59b9e4f788a5": {
"startTimeUTC": "2020-12-15T19:32:25.900406600Z",
"endTimeUTC": "2020-12-15T19:32:25.917410400Z",
"targetEndpoint": "https://127.0.0.1:8081/addresses/?$resolveFor=dbs%2FRuR-AA%3D%3D%2Fcolls%2FRuR-AKEqke8%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=0",
"errorMessage": null,
"inflightRequest": false
}
},
"regionsContacted": [
"https://127.0.0.1:8081/"
],
"retryContext": {
"retryCount": 0,
"statusAndSubStatusCodes": null,
"retryLatency": 0
},
"metadataDiagnosticsContext": {
"metadataDiagnosticList": [
{
"metaDataName": "CONTAINER_LOOK_UP",
"startTimeUTC": "2020-12-15T19:32:25.750382200Z",
"endTimeUTC": "2020-12-15T19:32:25.789377600Z",
"durationinMS": 38
},
{
"metaDataName": "PARTITION_KEY_RANGE_LOOK_UP",
"startTimeUTC": "2020-12-15T19:32:25.857394800Z",
"endTimeUTC": "2020-12-15T19:32:25.894406900Z",
"durationinMS": 37
},
{
"metaDataName": "SERVER_ADDRESS_LOOKUP",
"startTimeUTC": "2020-12-15T19:32:25.900406600Z",
"endTimeUTC": "2020-12-15T19:32:25.917410400Z",
"durationinMS": 17
}
]
},
"serializationDiagnosticsContext": {
"serializationDiagnosticsList": null
},
"gatewayStatistics": null,
"systemInformation": {
"usedMemory": "79083 KB",
"availableMemory": "4074261 KB",
"systemCpuLoad": "(2020-12-15T19:32:22.653618900Z 33.7%), (2020-12-15T19:32:22.990406700Z 33.7%)"
},
"clientCfgs": {
"id": -2,
"numberOfClients": 4,
"connCfg": {
"rntbd": "(cto:PT5S, rto:PT5S, icto:PT0S, ieto:PT1H, mcpe:130, mrpc:30, cer:false)",
"gw": "(cps:1000, rto:PT5S, icto:null, p:false)",
"other": "(ed: true, cs: false)"
},
"consistencyCfg": "(consistency: null, mm: true, prgns: [])"
}
}
}

@simplynaveen20
Copy link
Member Author

/azp run java - cosmos - ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@simplynaveen20
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@moderakh moderakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Naveen for the PR. This is helpful.
@simplynaveen20 Could you please add a sample exception stacktrace to the PR description?

also:
Serializing diagnostics is expensive and will have perf impact.

Now that we are doing that in the exception stacktrace for each exception we should exclude the business logic errors:

  • NotFound (with no substatus code, i.e., for READ_SESSION_NOT_AVAILABLE we need diagnostics),
  • Conflict,
  • PreconditionFailed.

These are business logic failures we don't need to the diagnostics for this.

NotFound with READ_SESSION_NOT_AVAILABLE substatus code is a consistency failure not business logic we should have diagnostics for that.

@simplynaveen20
Copy link
Member Author

Thank you Naveen for the PR. This is helpful.
@simplynaveen20 Could you please add a sample exception stacktrace to the PR description?

also:
Serializing diagnostics is expensive and will have perf impact.

Now that we are doing that in the exception stacktrace for each exception we should exclude the business logic errors:

  • NotFound (with no substatus code, i.e., for READ_SESSION_NOT_AVAILABLE we need diagnostics),
  • Conflict,
  • PreconditionFailed.

These are business logic failures we don't need to the diagnostics for this.

NotFound with READ_SESSION_NOT_AVAILABLE substatus code is a consistency failure not business logic we should have diagnostics for that.

Before this PR we were still serializing diagnostics and putting in string on exception message. In this PR we putting diagnostics object in exception parent ObjectNode and then serializing in json at the last. So for computation on diagnostics serialization still remain the same.
However I see your point in general to avoid diagnostics serialization on regular customer business scenario like 404 not found. I would like to keep it as sperate work item, however there is down side to put custom logic based on status code, we need to maintain extra logic for future scenarios and secondly its customer who should decide whether to fetch the whole message when there is expected failure and can be handle based on expected status code.

@simplynaveen20
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@simplynaveen20
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@moderakh moderakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Feel free to capture the following work in a separate PR (work item)

materializing diagnostics in every exception stacktrace for every exception is expensive and has perf impact, we should exclude the business logic errors:

NotFound (with no substatus code, i.e., for READ_SESSION_NOT_AVAILABLE we need diagnostics),
Conflict,
PreconditionFailed.
These are business logic failures we don't need to the diagnostics for this.

NotFound with READ_SESSION_NOT_AVAILABLE substatus code is a consistency failure not business logic we should have diagnostics for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants