[cDAC] Multi data-descriptor proposal #118126

max-charlamb · 2025-07-28T13:45:31Z

Why

As we move to completely implement the API's used by SOS in the cDAC, we need to access data structures and global values defined in the GC. These are special because the GC can either be internal to coreclr.dll or an external clrgc.dll. Given this option, it is impossible to completely define GC data structures at coreclr.dll compile time and store them in the data-descriptor like we do for VM structures.

Proposed Solution

Add support for a data-descriptor to have sub-descriptors. Data-descriptors will have a new optional sub-descriptor section which can contain pointers to sub-descriptors. Each sub-descriptor has the same spec as the main (existing) descriptor. The pointers act the same as the well-known symbol name for the main descriptor.

Memory Layout

The sub-descriptors will be an additional (optional) section of the data-descriptor identical in spec to the existing globals with the exception that the value's type must be a pointer.

Parsing

When parsing a data-descriptor, each sub-descriptor pointer is checked if it is non-null. If it is non-null, then the sub-descriptor should be parsed in the same way as the main descriptor.

The header of a sub-descriptor must match the parent descriptor. If the header of a sub-descriptor does not match (different pointer size/endianness) this is an error and undefined behavior.
The sub-descriptor and parent descriptor must not have conflicting descriptors. If any conflicting descriptors exist, this is an error and undefined behavior.

Types, globals, and contract versions are merged between the sub-descriptor and the parent descriptor.

Implemented in: #118050

Copilot

Pull Request Overview

This PR introduces support for multi data-descriptor functionality to the Contract Data Access (cDAC) system. The primary purpose is to enable deferred data definition resolution through sub-descriptors when certain data definitions are not known at compile time but may be provided by external components.

Key changes:

Adds support for string-type global values alongside existing primitive integer constants and pointers
Introduces sub-descriptor pointers as a new optional component of data descriptors
Updates the JSON format to include a "sub-descriptors" field with pointer-based references to external data descriptors

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
datacontracts_design.md	Updates global values to support strings and adds sub-descriptor concept documentation
data_descriptor.md	Adds comprehensive specification for sub-descriptor descriptors, JSON format updates, and example usage
contract-descriptor.md	Adds sub-descriptors field to example JSON and clarifies contract symbol discovery language

Comments suppressed due to low confidence (1)

docs/design/datacontracts/data_descriptor.md:300

The link reference uses 'contract-descriptor.md' which is inconsistent with the actual filename 'contract-descriptor.md' shown in the diff. However, this appears to be a correction from 'contract_descriptor.md' to match the actual hyphenated filename, so this is likely fixing an existing inconsistency.

descriptor](./contract-descriptor.md#Contract_descriptor).

docs/design/datacontracts/datacontracts_design.md

Co-authored-by: Copilot <[email protected]>

jkotas · 2025-07-28T15:42:30Z

docs/design/datacontracts/contract-descriptor.md

  },
+  "sub-descriptors":
+  {
+    "GCDescriptor": [ 1 ]


I do not think we want to have the sub-descriptors in the json. We do not necessarily know how many of them we are going to have and what their names are going to be at build time. (We happen to know for GC that motivated this change, but it would be nice to allow for optional dynamically loaded components.)

I thought having a separate section of pointers to sub-descriptors would be the cleanest way to implement them on the parser side. It would allow the parser to read the complete set of datadescriptors without outside information.

The name here isn't strictly required but I left it in for help debugging and to match the global spec. The parser machinery looks at the listed sub-descriptor pointers and if the values are non-null would recursively read in and merge the sub-descriptor. This would allow us to have sub-descriptor 'slots' that are not always used.

The alternative design I considered was to have the sub-descriptors be standard global values which are well-known to the relevant contracts. These contracts would use a new API on the Target to fetch this addition data. This would require a name and add more complexity to the Target as it's datastores would be mutable after creation.

The drawback is that the sub-descriptors couldn't be dynamically loaded (as you mention). I'm trying to understand if that would be an issue. Given the cDAC operates on a paused target, the memory between data descriptor initialization and contract use should not change (except for writes initiated by the cDAC), if there is a data descriptor JSON that can be loaded (ie no conflicts) would there be a benefit of pre-reading it?

The parser machinery looks at the listed sub-descriptor pointers and if the values are non-null would recursively read in and merge the sub-descriptor. This would allow us to have sub-descriptor 'slots' that are not always used.

It requires us to know all types of sub-descriptors that we may possibly reference upfront (when we are generating the json at build time). After giving it more thought, it should not be a problem in practice. It is very unlikely that we will allow extending the runtime in unknown ways. Consider this feedback resolved.

The drawback is that the sub-descriptors couldn't be dynamically loaded (as you mention).

My concern was about dynamic loading at runtime. The difference is whether the runtime can load arbitrary unknown components dynamically, or whether the runtime can only load a known set of components dynamically. As I have said, I think it is fine to limit the runtime to known components.

Given the cDAC operates on a paused target, the memory between data descriptor initialization and contract use should not change

Yes, this should not be a problem with what we have now. (My gut feel is that we may need it to evolve the cDAC architecture to cache more and be less eager with pre-computing once we get to scenarios like single stepping, but that is a problem for future.)

Formalizing sub-descriptors seems like unnecessary complexity and constraint to me. The existing cDAC code seems amenable enough to dynamically loading new descriptors.

For example a runtime contract can declare any arbitrary field of a data structure to be a contract descriptor pointer:

CDAC_TYPE_FIELD(GCDacVars, /*pointer*/, StandaloneGcContractDesc, ...)

And cDAC code can load that contract descriptor on the fly if it needs to use it:

class GCContract : IGCContract { IGCContract _standaloneGC; GCContract(Target t) { // do normal cDac stuff to read the value of a field ulong standaloneContractDesc = GetDescriptorFromGCDacVars(t); if(standaloneContractDesc != 0) { ContractDescriptorTarget.TryCreate( standaloneContractDesc, GetReadDelegate(target), GetGetThreadContextDelegate(target), out ContractDescriptorTarget standaloneGcTarget); standaloneGC = standaloneGcTarget.ContractRegistry.GCContract; } } public void EnumerateHeap() { if(_standaloneGC != null) { _standaloneGC.EnumerateHeap(); return; } // do the normal built-in GC enumerate heap algorithm } }

We could further simplify this a bit, it just shows the basic idea without diverging too far from how the code is currently structured. Is there any significant issue not treating ContractDescriptorTarget as a singleton?

Currently, the contracts don't know about ContractDescriptorTarget or the read/write delegates. They interact with the target through the abstract Target class.

This change would be possible but require adding some complexity to the managed side. Either having multiple targets (and dealing with properly flushing and using the correct one) or merging the globals/types together.

The current plan is to always load a sub-descriptor for the GC contract, even when we use the default GC.

The current plan is to always load a sub-descriptor for the GC contract, even when we use the default GC.

I think we'd be better off changing that plan and doing light changes to the managed code instead. Its harder to evolve interfaces between components (the contract descriptor format) than it is to change their internal implementation details so if complexity needs to be added somewhere I think we should bias towards putting it in the managed cDac implementation. Sub-descriptors also add constraints that we might find awkward later:

They prevent dynamic loaded dlls from defining new contracts. For example if one day we thought the GC contract would be better factored as two separate contracts it would be a breaking change for standalone GC that prevents it from being used on downlevel runtimes.

They prevent dynamic loaded dlls from having patterns other than singletons. For example imagine we have JIT plugin interface and we'd like to use the built-in JIT to compile some methods and the plugin JIT compiles other methods. We might want to access two instances of some JIT contract at the same time, not have one replace the other.

jkotas · 2025-07-28T19:45:14Z

docs/design/datacontracts/data_descriptor.md

 The data descriptor consists of:
 * a collection of type structure descriptors
 * a collection of global value descriptors
+* an optional collection of pointers to sub-contracts


Optional

Should global value descriptors on the previous line be tagged as optional as well? If the (sub-)descriptor does not need any global values, I would expect that it can be missing - similar to the list of sub-contracts can be missing.

Yes, ideally both the type structure descriptors and global value descriptors should be optional. The current infrastructure doesn't support that, but I can modify both the spec and infrastructure.

jkotas · 2025-07-28T19:52:02Z

docs/design/datacontracts/data_descriptor.md

+  },
+  "sub-descriptors": 
+  {
+    "GCDescriptor": [ 1 ] // indirect from aux data offset 1


Suggested change

"GCDescriptor": [ 1 ] // indirect from aux data offset 1

"GC": [ 1 ] // indirect from aux data offset 1

Nit: I do not think we need to repeat "descriptor" in the name. It is clear what it points to by being under sub-descriptors.

jkotas · 2025-07-28T20:23:04Z

docs/design/datacontracts/contract-descriptor.md

  },
+  "sub-descriptors":
+  {
+    "GCDescriptor": [ 1 ]


The parser machinery looks at the listed sub-descriptor pointers and if the values are non-null would recursively read in and merge the sub-descriptor. This would allow us to have sub-descriptor 'slots' that are not always used.

It requires us to know all types of sub-descriptors that we may possibly reference upfront (when we are generating the json at build time). After giving it more thought, it should not be a problem in practice. It is very unlikely that we will allow extending the runtime in unknown ways. Consider this feedback resolved.

The drawback is that the sub-descriptors couldn't be dynamically loaded (as you mention).

My concern was about dynamic loading at runtime. The difference is whether the runtime can load arbitrary unknown components dynamically, or whether the runtime can only load a known set of components dynamically. As I have said, I think it is fine to limit the runtime to known components.

Given the cDAC operates on a paused target, the memory between data descriptor initialization and contract use should not change

Yes, this should not be a problem with what we have now. (My gut feel is that we may need it to evolve the cDAC architecture to cache more and be less eager with pre-computing once we get to scenarios like single stepping, but that is a problem for future.)

jkotas · 2025-07-28T20:31:21Z

docs/design/datacontracts/data_descriptor.md

+* a name
+* a pointer value
+
+If these values are non-null, the pointer represents another JSON data descriptor with the specification described in this document.


Suggested change

If these values are non-null, the pointer represents another JSON data descriptor with the specification described in this document.

If the value is non-null, the pointer points to another [contract descriptor](contract-descriptor.md#contract-descriptor-1).

I assume it will point to DotNetRuntimeContractDescriptor that speced in the other doc, not directly to JSON.

Also, we use DotNetRuntimeContractDescriptor for both the main contract export and the data structure that it points to. We may want to rename the data structure (e.g. to just ContractDescriptor) to avoid confusion now that there will be multiple instances of it in the system.

Yes, the plan was for it to point to the same type of structure (currently named DotNetRuntimeContractDescriptor) which the symbol points to.

jkotas · 2025-07-28T20:33:14Z

docs/design/datacontracts/data_descriptor.md

+
+If these values are non-null, the pointer represents another JSON data descriptor with the specification described in this document.
+
+When parsing a data descriptor with sub-descriptors each sub-descriptor should be parsed then its type, global, and contract values should be merged in. If any conflicts arise when merging in sub-descriptor data, this is an error and behavior is undefined.


If any conflicts arise when merging in sub-descriptor data, this is an error and behavior is undefined.

This design means that the components involved need to be aware of each other to avoid conflicts. Just pointing it out.

dotnet-policy-service · 2025-08-26T14:03:12Z

Tagging subscribers to this area: @steveisok, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

max-charlamb · 2025-09-18T14:53:07Z

/ba-g docs only change

init docs for multi-descriptor setup

e80dd70

Copilot AI review requested due to automatic review settings July 28, 2025 13:45

max-charlamb marked this pull request as draft July 28, 2025 13:45

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 28, 2025

dotnet-policy-service bot assigned max-charlamb Jul 28, 2025

Copilot AI reviewed Jul 28, 2025

View reviewed changes

docs/design/datacontracts/datacontracts_design.md Outdated Show resolved Hide resolved

Update docs/design/datacontracts/datacontracts_design.md

0b2811e

Co-authored-by: Copilot <[email protected]>

jkotas reviewed Jul 28, 2025

View reviewed changes

max-charlamb mentioned this pull request Aug 7, 2025

[cDAC] GC Contract #118050

Merged

1 task

max-charlamb added documentation Documentation bug or enhancement, does not impact product or test code area-Diagnostics-coreclr labels Aug 26, 2025

max-charlamb marked this pull request as ready for review August 26, 2025 14:02

update docs

98a4684

max-charlamb removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 26, 2025

noahfalk approved these changes Sep 16, 2025

View reviewed changes

Merge branch 'main' into cdac-contract-proposal

14c8118

	"GCDescriptor": [ 1 ] // indirect from aux data offset 1
	"GC": [ 1 ] // indirect from aux data offset 1

	If these values are non-null, the pointer represents another JSON data descriptor with the specification described in this document.
	If the value is non-null, the pointer points to another [contract descriptor](contract-descriptor.md#contract-descriptor-1).


		If these values are non-null, the pointer represents another JSON data descriptor with the specification described in this document.

		When parsing a data descriptor with sub-descriptors each sub-descriptor should be parsed then its type, global, and contract values should be merged in. If any conflicts arise when merging in sub-descriptor data, this is an error and behavior is undefined.

[cDAC] Multi data-descriptor proposal #118126

Are you sure you want to change the base?

[cDAC] Multi data-descriptor proposal #118126

Uh oh!

Conversation

max-charlamb commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Proposed Solution

Memory Layout

Parsing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

jkotas Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-charlamb Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noahfalk Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dotnet-policy-service bot commented Aug 26, 2025

Uh oh!

max-charlamb commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

max-charlamb commented Jul 28, 2025 •

edited

Loading

jkotas Jul 28, 2025 •

edited

Loading

max-charlamb Jul 28, 2025 •

edited

Loading

noahfalk Jul 30, 2025 •

edited

Loading