-
Notifications
You must be signed in to change notification settings - Fork 5.2k
[cDAC] Multi data-descriptor proposal #118126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces support for multi data-descriptor functionality to the Contract Data Access (cDAC) system. The primary purpose is to enable deferred data definition resolution through sub-descriptors when certain data definitions are not known at compile time but may be provided by external components.
Key changes:
- Adds support for string-type global values alongside existing primitive integer constants and pointers
- Introduces sub-descriptor pointers as a new optional component of data descriptors
- Updates the JSON format to include a "sub-descriptors" field with pointer-based references to external data descriptors
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
datacontracts_design.md | Updates global values to support strings and adds sub-descriptor concept documentation |
data_descriptor.md | Adds comprehensive specification for sub-descriptor descriptors, JSON format updates, and example usage |
contract-descriptor.md | Adds sub-descriptors field to example JSON and clarifies contract symbol discovery language |
Comments suppressed due to low confidence (1)
docs/design/datacontracts/data_descriptor.md:300
- The link reference uses 'contract-descriptor.md' which is inconsistent with the actual filename 'contract-descriptor.md' shown in the diff. However, this appears to be a correction from 'contract_descriptor.md' to match the actual hyphenated filename, so this is likely fixing an existing inconsistency.
descriptor](./contract-descriptor.md#Contract_descriptor).
Co-authored-by: Copilot <[email protected]>
}, | ||
"sub-descriptors": | ||
{ | ||
"GCDescriptor": [ 1 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we want to have the sub-descriptors in the json. We do not necessarily know how many of them we are going to have and what their names are going to be at build time. (We happen to know for GC that motivated this change, but it would be nice to allow for optional dynamically loaded components.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought having a separate section of pointers to sub-descriptors would be the cleanest way to implement them on the parser side. It would allow the parser to read the complete set of datadescriptors without outside information.
The name here isn't strictly required but I left it in for help debugging and to match the global spec. The parser machinery looks at the listed sub-descriptor pointers and if the values are non-null would recursively read in and merge the sub-descriptor. This would allow us to have sub-descriptor 'slots' that are not always used.
The alternative design I considered was to have the sub-descriptors be standard global values which are well-known to the relevant contracts. These contracts would use a new API on the Target
to fetch this addition data. This would require a name and add more complexity to the Target
as it's datastores would be mutable after creation.
The drawback is that the sub-descriptors couldn't be dynamically loaded (as you mention). I'm trying to understand if that would be an issue. Given the cDAC operates on a paused target, the memory between data descriptor initialization and contract use should not change (except for writes initiated by the cDAC), if there is a data descriptor JSON that can be loaded (ie no conflicts) would there be a benefit of pre-reading it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser machinery looks at the listed sub-descriptor pointers and if the values are non-null would recursively read in and merge the sub-descriptor. This would allow us to have sub-descriptor 'slots' that are not always used.
It requires us to know all types of sub-descriptors that we may possibly reference upfront (when we are generating the json at build time). After giving it more thought, it should not be a problem in practice. It is very unlikely that we will allow extending the runtime in unknown ways. Consider this feedback resolved.
The drawback is that the sub-descriptors couldn't be dynamically loaded (as you mention).
My concern was about dynamic loading at runtime. The difference is whether the runtime can load arbitrary unknown components dynamically, or whether the runtime can only load a known set of components dynamically. As I have said, I think it is fine to limit the runtime to known components.
Given the cDAC operates on a paused target, the memory between data descriptor initialization and contract use should not change
Yes, this should not be a problem with what we have now. (My gut feel is that we may need it to evolve the cDAC architecture to cache more and be less eager with pre-computing once we get to scenarios like single stepping, but that is a problem for future.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formalizing sub-descriptors seems like unnecessary complexity and constraint to me. The existing cDAC code seems amenable enough to dynamically loading new descriptors.
For example a runtime contract can declare any arbitrary field of a data structure to be a contract descriptor pointer:
CDAC_TYPE_FIELD(GCDacVars, /*pointer*/, StandaloneGcContractDesc, ...)
And cDAC code can load that contract descriptor on the fly if it needs to use it:
class GCContract : IGCContract
{
IGCContract _standaloneGC;
GCContract(Target t)
{
// do normal cDac stuff to read the value of a field
ulong standaloneContractDesc = GetDescriptorFromGCDacVars(t);
if(standaloneContractDesc != 0)
{
ContractDescriptorTarget.TryCreate(
standaloneContractDesc,
GetReadDelegate(target),
GetGetThreadContextDelegate(target),
out ContractDescriptorTarget standaloneGcTarget);
standaloneGC = standaloneGcTarget.ContractRegistry.GCContract;
}
}
public void EnumerateHeap()
{
if(_standaloneGC != null)
{
_standaloneGC.EnumerateHeap();
return;
}
// do the normal built-in GC enumerate heap algorithm
}
}
We could further simplify this a bit, it just shows the basic idea without diverging too far from how the code is currently structured. Is there any significant issue not treating ContractDescriptorTarget as a singleton?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the contracts don't know about ContractDescriptorTarget
or the read/write delegates. They interact with the target through the abstract Target
class.
This change would be possible but require adding some complexity to the managed side. Either having multiple targets (and dealing with properly flushing and using the correct one) or merging the globals/types together.
The current plan is to always load a sub-descriptor for the GC contract, even when we use the default GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current plan is to always load a sub-descriptor for the GC contract, even when we use the default GC.
I think we'd be better off changing that plan and doing light changes to the managed code instead. Its harder to evolve interfaces between components (the contract descriptor format) than it is to change their internal implementation details so if complexity needs to be added somewhere I think we should bias towards putting it in the managed cDac implementation. Sub-descriptors also add constraints that we might find awkward later:
- They prevent dynamic loaded dlls from defining new contracts. For example if one day we thought the GC contract would be better factored as two separate contracts it would be a breaking change for standalone GC that prevents it from being used on downlevel runtimes.
- They prevent dynamic loaded dlls from having patterns other than singletons. For example imagine we have JIT plugin interface and we'd like to use the built-in JIT to compile some methods and the plugin JIT compiles other methods. We might want to access two instances of some JIT contract at the same time, not have one replace the other.
The data descriptor consists of: | ||
* a collection of type structure descriptors | ||
* a collection of global value descriptors | ||
* an optional collection of pointers to sub-contracts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional
Should global value descriptors on the previous line be tagged as optional as well? If the (sub-)descriptor does not need any global values, I would expect that it can be missing - similar to the list of sub-contracts can be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ideally both the type structure descriptors and global value descriptors should be optional. The current infrastructure doesn't support that, but I can modify both the spec and infrastructure.
}, | ||
"sub-descriptors": | ||
{ | ||
"GCDescriptor": [ 1 ] // indirect from aux data offset 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"GCDescriptor": [ 1 ] // indirect from aux data offset 1 | |
"GC": [ 1 ] // indirect from aux data offset 1 |
Nit: I do not think we need to repeat "descriptor" in the name. It is clear what it points to by being under sub-descriptors
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
}, | ||
"sub-descriptors": | ||
{ | ||
"GCDescriptor": [ 1 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser machinery looks at the listed sub-descriptor pointers and if the values are non-null would recursively read in and merge the sub-descriptor. This would allow us to have sub-descriptor 'slots' that are not always used.
It requires us to know all types of sub-descriptors that we may possibly reference upfront (when we are generating the json at build time). After giving it more thought, it should not be a problem in practice. It is very unlikely that we will allow extending the runtime in unknown ways. Consider this feedback resolved.
The drawback is that the sub-descriptors couldn't be dynamically loaded (as you mention).
My concern was about dynamic loading at runtime. The difference is whether the runtime can load arbitrary unknown components dynamically, or whether the runtime can only load a known set of components dynamically. As I have said, I think it is fine to limit the runtime to known components.
Given the cDAC operates on a paused target, the memory between data descriptor initialization and contract use should not change
Yes, this should not be a problem with what we have now. (My gut feel is that we may need it to evolve the cDAC architecture to cache more and be less eager with pre-computing once we get to scenarios like single stepping, but that is a problem for future.)
* a name | ||
* a pointer value | ||
|
||
If these values are non-null, the pointer represents another JSON data descriptor with the specification described in this document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these values are non-null, the pointer represents another JSON data descriptor with the specification described in this document. | |
If the value is non-null, the pointer points to another [contract descriptor](contract-descriptor.md#contract-descriptor-1). |
I assume it will point to DotNetRuntimeContractDescriptor
that speced in the other doc, not directly to JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we use DotNetRuntimeContractDescriptor
for both the main contract export and the data structure that it points to. We may want to rename the data structure (e.g. to just ContractDescriptor
) to avoid confusion now that there will be multiple instances of it in the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the plan was for it to point to the same type of structure (currently named DotNetRuntimeContractDescriptor
) which the symbol points to.
|
||
If these values are non-null, the pointer represents another JSON data descriptor with the specification described in this document. | ||
|
||
When parsing a data descriptor with sub-descriptors each sub-descriptor should be parsed then its type, global, and contract values should be merged in. If any conflicts arise when merging in sub-descriptor data, this is an error and behavior is undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any conflicts arise when merging in sub-descriptor data, this is an error and behavior is undefined.
This design means that the components involved need to be aware of each other to avoid conflicts. Just pointing it out.
Tagging subscribers to this area: @steveisok, @dotnet/dotnet-diag |
/ba-g docs only change |
Why
As we move to completely implement the API's used by SOS in the cDAC, we need to access data structures and global values defined in the GC. These are special because the GC can either be internal to
coreclr.dll
or an externalclrgc.dll
. Given this option, it is impossible to completely define GC data structures atcoreclr.dll
compile time and store them in the data-descriptor like we do for VM structures.Proposed Solution
Add support for a data-descriptor to have
sub-descriptors
. Data-descriptors will have a new optionalsub-descriptor
section which can contain pointers to sub-descriptors. Each sub-descriptor has the same spec as the main (existing) descriptor. The pointers act the same as the well-known symbol name for the main descriptor.Memory Layout
The
sub-descriptors
will be an additional (optional) section of the data-descriptor identical in spec to the existingglobals
with the exception that the value's type must be apointer
.Parsing
When parsing a data-descriptor, each sub-descriptor pointer is checked if it is
non-null
. If it isnon-null
, then the sub-descriptor should be parsed in the same way as the main descriptor.Types, globals, and contract versions are merged between the sub-descriptor and the parent descriptor.
Implemented in: #118050