From d392600cad3a0934e4b09592fab5f140703fe655 Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Mon, 25 Mar 2024 16:21:08 -0400 Subject: [PATCH 01/14] data descriptor spec --- docs/design/datacontracts/data_descriptor.md | 310 ++++++++++++++ .../datacontracts/datacontracts_design.md | 186 +++------ docs/design/datacontracts/sample.blob.c | 395 ++++++++++++++++++ docs/design/datacontracts/sample.data.h | 18 + 4 files changed, 774 insertions(+), 135 deletions(-) create mode 100644 docs/design/datacontracts/data_descriptor.md create mode 100644 docs/design/datacontracts/sample.blob.c create mode 100644 docs/design/datacontracts/sample.data.h diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md new file mode 100644 index 00000000000000..c6d416a304e794 --- /dev/null +++ b/docs/design/datacontracts/data_descriptor.md @@ -0,0 +1,310 @@ +# Data Descriptors + +The [data contract](datacontracts_design.md) specification for .NET depends on each target .NET +runtime describing a subset of its platform- and build-specific data structures to diagnostic +tooling. The information is given meaning by algorithmic contracts that describe how the low-level +layout of the memory of a .NET process corresponds to high-level abstract data structures that +represent the conceptual state of a .NET process. + +In this document we give a logical description of a data descriptor together with two physical +manifestations. + +The first physical format is used to publish well-known data descriptors in the `dotnet/runtime` +repository. It is supposed to be machine- and human-readable. This format is not meant to be +particularly consise and may be used for visualization, diagnostics, etc. Typically data +descriptors in this form may be written by hand or with the aid of tooling. + +The second physical format is used to embed a data descriptor blob within a particularly instance of +a target runtime. It is meant to be machine-readable while minimizing the total space needed to +store it. It is primarily meant to be read and written by tooling. + +## Logical descriptor + +Each logical descriptor exists within an implied /target architecture/ consisiting of: +* target architecture endianness (little endian or big endian) +* target architecture pointer size (4 bytes or 8 bytes) + +The following /primitive types/ are assumed: int8, uint8, int16, uint16, int32, uint32, int64, +uint64, nint, nuint, pointer, (UTF-8) string. The multi-byte types are in the target architecture +endianness. The types `nint`, `nuint` and `pointer` have target architecture pointer size. + +The data descriptor consists of: +* the data descriptor specification version +* a collection of type structure descriptors +* a collection of global value descriptors. + +## Data descriptor specification version + +This is the version of the phsyical data descriptor. + +## Types + +The types (both primitive types and structures described by structure descriptors) are classified as +having either determinate or indeterminate type. Determinate sizes may be used for pointer +arithmetic. Types with indeterminate size may not be. Note that some sizes may be determinate, but +/target specific/. For example pointer types have a fixed size that varies by architecture. + +The following primitive types have indeterminate size: `string`. + +## Structure descriptors + +Each structure descriptor consists of: +* a name +* an optional size in bytes +* a collection of field descriptors + +If the size is not given, the type has indeterminate size. The size may also be given explicitly as +"indeterminate" to emphasize that the type has indeterminate size. + +The collection of field descriptors may be empty. In that case the type is opaque. The primitive +types may be thought of as opaque (for example: on ARM64 `nuint` is an opaque 8 byte type, `int64` +is another opaque 8 byte type. `string` is an opaque type of indeterminate size). + +Type names must be globally unique within a single logical descriptor. + +### Field descriptors + +Each field descriptor consists of: +* a name +* an offset in bytes from the beginning of the struct +* a type or the special type `variable` + +The name of a field descriptor must be unique within the definition of a structure. + +The offset may be negative. + +Two or more fields may have the same offets or imply that the underlying fields overlap. The field +offsets need not be aligned using any sort of target-specific alignment rules. + +Each field's type may refer to one of the primitive types or to any other type defined in the logical descriptor. + +If the field's type is `variable` it represents a byte-array of indeterminate size starting at the given offset. + +If a structure descriptor contains at least one field of indeterminate size, the whole structure +must have indeterminate size. Tooling is not required to, but may, signal a warning if a descriptor +has a determinate size and contains indeterminate size fields. + +It is expected that tooling will signal a warning if a field specifies a type that does not appear +in the logical descriptor. + +## Global value descriptors + +Each global value descriptor consists of: +* a name +* a type +* a value + +The name of each global value must be unique within the logical descriptor. + +The type must be one of the determinate-size primitive types. + +The value must be a integral constant within the range of its type. Signed values are two's +complement. Pointer values need not be aligned and need not point to adressable target memory. + + +## Physical descriptors + +The phsyical descriptors are meant to describe /subsets/ of a logical descriptor and to compose. +Each physical descriptor can name an ordered sequence of zero or more "baseline" descriptor which is then +considered to comprise a piece of the overall logical descriptor. + +Starting from a single physical descriptor, the "baseline" relationship forms a directed graph. It +is an error for the graph to contain a cycle. The baseline relationship may form a DAG (that is: two +or more nodes may refer to the same baseline). + +When constructing the logical descriptor, the DAG is traversed in a post-order traversal with each +node visited at most once with baselines of a particular node visited from first to last. + +To form the logical descriptor the types are added in traversal order with later appearances +augmenting earlier ones (fields are added or modified, sizes and offsets are overwritten). The +global values are added in traversal order with later appearances overwriting previous ones. + +Rationale: if a baseline is included more than once, only the first inclusion counts. If a type +appears in multiple physical descriptors, the later appearances may add more fields or change the +offsets or definite/indefinite sizes of prior definitions. If a value appears multipel times, later +definitions take precedence. + +**FIXME** do we really want a DAG? Are we ok with a linked list? + +## Physical JSON descriptor + +### Version + +This is version 0 of the physical descriptor + +### Summary + +A data descriptor may be stored in the "JSON with comments" format. + +The toplevel dictionary will contain: + +* `"version": 0` +* `"baseline": "FREEFORM STRING"` or `baseline: ["FREEFORM STRING"", ...]` +* `"types": TYPE_ARRAY` see below +* `"globals": VALUE_ARRAY` see below + +### Types + +The types will be in an array, with each type described by a dictionary containining keys: + +* `"name": "type name"` the name of each type +* optional `"size": int | "indeterminate"` if omitted the size is indeterminate +* optional `"fields": FIELD_ARRAY` if omitted same as a field array of length zero + +Each `FIELD_ARRAY` is an array of dictionaries each containing keys: + +* `"name": "field name"` the name of each field +* `"type": "type name"` the name of a primitive type or another type defined in the same /logical/ descriptor +* optional `"offset": int | "unknown"` the offset of the field or "unknown". If omitted, same as "unknown". + +Note that the logical descriptor does not contain "unknown" offsets. + +Rationale: "unknown" offsets may be used to document in the physical JSON descriptor that another +physical descriptor in the "baseline" graph is expected to provide the offset of the field. + +### Global values + +The global values will be in an array, with each value described by a dictionary containing keys: + +* `"name": "global value name"` the name of the global value +* `"type": "type name"` the type of the global value +* optional `"value": VALUE | "unknown"` the value of the global value or "unknown". If omitted, same as "unknown". + +Note that the logical descriptor does not contain "unknown" values. + +The `VALUE` may be a JSON numeric constant integer or a string containing a signed or unsigned +decimal or hex (with prefix `0x` or `0X`) integer constant. The constant must be within the range +of the type of the global value. + +For pointer and nuint globals, the value may be assumed to fit in a 64-bit unsigned integer. For +nint globals, the value may be assumed to fit in a 64-bit signed integer. + +If the value is specified as "unknown" another physical descriptor in the "baseline" graph is +expected to provide the value. + +## Physical binary blob descriptor + +### Version + +This is version 0 of the physical binary blob format. + +### Summary + +The binary blob format for a physical descriptor is expected to be stored in the memory space of a +target .NET runtime and is thus likely to be the final descriptor in the "baseline" graph. + +It is likely that the physical descriptor will be a part of a build-time constant in the disk image +of a .NET runtime, and the format is designed to be compact and specifiable as a (likely +machine-generated) compile-time constant in a suitable source language. + +The data descriptor forms one part of an overall physical data contract descriptor in a targret .NET +runtime and as such this format does not specify a "magic number", or "well known symbol", or +another means of identifying the blob within a target process. Additionally the version of the +binary blob data descriptor is expected to be stored within the larger enclosing data contract +descriptor and is not included here. + +### Blob + +Multi-byte values are in the target platform endianness. + +The format is: + +```c +struct BinaryBlobDataDescriptor +{ + struct Directory { + uint32_t TypesStart; + uint32_t FieldPoolStart; + + uint32_t GlobalValuesStart; + uint32_t NamesStart; + + uint32_t TypeCount; + uint32_t FieldPoolCount; + + uint32_t NamesPoolCount; + uint32_t Reserved0; + + uint16_t TypeSpecSize; + uint16_t FieldSpecSize; + + uint16_t GlobalSpecSize; + uint16_t Reserved1; + } + uint32_t BaselineName; + TypeSpec[TypeCount] Types; + FieldSpec[FieldPoolCount] FieldPool; + GlobalSpec[GlobalsCount] GlobalValues; + uint8_t[NamesPoolCount] NamesPool; +}; + +struct TypeSpec +{ + uint32_t Name; + uint32_t Fields; + uint16_t Size; +}; + +struct FieldSpec +{ + uint32_t Name; + uint16_t FieldOffset; + uint16_t Reserved; +}; + +struct GlobalSpec +{ + uint32_t Name; + uint32_t Reserved; + uint64_t Value; +}; +``` + +Rationale: the blob should be producable by including a specially formatted C header file multiple +times with redefinitions of some macro names appearing in its content. Additionally the blob avoids +pointers so that it may be read off from the on-disk representation of an object file. + +The blob begins with a directory that gives the relative offsets of the `Types`, `FieldPool`, +`GlobalValues` and `Names` fields of the blob. The number of elements of each of the arrays is +next. This is followed by the sizes of the `TypeSpec`, `FieldSpec` and `GlobalSpec` structs. + +Rationale: If a `BinaryBlobDataDescriptor` is created via C macros, we want to embed the `offsetof` +and `sizeof` of the components of the blob into the blob itself without having to account for any +padding that the C compiler may introduce to enfore alignment. + +The baseline is specified as an offset into the names pool. + +The types are given as an array of `TypeSpec` elements. Each one contains an offset into the +`NamesPool` giving the name of the type, An offset into the fields pool indicating the first +specified field of the type, and the size of the type in bytes or 0 if it is indeterminate. + +The fields pool is given as a sequence of `FieldSpec` elements. The fields for each type are given +in a contiguious subsequence and are terminated by a marker `FieldSpec` with a `Name` offset of 0. +(Thus if a type has an empty sequence of fields it just points to a marker field spec directly.) +For each field there is a name that gives an offset in the name pool and an offset indicating the +field's offset. The field type is not given. + +Rationale: it is expected that the types of the fields were provided by a "baseline" data descriptor. + +The globals are gives as a sequence of `GlobalSpec` elements. Each global has a name and a value. +The types of the globals are not given. + +Rationale: it is expected that the types of the global values were provided by a "baseline" data descriptor. + +The `NamesPool` is a single sequence of utf-8 bytes comprising the concatenation of all the type +field and global names including a terminating nul byte for each name. The same name may occur +multiple times. The names will be referenced by multiple type or multiple fields. (That is, a +clever blob emitter may pool strings). The first name in the name pool is the empty string (with +its nul byte). + +Rationale: we want to reserve the offset 0 as a marker. + +Names are referenced by giving their offset from the beginning of the `NamesPool`. Each name +extends until the first nul byte encountered at or past the beginning of the name. + + +## Example + +And example C header describing some data types is given in [sample.data.h](./sample.data.h). And +example series of C macro preprocessor definitions that produces a constant blob `Blob` is given in +[sample.blob.c](./sample.blob.c) diff --git a/docs/design/datacontracts/datacontracts_design.md b/docs/design/datacontracts/datacontracts_design.md index 8a52243fcdcb20..6cc424c356eced 100644 --- a/docs/design/datacontracts/datacontracts_design.md +++ b/docs/design/datacontracts/datacontracts_design.md @@ -16,15 +16,54 @@ The physical layout of this data is not defined in this document, but its practi The Data Contract Descriptor has a set of records of the following forms. -### Global Values +### Data descriptor + +The data descriptor is a logical entity that defines the layout of certain types relevant to one or +more algorithmic contracts, as well as global values known to the target runtime that may be +relevant to one or more algorithmic contracts. + +More details are provided in the [data descriptor spec](./data_descriptor.md). We highlight some important aspects below: + +#### Baseline data descriptor identifier + +An optional string identifying a well-known record of global values and data structure layouts. The +identifier is an arbitrary string, that could be used, for example to tag a collection of globals +and data structure layouts present in a particular release of a .NET runtime for a certain +architecture (for example `net9.0-rc1/Release/linux-arm64`). Global values and data structure +layouts present in the data contract descriptor take precedence over the baseline contract. This +way variant builds can be specified as a delta over a baseline. For example, debug builds of +CoreCLR that include additional fields in a `MethodTable` data structure could be based on the +Release data descriptor augmented with new `MethodTable` and other structure descriptors. + +It is not a requirement that the baseline is chosen so that additional "delta" is the smallest +possible size, although for practical purposes that may be desired. + +Data descriptors are registered as "well known" by checking them into the main branch of +`dotnet/runtime` in the `docs/design/datacontracts/data/` directory in a format to be specified +later. The relative path name (with `/` as the path separator, if any) of the descriptor without +any extension is the identifier. (for example: +`/docs/design/datacontracts/data/net9.0-rc1/Release/linux-arm64.json` is the filename for the data +descriptor with identifier `net9.0-rc1/Release/linux-arm64`) + +#### Global Values Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint, string) All global values have a string describing their name, and a value of one of the above types. +For instance, we will likely have a `TargetPointerSize` global value represented in a data descriptor. + +#### Data Structure Layout +Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field. + +Data structures may have a determinate size, specified in the descriptor, or an indeterminate size. +Determinate sizes are used by contracts for pointer arithmetic such as for iterating over arrays. +The determinate size of a structure may be larger than the sum of the sizes of the fields specified +in the data descriptor (that is, the data descriptor does not include every field and may not +include padding bytes). + ### Compatible Contract + Each compatible contract is described by a string naming the contract, and a uint32 version. It is an ERROR if multiple versions of a contract are specified in the contract descriptor. -### Data Structure Layout -Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field. ## Versioning of contracts Contracts are described an integer version number. A higher version number is not more recent, it just means different. In order to avoid conflicts, all contracts should be documented in the main branch of the dotnet repository with a version number which does not conflict with any other. It is expected that every version of every contract describes the same functionality/data layout/set of global values. @@ -34,20 +73,11 @@ Logically a contract may refer to another contract. If it does so, it will typic ## Types of contracts -There are 3 different types of contracts each representing a different phase of execution of the data contract system. +There are 2 different types of contracts each representing a different phase of execution of the data contract system. ### Composition contracts These contracts indicate the version numbers of other contracts. This is done to reduce the size of contract list needed in the Data Contract Descriptor. In general it is intended that as a runtime nears shipping, the product team can gather up all of the current versions of the contracts into a single magic value, which can be used to initialize most of the contract versions of the data contract system. A specific version number in the Data Contract Descriptor for a given contract will override any composition contracts specified in the Data Contract Descriptor. If there are multiple composition contracts in a Data Contract Descriptor which specify the same contract to have a different version, the first composition contract linearly in the Data Contract Descriptor wins. This is intended to allow for a composite contract for the architecture/os indepedent work, and a separate composite contract for the non independent work. If a contract is specified explicitly in the Data Contract Descriptor and a different version is specified via the composition contract mechanism, the explicitly specified contract takes precedence. -### Fixed value contracts -These contracts represent data which is entirely determined by the contract version + contract name. There are 2 subtypes of this form of contract. - -#### Global Value Contract -A global value contract specifies numbers which can be referred to by other contracts. If a global value is specified directly in the Data Contract Descriptor, then the global value defintion in the Data Contract Descriptor takes precedence. The intention is that these global variable contracts represent magic numbers and values which are useful for the operation of algorithmic contracts. For instance, we will likely have a `TargetPointerSize` global value represented via a contract, and things like `FEATURE_SUPPORTS_COM` can also be a global value contract, with a value of 1. - -#### Data Structure Definition Contract -A data structure definition contract defines a single type's physical layout. It MUST be named "MyDataStructureType_layout". If a data structure layout is specified directly in the Data Contract Descriptor, then the data structure defintion in the Data Contract Descriptor takes precedence. These contracts are responsible for declaring the field layout of individual fields. While not all versions of a data structure are required to have the same fields/type of fields, algorithms may be built targetting the union of the set of field types defined in the version of a given data structure definition contract. Access to a field which isn't defined on the current runtime will produce an error. - ### Algorithmic contracts Algorithmic contracts define how to process a given set of data structures to produce useful results. These are effectively code snippets which utilize the abstracted data structures provided by Data Structure Definition Contracts and Global Value Contract to produce useful output about a given program. Descriptions of these contracts may refer to functionality provided by other contracts to do their work. The algorithms provided in these contracts are designed to operate given the ability to read various primitive types and defined data structures from the process memory space, as well as perform general purpose computation. @@ -59,131 +89,17 @@ For working with data from the target process/other contracts, the following C# Best practice is to either write the algorithm in C# like psuedocode working on top of the [C# style api](contract_csharp_api_design.cs) or by reference to specifications which are not co-developed with the runtime, such as OS/architecture specifications. Within the contract algorithm specification, the intention is that all interesting api work is done by using an instance of the `Target` class. -## Arrangement of contract specifications in the repo - -Specs shall be stored in the repo in a set of directories. `docs/design/datacontracts` Each one of them shall be a seperate markdown file named with the name of contract. `docs/design/datacontracts/datalayout/.md` Every version of each contract shall be located in the same file to facilitate understanding how variations between different contracts work. - -### Global Value Contracts -The format of each contract spec shall be - - -``` -# Contract - -Insert description of contract, and what its for here. - -## Version - -Insert description (if possible) about what is interesting about this particular version of the contract - -### Values -| Global Name | Type | Value | -| --- | --- | --- | -| SomeGlobal | Int32 | 1 | -| SomeOtherGlobal | Int8 | 0 | - -## Version - -Insert description (if possible) about what is interesting about this particular version of the contract - -### Values -| Global Name | Type | Value | -| --- | --- | --- | -| SomeGlobal | Int32 | 1 | -| SomeOtherGlobal | Int8 | 1 | -``` - -Which should format like: -# Contract - -Insert description of contract, and what its for here. - -## Version - -Insert description (if possible) about what is interesting about this particular version of the contract - -### Values -| Global Name | Type | Value | -| --- | --- | --- | -| SomeGlobal | Int32 | 1 | -| SomeOtherGlobal | Int8 | 0 | - -## Version - -Insert description (if possible) about what is interesting about this particular version of the contract - -### Values -| Global Name | Type | Value | -| --- | --- | --- | -| SomeGlobal | Int32 | 1 | -| SomeOtherGlobal | Int8 | 1 | - - -### Data Structure Contracts -Data structure contracts describe the field layout of individual types in the that are referred to by algorithmic contracts. If one of the versions is marked as DEFAULT then that version exists if no specific version is specified in the Data Contract Descriptor. - -``` -# Contract _layout - -Insert description of type, and what its for here. - -## Version , DEFAULT - -Insert description (if possible) about what is interesting about this particular version of the contract - -### Structure Size -8 bytes +Algorithmic contracts may include specifications for numbers which can be referred to in the contract or by other contracts. The intention is that these global values represent magic numbers and values which are useful for the operation of algorithmic contracts. -### Fields -| Field Name | Type | Offset | -| --- | --- | --- | -| FirstField | Int32 | 0 | -| SecondField | Int64 | 4 | +While not all versions of a data structure are required to have the same fields/type of fields, +algorithms may be built targetting the union of the set of field types defined in the data structure +descriptors of possible target runtimes. Access to a field which isn't defined on the current +runtime will produce an error. -## Version -Insert description (if possible) about what is interesting about this particular version of the contract - -### Structure Size -16 bytes - -### Fields -| Field Name | Type | Offset | -| --- | --- | --- | -| FirstField | Int32 | 0 | -| SecondField | Int64 | 8 | -``` - -Which should format like: -# Contract _layout - -Insert description of type, and what its for here. - -## Version , DEFAULT - -Insert description (if possible) about what is interesting about this particular version of the contract - -### Structure Size -8 bytes - -### Fields -| Field Name | Type | Offset | -| --- | --- | --- | -| FirstField | Int32 | 0 | -| SecondField | Int64 | 4 | - -## Version - -Insert description (if possible) about what is interesting about this particular version of the contract - -### Structure Size -16 bytes +## Arrangement of contract specifications in the repo -### Fields -| Field Name | Type | Offset | -| --- | --- | --- | -| FirstField | Int32 | 0 | -| SecondField | Int64 | 8 | +Specs shall be stored in the repo in a set of directories. `docs/design/datacontracts` Each one of them shall be a seperate markdown file named with the name of contract. `docs/design/datacontracts/datalayout/.md` Every version of each contract shall be located in the same file to facilitate understanding how variations between different contracts work. ### Algorthmic Contract @@ -326,4 +242,4 @@ int ComputeInterestingValue2(SomeStructUsedAsPartOfContractApi struct) else return struct.Value1; } -``` \ No newline at end of file +``` diff --git a/docs/design/datacontracts/sample.blob.c b/docs/design/datacontracts/sample.blob.c new file mode 100644 index 00000000000000..7919c9e93aa185 --- /dev/null +++ b/docs/design/datacontracts/sample.blob.c @@ -0,0 +1,395 @@ +#include +#include + +// example structures + +typedef struct ManagedThread ManagedThread; + +struct ManagedThread { + uint32_t GCHandle; + ManagedThread *next; +}; + +typedef struct ManagedThreadStore { + ManagedThread *threads; +} ManagedThreadStore; + +static ManagedThreadStore g_managedThreadStore; + +static const int FEATURE_COM = 1; + +// end example structures + +// begin blob definition + +struct TypeSpec +{ + uint32_t Name; + uint32_t Fields; + uint16_t Size; +}; + +struct FieldSpec +{ + uint32_t Name; + uint16_t FieldOffset; + uint16_t Reserved; +}; + +struct GlobalSpec +{ + uint32_t Name; + uint32_t Reserved; + uint64_t Value; +}; + +#define CONCAT(token1,token2) token1 ## token2 +#define CONCAT4(token1, token2, token3, token4) token1 ## token2 ## token3 ## token4 + +#define MAKE_TYPELEN_NAME(tyname) CONCAT(cdac_string_pool_typename__, tyname) +#define MAKE_FIELDLEN_NAME(tyname,membername) CONCAT4(cdac_string_pool_membername__, tyname, __, membername) +#define MAKE_GLOBALLEN_NAME(globalname) CONCAT(cdac_string_pool_globalname__, globalname) + +// define a struct where the size of each field is the length of some string. we will use offsetof to get +// the offset of each struct element, which will be equal to the offset of the beginning of that string in the +// string pool. +struct CDacStringPoolSizes +{ + char cdac_string_pool_nil; // make the first real string start at offset 1 + // include 1 + for the nul +#define DECL_LEN(membername,len) char membername[1 + (len)]; +#define CDAC_BASELINE(name) DECL_LEN(cdac_string_pool_baseline_, (sizeof(name))) +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) DECL_LEN(MAKE_TYPELEN_NAME(name), sizeof(#name)) +#define CDAC_TYPE_INDETERMINATE(name) +#define CDAC_TYPE_SIZE(size) +#define CDAC_TYPE_FIELD(tyname,membername,offset) DECL_LEN(MAKE_FIELDLEN_NAME(tyname,membername), sizeof(#membername)) +#define CDAC_TYPE_END(name) +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) DECL_LEN(MAKE_GLOBALLEN_NAME(name), sizeof(#name)) +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END +}; + +#define GET_TYPE_NAME(name) offsetof(struct CDacStringPoolSizes, MAKE_TYPELEN_NAME(name)) +#define GET_FIELD_NAME(tyname,membername) offsetof(struct CDacStringPoolSizes, MAKE_FIELDLEN_NAME(tyname,membername)) +#define GET_GLOBAL_NAME(globalname) offsetof(struct CDacStringPoolSizes, MAKE_GLOBALLEN_NAME(globalname)) + +// count the types +enum +{ + CDacBlobTypesCount = +#define CDAC_BASELINE(name) 0 +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) + 1 +#define CDAC_TYPE_INDETERMINATE(name) +#define CDAC_TYPE_SIZE(size) +#define CDAC_TYPE_FIELD(tyname,membername,offset) +#define CDAC_TYPE_END(name) +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END + , +}; + +// count the field pool size. +// there's 1 placeholder element at the start, and 1 endmarker after each type +enum +{ + CDacBlobFieldPoolCount = +#define CDAC_BASELINE(name) 1 +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) +#define CDAC_TYPE_INDETERMINATE(name) +#define CDAC_TYPE_SIZE(size) +#define CDAC_TYPE_FIELD(tyname,membername,offset) + 1 +#define CDAC_TYPE_END(name) + 1 +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END + , +}; + +// count the globals +enum +{ + CDacBlobGlobalsCount = +#define CDAC_BASELINE(name) 0 +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) +#define CDAC_TYPE_INDETERMINATE(name) +#define CDAC_TYPE_SIZE(size) +#define CDAC_TYPE_FIELD(tyname,membername,offset) +#define CDAC_TYPE_END(name) +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) + 1 +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END + , +}; + +#define MAKE_TYPEFIELDS_TYNAME(tyname) CONCAT(CDacFieldPoolTypeStart__, tyname) + +// offsets of each run of fields +// this looks like +// +// struct CDacFieldPoolSizes { +// char empty_field_spec[sizeof(struct FieldSpec)]; +// struct CDacFieldPoolTypeStart__MethodTable { +// char cdac_field_pool_member__MethodTable__GCHandle[sizeof(struct FieldSpec)]; +// char cdac_field_pool_member__MethodTable_endmarker[sizeof(struct FieldSpec)]; +// } CDacFieldPoolTypeStart__MethodTable; +// ... +// }; +// +// so that offsetof(struct CDacFieldPoolSizes, CDacFieldPoolTypeStart__MethodTable) will give the offset of the +// method table field descriptors in the run of fields +struct CDacFieldPoolSizes +{ + char empty_field_spec[sizeof(struct FieldSpec)]; // make all valid field specs non-zero +#define DECL_LEN(membername) char membername[sizeof(struct FieldSpec)]; +#define CDAC_BASELINE(name) +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) struct MAKE_TYPEFIELDS_TYNAME(name) { +#define CDAC_TYPE_INDETERMINATE(name) +#define CDAC_TYPE_SIZE(size) +#define CDAC_TYPE_FIELD(tyname,membername,offset) DECL_LEN(CONCAT4(cdac_field_pool_member__, tyname, __, membername)) +#define CDAC_TYPE_END(name) DECL_LEN(CONCAT4(cdac_field_pool_member__, tyname, _, endmarker)) \ + } MAKE_TYPEFIELDS_TYNAME(name); +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END +}; + +#define GET_TYPE_FIELDS(tyname) offsetof(struct CDacFieldPoolSizes, MAKE_TYPEFIELDS_TYNAME(tyname)) + +struct BinaryBlobDataDescriptor +{ + struct Directory { + uint32_t TypesStart; + uint32_t FieldPoolStart; + + uint32_t GlobalValuesStart; + uint32_t NamesStart; + + uint32_t TypeCount; + uint32_t FieldPoolCount; + + uint32_t NamesPoolCount; + uint32_t Reserved0; + + uint16_t TypeSpecSize; + uint16_t FieldSpecSize; + + uint16_t GlobalSpecSize; + uint16_t Reserved1; + } Directory; + uint32_t BaselineName; + struct TypeSpec Types[CDacBlobTypesCount]; + struct FieldSpec FieldPool[CDacBlobFieldPoolCount]; + struct GlobalSpec GlobalValues[CDacBlobGlobalsCount]; + uint8_t NamesPool[sizeof(struct CDacStringPoolSizes)]; +}; + +struct MagicAndBlob { + char magic[4]; + struct BinaryBlobDataDescriptor Blob; +}; + +const struct MagicAndBlob Blob = { + .magic = "DAC", + .Blob = { + .Directory = { + .TypesStart = offsetof(struct BinaryBlobDataDescriptor, Types), + .FieldPoolStart = offsetof(struct BinaryBlobDataDescriptor, FieldPool), + .GlobalValuesStart = offsetof(struct BinaryBlobDataDescriptor, GlobalValues), + .TypeCount = CDacBlobTypesCount, + .FieldPoolCount = CDacBlobFieldPoolCount, + .NamesPoolCount = sizeof(struct CDacStringPoolSizes), + .TypeSpecSize = sizeof(struct TypeSpec), + .FieldSpecSize = sizeof(struct FieldSpec), + .GlobalSpecSize = sizeof(struct GlobalSpec), + }, + .BaselineName = offsetof(struct CDacStringPoolSizes, cdac_string_pool_baseline_), + .NamesPool = ("\0" // starts with a nul +#define CDAC_BASELINE(name) name "\0" +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) #name "\0" +#define CDAC_TYPE_INDETERMINATE(name) +#define CDAC_TYPE_SIZE(size) +#define CDAC_TYPE_FIELD(tyname,membername,offset) #membername "\0" +#define CDAC_TYPE_END(name) +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) #name "\0" +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END + ), + .FieldPool = { +#define CDAC_BASELINE(name) {0,}, +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) +#define CDAC_TYPE_INDETERMINATE(name) +#define CDAC_TYPE_SIZE(size) +#define CDAC_TYPE_FIELD(tyname,membername,offset) { \ + .Name = GET_FIELD_NAME(tyname,membername), \ + .FieldOffset = offset, \ +}, +#define CDAC_TYPE_END(name) { 0, }, +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END + }, + .Types = { +#define CDAC_BASELINE(name) +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) { \ + .Name = GET_TYPE_NAME(name), \ + .Fields = GET_TYPE_FIELDS(name), +#define CDAC_TYPE_INDETERMINATE(name) .Size = 0, +#define CDAC_TYPE_SIZE(size) .Size = size, +#define CDAC_TYPE_FIELD(tyname,membername,offset) +#define CDAC_TYPE_END(name) }, +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END + }, + .GlobalValues = { +#define CDAC_BASELINE(name) +#define CDAC_TYPES_BEGIN() +#define CDAC_TYPE_BEGIN(name) +#define CDAC_TYPE_INDETERMINATE(name) +#define CDAC_TYPE_SIZE(size) +#define CDAC_TYPE_FIELD(tyname,membername,offset) +#define CDAC_TYPE_END(name) +#define CDAC_TYPES_END() +#define CDAC_GLOBALS_BEGIN() +#define CDAC_GLOBAL(name,value) { .Name = GET_GLOBAL_NAME(name), .Value = value }, +#define CDAC_GLOBALS_END() +#include "sample.data.h" +#undef CDAC_BASELINE +#undef CDAC_TYPES_BEGIN +#undef CDAC_TYPES_END +#undef CDAC_TYPE_BEGIN +#undef CDAC_TYPE_INDETERMINATE +#undef CDAC_TYPE_SIZE +#undef CDAC_TYPE_FIELD +#undef CDAC_TYPE_END +#undef DECL_LEN +#undef CDAC_GLOBALS_BEGIN +#undef CDAC_GLOBAL +#undef CDAC_GLOBALS_END + }, + + } +}; + +// end blob definition diff --git a/docs/design/datacontracts/sample.data.h b/docs/design/datacontracts/sample.data.h new file mode 100644 index 00000000000000..18f1a454035264 --- /dev/null +++ b/docs/design/datacontracts/sample.data.h @@ -0,0 +1,18 @@ +CDAC_BASELINE("net9.0-rc1/Release/osx-arm64") +CDAC_TYPES_BEGIN() + +CDAC_TYPE_BEGIN(ManagedThread) +CDAC_TYPE_INDETERMINATE(ManagedThread) +CDAC_TYPE_FIELD(ManagedThread, GCHandle, offsetof(ManagedThread,GCHandle)) +CDAC_TYPE_END(ManagedThread) + +CDAC_TYPE_BEGIN(GCHandle) +CDAC_TYPE_SIZE(sizeof(intptr_t)) +CDAC_TYPE_END(GCHandle) + +CDAC_TYPES_END() + +CDAC_GLOBALS_BEGIN() +CDAC_GLOBAL(ManagedThreadStore, (uint64_t)(uintptr_t)&g_managedThreadStore) +CDAC_GLOBAL(FeatureCOMFlag, FEATURE_COM) +CDAC_GLOBALS_END() From 980144aba4ecf4fe47adbf736ae4f9d3c6b3c2b1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleksey=20Kliger=20=28=CE=BBgeek=29?= Date: Tue, 26 Mar 2024 11:15:57 -0400 Subject: [PATCH 02/14] fix typos Co-authored-by: Aaron Robinson Co-authored-by: Jan Kotas Co-authored-by: Noah Falk --- docs/design/datacontracts/data_descriptor.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index c6d416a304e794..c4c92fc5109823 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -11,7 +11,7 @@ manifestations. The first physical format is used to publish well-known data descriptors in the `dotnet/runtime` repository. It is supposed to be machine- and human-readable. This format is not meant to be -particularly consise and may be used for visualization, diagnostics, etc. Typically data +particularly concise and may be used for visualization, diagnostics, etc. Typically data descriptors in this form may be written by hand or with the aid of tooling. The second physical format is used to embed a data descriptor blob within a particularly instance of @@ -35,12 +35,12 @@ The data descriptor consists of: ## Data descriptor specification version -This is the version of the phsyical data descriptor. +This is the version of the physical data descriptor. ## Types The types (both primitive types and structures described by structure descriptors) are classified as -having either determinate or indeterminate type. Determinate sizes may be used for pointer +having either determinate or indeterminate size. Determinate sizes may be used for pointer arithmetic. Types with indeterminate size may not be. Note that some sizes may be determinate, but /target specific/. For example pointer types have a fixed size that varies by architecture. @@ -73,7 +73,7 @@ The name of a field descriptor must be unique within the definition of a structu The offset may be negative. -Two or more fields may have the same offets or imply that the underlying fields overlap. The field +Two or more fields may have the same offsets or imply that the underlying fields overlap. The field offsets need not be aligned using any sort of target-specific alignment rules. Each field's type may refer to one of the primitive types or to any other type defined in the logical descriptor. @@ -99,12 +99,12 @@ The name of each global value must be unique within the logical descriptor. The type must be one of the determinate-size primitive types. The value must be a integral constant within the range of its type. Signed values are two's -complement. Pointer values need not be aligned and need not point to adressable target memory. +complement. Pointer values need not be aligned and need not point to addressable target memory. ## Physical descriptors -The phsyical descriptors are meant to describe /subsets/ of a logical descriptor and to compose. +The physical descriptors are meant to describe /subsets/ of a logical descriptor and to compose. Each physical descriptor can name an ordered sequence of zero or more "baseline" descriptor which is then considered to comprise a piece of the overall logical descriptor. From 67eee3f5f42f2ff1370474314c1d1d545a8e4547 Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Tue, 26 Mar 2024 14:43:42 -0400 Subject: [PATCH 03/14] Add field and global types to physical blob descriptor --- docs/design/datacontracts/data_descriptor.md | 59 ++++++++++------- docs/design/datacontracts/sample.blob.c | 67 ++++++++++---------- docs/design/datacontracts/sample.data.h | 11 +++- 3 files changed, 79 insertions(+), 58 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index c4c92fc5109823..8f13e14edb414e 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -25,7 +25,7 @@ Each logical descriptor exists within an implied /target architecture/ consisiti * target architecture pointer size (4 bytes or 8 bytes) The following /primitive types/ are assumed: int8, uint8, int16, uint16, int32, uint32, int64, -uint64, nint, nuint, pointer, (UTF-8) string. The multi-byte types are in the target architecture +uint64, nint, nuint, pointer. The multi-byte types are in the target architecture endianness. The types `nint`, `nuint` and `pointer` have target architecture pointer size. The data descriptor consists of: @@ -44,8 +44,6 @@ having either determinate or indeterminate size. Determinate sizes may be used arithmetic. Types with indeterminate size may not be. Note that some sizes may be determinate, but /target specific/. For example pointer types have a fixed size that varies by architecture. -The following primitive types have indeterminate size: `string`. - ## Structure descriptors Each structure descriptor consists of: @@ -67,19 +65,14 @@ Type names must be globally unique within a single logical descriptor. Each field descriptor consists of: * a name * an offset in bytes from the beginning of the struct -* a type or the special type `variable` The name of a field descriptor must be unique within the definition of a structure. -The offset may be negative. - Two or more fields may have the same offsets or imply that the underlying fields overlap. The field offsets need not be aligned using any sort of target-specific alignment rules. Each field's type may refer to one of the primitive types or to any other type defined in the logical descriptor. -If the field's type is `variable` it represents a byte-array of indeterminate size starting at the given offset. - If a structure descriptor contains at least one field of indeterminate size, the whole structure must have indeterminate size. Tooling is not required to, but may, signal a warning if a descriptor has a determinate size and contains indeterminate size fields. @@ -98,8 +91,9 @@ The name of each global value must be unique within the logical descriptor. The type must be one of the determinate-size primitive types. -The value must be a integral constant within the range of its type. Signed values are two's -complement. Pointer values need not be aligned and need not point to addressable target memory. +The value must be an integral constant within the range of its type. Signed values use the target's +natural encoding. Pointer values need not be aligned and need not point to addressable target +memory. ## Physical descriptors @@ -188,6 +182,29 @@ expected to provide the value. This is version 0 of the physical binary blob format. +### Design requirements + +The design of the physical binary blob descriptor is constrained by the following requirements: +* The binary blob should be easy to process by examining an object file on disk - even if the object + file is for a foreign architecture/OS. It should be possible to read the binary blob purely by + looking at the bytes. Tooling should be able to analyze the blob without having to understand + relocation entries, dwarf debug info, symbols etc. +* It should be possible to produce the blob using the native C/C++/NativeAOT compiler for a given + target/architecture. In particular for a runtime written in C, the binary blob should be + constructible using C idioms. If the C compiler needs to pad or align the data, the blob format + should provide a way to iterate the blob contents without having to know anything abotu the target + platform ABI or C compiler conventions. + +This leads to the following overall strategy for the design: +* The physical blob is "self-contained": using pointers would mean that the encoding of the blob + would have relocations applied to it, which would preclude reading the blob out of of an object + file without understanding the object file format. +* The physical blob must be "self-describing": If the C compiler adds padding or alignment, the blob + descriptor must contain information for how to skip the pading/alignment data. +* The physical blob must be constructible using "lowest common denominator" target toolchain + tooling - the C preprocessor. That doesn't mean that tooling _must_ use the C preprocessor to + generate the blob, but the format must not exceed the capabilities of the C preprocessor. + ### Summary The binary blob format for a physical descriptor is expected to be stored in the memory space of a @@ -223,13 +240,11 @@ struct BinaryBlobDataDescriptor uint32_t FieldPoolCount; uint32_t NamesPoolCount; - uint32_t Reserved0; - - uint16_t TypeSpecSize; - uint16_t FieldSpecSize; - uint16_t GlobalSpecSize; - uint16_t Reserved1; + uint8_t TypeSpecSize; + uint8_t FieldSpecSize; + uint8_t GlobalSpecSize; + uint8_t Reserved0; } uint32_t BaselineName; TypeSpec[TypeCount] Types; @@ -248,29 +263,27 @@ struct TypeSpec struct FieldSpec { uint32_t Name; + uint32_t TypeName; uint16_t FieldOffset; - uint16_t Reserved; }; struct GlobalSpec { uint32_t Name; - uint32_t Reserved; + uint32_t TypeName; uint64_t Value; }; ``` -Rationale: the blob should be producable by including a specially formatted C header file multiple -times with redefinitions of some macro names appearing in its content. Additionally the blob avoids -pointers so that it may be read off from the on-disk representation of an object file. - The blob begins with a directory that gives the relative offsets of the `Types`, `FieldPool`, `GlobalValues` and `Names` fields of the blob. The number of elements of each of the arrays is next. This is followed by the sizes of the `TypeSpec`, `FieldSpec` and `GlobalSpec` structs. Rationale: If a `BinaryBlobDataDescriptor` is created via C macros, we want to embed the `offsetof` and `sizeof` of the components of the blob into the blob itself without having to account for any -padding that the C compiler may introduce to enfore alignment. +padding that the C compiler may introduce to enfore alignment. Additionally the `Directory` tries +to follow a common C alignment rule (we don't want padding introduced in the directory itself): +N-byte members are aligned to start on N-byte boundaries. The baseline is specified as an offset into the names pool. diff --git a/docs/design/datacontracts/sample.blob.c b/docs/design/datacontracts/sample.blob.c index 7919c9e93aa185..50f32badf44481 100644 --- a/docs/design/datacontracts/sample.blob.c +++ b/docs/design/datacontracts/sample.blob.c @@ -6,8 +6,8 @@ typedef struct ManagedThread ManagedThread; struct ManagedThread { - uint32_t GCHandle; - ManagedThread *next; + uint32_t m_gcHandle; + ManagedThread *m_next; }; typedef struct ManagedThreadStore { @@ -16,8 +16,6 @@ typedef struct ManagedThreadStore { static ManagedThreadStore g_managedThreadStore; -static const int FEATURE_COM = 1; - // end example structures // begin blob definition @@ -32,14 +30,14 @@ struct TypeSpec struct FieldSpec { uint32_t Name; + uint32_t TypeName; uint16_t FieldOffset; - uint16_t Reserved; }; struct GlobalSpec { uint32_t Name; - uint32_t Reserved; + uint32_t TypeName; uint64_t Value; }; @@ -48,7 +46,9 @@ struct GlobalSpec #define MAKE_TYPELEN_NAME(tyname) CONCAT(cdac_string_pool_typename__, tyname) #define MAKE_FIELDLEN_NAME(tyname,membername) CONCAT4(cdac_string_pool_membername__, tyname, __, membername) +#define MAKE_FIELDTYPELEN_NAME(tyname,membername) CONCAT4(cdac_string_pool_membertypename__, tyname, __, membername) #define MAKE_GLOBALLEN_NAME(globalname) CONCAT(cdac_string_pool_globalname__, globalname) +#define MAKE_GLOBALTYPELEN_NAME(globalname) CONCAT(cdac_string_pool_globaltypename__, globalname) // define a struct where the size of each field is the length of some string. we will use offsetof to get // the offset of each struct element, which will be equal to the offset of the beginning of that string in the @@ -63,11 +63,13 @@ struct CDacStringPoolSizes #define CDAC_TYPE_BEGIN(name) DECL_LEN(MAKE_TYPELEN_NAME(name), sizeof(#name)) #define CDAC_TYPE_INDETERMINATE(name) #define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membername,offset) DECL_LEN(MAKE_FIELDLEN_NAME(tyname,membername), sizeof(#membername)) +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) DECL_LEN(MAKE_FIELDLEN_NAME(tyname,membername), sizeof(#membername)) \ + DECL_LEN(MAKE_FIELDTYPELEN_NAME(tyname,membername), sizeof(#membertyname)) #define CDAC_TYPE_END(name) #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) DECL_LEN(MAKE_GLOBALLEN_NAME(name), sizeof(#name)) +#define CDAC_GLOBAL(name,tyname,value) DECL_LEN(MAKE_GLOBALLEN_NAME(name), sizeof(#name)) \ + DECL_LEN(MAKE_GLOBALTYPELEN_NAME(name), sizeof(#tyname)) #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE @@ -86,7 +88,9 @@ struct CDacStringPoolSizes #define GET_TYPE_NAME(name) offsetof(struct CDacStringPoolSizes, MAKE_TYPELEN_NAME(name)) #define GET_FIELD_NAME(tyname,membername) offsetof(struct CDacStringPoolSizes, MAKE_FIELDLEN_NAME(tyname,membername)) +#define GET_FIELDTYPE_NAME(tyname,membername) offsetof(struct CDacStringPoolSizes, MAKE_FIELDTYPELEN_NAME(tyname,membername)) #define GET_GLOBAL_NAME(globalname) offsetof(struct CDacStringPoolSizes, MAKE_GLOBALLEN_NAME(globalname)) +#define GET_GLOBALTYPE_NAME(globalname) offsetof(struct CDacStringPoolSizes, MAKE_GLOBALTYPELEN_NAME(globalname)) // count the types enum @@ -97,11 +101,11 @@ enum #define CDAC_TYPE_BEGIN(name) + 1 #define CDAC_TYPE_INDETERMINATE(name) #define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membername,offset) +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) #define CDAC_TYPE_END(name) #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBAL(name,tyname,value) #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE @@ -129,11 +133,11 @@ enum #define CDAC_TYPE_BEGIN(name) #define CDAC_TYPE_INDETERMINATE(name) #define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membername,offset) + 1 +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) + 1 #define CDAC_TYPE_END(name) + 1 #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBAL(name,tyname,value) #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE @@ -160,11 +164,11 @@ enum #define CDAC_TYPE_BEGIN(name) #define CDAC_TYPE_INDETERMINATE(name) #define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membername,offset) +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) #define CDAC_TYPE_END(name) #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) + 1 +#define CDAC_GLOBAL(name,tyname,value) + 1 #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE @@ -207,12 +211,12 @@ struct CDacFieldPoolSizes #define CDAC_TYPE_BEGIN(name) struct MAKE_TYPEFIELDS_TYNAME(name) { #define CDAC_TYPE_INDETERMINATE(name) #define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membername,offset) DECL_LEN(CONCAT4(cdac_field_pool_member__, tyname, __, membername)) +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) DECL_LEN(CONCAT4(cdac_field_pool_member__, tyname, __, membername)) #define CDAC_TYPE_END(name) DECL_LEN(CONCAT4(cdac_field_pool_member__, tyname, _, endmarker)) \ } MAKE_TYPEFIELDS_TYNAME(name); #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBAL(name,tyname,value) #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE @@ -244,13 +248,11 @@ struct BinaryBlobDataDescriptor uint32_t FieldPoolCount; uint32_t NamesPoolCount; - uint32_t Reserved0; - - uint16_t TypeSpecSize; - uint16_t FieldSpecSize; - uint16_t GlobalSpecSize; - uint16_t Reserved1; + uint8_t TypeSpecSize; + uint8_t FieldSpecSize; + uint8_t GlobalSpecSize; + uint8_t Reserved0; } Directory; uint32_t BaselineName; struct TypeSpec Types[CDacBlobTypesCount]; @@ -260,12 +262,12 @@ struct BinaryBlobDataDescriptor }; struct MagicAndBlob { - char magic[4]; + char magic[8]; struct BinaryBlobDataDescriptor Blob; }; const struct MagicAndBlob Blob = { - .magic = "DAC", + .magic = "DACBLOB", .Blob = { .Directory = { .TypesStart = offsetof(struct BinaryBlobDataDescriptor, Types), @@ -285,11 +287,11 @@ const struct MagicAndBlob Blob = { #define CDAC_TYPE_BEGIN(name) #name "\0" #define CDAC_TYPE_INDETERMINATE(name) #define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membername,offset) #membername "\0" +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) #membername "\0" #membertyname "\0" #define CDAC_TYPE_END(name) #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) #name "\0" +#define CDAC_GLOBAL(name,tyname,value) #name "\0" #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE @@ -311,14 +313,15 @@ const struct MagicAndBlob Blob = { #define CDAC_TYPE_BEGIN(name) #define CDAC_TYPE_INDETERMINATE(name) #define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membername,offset) { \ +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) { \ .Name = GET_FIELD_NAME(tyname,membername), \ + .TypeName = GET_FIELDTYPE_NAME(tyname,membername), \ .FieldOffset = offset, \ }, #define CDAC_TYPE_END(name) { 0, }, #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBAL(name,tyname,value) #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE @@ -342,11 +345,11 @@ const struct MagicAndBlob Blob = { .Fields = GET_TYPE_FIELDS(name), #define CDAC_TYPE_INDETERMINATE(name) .Size = 0, #define CDAC_TYPE_SIZE(size) .Size = size, -#define CDAC_TYPE_FIELD(tyname,membername,offset) +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) #define CDAC_TYPE_END(name) }, #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) +#define CDAC_GLOBAL(name,tyname,value) #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE @@ -368,11 +371,11 @@ const struct MagicAndBlob Blob = { #define CDAC_TYPE_BEGIN(name) #define CDAC_TYPE_INDETERMINATE(name) #define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membername,offset) +#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) #define CDAC_TYPE_END(name) #define CDAC_TYPES_END() #define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,value) { .Name = GET_GLOBAL_NAME(name), .Value = value }, +#define CDAC_GLOBAL(name,tyname,value) { .Name = GET_GLOBAL_NAME(name), .TypeName = GET_GLOBALTYPE_NAME(name), .Value = value }, #define CDAC_GLOBALS_END() #include "sample.data.h" #undef CDAC_BASELINE diff --git a/docs/design/datacontracts/sample.data.h b/docs/design/datacontracts/sample.data.h index 18f1a454035264..df91c3df601ed6 100644 --- a/docs/design/datacontracts/sample.data.h +++ b/docs/design/datacontracts/sample.data.h @@ -3,7 +3,8 @@ CDAC_TYPES_BEGIN() CDAC_TYPE_BEGIN(ManagedThread) CDAC_TYPE_INDETERMINATE(ManagedThread) -CDAC_TYPE_FIELD(ManagedThread, GCHandle, offsetof(ManagedThread,GCHandle)) +CDAC_TYPE_FIELD(ManagedThread, GCHandle, GCHandle, offsetof(ManagedThread,m_gcHandle)) +CDAC_TYPE_FIELD(ManagedThread, pointer, Next, offsetof(ManagedThread,m_next)) CDAC_TYPE_END(ManagedThread) CDAC_TYPE_BEGIN(GCHandle) @@ -13,6 +14,10 @@ CDAC_TYPE_END(GCHandle) CDAC_TYPES_END() CDAC_GLOBALS_BEGIN() -CDAC_GLOBAL(ManagedThreadStore, (uint64_t)(uintptr_t)&g_managedThreadStore) -CDAC_GLOBAL(FeatureCOMFlag, FEATURE_COM) +CDAC_GLOBAL(ManagedThreadStore, pointer, (uint64_t)(uintptr_t)&g_managedThreadStore) +#if FEATURE_EH_FUNCLETS +CDAC_GLOBAL(FeatureEHFunclets, uint8, 1) +#else +CDAC_GLOBAL(FeatureEHFunclets, uint8, 0) +#endif CDAC_GLOBALS_END() From 019bc63be6867f267323f14262e4024b34a71a04 Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Tue, 26 Mar 2024 14:52:11 -0400 Subject: [PATCH 04/14] spellcheck --- docs/design/datacontracts/data_descriptor.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index 8f13e14edb414e..5b21c80848d90a 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -20,7 +20,7 @@ store it. It is primarily meant to be read and written by tooling. ## Logical descriptor -Each logical descriptor exists within an implied /target architecture/ consisiting of: +Each logical descriptor exists within an implied /target architecture/ consisting of: * target architecture endianness (little endian or big endian) * target architecture pointer size (4 bytes or 8 bytes) @@ -115,7 +115,7 @@ global values are added in traversal order with later appearances overwriting pr Rationale: if a baseline is included more than once, only the first inclusion counts. If a type appears in multiple physical descriptors, the later appearances may add more fields or change the -offsets or definite/indefinite sizes of prior definitions. If a value appears multipel times, later +offsets or definite/indefinite sizes of prior definitions. If a value appears multiple times, later definitions take precedence. **FIXME** do we really want a DAG? Are we ok with a linked list? @@ -139,7 +139,7 @@ The toplevel dictionary will contain: ### Types -The types will be in an array, with each type described by a dictionary containining keys: +The types will be in an array, with each type described by a dictionary containing keys: * `"name": "type name"` the name of each type * optional `"size": int | "indeterminate"` if omitted the size is indeterminate @@ -192,7 +192,7 @@ The design of the physical binary blob descriptor is constrained by the followin * It should be possible to produce the blob using the native C/C++/NativeAOT compiler for a given target/architecture. In particular for a runtime written in C, the binary blob should be constructible using C idioms. If the C compiler needs to pad or align the data, the blob format - should provide a way to iterate the blob contents without having to know anything abotu the target + should provide a way to iterate the blob contents without having to know anything about the target platform ABI or C compiler conventions. This leads to the following overall strategy for the design: @@ -200,7 +200,7 @@ This leads to the following overall strategy for the design: would have relocations applied to it, which would preclude reading the blob out of of an object file without understanding the object file format. * The physical blob must be "self-describing": If the C compiler adds padding or alignment, the blob - descriptor must contain information for how to skip the pading/alignment data. + descriptor must contain information for how to skip the padding/alignment data. * The physical blob must be constructible using "lowest common denominator" target toolchain tooling - the C preprocessor. That doesn't mean that tooling _must_ use the C preprocessor to generate the blob, but the format must not exceed the capabilities of the C preprocessor. @@ -214,7 +214,7 @@ It is likely that the physical descriptor will be a part of a build-time constan of a .NET runtime, and the format is designed to be compact and specifiable as a (likely machine-generated) compile-time constant in a suitable source language. -The data descriptor forms one part of an overall physical data contract descriptor in a targret .NET +The data descriptor forms one part of an overall physical data contract descriptor in a target .NET runtime and as such this format does not specify a "magic number", or "well known symbol", or another means of identifying the blob within a target process. Additionally the version of the binary blob data descriptor is expected to be stored within the larger enclosing data contract @@ -281,7 +281,7 @@ next. This is followed by the sizes of the `TypeSpec`, `FieldSpec` and `GlobalSp Rationale: If a `BinaryBlobDataDescriptor` is created via C macros, we want to embed the `offsetof` and `sizeof` of the components of the blob into the blob itself without having to account for any -padding that the C compiler may introduce to enfore alignment. Additionally the `Directory` tries +padding that the C compiler may introduce to enforce alignment. Additionally the `Directory` tries to follow a common C alignment rule (we don't want padding introduced in the directory itself): N-byte members are aligned to start on N-byte boundaries. @@ -292,7 +292,7 @@ The types are given as an array of `TypeSpec` elements. Each one contains an of specified field of the type, and the size of the type in bytes or 0 if it is indeterminate. The fields pool is given as a sequence of `FieldSpec` elements. The fields for each type are given -in a contiguious subsequence and are terminated by a marker `FieldSpec` with a `Name` offset of 0. +in a contiguous subsequence and are terminated by a marker `FieldSpec` with a `Name` offset of 0. (Thus if a type has an empty sequence of fields it just points to a marker field spec directly.) For each field there is a name that gives an offset in the name pool and an offset indicating the field's offset. The field type is not given. From 452c9906c1837a2d99a431f5a90f011ccc5986ef Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Tue, 26 Mar 2024 15:32:17 -0400 Subject: [PATCH 05/14] simplify --- docs/design/datacontracts/data_descriptor.md | 65 ++++++++----------- .../datacontracts/datacontracts_design.md | 28 +++----- docs/design/datacontracts/sample.data.h | 2 +- 3 files changed, 38 insertions(+), 57 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index 5b21c80848d90a..66d37906ed9755 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -20,29 +20,25 @@ store it. It is primarily meant to be read and written by tooling. ## Logical descriptor -Each logical descriptor exists within an implied /target architecture/ consisting of: +Each logical descriptor exists within an implied *target architecture* consisting of: * target architecture endianness (little endian or big endian) * target architecture pointer size (4 bytes or 8 bytes) -The following /primitive types/ are assumed: int8, uint8, int16, uint16, int32, uint32, int64, +The following *primitive types* are assumed: int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer. The multi-byte types are in the target architecture endianness. The types `nint`, `nuint` and `pointer` have target architecture pointer size. The data descriptor consists of: -* the data descriptor specification version * a collection of type structure descriptors -* a collection of global value descriptors. - -## Data descriptor specification version - -This is the version of the physical data descriptor. +* a collection of global value descriptors ## Types The types (both primitive types and structures described by structure descriptors) are classified as -having either determinate or indeterminate size. Determinate sizes may be used for pointer -arithmetic. Types with indeterminate size may not be. Note that some sizes may be determinate, but -/target specific/. For example pointer types have a fixed size that varies by architecture. +having either determinate or indeterminate size. Types with a determinate size may be used for +pointer arithmetic, whereas types with an indeterminate size may not be. Note that some sizes may +be determinate, but *target specific*. For example pointer types have a fixed size that varies by +architecture. ## Structure descriptors @@ -64,6 +60,7 @@ Type names must be globally unique within a single logical descriptor. Each field descriptor consists of: * a name +* a type * an offset in bytes from the beginning of the struct The name of a field descriptor must be unique within the definition of a structure. @@ -98,27 +95,21 @@ memory. ## Physical descriptors -The physical descriptors are meant to describe /subsets/ of a logical descriptor and to compose. -Each physical descriptor can name an ordered sequence of zero or more "baseline" descriptor which is then -considered to comprise a piece of the overall logical descriptor. +The physical descriptors are meant to describe *subsets* of a logical descriptor and to compose. -Starting from a single physical descriptor, the "baseline" relationship forms a directed graph. It -is an error for the graph to contain a cycle. The baseline relationship may form a DAG (that is: two -or more nodes may refer to the same baseline). +In typical usage we expect to have two physical descriptors that are combined to form the logical descriptor for a target runtime: +* a "baseline" physical descriptor with a well-known name, +* a "binary blob" physical descriptor that is part of the target runtime process' memory -When constructing the logical descriptor, the DAG is traversed in a post-order traversal with each -node visited at most once with baselines of a particular node visited from first to last. +When constructing the logical descriptor, first the baseline physical desctriptor is consumed: the +types and values from the baseline are added to the logical descriptor. Then the types of the +binary blob are used to augment the baseline: fields are added or modified, sizes and offsets are +overwritten. The global values of the binary blob are used to augment the baseline: new globals are +added, existing globals are modified by overwriting their types or values. -To form the logical descriptor the types are added in traversal order with later appearances -augmenting earlier ones (fields are added or modified, sizes and offsets are overwritten). The -global values are added in traversal order with later appearances overwriting previous ones. - -Rationale: if a baseline is included more than once, only the first inclusion counts. If a type -appears in multiple physical descriptors, the later appearances may add more fields or change the -offsets or definite/indefinite sizes of prior definitions. If a value appears multiple times, later -definitions take precedence. - -**FIXME** do we really want a DAG? Are we ok with a linked list? +Rationale: If a type appears in multiple physical descriptors, the later appearances may add more +fields or change the offsets or definite/indefinite sizes of prior definitions. If a value appears +multiple times, later definitions take precedence. ## Physical JSON descriptor @@ -133,7 +124,6 @@ A data descriptor may be stored in the "JSON with comments" format. The toplevel dictionary will contain: * `"version": 0` -* `"baseline": "FREEFORM STRING"` or `baseline: ["FREEFORM STRING"", ...]` * `"types": TYPE_ARRAY` see below * `"globals": VALUE_ARRAY` see below @@ -148,13 +138,14 @@ The types will be in an array, with each type described by a dictionary containi Each `FIELD_ARRAY` is an array of dictionaries each containing keys: * `"name": "field name"` the name of each field -* `"type": "type name"` the name of a primitive type or another type defined in the same /logical/ descriptor +* `"type": "type name"` the name of a primitive type or another type defined in the logical descriptor * optional `"offset": int | "unknown"` the offset of the field or "unknown". If omitted, same as "unknown". -Note that the logical descriptor does not contain "unknown" offsets. +Note that the logical descriptor does not contain "unknown" offsets: it is expected that the binary +blob will augment the baseline with a known offset for all fields in the baseline. -Rationale: "unknown" offsets may be used to document in the physical JSON descriptor that another -physical descriptor in the "baseline" graph is expected to provide the offset of the field. +Rationale: "unknown" offsets may be used to document in the physical JSON descriptor that the binary +blob descriptor is expected to provide the offset of the field. ### Global values @@ -164,7 +155,8 @@ The global values will be in an array, with each value described by a dictionary * `"type": "type name"` the type of the global value * optional `"value": VALUE | "unknown"` the value of the global value or "unknown". If omitted, same as "unknown". -Note that the logical descriptor does not contain "unknown" values. +Note that the logical descriptor does not contain "unknown" values: it is expected that the binary +blob will augment the baseline with a known offset for all fields in the baseline. The `VALUE` may be a JSON numeric constant integer or a string containing a signed or unsigned decimal or hex (with prefix `0x` or `0X`) integer constant. The constant must be within the range @@ -173,9 +165,6 @@ of the type of the global value. For pointer and nuint globals, the value may be assumed to fit in a 64-bit unsigned integer. For nint globals, the value may be assumed to fit in a 64-bit signed integer. -If the value is specified as "unknown" another physical descriptor in the "baseline" graph is -expected to provide the value. - ## Physical binary blob descriptor ### Version diff --git a/docs/design/datacontracts/datacontracts_design.md b/docs/design/datacontracts/datacontracts_design.md index 6cc424c356eced..d4add0b910d051 100644 --- a/docs/design/datacontracts/datacontracts_design.md +++ b/docs/design/datacontracts/datacontracts_design.md @@ -39,17 +39,15 @@ It is not a requirement that the baseline is chosen so that additional "delta" i possible size, although for practical purposes that may be desired. Data descriptors are registered as "well known" by checking them into the main branch of -`dotnet/runtime` in the `docs/design/datacontracts/data/` directory in a format to be specified -later. The relative path name (with `/` as the path separator, if any) of the descriptor without +`dotnet/runtime` in the `docs/design/datacontracts/data/` directory in the JSON format specified +in the [data descriptor spec](./data_descriptor.md#Physical_JSON_Descriptor). The relative path name (with `/` as the path separator, if any) of the descriptor without any extension is the identifier. (for example: -`/docs/design/datacontracts/data/net9.0-rc1/Release/linux-arm64.json` is the filename for the data -descriptor with identifier `net9.0-rc1/Release/linux-arm64`) +`/docs/design/datacontracts/data/net9.0/coreclr/linux-arm64.json` is the filename for the data +descriptor with identifier `net9.0/coreclr/linux-arm64`) #### Global Values -Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint, string) -All global values have a string describing their name, and a value of one of the above types. - -For instance, we will likely have a `TargetPointerSize` global value represented in a data descriptor. +Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint) +All global values have a string describing their name, a type, and a value of one of the above types. #### Data Structure Layout Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field. @@ -71,15 +69,9 @@ Contracts are described an integer version number. A higher version number is no ## Contract data model Logically a contract may refer to another contract. If it does so, it will typically refer to other contracts by names which do not include the contract version. This is to allow for version flexibility. Logically once the Data Contract Descriptor is fully processed, there is a single list of contracts that represents the set of contracts useable with whatever runtime instance is being processed. -## Types of contracts - -There are 2 different types of contracts each representing a different phase of execution of the data contract system. - -### Composition contracts -These contracts indicate the version numbers of other contracts. This is done to reduce the size of contract list needed in the Data Contract Descriptor. In general it is intended that as a runtime nears shipping, the product team can gather up all of the current versions of the contracts into a single magic value, which can be used to initialize most of the contract versions of the data contract system. A specific version number in the Data Contract Descriptor for a given contract will override any composition contracts specified in the Data Contract Descriptor. If there are multiple composition contracts in a Data Contract Descriptor which specify the same contract to have a different version, the first composition contract linearly in the Data Contract Descriptor wins. This is intended to allow for a composite contract for the architecture/os indepedent work, and a separate composite contract for the non independent work. If a contract is specified explicitly in the Data Contract Descriptor and a different version is specified via the composition contract mechanism, the explicitly specified contract takes precedence. +## Algorithmic contracts -### Algorithmic contracts -Algorithmic contracts define how to process a given set of data structures to produce useful results. These are effectively code snippets which utilize the abstracted data structures provided by Data Structure Definition Contracts and Global Value Contract to produce useful output about a given program. Descriptions of these contracts may refer to functionality provided by other contracts to do their work. The algorithms provided in these contracts are designed to operate given the ability to read various primitive types and defined data structures from the process memory space, as well as perform general purpose computation. +Algorithmic contracts define how to process a given set of data structures to produce useful results. These are effectively code snippets which utilize the abstracted data structures and global values provided by data descriptor to produce useful output about a given program. Descriptions of these contracts may refer to functionality provided by other contracts to do their work. The algorithms provided in these contracts are designed to operate given the ability to read various primitive types and defined data structures from the process memory space, as well as perform general purpose computation. It is entirely reasonable for an algorithmic contract to have multiple entrypoints which take different inputs. For example imagine a contract which provides information about a `MethodTable`. It may provide the an api to get the `BaseSize` of a `MethodTable`, and an api to get the `DynamicTypeID` of a `MethodTable`. However, while the set of contracts which describe an older version of .NET may provide a means by which the `DynamicTypeID` may be acquired for a `MethodTable`, a newer runtime may not have that concept. In such a case, it is very reasonable to define that the `GetDynamicTypeID` api portion of that contract is defined to simply `throw new NotSupportedException();` @@ -99,11 +91,11 @@ runtime will produce an error. ## Arrangement of contract specifications in the repo -Specs shall be stored in the repo in a set of directories. `docs/design/datacontracts` Each one of them shall be a seperate markdown file named with the name of contract. `docs/design/datacontracts/datalayout/.md` Every version of each contract shall be located in the same file to facilitate understanding how variations between different contracts work. +Specs shall be stored in the repo in a set of directories. `docs/design/datacontracts` Each one of them shall be a seperate markdown file named with the name of contract. `docs/design/datacontracts/.md` Every version of each contract shall be located in the same file to facilitate understanding how variations between different contracts work. ### Algorthmic Contract -Algorithmic contracts these describe how an algorithm that processes over data layouts work. Unlike all other contract forms, every version of an algorithmic contract presents a consistent api to consumers of the contract. +Algorithmic contracts describe how an algorithm that processes over data layouts work. Every version of an algorithmic contract presents a consistent api to consumers of the contract. There are several sections: 1. The header, where a description of what the contract can do is placed. diff --git a/docs/design/datacontracts/sample.data.h b/docs/design/datacontracts/sample.data.h index df91c3df601ed6..3c9b4bb9a585ce 100644 --- a/docs/design/datacontracts/sample.data.h +++ b/docs/design/datacontracts/sample.data.h @@ -1,4 +1,4 @@ -CDAC_BASELINE("net9.0-rc1/Release/osx-arm64") +CDAC_BASELINE("net9.0/coreclr/osx-arm64") CDAC_TYPES_BEGIN() CDAC_TYPE_BEGIN(ManagedThread) From 82240506e7155b0e9e9db54bc9099252ebc6fc66 Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Tue, 26 Mar 2024 15:41:12 -0400 Subject: [PATCH 06/14] composition requirement for binary blobs --- docs/design/datacontracts/data_descriptor.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index 66d37906ed9755..f4508ec8e71928 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -101,7 +101,7 @@ In typical usage we expect to have two physical descriptors that are combined to * a "baseline" physical descriptor with a well-known name, * a "binary blob" physical descriptor that is part of the target runtime process' memory -When constructing the logical descriptor, first the baseline physical desctriptor is consumed: the +When constructing the logical descriptor, first the baseline physical descriptor is consumed: the types and values from the baseline are added to the logical descriptor. Then the types of the binary blob are used to augment the baseline: fields are added or modified, sizes and offsets are overwritten. The global values of the binary blob are used to augment the baseline: new globals are @@ -183,9 +183,15 @@ The design of the physical binary blob descriptor is constrained by the followin constructible using C idioms. If the C compiler needs to pad or align the data, the blob format should provide a way to iterate the blob contents without having to know anything about the target platform ABI or C compiler conventions. +* It should be possible to create separate subsets of the physical descriptor (in the target runtime + object format) using separate toolchains (for example: in NativeAOT some of the struct layouts may + be described by the NativeAOT compiler, while some might be described by the C/C++ toolchain) and + to run a build host (not target architecture) tool to read and compose them into a single physical + binary blob before embedding it into the final NativeAOT runtime binary. This leads to the following overall strategy for the design: -* The physical blob is "self-contained": using pointers would mean that the encoding of the blob +* The physical blob is "self-contained": indirections are encoded as offsets from the beginning of + the blob (or other base offsets), whereas using pointers would mean that the encoding of the blob would have relocations applied to it, which would preclude reading the blob out of of an object file without understanding the object file format. * The physical blob must be "self-describing": If the C compiler adds padding or alignment, the blob @@ -193,6 +199,9 @@ This leads to the following overall strategy for the design: * The physical blob must be constructible using "lowest common denominator" target toolchain tooling - the C preprocessor. That doesn't mean that tooling _must_ use the C preprocessor to generate the blob, but the format must not exceed the capabilities of the C preprocessor. +* The physical blob must be round-trippable: it should be possible to extract the blob from an + object file and write it back out as C source code that compiles back to a logically equivalent + blob. ### Summary From 11e988c647f3da0504b50711894ce206b6bd061e Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Tue, 26 Mar 2024 16:10:37 -0400 Subject: [PATCH 07/14] lint --- docs/design/datacontracts/datacontracts_design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/datacontracts/datacontracts_design.md b/docs/design/datacontracts/datacontracts_design.md index d4add0b910d051..52c9a84a886561 100644 --- a/docs/design/datacontracts/datacontracts_design.md +++ b/docs/design/datacontracts/datacontracts_design.md @@ -81,7 +81,7 @@ For working with data from the target process/other contracts, the following C# Best practice is to either write the algorithm in C# like psuedocode working on top of the [C# style api](contract_csharp_api_design.cs) or by reference to specifications which are not co-developed with the runtime, such as OS/architecture specifications. Within the contract algorithm specification, the intention is that all interesting api work is done by using an instance of the `Target` class. -Algorithmic contracts may include specifications for numbers which can be referred to in the contract or by other contracts. The intention is that these global values represent magic numbers and values which are useful for the operation of algorithmic contracts. +Algorithmic contracts may include specifications for numbers which can be referred to in the contract or by other contracts. The intention is that these global values represent magic numbers and values which are useful for the operation of algorithmic contracts. While not all versions of a data structure are required to have the same fields/type of fields, algorithms may be built targetting the union of the set of field types defined in the data structure From db02e42351fdbe17f00172b0eb1c393a953e1aef Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Wed, 27 Mar 2024 10:51:13 -0400 Subject: [PATCH 08/14] remove binary blob format it's a build tooling implementation detail --- docs/design/datacontracts/data_descriptor.md | 179 +-------- docs/design/datacontracts/sample.blob.c | 398 ------------------- docs/design/datacontracts/sample.data.h | 23 -- 3 files changed, 18 insertions(+), 582 deletions(-) delete mode 100644 docs/design/datacontracts/sample.blob.c delete mode 100644 docs/design/datacontracts/sample.data.h diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index f4508ec8e71928..a8574c5a5155b5 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -6,17 +6,17 @@ tooling. The information is given meaning by algorithmic contracts that describ layout of the memory of a .NET process corresponds to high-level abstract data structures that represent the conceptual state of a .NET process. -In this document we give a logical description of a data descriptor together with two physical -manifestations. +In this document we give a logical description of a data descriptor together with a physical +manifestation. -The first physical format is used to publish well-known data descriptors in the `dotnet/runtime` -repository. It is supposed to be machine- and human-readable. This format is not meant to be -particularly concise and may be used for visualization, diagnostics, etc. Typically data -descriptors in this form may be written by hand or with the aid of tooling. +The physical format is used for two purposes: -The second physical format is used to embed a data descriptor blob within a particularly instance of -a target runtime. It is meant to be machine-readable while minimizing the total space needed to -store it. It is primarily meant to be read and written by tooling. +1. To publish well-known data descriptors in the `dotnet/runtime` repository in a machine- and +human-readable form. This datamay be used for visualization, diagnostics, etc. These data +descriptors may be written by hand or with the aid of tooling. + +2. To embed a data descriptor blob within a particular instance of a target runtime. The data +descriptor blob will be discovered by diagnostic tooling from the memory of a target process. ## Logical descriptor @@ -153,7 +153,7 @@ The global values will be in an array, with each value described by a dictionary * `"name": "global value name"` the name of the global value * `"type": "type name"` the type of the global value -* optional `"value": VALUE | "unknown"` the value of the global value or "unknown". If omitted, same as "unknown". +* optional `"value": VALUE | { "indirect": int } | "unknown"` the value of the global value, or an offset in an auxiliary array containing the value or "unknown". Note that the logical descriptor does not contain "unknown" values: it is expected that the binary blob will augment the baseline with a known offset for all fields in the baseline. @@ -165,157 +165,14 @@ of the type of the global value. For pointer and nuint globals, the value may be assumed to fit in a 64-bit unsigned integer. For nint globals, the value may be assumed to fit in a 64-bit signed integer. -## Physical binary blob descriptor - -### Version +If the value is given as `{"indirect": int}` then the value is stored in an auxiliary array that is +part of the data contrat descriptor. Only in-memory data descriptors may have indirect values; baseline data descriptors may not have indirect values. -This is version 0 of the physical binary blob format. - -### Design requirements - -The design of the physical binary blob descriptor is constrained by the following requirements: -* The binary blob should be easy to process by examining an object file on disk - even if the object - file is for a foreign architecture/OS. It should be possible to read the binary blob purely by - looking at the bytes. Tooling should be able to analyze the blob without having to understand - relocation entries, dwarf debug info, symbols etc. -* It should be possible to produce the blob using the native C/C++/NativeAOT compiler for a given - target/architecture. In particular for a runtime written in C, the binary blob should be - constructible using C idioms. If the C compiler needs to pad or align the data, the blob format - should provide a way to iterate the blob contents without having to know anything about the target - platform ABI or C compiler conventions. -* It should be possible to create separate subsets of the physical descriptor (in the target runtime - object format) using separate toolchains (for example: in NativeAOT some of the struct layouts may - be described by the NativeAOT compiler, while some might be described by the C/C++ toolchain) and - to run a build host (not target architecture) tool to read and compose them into a single physical - binary blob before embedding it into the final NativeAOT runtime binary. - -This leads to the following overall strategy for the design: -* The physical blob is "self-contained": indirections are encoded as offsets from the beginning of - the blob (or other base offsets), whereas using pointers would mean that the encoding of the blob - would have relocations applied to it, which would preclude reading the blob out of of an object - file without understanding the object file format. -* The physical blob must be "self-describing": If the C compiler adds padding or alignment, the blob - descriptor must contain information for how to skip the padding/alignment data. -* The physical blob must be constructible using "lowest common denominator" target toolchain - tooling - the C preprocessor. That doesn't mean that tooling _must_ use the C preprocessor to - generate the blob, but the format must not exceed the capabilities of the C preprocessor. -* The physical blob must be round-trippable: it should be possible to extract the blob from an - object file and write it back out as C source code that compiles back to a logically equivalent - blob. +Rationale: This allows tooling to generate the in-memory data descriptor as a single constant +string. For pointers, the address can be stored at a known offset in an in-proc +array of pointers and the offset written into the constant json string. -### Summary +The indirection array is not part of the data descriptor spec. It is expected that the data +contract descriptor will include it. (The data contract descriptor must contain: the data +descriptor, the set of compatible algorithmic contracts, the aux array of globals). -The binary blob format for a physical descriptor is expected to be stored in the memory space of a -target .NET runtime and is thus likely to be the final descriptor in the "baseline" graph. - -It is likely that the physical descriptor will be a part of a build-time constant in the disk image -of a .NET runtime, and the format is designed to be compact and specifiable as a (likely -machine-generated) compile-time constant in a suitable source language. - -The data descriptor forms one part of an overall physical data contract descriptor in a target .NET -runtime and as such this format does not specify a "magic number", or "well known symbol", or -another means of identifying the blob within a target process. Additionally the version of the -binary blob data descriptor is expected to be stored within the larger enclosing data contract -descriptor and is not included here. - -### Blob - -Multi-byte values are in the target platform endianness. - -The format is: - -```c -struct BinaryBlobDataDescriptor -{ - struct Directory { - uint32_t TypesStart; - uint32_t FieldPoolStart; - - uint32_t GlobalValuesStart; - uint32_t NamesStart; - - uint32_t TypeCount; - uint32_t FieldPoolCount; - - uint32_t NamesPoolCount; - - uint8_t TypeSpecSize; - uint8_t FieldSpecSize; - uint8_t GlobalSpecSize; - uint8_t Reserved0; - } - uint32_t BaselineName; - TypeSpec[TypeCount] Types; - FieldSpec[FieldPoolCount] FieldPool; - GlobalSpec[GlobalsCount] GlobalValues; - uint8_t[NamesPoolCount] NamesPool; -}; - -struct TypeSpec -{ - uint32_t Name; - uint32_t Fields; - uint16_t Size; -}; - -struct FieldSpec -{ - uint32_t Name; - uint32_t TypeName; - uint16_t FieldOffset; -}; - -struct GlobalSpec -{ - uint32_t Name; - uint32_t TypeName; - uint64_t Value; -}; -``` - -The blob begins with a directory that gives the relative offsets of the `Types`, `FieldPool`, -`GlobalValues` and `Names` fields of the blob. The number of elements of each of the arrays is -next. This is followed by the sizes of the `TypeSpec`, `FieldSpec` and `GlobalSpec` structs. - -Rationale: If a `BinaryBlobDataDescriptor` is created via C macros, we want to embed the `offsetof` -and `sizeof` of the components of the blob into the blob itself without having to account for any -padding that the C compiler may introduce to enforce alignment. Additionally the `Directory` tries -to follow a common C alignment rule (we don't want padding introduced in the directory itself): -N-byte members are aligned to start on N-byte boundaries. - -The baseline is specified as an offset into the names pool. - -The types are given as an array of `TypeSpec` elements. Each one contains an offset into the -`NamesPool` giving the name of the type, An offset into the fields pool indicating the first -specified field of the type, and the size of the type in bytes or 0 if it is indeterminate. - -The fields pool is given as a sequence of `FieldSpec` elements. The fields for each type are given -in a contiguous subsequence and are terminated by a marker `FieldSpec` with a `Name` offset of 0. -(Thus if a type has an empty sequence of fields it just points to a marker field spec directly.) -For each field there is a name that gives an offset in the name pool and an offset indicating the -field's offset. The field type is not given. - -Rationale: it is expected that the types of the fields were provided by a "baseline" data descriptor. - -The globals are gives as a sequence of `GlobalSpec` elements. Each global has a name and a value. -The types of the globals are not given. - -Rationale: it is expected that the types of the global values were provided by a "baseline" data descriptor. - -The `NamesPool` is a single sequence of utf-8 bytes comprising the concatenation of all the type -field and global names including a terminating nul byte for each name. The same name may occur -multiple times. The names will be referenced by multiple type or multiple fields. (That is, a -clever blob emitter may pool strings). The first name in the name pool is the empty string (with -its nul byte). - -Rationale: we want to reserve the offset 0 as a marker. - -Names are referenced by giving their offset from the beginning of the `NamesPool`. Each name -extends until the first nul byte encountered at or past the beginning of the name. - - -## Example - -And example C header describing some data types is given in [sample.data.h](./sample.data.h). And -example series of C macro preprocessor definitions that produces a constant blob `Blob` is given in -[sample.blob.c](./sample.blob.c) diff --git a/docs/design/datacontracts/sample.blob.c b/docs/design/datacontracts/sample.blob.c deleted file mode 100644 index 50f32badf44481..00000000000000 --- a/docs/design/datacontracts/sample.blob.c +++ /dev/null @@ -1,398 +0,0 @@ -#include -#include - -// example structures - -typedef struct ManagedThread ManagedThread; - -struct ManagedThread { - uint32_t m_gcHandle; - ManagedThread *m_next; -}; - -typedef struct ManagedThreadStore { - ManagedThread *threads; -} ManagedThreadStore; - -static ManagedThreadStore g_managedThreadStore; - -// end example structures - -// begin blob definition - -struct TypeSpec -{ - uint32_t Name; - uint32_t Fields; - uint16_t Size; -}; - -struct FieldSpec -{ - uint32_t Name; - uint32_t TypeName; - uint16_t FieldOffset; -}; - -struct GlobalSpec -{ - uint32_t Name; - uint32_t TypeName; - uint64_t Value; -}; - -#define CONCAT(token1,token2) token1 ## token2 -#define CONCAT4(token1, token2, token3, token4) token1 ## token2 ## token3 ## token4 - -#define MAKE_TYPELEN_NAME(tyname) CONCAT(cdac_string_pool_typename__, tyname) -#define MAKE_FIELDLEN_NAME(tyname,membername) CONCAT4(cdac_string_pool_membername__, tyname, __, membername) -#define MAKE_FIELDTYPELEN_NAME(tyname,membername) CONCAT4(cdac_string_pool_membertypename__, tyname, __, membername) -#define MAKE_GLOBALLEN_NAME(globalname) CONCAT(cdac_string_pool_globalname__, globalname) -#define MAKE_GLOBALTYPELEN_NAME(globalname) CONCAT(cdac_string_pool_globaltypename__, globalname) - -// define a struct where the size of each field is the length of some string. we will use offsetof to get -// the offset of each struct element, which will be equal to the offset of the beginning of that string in the -// string pool. -struct CDacStringPoolSizes -{ - char cdac_string_pool_nil; // make the first real string start at offset 1 - // include 1 + for the nul -#define DECL_LEN(membername,len) char membername[1 + (len)]; -#define CDAC_BASELINE(name) DECL_LEN(cdac_string_pool_baseline_, (sizeof(name))) -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) DECL_LEN(MAKE_TYPELEN_NAME(name), sizeof(#name)) -#define CDAC_TYPE_INDETERMINATE(name) -#define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) DECL_LEN(MAKE_FIELDLEN_NAME(tyname,membername), sizeof(#membername)) \ - DECL_LEN(MAKE_FIELDTYPELEN_NAME(tyname,membername), sizeof(#membertyname)) -#define CDAC_TYPE_END(name) -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) DECL_LEN(MAKE_GLOBALLEN_NAME(name), sizeof(#name)) \ - DECL_LEN(MAKE_GLOBALTYPELEN_NAME(name), sizeof(#tyname)) -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END -}; - -#define GET_TYPE_NAME(name) offsetof(struct CDacStringPoolSizes, MAKE_TYPELEN_NAME(name)) -#define GET_FIELD_NAME(tyname,membername) offsetof(struct CDacStringPoolSizes, MAKE_FIELDLEN_NAME(tyname,membername)) -#define GET_FIELDTYPE_NAME(tyname,membername) offsetof(struct CDacStringPoolSizes, MAKE_FIELDTYPELEN_NAME(tyname,membername)) -#define GET_GLOBAL_NAME(globalname) offsetof(struct CDacStringPoolSizes, MAKE_GLOBALLEN_NAME(globalname)) -#define GET_GLOBALTYPE_NAME(globalname) offsetof(struct CDacStringPoolSizes, MAKE_GLOBALTYPELEN_NAME(globalname)) - -// count the types -enum -{ - CDacBlobTypesCount = -#define CDAC_BASELINE(name) 0 -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) + 1 -#define CDAC_TYPE_INDETERMINATE(name) -#define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) -#define CDAC_TYPE_END(name) -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END - , -}; - -// count the field pool size. -// there's 1 placeholder element at the start, and 1 endmarker after each type -enum -{ - CDacBlobFieldPoolCount = -#define CDAC_BASELINE(name) 1 -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) -#define CDAC_TYPE_INDETERMINATE(name) -#define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) + 1 -#define CDAC_TYPE_END(name) + 1 -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END - , -}; - -// count the globals -enum -{ - CDacBlobGlobalsCount = -#define CDAC_BASELINE(name) 0 -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) -#define CDAC_TYPE_INDETERMINATE(name) -#define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) -#define CDAC_TYPE_END(name) -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) + 1 -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END - , -}; - -#define MAKE_TYPEFIELDS_TYNAME(tyname) CONCAT(CDacFieldPoolTypeStart__, tyname) - -// offsets of each run of fields -// this looks like -// -// struct CDacFieldPoolSizes { -// char empty_field_spec[sizeof(struct FieldSpec)]; -// struct CDacFieldPoolTypeStart__MethodTable { -// char cdac_field_pool_member__MethodTable__GCHandle[sizeof(struct FieldSpec)]; -// char cdac_field_pool_member__MethodTable_endmarker[sizeof(struct FieldSpec)]; -// } CDacFieldPoolTypeStart__MethodTable; -// ... -// }; -// -// so that offsetof(struct CDacFieldPoolSizes, CDacFieldPoolTypeStart__MethodTable) will give the offset of the -// method table field descriptors in the run of fields -struct CDacFieldPoolSizes -{ - char empty_field_spec[sizeof(struct FieldSpec)]; // make all valid field specs non-zero -#define DECL_LEN(membername) char membername[sizeof(struct FieldSpec)]; -#define CDAC_BASELINE(name) -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) struct MAKE_TYPEFIELDS_TYNAME(name) { -#define CDAC_TYPE_INDETERMINATE(name) -#define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) DECL_LEN(CONCAT4(cdac_field_pool_member__, tyname, __, membername)) -#define CDAC_TYPE_END(name) DECL_LEN(CONCAT4(cdac_field_pool_member__, tyname, _, endmarker)) \ - } MAKE_TYPEFIELDS_TYNAME(name); -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END -}; - -#define GET_TYPE_FIELDS(tyname) offsetof(struct CDacFieldPoolSizes, MAKE_TYPEFIELDS_TYNAME(tyname)) - -struct BinaryBlobDataDescriptor -{ - struct Directory { - uint32_t TypesStart; - uint32_t FieldPoolStart; - - uint32_t GlobalValuesStart; - uint32_t NamesStart; - - uint32_t TypeCount; - uint32_t FieldPoolCount; - - uint32_t NamesPoolCount; - - uint8_t TypeSpecSize; - uint8_t FieldSpecSize; - uint8_t GlobalSpecSize; - uint8_t Reserved0; - } Directory; - uint32_t BaselineName; - struct TypeSpec Types[CDacBlobTypesCount]; - struct FieldSpec FieldPool[CDacBlobFieldPoolCount]; - struct GlobalSpec GlobalValues[CDacBlobGlobalsCount]; - uint8_t NamesPool[sizeof(struct CDacStringPoolSizes)]; -}; - -struct MagicAndBlob { - char magic[8]; - struct BinaryBlobDataDescriptor Blob; -}; - -const struct MagicAndBlob Blob = { - .magic = "DACBLOB", - .Blob = { - .Directory = { - .TypesStart = offsetof(struct BinaryBlobDataDescriptor, Types), - .FieldPoolStart = offsetof(struct BinaryBlobDataDescriptor, FieldPool), - .GlobalValuesStart = offsetof(struct BinaryBlobDataDescriptor, GlobalValues), - .TypeCount = CDacBlobTypesCount, - .FieldPoolCount = CDacBlobFieldPoolCount, - .NamesPoolCount = sizeof(struct CDacStringPoolSizes), - .TypeSpecSize = sizeof(struct TypeSpec), - .FieldSpecSize = sizeof(struct FieldSpec), - .GlobalSpecSize = sizeof(struct GlobalSpec), - }, - .BaselineName = offsetof(struct CDacStringPoolSizes, cdac_string_pool_baseline_), - .NamesPool = ("\0" // starts with a nul -#define CDAC_BASELINE(name) name "\0" -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) #name "\0" -#define CDAC_TYPE_INDETERMINATE(name) -#define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) #membername "\0" #membertyname "\0" -#define CDAC_TYPE_END(name) -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) #name "\0" -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END - ), - .FieldPool = { -#define CDAC_BASELINE(name) {0,}, -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) -#define CDAC_TYPE_INDETERMINATE(name) -#define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) { \ - .Name = GET_FIELD_NAME(tyname,membername), \ - .TypeName = GET_FIELDTYPE_NAME(tyname,membername), \ - .FieldOffset = offset, \ -}, -#define CDAC_TYPE_END(name) { 0, }, -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END - }, - .Types = { -#define CDAC_BASELINE(name) -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) { \ - .Name = GET_TYPE_NAME(name), \ - .Fields = GET_TYPE_FIELDS(name), -#define CDAC_TYPE_INDETERMINATE(name) .Size = 0, -#define CDAC_TYPE_SIZE(size) .Size = size, -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) -#define CDAC_TYPE_END(name) }, -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END - }, - .GlobalValues = { -#define CDAC_BASELINE(name) -#define CDAC_TYPES_BEGIN() -#define CDAC_TYPE_BEGIN(name) -#define CDAC_TYPE_INDETERMINATE(name) -#define CDAC_TYPE_SIZE(size) -#define CDAC_TYPE_FIELD(tyname,membertyname,membername,offset) -#define CDAC_TYPE_END(name) -#define CDAC_TYPES_END() -#define CDAC_GLOBALS_BEGIN() -#define CDAC_GLOBAL(name,tyname,value) { .Name = GET_GLOBAL_NAME(name), .TypeName = GET_GLOBALTYPE_NAME(name), .Value = value }, -#define CDAC_GLOBALS_END() -#include "sample.data.h" -#undef CDAC_BASELINE -#undef CDAC_TYPES_BEGIN -#undef CDAC_TYPES_END -#undef CDAC_TYPE_BEGIN -#undef CDAC_TYPE_INDETERMINATE -#undef CDAC_TYPE_SIZE -#undef CDAC_TYPE_FIELD -#undef CDAC_TYPE_END -#undef DECL_LEN -#undef CDAC_GLOBALS_BEGIN -#undef CDAC_GLOBAL -#undef CDAC_GLOBALS_END - }, - - } -}; - -// end blob definition diff --git a/docs/design/datacontracts/sample.data.h b/docs/design/datacontracts/sample.data.h deleted file mode 100644 index 3c9b4bb9a585ce..00000000000000 --- a/docs/design/datacontracts/sample.data.h +++ /dev/null @@ -1,23 +0,0 @@ -CDAC_BASELINE("net9.0/coreclr/osx-arm64") -CDAC_TYPES_BEGIN() - -CDAC_TYPE_BEGIN(ManagedThread) -CDAC_TYPE_INDETERMINATE(ManagedThread) -CDAC_TYPE_FIELD(ManagedThread, GCHandle, GCHandle, offsetof(ManagedThread,m_gcHandle)) -CDAC_TYPE_FIELD(ManagedThread, pointer, Next, offsetof(ManagedThread,m_next)) -CDAC_TYPE_END(ManagedThread) - -CDAC_TYPE_BEGIN(GCHandle) -CDAC_TYPE_SIZE(sizeof(intptr_t)) -CDAC_TYPE_END(GCHandle) - -CDAC_TYPES_END() - -CDAC_GLOBALS_BEGIN() -CDAC_GLOBAL(ManagedThreadStore, pointer, (uint64_t)(uintptr_t)&g_managedThreadStore) -#if FEATURE_EH_FUNCLETS -CDAC_GLOBAL(FeatureEHFunclets, uint8, 1) -#else -CDAC_GLOBAL(FeatureEHFunclets, uint8, 0) -#endif -CDAC_GLOBALS_END() From b40fa38093c6c171c4cb3b561203d133672956ce Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Wed, 27 Mar 2024 12:24:10 -0400 Subject: [PATCH 09/14] add example --- docs/design/datacontracts/data_descriptor.md | 86 ++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index a8574c5a5155b5..31dab2d76f27f3 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -176,3 +176,89 @@ The indirection array is not part of the data descriptor spec. It is expected t contract descriptor will include it. (The data contract descriptor must contain: the data descriptor, the set of compatible algorithmic contracts, the aux array of globals). +## Example + +This is an example of a baseline descriptor for a 64-bit architecture. Suppose it has the name `"example-64"` + +```jsonc +{ + "version": 0, + "types": [ + { + "name": "GCHandle", + "size": 8, + "fields": [ + { "name": "Value", "type": "pointer", "offset": 0 } + ] + }, + { + "name": "Thread", + "size": "indeterminate", + "fields": [ + { "name": "ThreadId", "type": "uint32", "offset": "unknown" }, + { "name": "Next", "type": "pointer" }, // offset "unknown" is implied + { "name": "ThreadState", "type": "uint32" } + ] + }, + { + "name": "ThreadStore", + "fields": [ + { "name": "ThreadCount", "type": "int32" }, + { "name": "ThreadList", "type": "pointer" } + ] + } + ], + "globals": [ + { "name": "FEATURE_EH_FUNCLETS", "type": "uint8", "value": "0" }, // baseline defaults value to 0 + { "name": "s_pThreadStore", "type": "pointer" } // no baseline value + ] +} +``` + +The following is an example of an in-memory descriptor that references the above baseline: + +```jsonc +{ + "version": "0", + "baseline": "example-64", + "types": [ + { + "name": "Thread", + "fields": [ + { "name": "ThreadId", "offset": 32 }, + { "name": "ThreadState", "offset": 0 }, + { "name": "Next", "offset": 128 } + ] + }, + { + "name": "ThreadStore", + "fields": [ + { "name": "ThreadCount", "offset": 32 } + { "name": "ThreadList", "offset": 8 } + ] + } + ], + "globals": [ + { "name": "s_pThreadStore", "value": { "indirect": 0 } } + ] +} +``` + +If the indirect values table has the values `0x0100ffe0` in offset 0, then a possible logical descriptor with the above physical descriptors will have the following types: + +| Type | Size | Field Name | Field Type | Field Offset | +| ----------- | ------------- | ----------- | ---------- | ------------ | +| GCHandle | 8 | Value | pointer | 0 | +| Thread | indeterminate | ThreadState | uint32 | 0 | +| | | ThreadId | uint32 | 32 | +| | | Next | pointer | 128 | +| ThreadStore | indeterminate | ThreadList | pointer | 8 | +| | | ThreadCount | int32 | 32 | + + +And the globals will be: + +| Name | Type | Value | +| ------------------- | ------- | ---------- | +| FEATURE_EH_FUNCLETS | uint8 | 0 | +| s_pThreadStore | pointer | 0x0100ffe0 | From a46b9e207028b84de12df4422446588e706db46b Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Thu, 28 Mar 2024 11:34:13 -0400 Subject: [PATCH 10/14] Spell check --- docs/design/datacontracts/data_descriptor.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index 31dab2d76f27f3..8f209054571fed 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -12,7 +12,7 @@ manifestation. The physical format is used for two purposes: 1. To publish well-known data descriptors in the `dotnet/runtime` repository in a machine- and -human-readable form. This datamay be used for visualization, diagnostics, etc. These data +human-readable form. This data may be used for visualization, diagnostics, etc. These data descriptors may be written by hand or with the aid of tooling. 2. To embed a data descriptor blob within a particular instance of a target runtime. The data @@ -166,11 +166,11 @@ For pointer and nuint globals, the value may be assumed to fit in a 64-bit unsig nint globals, the value may be assumed to fit in a 64-bit signed integer. If the value is given as `{"indirect": int}` then the value is stored in an auxiliary array that is -part of the data contrat descriptor. Only in-memory data descriptors may have indirect values; baseline data descriptors may not have indirect values. +part of the data contract descriptor. Only in-memory data descriptors may have indirect values; baseline data descriptors may not have indirect values. Rationale: This allows tooling to generate the in-memory data descriptor as a single constant string. For pointers, the address can be stored at a known offset in an in-proc -array of pointers and the offset written into the constant json string. +array of pointers and the offset written into the constant JSON string. The indirection array is not part of the data descriptor spec. It is expected that the data contract descriptor will include it. (The data contract descriptor must contain: the data From 7896c0a23307e942e476e3fb2d63673d37ba26ae Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Thu, 28 Mar 2024 11:47:10 -0400 Subject: [PATCH 11/14] replace "binary blob" by "in-memory data descriptor" --- docs/design/datacontracts/data_descriptor.md | 26 +++++++++++--------- 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index 8f209054571fed..5698318ffc010b 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -97,15 +97,15 @@ memory. The physical descriptors are meant to describe *subsets* of a logical descriptor and to compose. -In typical usage we expect to have two physical descriptors that are combined to form the logical descriptor for a target runtime: -* a "baseline" physical descriptor with a well-known name, -* a "binary blob" physical descriptor that is part of the target runtime process' memory +In the .NET runtime there are two physical descriptors: +* a "baseline" physical data descriptor with a well-known name, +* an in-memory physical data descriptor that resides in the target process' memory When constructing the logical descriptor, first the baseline physical descriptor is consumed: the types and values from the baseline are added to the logical descriptor. Then the types of the -binary blob are used to augment the baseline: fields are added or modified, sizes and offsets are -overwritten. The global values of the binary blob are used to augment the baseline: new globals are -added, existing globals are modified by overwriting their types or values. +in-memory data descriptor are used to augment the baseline: fields are added or modified, sizes and +offsets are overwritten. The global values of the in-memory data descriptor are used to augment the +baseline: new globals are added, existing globals are modified by overwriting their types or values. Rationale: If a type appears in multiple physical descriptors, the later appearances may add more fields or change the offsets or definite/indefinite sizes of prior definitions. If a value appears @@ -141,11 +141,12 @@ Each `FIELD_ARRAY` is an array of dictionaries each containing keys: * `"type": "type name"` the name of a primitive type or another type defined in the logical descriptor * optional `"offset": int | "unknown"` the offset of the field or "unknown". If omitted, same as "unknown". -Note that the logical descriptor does not contain "unknown" offsets: it is expected that the binary -blob will augment the baseline with a known offset for all fields in the baseline. +Note that the logical descriptor does not contain "unknown" offsets: it is expected that the +in-memory data descriptor will augment the baseline with a known offset for all fields in the +baseline. -Rationale: "unknown" offsets may be used to document in the physical JSON descriptor that the binary -blob descriptor is expected to provide the offset of the field. +Rationale: "unknown" offsets may be used to document in the physical JSON descriptor that the +in-memory descriptor is expected to provide the offset of the field. ### Global values @@ -155,8 +156,9 @@ The global values will be in an array, with each value described by a dictionary * `"type": "type name"` the type of the global value * optional `"value": VALUE | { "indirect": int } | "unknown"` the value of the global value, or an offset in an auxiliary array containing the value or "unknown". -Note that the logical descriptor does not contain "unknown" values: it is expected that the binary -blob will augment the baseline with a known offset for all fields in the baseline. +Note that the logical descriptor does not contain "unknown" values: it is expected that the +in-memory data descriptor will augment the baseline with a known offset for all fields in the +baseline. The `VALUE` may be a JSON numeric constant integer or a string containing a signed or unsigned decimal or hex (with prefix `0x` or `0X`) integer constant. The constant must be within the range From 27f98924f5fb4d1181939bd26dced06f51852292 Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Thu, 28 Mar 2024 11:47:32 -0400 Subject: [PATCH 12/14] clarify globals example. use `[int]` for indirect data --- docs/design/datacontracts/data_descriptor.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index 5698318ffc010b..26021673a292c4 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -154,7 +154,7 @@ The global values will be in an array, with each value described by a dictionary * `"name": "global value name"` the name of the global value * `"type": "type name"` the type of the global value -* optional `"value": VALUE | { "indirect": int } | "unknown"` the value of the global value, or an offset in an auxiliary array containing the value or "unknown". +* optional `"value": VALUE | [ int ] | "unknown"` the value of the global value, or an offset in an auxiliary array containing the value or "unknown". Note that the logical descriptor does not contain "unknown" values: it is expected that the in-memory data descriptor will augment the baseline with a known offset for all fields in the @@ -167,8 +167,9 @@ of the type of the global value. For pointer and nuint globals, the value may be assumed to fit in a 64-bit unsigned integer. For nint globals, the value may be assumed to fit in a 64-bit signed integer. -If the value is given as `{"indirect": int}` then the value is stored in an auxiliary array that is -part of the data contract descriptor. Only in-memory data descriptors may have indirect values; baseline data descriptors may not have indirect values. +If the value is given as a single-element array `[ int ]` then the value is stored in an auxiliary +array that is part of the data contract descriptor. Only in-memory data descriptors may have +indirect values; baseline data descriptors may not have indirect values. Rationale: This allows tooling to generate the in-memory data descriptor as a single constant string. For pointers, the address can be stored at a known offset in an in-proc @@ -212,6 +213,7 @@ This is an example of a baseline descriptor for a 64-bit architecture. Suppose i ], "globals": [ { "name": "FEATURE_EH_FUNCLETS", "type": "uint8", "value": "0" }, // baseline defaults value to 0 + { "name": "FEATURE_COMINTEROP", "type", "uint8", "value": "1"}, { "name": "s_pThreadStore", "type": "pointer" } // no baseline value ] } @@ -241,7 +243,8 @@ The following is an example of an in-memory descriptor that references the above } ], "globals": [ - { "name": "s_pThreadStore", "value": { "indirect": 0 } } + { "name": "FEATURE_COMINTEROP", "value": "0"}, + { "name": "s_pThreadStore", "value": [ 0 ] } // indirect from aux data offset 0 ] } ``` @@ -262,5 +265,12 @@ And the globals will be: | Name | Type | Value | | ------------------- | ------- | ---------- | +| FEATURE_COMINTEROP | uint8 | 0 | | FEATURE_EH_FUNCLETS | uint8 | 0 | | s_pThreadStore | pointer | 0x0100ffe0 | + +The `FEATURE_EH_FUNCLETS` global's value comes from the baseline - not the in-memory data +descriptor. By contrast, `FEATUER_COMINTEROP` comes from the in-memory data descriptor - with the +value embedded directly in the json since it is known at build time and does not vary. Finally the +value of the pointer `s_pThreadStore` comes from the auxiliary vector's offset 0 since it is an +execution-time value that is only known to the running process. From b2dba2c28b9ba9dbc51593f4bdf2d3607de32e33 Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Thu, 28 Mar 2024 12:12:59 -0400 Subject: [PATCH 13/14] Add a "compact" JSON variant --- docs/design/datacontracts/data_descriptor.md | 98 +++++++++++++------- 1 file changed, 67 insertions(+), 31 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index 26021673a292c4..05fe95dee305ee 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -115,19 +115,23 @@ multiple times, later definitions take precedence. ### Version -This is version 0 of the physical descriptor +This is version 0 of the physical descriptor. ### Summary -A data descriptor may be stored in the "JSON with comments" format. +A data descriptor may be stored in the "JSON with comments" format. There are two formats: a +"regular" format and a "compact" format. The baseline data descriptor may be either regular or +compact. The in-memory descriptor will typically be compact. The toplevel dictionary will contain: * `"version": 0` -* `"types": TYPE_ARRAY` see below -* `"globals": VALUE_ARRAY` see below +* `"types": TYPES_DESCRIPTOR` see below +* `"globals": GLOBALS_DESCRIPTOR` see below -### Types +### Types descriptor + +**Regular format**: The types will be in an array, with each type described by a dictionary containing keys: @@ -141,6 +145,29 @@ Each `FIELD_ARRAY` is an array of dictionaries each containing keys: * `"type": "type name"` the name of a primitive type or another type defined in the logical descriptor * optional `"offset": int | "unknown"` the offset of the field or "unknown". If omitted, same as "unknown". +**Compact format**: + +The types will be in a dictionary, with each type name being the key and a `FIELD_DICT` dictionary as a value. + +The `FIELD_DICT` will have a field name as a key, or the special name `"!"` as a key. + +If a key is `!` the value is an `int` giving the total size of the struct. The key must be omitted +if the size is indeterminate. + +If the key is any other string, the value may be one of: + +* `[int, "type name"]` giving the type and offset of the field +* `int` giving just the offset of the field with the type left unspecified + +Unknown offsets are not supported in the compact format. + +Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the +common case the field type is known from the baseline descriptor. As a result, a field descriptor +like `"field_name": 36` is the minimum necessary information to be conveyed. If the field is not +present in the baseline, then `"field_name": [12, "uint16"]` may be used. + +**Both formats**: + Note that the logical descriptor does not contain "unknown" offsets: it is expected that the in-memory data descriptor will augment the baseline with a known offset for all fields in the baseline. @@ -150,23 +177,39 @@ in-memory descriptor is expected to provide the offset of the field. ### Global values +**Regular format**: + The global values will be in an array, with each value described by a dictionary containing keys: * `"name": "global value name"` the name of the global value * `"type": "type name"` the type of the global value * optional `"value": VALUE | [ int ] | "unknown"` the value of the global value, or an offset in an auxiliary array containing the value or "unknown". -Note that the logical descriptor does not contain "unknown" values: it is expected that the -in-memory data descriptor will augment the baseline with a known offset for all fields in the -baseline. - The `VALUE` may be a JSON numeric constant integer or a string containing a signed or unsigned decimal or hex (with prefix `0x` or `0X`) integer constant. The constant must be within the range of the type of the global value. +**Compact format**: + +The global values will be in a dictionary, with each key being the name of a global and the values being one of: + +* `[VALUE | [int], "type name"]` the type and value of a global +* `VALUE | [int]` just the value of a global + +As in the regular format, `VALUE` is a numeric constant or a string containing an integer constant. + +Note that a two element array is unambiguously "type and value", whereas a one-element array is +unambiguosly "indirect value". + +**Both formats** + For pointer and nuint globals, the value may be assumed to fit in a 64-bit unsigned integer. For nint globals, the value may be assumed to fit in a 64-bit signed integer. +Note that the logical descriptor does not contain "unknown" values: it is expected that the +in-memory data descriptor will augment the baseline with a known offset for all fields in the +baseline. + If the value is given as a single-element array `[ int ]` then the value is stored in an auxiliary array that is part of the data contract descriptor. Only in-memory data descriptors may have indirect values; baseline data descriptors may not have indirect values. @@ -179,10 +222,14 @@ The indirection array is not part of the data descriptor spec. It is expected t contract descriptor will include it. (The data contract descriptor must contain: the data descriptor, the set of compatible algorithmic contracts, the aux array of globals). + + ## Example This is an example of a baseline descriptor for a 64-bit architecture. Suppose it has the name `"example-64"` +The baseline is given in the "regular" format. + ```jsonc { "version": 0, @@ -219,33 +266,22 @@ This is an example of a baseline descriptor for a 64-bit architecture. Suppose i } ``` -The following is an example of an in-memory descriptor that references the above baseline: +The following is an example of an in-memory descriptor that references the above baseline. The in-memory descriptor is in the "compact" format: ```jsonc { "version": "0", "baseline": "example-64", - "types": [ - { - "name": "Thread", - "fields": [ - { "name": "ThreadId", "offset": 32 }, - { "name": "ThreadState", "offset": 0 }, - { "name": "Next", "offset": 128 } - ] - }, - { - "name": "ThreadStore", - "fields": [ - { "name": "ThreadCount", "offset": 32 } - { "name": "ThreadList", "offset": 8 } - ] - } - ], - "globals": [ - { "name": "FEATURE_COMINTEROP", "value": "0"}, - { "name": "s_pThreadStore", "value": [ 0 ] } // indirect from aux data offset 0 - ] + "types": + { + "Thread": { "ThreadId": 32, "ThreadState": 0, "Next": 128 }, + "ThreadStore": { "ThreadCount": 32, "ThreadList": 8 } + }, + "globals": + { + "FEATURE_COMINTEROP": 0, + "s_pThreadStore": [ 0 ] // indirect from aux data offset 0 + } } ``` From ddb0a4b750764ee2a54ca2c242a9be176eaad2ab Mon Sep 17 00:00:00 2001 From: Aleksey Kliger Date: Fri, 29 Mar 2024 09:16:06 -0400 Subject: [PATCH 14/14] move baseline discussion to data spec; spell check --- docs/design/datacontracts/data_descriptor.md | 33 +++++++++++++++-- .../datacontracts/datacontracts_design.md | 37 +++++-------------- 2 files changed, 39 insertions(+), 31 deletions(-) diff --git a/docs/design/datacontracts/data_descriptor.md b/docs/design/datacontracts/data_descriptor.md index 05fe95dee305ee..cd0d5ce92e82c5 100644 --- a/docs/design/datacontracts/data_descriptor.md +++ b/docs/design/datacontracts/data_descriptor.md @@ -126,9 +126,34 @@ compact. The in-memory descriptor will typically be compact. The toplevel dictionary will contain: * `"version": 0` +* optional `"baseline": "BASELINE_ID"` see below * `"types": TYPES_DESCRIPTOR` see below * `"globals": GLOBALS_DESCRIPTOR` see below +### Baseline data descriptor identifier + +The in-memory descriptor may contain an optional string identifying a well-known baseline +descriptor. The identifier is an arbitrary string, that could be used, for example to tag a +collection of globals and data structure layouts present in a particular release of a .NET runtime +for a certain architecture (for example `net9.0/coreclr/linux-arm64`). Global values and data structure +layouts present in the data contract descriptor take precedence over the baseline contract. This +way variant builds can be specified as a delta over a baseline. For example, debug builds of +CoreCLR that include additional fields in a `MethodTable` data structure could be based on the same +baseline as Release builds, but with the in-memory data descriptor augmented with new `MethodTable` +fields and additional structure descriptors. + +It is not a requirement that the baseline is chosen so that additional "delta" is the smallest +possible size, although for practical purposes that may be desired. + +Data descriptors are registered as "well known" by checking them into the main branch of +`dotnet/runtime` in the `docs/design/datacontracts/data/` directory in the JSON format specified +in the [data descriptor spec](./data_descriptor.md#Physical_JSON_Descriptor). The relative path name (with `/` as the path separator, if any) of the descriptor without +any extension is the identifier. (for example: +`/docs/design/datacontracts/data/net9.0/coreclr/linux-arm64.json` is the filename for the data +descriptor with identifier `net9.0/coreclr/linux-arm64`) + +The baseline descriptors themselves must not have a baseline. + ### Types descriptor **Regular format**: @@ -161,10 +186,10 @@ If the key is any other string, the value may be one of: Unknown offsets are not supported in the compact format. -Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the +Rationale: the compact format is expected to be used for the in-memory data descriptor. In the common case the field type is known from the baseline descriptor. As a result, a field descriptor like `"field_name": 36` is the minimum necessary information to be conveyed. If the field is not -present in the baseline, then `"field_name": [12, "uint16"]` may be used. +present in the baseline, then `"field_name": [12, "uint16"]` must be used. **Both formats**: @@ -199,7 +224,7 @@ The global values will be in a dictionary, with each key being the name of a glo As in the regular format, `VALUE` is a numeric constant or a string containing an integer constant. Note that a two element array is unambiguously "type and value", whereas a one-element array is -unambiguosly "indirect value". +unambiguously "indirect value". **Both formats** @@ -306,7 +331,7 @@ And the globals will be: | s_pThreadStore | pointer | 0x0100ffe0 | The `FEATURE_EH_FUNCLETS` global's value comes from the baseline - not the in-memory data -descriptor. By contrast, `FEATUER_COMINTEROP` comes from the in-memory data descriptor - with the +descriptor. By contrast, `FEATURE_COMINTEROP` comes from the in-memory data descriptor - with the value embedded directly in the json since it is known at build time and does not vary. Finally the value of the pointer `s_pThreadStore` comes from the auxiliary vector's offset 0 since it is an execution-time value that is only known to the running process. diff --git a/docs/design/datacontracts/datacontracts_design.md b/docs/design/datacontracts/datacontracts_design.md index 52c9a84a886561..f88e0abfd06e5a 100644 --- a/docs/design/datacontracts/datacontracts_design.md +++ b/docs/design/datacontracts/datacontracts_design.md @@ -24,33 +24,16 @@ relevant to one or more algorithmic contracts. More details are provided in the [data descriptor spec](./data_descriptor.md). We highlight some important aspects below: -#### Baseline data descriptor identifier - -An optional string identifying a well-known record of global values and data structure layouts. The -identifier is an arbitrary string, that could be used, for example to tag a collection of globals -and data structure layouts present in a particular release of a .NET runtime for a certain -architecture (for example `net9.0-rc1/Release/linux-arm64`). Global values and data structure -layouts present in the data contract descriptor take precedence over the baseline contract. This -way variant builds can be specified as a delta over a baseline. For example, debug builds of -CoreCLR that include additional fields in a `MethodTable` data structure could be based on the -Release data descriptor augmented with new `MethodTable` and other structure descriptors. - -It is not a requirement that the baseline is chosen so that additional "delta" is the smallest -possible size, although for practical purposes that may be desired. - -Data descriptors are registered as "well known" by checking them into the main branch of -`dotnet/runtime` in the `docs/design/datacontracts/data/` directory in the JSON format specified -in the [data descriptor spec](./data_descriptor.md#Physical_JSON_Descriptor). The relative path name (with `/` as the path separator, if any) of the descriptor without -any extension is the identifier. (for example: -`/docs/design/datacontracts/data/net9.0/coreclr/linux-arm64.json` is the filename for the data -descriptor with identifier `net9.0/coreclr/linux-arm64`) - #### Global Values -Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint) + +Global values which can be either primitive integer constants or pointers. All global values have a string describing their name, a type, and a value of one of the above types. #### Data Structure Layout -Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field. + +Each data structure layout has a name for the type, followed by a list of fields. These fields can +be primitive integer types or pointers or another named data structure type. Each field descriptor +provides the offset of the field, the name of the field, and the type of the field. Data structures may have a determinate size, specified in the descriptor, or an indeterminate size. Determinate sizes are used by contracts for pointer arithmetic such as for iterating over arrays. @@ -75,7 +58,7 @@ Algorithmic contracts define how to process a given set of data structures to pr It is entirely reasonable for an algorithmic contract to have multiple entrypoints which take different inputs. For example imagine a contract which provides information about a `MethodTable`. It may provide the an api to get the `BaseSize` of a `MethodTable`, and an api to get the `DynamicTypeID` of a `MethodTable`. However, while the set of contracts which describe an older version of .NET may provide a means by which the `DynamicTypeID` may be acquired for a `MethodTable`, a newer runtime may not have that concept. In such a case, it is very reasonable to define that the `GetDynamicTypeID` api portion of that contract is defined to simply `throw new NotSupportedException();` -For simplicity, as it can be expected that all developers who work on the .NET runtime understand C# to a fair degree, it is preferred that the algorithms be defined in C#, or at least psuedocode that looks like C#. It is also condsidered entirely permissable to refer to other specifications if the algorithm is a general purpose one which is well defined by the OS or some other body. (For example, it is expected that the unwinding algorithms will be defined by references into either the DWARF spec, or various Windows Unwind specifications.) +For simplicity, as it can be expected that all developers who work on the .NET runtime understand C# to a fair degree, it is preferred that the algorithms be defined in C#, or at least psuedocode that looks like C#. It is also considered entirely permissible to refer to other specifications if the algorithm is a general purpose one which is well defined by the OS or some other body. (For example, it is expected that the unwinding algorithms will be defined by references into either the DWARF spec, or various Windows Unwind specifications.) For working with data from the target process/other contracts, the following C# interface is intended to be used within the algorithmic descriptions: @@ -84,16 +67,16 @@ Best practice is to either write the algorithm in C# like psuedocode working on Algorithmic contracts may include specifications for numbers which can be referred to in the contract or by other contracts. The intention is that these global values represent magic numbers and values which are useful for the operation of algorithmic contracts. While not all versions of a data structure are required to have the same fields/type of fields, -algorithms may be built targetting the union of the set of field types defined in the data structure +algorithms may be built targeting the union of the set of field types defined in the data structure descriptors of possible target runtimes. Access to a field which isn't defined on the current runtime will produce an error. ## Arrangement of contract specifications in the repo -Specs shall be stored in the repo in a set of directories. `docs/design/datacontracts` Each one of them shall be a seperate markdown file named with the name of contract. `docs/design/datacontracts/.md` Every version of each contract shall be located in the same file to facilitate understanding how variations between different contracts work. +Specs shall be stored in the repo in a set of directories. `docs/design/datacontracts` Each one of them shall be a separate markdown file named with the name of contract. `docs/design/datacontracts/.md` Every version of each contract shall be located in the same file to facilitate understanding how variations between different contracts work. -### Algorthmic Contract +### Algorithmic Contract Algorithmic contracts describe how an algorithm that processes over data layouts work. Every version of an algorithmic contract presents a consistent api to consumers of the contract.