Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add C source information to Hs and SHs ASTs #316

Closed
TravisCardwell opened this issue Nov 28, 2024 · 18 comments
Closed

Add C source information to Hs and SHs ASTs #316

TravisCardwell opened this issue Nov 28, 2024 · 18 comments

Comments

@TravisCardwell
Copy link
Collaborator

We would like to track how various parts of our generated ASTs are created, referencing the C source.

One motivation for this is test generation (#22), which requires generating both C and Haskell code for testing. Generating the C test code from a C Header is not a good option because we would need to reimplement a lot of the logic for translating from C to Haskell. If the Haskell AST includes C source information, we could traverse the Haskell AST and determine exactly what test functions are required, perhaps referencing the C Header to get C details.

For example, for a given data declaration (called Struct in Hs and Record in SHs), it is useful to know the name of the corresponding C type, including the C namespace. When generating C code, the namespace determines how an identifier is written.

  • Ordinary namespace: foo
  • struct tag namespace: struct foo
  • union tag namespace: union foo
  • enum namespace: enum foo

C source information can also be used to improve the generated Haddock documentation (#26). For example, we could output corresponding C names to help users understand/confirm which Haskell maps to which C.

We should include source locations, which may optionally be output in LINE pragmas (#74). We could even consider including source location information in generated Haddock documentation.

Related to #23 (which is for the high-level API)

@TravisCardwell
Copy link
Collaborator Author

TravisCardwell commented Nov 28, 2024

One idea is to change CName to be a phantom type with a Namespace parameter, like HsName.

data Namespace =
    NsOrdinary
  | NsStructTag
  | NsUnionTag
  | NsEnumTag

newtype CName (ns :: Namespace) = CName { getCName :: Text }
  ...

This type does not include member namespaces, which are separate per structure/union. Discussing with @edsko yesterday, he suggested that we may want to model member namespaces as well, perhaps using a GADT.


C identifier namespace reference:

  • C99 6.2.3
  • C11 6.2.3
  • C17 6.2.3
  • C23 6.2.3
    • Added standard attributes and attribute prefixes
    • Added trailing identifier in an attribute prefixed token

@phadej
Copy link
Collaborator

phadej commented Nov 28, 2024

struct tag namespace: struct foo

it doesn't work in general. Counter example

struct foo {
  struct { int x; int y } bar;
  int z;
}

we will create CFooBar which doesn't correspond to any type we could reference in C.

More generally, what you propose is essentially delaying name mangling.

@TravisCardwell
Copy link
Collaborator Author

Thanks for the example!

I think that we should track how parts of our generated ASTs are created, so in this case we could include information for CFooBar that records that it is created for an anonymous structure. When generating tests, this information lets us know that we cannot create some tests for that type. For example, the PokePeekXSameSemanticsX test can be implemented because it does not require C, while the HsSizeOfXEqCSizeOfX cannot be implemented because there is no way to reference the anonymous structure in C.

I do not mean to suggest that name mangling should be delayed. A Struct should continue to contain structName :: HsName NsTypeConstr. The suggestion is to add information about how/why the particular Struct is created.

Regarding documentation, we could generate documentation like the following:

This type corresponds to the anonymous @struct@ defined for field @bar@ of
@struct foo@.  Source: @foobar.h@ line 121

@phadej
Copy link
Collaborator

phadej commented Nov 28, 2024

The suggestion is to add information about how/why the particular Struct is created.

Regarding documentation, we could generate documentation like the following:

This type corresponds to the anonymous @struct@ defined for field @bar@ of
@struct foo@.  Source: @foobar.h@ line 121

which is essentially preserving exactly the information (to be) passed to the name mangling machinery. (And source location, but that is purely informative bit).

@TravisCardwell
Copy link
Collaborator Author

Indeed. Perhaps another way to put it is that name mangling is not invertible.

When generating tests, name CFooBar does not provide enough information for us to determine which tests need to be created for it. Additional information is required. If such information is included in the AST, it can be used to make the necessary decisions. With the current ASTs, generating tests for CFooBar requires that we process the C Header (again) to determine how/why CFooBar is created.

Documentation like the above example could help users understand/confirm which Haskell maps to which C. When a user sees name CFooBar, they may want to confirm that it is for an anonymous struct and not a C type named foo_bar. I imagine that it is not necessary in many cases, but it would likely be appreciated when the C API uses similar names that may be confusing, especially after translation to Haskell.

@phadej
Copy link
Collaborator

phadej commented Nov 29, 2024

I think that the easiest solution to go forward is to add a field with clang_getTypeSpelling result. We only have to filter out invalid spellings (like struct (unnamed struct at ex.h:2:2)), but AFAICT these are easy to spot (they end with invalid character )).

CXString clang_getTypeSpelling (CXType CT)
Pretty-print the underlying type using the rules of the language of the translation unit from which it came.

That string is not CName, but I don't think we need to parse it in any way, simply use it as is.

@TravisCardwell
Copy link
Collaborator Author

Thank you very much for the suggestion. I will give that a try.

TravisCardwell added a commit that referenced this issue Dec 4, 2024
This commit adds a type spelling field to the following ASTs:

* `C` phase: `Struct`, `Enu`, `Typedef`
    * Type: `Text`
* `Hs` phase: `Struct`, `Newtype`
    * Type: `Maybe Text`
* `SHs` phase: `Record`, `Newtype`
    * Type: `Maybe Text`

As suggested in #316, the string is not parsed.  This commit simply uses
`Text`, but we could implement a `newtype` wrapper if desired.

The goal is to make it easier to generate tests for structures,
enumerations, and `typedef`s of structures/enumerations.  Note that this
is also required for unions, but those are not implemented yet.  This is
the minimal change required to do this; the `Maybe` is needed because it
does *not* track the type spelling for macros.

(Cherry-picked from `source-info` for experimentation)
@edsko
Copy link
Collaborator

edsko commented Dec 4, 2024

After discussing this with @TravisCardwell , the idea to annotate the Haskell tree with some kind of Reason or Origin that gives us information such as

This type corresponds to the anonymous @struct@ defined for field @bar@ of
@struct foo@.  Source: @foobar.h@ line 121

is useful for both test generation and documentation generation; @phadej 's objection that the proposal as originally stated doesn't quite work ("which doesn't correspond to any type we could reference in C.") are especially important cases to consider, both for tests and for documentation generation; we agreed that Travis will submit a draft PR with an initial attempt at this so that we have something concrete to discuss and refine.

@TravisCardwell
Copy link
Collaborator Author

While adding source locations to the C AST, I ran into an issue with forward declarations. When translating example forward_declaration.h, the source location for S1 is the forward declaration. It should instead be the actual declaration.

I am not sure how to resolve this. When the lookup in processTypeDecl returns Just (TypeDecl t _), perhaps we should check if the current declaration is the "primary" declaration and update the source location if so?

@phadej
Copy link
Collaborator

phadej commented Dec 5, 2024

@TravisCardwell processTypeDecl doesn't return jUst (TypeDecl T _) I'm not sure what you are looking at. have you already upgraded / rebased to the post #296 main.

EDIT: I guess you did update. processTypeDecl doesn't return declarations, it returns a type reference. Declarations are added to DeclState as side-effect of (semantic) type traversal. Look at call-sites of addDecl in Fold.Type.

@TravisCardwell
Copy link
Collaborator Author

@phadej Thank you very much for the assistance.

I indeed rebased on the current main. My wording was just confusing; I should have written OMap.lookup instead of just "lookup." Sorry about that!

By "the lookup in processTypeDecl returns Just (TypeDecl t _)", I meant to refer to the OMap.lookup in the following code, modified to include traces. This is from my source-info branch, where sloc is a SingleLoc source location for the current cursor.

processTypeDecl unit ty sloc = do
    dtraceIO "processTypeDecl" ty
    s <- get
    case OMap.lookup ty (typeDeclarations s) of
        Nothing                      -> processTypeDecl' PathTop unit ty sloc
        Just (TypeDecl t _)          -> dtraceIO "processTypeDecl*" () >> return t
        Just (TypeDeclAlias t)       -> return t
        Just (TypeDeclProcessing t') -> liftIO $ fail $ "Incomplete type declaration: " ++ show t'

My reasoning is that when we process the "primary" declaration we have already processed it via the forward declaration and that OMap.lookup returns Just (TypeDecl t _), in which case we simply return the already-created type. Those traces output the following, confirming that this branch of OMap.lookup is indeed executed when processTypeDecl is called for the "primary" declarations that have already been processed by forward declarations.

processTypeDecl: CXType_Record#d06a0928_5a720000_10100128_5a720000
processTypeDecl: CXType_Typedef#806e0928_5a720000_10100128_5a720000
processTypeDecl: CXType_Record#d06a0928_5a720000_10100128_5a720000
processTypeDecl*: ()
processTypeDecl: CXType_Record#206d0928_5a720000_10100128_5a720000
processTypeDecl: CXType_Record#206d0928_5a720000_10100128_5a720000
processTypeDecl*: ()

Instead of just return t, perhaps we could check for this case and update the state with the "primary" source location.

@phadej
Copy link
Collaborator

phadej commented Dec 6, 2024

Isn't "primary" source location the one returned by clang_getTypeDeclaration, which is used in all branches of processTypeDecl' which create Decls?

@phadej
Copy link
Collaborator

phadej commented Dec 6, 2024

I didn't rename processTypeDecl (but probably should) to just processType. It processes types (yet, originating from declarations, c.f. processTypeDeclRec is from recursive calls, should probably been processTypeRecur).

But it's the processTypeDecl' which creates the (our C AST) declarations, and IMHO it's its responsibility to annotate them with provenance information at creation time.

@TravisCardwell
Copy link
Collaborator Author

TravisCardwell commented Dec 9, 2024

Using the cursor returned from clang_getTypeDeclaration does indeed resolve the issue. Thank you!

With the code refactored, I ran into issues with typedefs that are included from other header files. Example fixedwidth.h provides a small example where a typedef from an included header is used, and we are generating newtype definitions for those types, as can be seen in fixedwidth.pp.hs. The source location uses an absolute path in this case. This is problematic in test fixtures, where different developers have different home directories, etc.

From a user perspective, I would like source locations to be relative to the package directory, while sources outside of the package directory should be absolute. This would resolve the issue for the fixtures that we currently have; we would get paths like hs-bindgen/musl-include/bits/alltypes.h. We would still have issues if we test using system headers that differ in location on different systems, but perhaps such tests simply should not compare treediffs.

Thoughts? Would it be acceptable to transform (these) location paths so that they are relative to the package directory?

EDIT: I implemented this in commit "Make typedef source locations relative" of the source-info branch. The implementation in HsBindgen.Clang.HighLevel.SourceLoc is pure, while the usage in HsBindgen.C.Fold.Type uses IO to get the root directory.

@TravisCardwell
Copy link
Collaborator Author

After a fair amount of experimentation, I realized that there is a simple solution that may be sufficient.

C phase: The C AST must contain the C names and source locations. It generally represents the actual structure of the parsed C declarations, with some differences:

  • We process macros specially.
  • A anonymous structure(/union) used as the type of an unnamed field is collapsed within the C AST. (We do not see the distinction in the Haskell translation.)
  • A top-level structure(/union) that does not have a name does not translate to anything. It is omitted from the C AST. (We do not see it in the Haskell translation.)

Any transformation that we need to distinguish in documentation or test generation must be recorded in the AST. Currently, this is simply done by using different data structures.

Haskell phase: The Haskell AST should record why each declaration is created, using Origin sum types that reference the C AST when applicable. For example, we currently create newtype declarations for three reasons, so the NewtypeOrigin type records this.

data NewtypeOrigin =
      NewtypeOriginEnum C.Enu
    | NewtypeOriginTypedef C.Typedef
    | NewtypeOriginMacro C.Macro
  deriving Show

Some other parts of the AST track the origin in the same way, to support documentation and test generation. Specifically, we currently have a FieldOrigin that can references C fields as well as a PatSynOrigin that can reference C enumeration values.

Simplified Haskell phase: The simplified Haskell AST records origin information in the same way as the Haskell AST. At this point, there are no differences, so the Haskell AST Origin types are used as-is. If we ever introduce transformations, we should define simplified Haskell AST origin types that record the transformations made.

@TravisCardwell
Copy link
Collaborator Author

As mentioned in #316 (comment), we have to know the namespace of C names in order to generate C code. For example:

  • Type struct foo {...} must be referenced as struct foo.
  • Type typedef struct {..} foo must be referenced as foo.

The idea mentioned in #316 (comment) requires many changes across the codebase, so for now I implemented a simple/minimal solution: adding a DeclName type that is used in DeclPath. Perhaps this is sufficient.

-- | Declaration name
data DeclName
    = -- No name specified (anonymous)
      DeclNameNone
    | -- Structure/union tag specified
      DeclNameTag CName
    | -- Typedef name specified
      DeclNameTypedef CName
  deriving stock (Eq, Generic, Show)
  deriving anyclass (PrettyVal)

@TravisCardwell
Copy link
Collaborator Author

I ported the current test generation code to work with Haskell AST that includes origin information. IMHO, it is much better than the previous implementation that used the C type spelling to "join" generated Haskell AST with the C AST because there are far fewer error cases.

The test generation is an initial implementation with very broad strokes. Many things need to be improved/implemented. For example, the current implementation only supports primitive types. It does not yet support nested structures, pointers, typedef types, etc. It currently uses error when there is a type that is not yet supported, which also needs to change.

The test generation AST is very high-level, as (I think that) we will always generate pretty-printed code. Using TH is not an option with C source generation.

@edsko
Copy link
Collaborator

edsko commented Dec 20, 2024

Done in #338.

@edsko edsko closed this as completed Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants