Skip to content

Commit

Permalink
docs: design notes includes rationale and philosophy
Browse files Browse the repository at this point in the history
fix #154, #615
  • Loading branch information
alandefreitas committed Jun 11, 2024
1 parent a3fd0ce commit 608a925
Showing 1 changed file with 88 additions and 48 deletions.
136 changes: 88 additions & 48 deletions docs/modules/ROOT/pages/design-notes.adoc
Original file line number Diff line number Diff line change
@@ -1,50 +1,90 @@
= Design Notes

== AST Traversal

During the AST traversal stage, the complete AST (generated by the clang frontend)
is walked beginning with the root `TranslationUnitDecl` node. It is during this
stage that USRs (universal symbol references) are generated and hashed with SHA1
to form the 160 bit `SymbolID` for an entity. With the exception of built-in types,
*all* entities referenced in the corpus will be traversed and be assigned a `SymbolID`;
including those from the standard library. This is necessary to generate the
full interface for user-defined types.

== Bitcode

AST traversal is performed in parallel on a per-translation-unit basis.
To maximize the size of the code base MrDocs is capable of processing, `Info`
types generated during traversal are serialized to a compressed bitcode representation.
Once AST traversal is complete for all translation units, the bitcode is deserialized
back into `Info` types, and then merged to form the corpus. The merging step is necessar
as there may be multiple identical definitions of the same entity (e.g. for class types,
templates, inline functions, etc), as well as functions declared in one translation
unit & defined in another.

== The Corpus

After AST traversal and `Info` merging, the result is stored as a map of `Info`s
indexed by their respective `SymbolID`s. Documentation generators may traverse
this structure by calling `Corpus::traverse` with a `Corpus::Visitor` derived
visitor and the `SymbolID` of the entity to visit (e.g. the global namespace).

== Namespaces

Namespaces do not have a source location.
This is because there can be many namespaces.
We probably don't want to store any javadocs for namespaces either.

== Paths

The AST visitor and metadata all use forward slashes to represent file
pathnames, even on Windows. This is so the generated reference documentation
does not vary based on the platform.

== Exceptions

Errors thrown by the program should always have type `Exception`. Objects
of this type are capable of transporting an `Error` object. This is important
for the scripting to work; exceptions are used to propagate errors from
library code to scripts and back to the invoking code. For exceptional cases,
these thrown exceptions should be uncaught. The tool installs an uncaught exception
handler that prints a stack trace and exits the process immediately.
== Why automate documentation?

{cpp} API design is challenging.
When you write a function signature, or declare a class, it is at that moment when you are likely to have as much information as you will ever have about what it is supposed to do.
The more time passes before you write the documentation, the less likely you are to remember all the details of what motivated the API in the first place.

In other words, because best and worst engineers are naturally lazy, the *temporal adjacency* of the {cpp} declaration to the documentation improves outcomes.
For this reason, having the documentation be as close as possible to the declaration is ideal.
That is, the *spatial adjacency* of the C++ declaration to the documentation improves outcomes because it facilitates temporal adjacency.

In summary, the automated extraction of reference documentation from {cpp} declarations containing attached documentation comments is ideal because:

* Temporally adjacent documentation is more comprehensive
* Spatially adjacent documentation requires less effort to maintain
* And causally connected documentation is more accurate

From the perspective of engineers, however, the biggest advantage of automated documentation is that it implies a single source of truth for the API at a low cost.
However, {cpp} codebases are notoriously difficult to document automatically because of constructs where the code needs to diverge from the intended API that represents the contract with the user.

Tools like Doxygen typically require workarounds and manual intervention via preprocessor macros to get the documentation right.
These workarounds are problematic because they effectively mean that there are two versions of the code: the well-formed and the ill-formed versions.
This subverts the single sources of truth for the code.

|===
| | Mr. Docs | Automatic | Manual | No Reference

| No workarounds | ✅ | ❌ | ✅ | ❌
| Nice for users | ✅ | ✅ | ✅ | ❌
| Single Source of Truth | ✅ | ❌ | ❌ | ❌
| Less Work for Developers | ✅ | ❌ | ❌ | ✅
| Always up-to-date | ✅ | ✅ | ❌ | ❌
|===

* By providing no reference to users, they are forced to read header files to understand the API.
This is a problem because header files are not always the best source of truth for the API.
Developers familiar with https://cppreference.com[cppreference.com,window=_blank] will know that the documentation there is often more informative than the header files.
* A common alternative is to provide a manual reference to the API.
Developers duplicate the signatures, which requires extra work.
This strategy tends to work well for small libraries and allows the developer to directly express the contract with the user.
However, as the single source of truth is lost, it becomes unmanageable as the codebase grows.
When the declaration changes, they forget to edit the docs, and the reference becomes out of date.
* In this case, it looks like the best solution is to automate the documentation.
The causal connection between the declaration and the generated documentation improves outcomes.
That's when developers face difficulties with existing tools like Doxygen, which require workarounds and manual intervention to get the documentation right.
The workarounds mean that there are two versions of the code: the well-formed and the ill-formed versions.
* The philosophy behind MrDocs is to provide solutions to these workarounds so that the single source of truth can be maintained with minimal effort by developers.
With Mr.Docs, the documented {cpp} code should be the same as the code that is compiled.

== Customization

Once the documentation is extracted from the code, it is necessary to format it in a way that is useful to the user.
This usually involves generating output files that match the content of the documentation to the user's needs.

A limitation of existing tools like Doxygen is that their output formats are not very customizable.
It supports minor customization in the output style and, for custom content formats, it requires much more complex workflows, like generating XML files and writing secondary applications to process them.

MrDocs attempts to support multiple output formats and customization options.
In practice, MrDocs attempts to provide three levels of customization:

* At the first level, the user can use an existing generator and customize its templates and helper functions.
* The user can write a MrDocs plugin at a second level that defines a new generator.
* At a third level, it can use a generator for a structured file format and consume the output in a more uncomplicated secondary application or script.

These multiple levels of complexity mean developers can worry only about the documentation content.
In practice, developers rarely need new generators and are usually only interested in changing how an existing generator formats the output.
Removing the necessity of writing and maintaining a secondary application only to customize the output via XML files can save these developers an immense amount of time.

== {cpp} Constructs

To deal with the complexity of {cpp} constructs, MrDocs uses clang's libtooling API.
That means MrDocs understands all {cpp}: if clang can compile it, MrDocs knows about it.

Here are a few constructs MrDocs attempts to understand and document automatically from the metadata:

* Overload sets
* Private APIs
* Unspecified Return Types
* Concepts
* Typedef / Aliases
* Constants
* Automatic Related Types
* Macros
* SFINAE
* Hidden Base Classes
* Hidden EBO
* Niebloids
* Coroutines

0 comments on commit 608a925

Please sign in to comment.