Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-documenter] Fix ae-ambiguous-ids meta issue #1308

Open
Tracked by #290
octogonz opened this issue Jun 2, 2019 · 17 comments
Open
Tracked by #290

[api-documenter] Fix ae-ambiguous-ids meta issue #1308

octogonz opened this issue Jun 2, 2019 · 17 comments
Labels
ae-ambiguous-ids This issue belongs to a set of bugs tracked by Pete bug Something isn't working as intended

Comments

@octogonz
Copy link
Collaborator

octogonz commented Jun 2, 2019

This issue tracks the fix for a set of related bugs tagged by ae-ambiguous-ids. These issues all involve IDs that are generated using a surjective mapping that results in naming collisions (e.g. a file being ovewritten) or instability (e.g. a hyperlink being broken after a SemVer MINOR change).

The problem was summarized in #1252 (comment):

we need a naming pattern that ideally would satisfy several requirements:

  1. be unambiguous, i.e. provide a one-to-one mapping for all possible inputs
  2. be stable, e.g. adding a new function overload should not cause the UID to change for a preexisting function overload (since that could lead to broken hyperlinks)
  3. look "nice" to humans (particularly for the name field)

(Elsewhere I pointed out that the TSDoc declaration reference notation meets 1 and 2, but perhaps not 3.)

Examples of problematic TypeScript constructs:

  • a static member with the same name as an instance member
  • a function with multiple overloads
  • a merged declarations (e.g. an enum X { ... } alongside namespace X { ... })
  • an indexer (defined using symbols without any name) (e.g. [key: string] : string; as a member of an interface)

Certain IDs are case-insensitive, which means that e.g. noTallPeople and NotAllPeople may collide.

And here's a summary of the specific kinds of IDs that we need to solve this for:

  • Markdown URLs: case insensitive, no special characters
  • DocFX YAML URLs: case insensitive, no special characters
  • DocFX YAML IDs: case sensitive, allows some special characters
  • Titles that appear in docs: needs to look nice to a human
@octogonz
Copy link
Collaborator Author

octogonz commented Jun 2, 2019

Here's a pathological input that illustrates most of the sources of ambiguity, along with the corresponding TSDoc declaration reference expressions that would uniquely identify these members (without relying on @label tags):

// TSDoc: '(A:class)'
export class A {
  // TSDoc: '(A:class).(:constructor,0)'
  public constructor(c: number, d: number);
  // TSDoc: '(A:class).(:constructor,1)'
  public constructor(c: string);
  // not visible API
  public constructor(c: number|string, d?: number) {
  }

  // TSDoc: '(A:class).(constructor:instance)'
  public "constructor": number; // TypeScript keyword confusingly used as member name

  // TSDoc: '(A:class).("(constructor:instance)":instance)'
  public "(constructor:instance)": number; // TSDoc expression confusingly used as member name

  // TSDoc: '(A:class).(memberA:instance,0)'
  public memberA(e: number, f: number): void;
  // TSDoc: '(A:class).(memberA:instance,1)'
  public memberA(e: string): void;
  // not visible API
  public memberA(e: number|string, f?: number): void {
  }

  // TSDoc: '(A:class).(memberA:static,0)'
  public static memberA(g: boolean): void { // static member conflicting with instance member
  }

  // TSDoc: '(A:class).(MemberA:static,0)'
  public static MemberA(): void {  // differs by case only
  }
}

// TSDoc: '(B:namespace)'
export namespace B {
  // TSDoc: '(B:namespace).simpleSymbol'
  export const simpleSymbol: unique symbol = Symbol("B.simpleSymbol");
}

// TSDoc: '(B:enum)'
export enum B { // merged declaration
  // TSDoc: '(B:enum).One'
  One
}

// TSDoc: '(C:namespace)'
export namespace C {
  // TSDoc: '(C:namespace).(memberC:0)'
  export function memberC(): void {
  }
}

// TSDoc: '(C:interface)'
export interface C { // merged declaration
  // TSDoc: '(C:interface).memberC'
  memberC: number;  // "instance side" conflicts with "static side" from namespace

  // TSDoc: '(C:interface).(:new,0)'
  new(h: number): number;

  // TSDoc: '(C:interface).(:call,0)'
  (i: string): string;

  // TSDoc: '(C:interface).(:index,0)'
  [j: string]: number;

  // TSDoc: '(C:interface).[(B:namespace).simpleSymbol]'
  [B.simpleSymbol](k: boolean, l: string): void;
}

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 2, 2019

  1. be unambiguous, i.e. provide a one-to-one mapping for all possible inputs
  2. be stable, e.g. adding a new function overload should not cause the UID to change for a preexisting function overload (since that could lead to broken hyperlinks)
  3. look "nice" to humans (particularly for the name field)

Thinking about the above code, it's obvious that we can't have both 2 and 3. Maybe we can relax the "stable" requirement like this:

"If a new API member is added that makes an existing ID ambiguous, it's okay to change the ID, but an old ID should never get reassigned to a different target."

For example, if we use ID a.f to refer to a member function, and then we add an overload, it's better to replace the original ID with a.f_1 (breaking hyperlinks pointing to a.f), versus preserving it and having it accidentally point to the wrong overload. This design also ensures that a web server could remember the old URLs and be able to redirect old URLs to the new disambiguated URL.

@rbuckton
Copy link
Member

rbuckton commented Jun 3, 2019

Would it be better to form ids for method signatures based on their parameter types, not their overload order? You see this naming scheme in the .NET documentation for methods with overloads, i.e.:

Adding a net-new overload to an existing overload list would break the "stable" requirement:

declare class Foo {
  method(): void; // (Foo:class).method
  // method(x: string): void;
  method(x: number): void; // (Foo:class).method_1
}

Uncommenting the above line would give that overload the name (Foo:class).method_1, replacing the existing link.

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 3, 2019

Would it be better to form ids for method signatures based on their parameter types, not their overload order? You see this naming scheme in the .NET documentation for methods with overloads

People seemed to like this idea when we discussed it in #881 (comment).

However, by itself it's not a complete solution. It doesn't handle other sources of ambiguity (e.g. static vs instance, character case differences). It doesn't even handle all forms of overloading, for example:

export class Foo {
    /**
     * First overload
     * @param x - a boolean
     */
    public method(x: boolean): void;

    /**
     * Second overload
     * @param x - an array of strings
     */
    public method(x: string[]): void;

    public method(x: boolean | string[]): void {
      // . . .      
    }  
}

Perhaps we can think of ID assignment as a kind of mapping function:

f(item, context, history) --> identifier

The output of f() is an identifier string. If our scenario is a URL, then the identifier could be api/foo.method_1. If our scenario is a doc title, then the identifier could be Foo.method(x) #1. The item is the API declaration that the identifier refers to. The context is the set of other declarations whose mapped identifier must not conflict with item.

We could also consider enhancing API Documenter to remember a history of previous outputs (e.g. saved as a JSON file), which could be consulted by the function to avoid "broken links" (i.e. changing an identifier to something new) and/or "shuffled links" (i.e. reusing an existing identifier to reference a different target). This additional tech might make it easier to provide nice-looking identifiers, while still providing a solution for high traffic websites who are concerned about URLs getting broken or shuffled.

@acchou
Copy link
Contributor

acchou commented Jun 4, 2019

I'd like to register a preference for overloaded functions specifically, to simply merge all of the identifiers that collide into a single file. This is natural (and in my opinion, preferable) for overloads because they are documenting the behavior of a single function that may take different inputs -- duplicating the documentation for each overload is not user friendly.

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 4, 2019

I'd like to register a preference for overloaded functions specifically, to simply merge all of the identifiers that collide into a single file.

@acchou I suspect this depends on the how the overloads are used in your API. We've heard from others that documenting individual overloads was an important need.

FYI we recently started work on an api-documenter.json settings file, so maybe the default handling of overloads could be configurable. (Feel free to open a GitHub issue tracking this, since it's somewhat offtopic here.)

We also proposed to add a new TSDoc tag @partOf that says "this declaration shouldn't get its own page; instead, show it as part of the signature for that declaration over there." (See microsoft/tsdoc#137) Then overloads could be merged like this:

/**
 * Adds some numbers together
 */
declare function add(x: number, y: number): number;

/** {@partOf (add:1)} */
declare function add(x: number, y: number, z: number): number;

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 4, 2019

We could also consider enhancing API Documenter to remember a history of previous outputs (e.g. saved as a JSON file), which could be consulted by the function to avoid "broken links" (i.e. changing an identifier to something new) and/or "shuffled links" (i.e. reusing an existing identifier to reference a different target). This additional tech might make it easier to provide nice-looking identifiers, while still providing a solution for high traffic websites who are concerned about URLs getting broken or shuffled.

It's theoretically possible to provide a "no shuffling" guarantee without resorting to history. But every approach I can come up with leads to awkward, ugly identifiers. It seems that (optionally) tracking history in a JSON file is the right way to go: It's "opt-in" for people who don't care about broken links, and it's relatively easy to incorporate into a mapping algorithm.

Regarding the mapping algorithm, here's some more thoughts about how we can implement f():

  • Each hierarchical declaration should get solved independently. For example, in my code sample, if we're mapping A.MemberA to the URL /api/a.membera.md, the parts A-->a and MemberA-->membera should be calculated independently. I like this because something like /api/a-2.membera-1.md is more understandable than /api/a.membera-6.md.

  • The operation should start with a friendly mapping, e.g. lowering case and discarding bad characters: MemberA --> membera

  • For Unicode, it seems reasonable for the friendly mapping to include any normal writing characters, since this is the year 2019 and most systems now support IRI

  • For non-alphanumeric characters (e.g. public "&^@%#@"(x: string): void {) we should just throw them away rather than trying to encode them

  • Next, we handle any ambiguities introduced by the friendly mapping. This happens by computing f() for all siblings at a given level of the hierarchy, and then checking for collisions.

  • If a collision is encountered, we'll apply a series of disambiguation transformations. Since shuffled links are worse than broken links, when a disambiguation transformation is applied, it should be applied to all conflicting members. For example, if we disambiguate (A:class).(memberA:static,0) by mapping it to /api/a.member-static.md then (A:class).(memberA:instance,0) should also become /api/a.member-instance.md and not simply /api/a.member.md.

The transformations will depend on the particular identifier scenario. Recall that we're tackling 4 scenarios:

  • Markdown URLs
  • DocFX YAML URLs
  • DocFX YAML IDs
  • Titles that appear in docs

Since the URLs come from filenames, they shouldn't use characters that are bad for filenames. And although the f() mappings will be somewhat different, we should aim for the resulting identifiers to be as consistent as possible. Probably I'll create a single PR that implements all 4 mapping functions.

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 4, 2019

What characters should be allowed in the URLs? In #1303 @acchou found that ( ) caused problems for Docusaurus. We also need to be careful with characters that cause problems in filenames.

I found this thread discussing safe filename characters, and it concludes that [0-9a-zA-Z\-._] is the safest set.

Suppose we start with a TSDoc canonical declaration reference like this:

(A:class).(memberA:instance,0)

The ( and ) are mostly redundant (and TSDoc considered making them optional), so let's drop them. We can't use :, but we can replace it with - which is not a valid identifier character. We can also map , to - without introducing ambiguity. Lastly, we'll discard the "index selector" ,0 because f() will reintroduce its own numbering if needed. This converts the above longform expression to a URL like this:

/api/a-class.membera-instance.md

If there are no ambiguities, we can reduce the canonical reference by removing the redundant parts before applying the character substitutions.

Or, if there still ambiguities after replacing characters, let's resolve them by reintroducing numeric suffixes after each dotted part, like this:

/api/a-class-1234.membera-instance-2345.md

Lastly, for the 4 anonymous member kinds (constructor, call, index, new) our friendly name can treat these as regular identifiers (a.constructor) instead of as TSDoc member selectors (a.-constructor).

There's other possible improvements, but this actually seems like a pretty reasonable starting point. BTW I thought about appending function parameter names, but then I realized that it means the URL will break if someone renames a parameter, which maybe isn't worth it. (?)

Here's a fully worked out result of the above algorithm:

// URL '/api/a.md'
export class A {
  // URL '/api/a.constructor-1.md'
  public constructor(c: number, d: number);
  // URL '/api/a.constructor-2.md'
  public constructor(c: string);
  // not visible API
  public constructor(c: number|string, d?: number) {
  }

  // URL '/api/a.constructor-3.md'
  public "constructor": number; // TypeScript keyword confusingly used as member name

  // URL '/api/a.constructorinstance.md'
  public "(constructor:instance)": number; // TSDoc expression confusingly used as member name

  // URL '/api/a.membera-instance-1.md'
  public memberA(e: number, f: number): void;
  // URL '/api/a.membera-instance-2.md'
  public memberA(e: string): void;
  // not visible API
  public memberA(e: number|string, f?: number): void {
  }

  // URL '/api/a.membera-static-1.md'
  public static memberA(g: boolean): void { // static member conflicting with instance member
  }

  // URL '/api/a.membera-static-2.md'
  public static MemberA(): void {  // differs by case only
  }
}

// URL '/api/b-namespace.md'
export namespace B {
  // URL '/api/b-namespace.simplesymbol.md'
  export const simpleSymbol: unique symbol = Symbol("B.simpleSymbol");
}

// URL '/api/b-enum.md'
export enum B { // merged declaration
  // URL '/api/b-enum.one.md'
  One
}

// URL '/api/c-namespace.md'
export namespace C {
  // URL '/api/c-namespace.memberc.md'
  export function memberC(): void {
  }
}

// URL '/api/c-interface.md'
export interface C { // merged declaration
  // URL '/api/c-interface.memberc.md'
  memberC: number;  // "instance side" conflicts with "static side" from namespace

  // URL '/api/c-interface.new.md'
  new(h: number): number;

  // URL '/api/c-interface.call.md'
  (i: string): string;

  // URL '/api/c-interface.index.md'
  [j: string]: number;

  // URL '/api/c-interface.bnamespacesimplesymbol.md'
  [B.simpleSymbol](k: boolean, l: string): void;
}

The handling of ECMAScript symbols may get awkward (e.g. (ClassM2:class).([(WellknownSymbolsM1:namespace).(toNumberPrimitive:variable)]:static) is possible in TSDoc). If we allow ourselves to use [ and ] characters in the file name, that would greatly improve the handling of ECMAScript symbols.

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 4, 2019

Lemme know if you guys like this direction. If so, I can start work on a PR when I get some time.

It's tempting to put the mapping implementation in api-extractor-model so it can be reused by other tools besides API Documenter...

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 4, 2019

Would it be better to form ids for method signatures based on their parameter types, not their overload order?

@rbuckton I just realized I misread your sentence as "parameter names". Parameter types won't work in TypeScript because the parameter types can be arbitrarily large expressions, whereas in C# they are generally a simple dotted name.

Examples from real code:

export function mergeStyles(...args: (IStyle | IStyleBaseArray | false | null | undefined)[]): string;

export function mergeStyleSets<
  TStyleSet1 extends IStyleSet<TStyleSet1>,
  TStyleSet2 extends IStyleSet<TStyleSet2>,
  TStyleSet3 extends IStyleSet<TStyleSet3>,
  TStyleSet4 extends IStyleSet<TStyleSet4>
>(
  styleSet1: TStyleSet1 | false | null | undefined,
  styleSet2: TStyleSet2 | false | null | undefined,
  styleSet3: TStyleSet3 | false | null | undefined,
  styleSet4: TStyleSet4 | false | null | undefined
): IProcessedStyleSet<TStyleSet1 & TStyleSet2 & TStyleSet3 & TStyleSet4>;

export function serializeRuleEntries(ruleEntries: { [key: string]: string | number }): string;

@rbuckton
Copy link
Member

rbuckton commented Jun 5, 2019

The operation should start with a friendly mapping, e.g. lowering case and discarding bad characters: MemberA --> membera

What about cases where this causes a collision, since JavaScript/TypeScript property names are case sensitive?

For non-alphanumeric characters (e.g. public "&^@%#@"(x: string): void {) we should just throw them away rather than trying to encode them.

Not opposed, but this can introduce collisions as well.

For methods, docfx is very specific about how uids for methods should be derived for ManagedReference documents (joining the uids of the parameter types). It seems best to follow suit to avoid the shuffled links concern.

Next, we handle any ambiguities introduced by the friendly mapping. This happens by computing f() for all siblings at a given level of the hierarchy, and then checking for collisions.

I concur with this, as it is something I had to implement myself in a custom YamlDocumenter for one of my projects: https://github.com/esfx/esfx/blob/master/scripts/docs/yamlDocumenter.js#L601

Keep in mind that the canonical UID you generate does not necessarily have to be a valid path/URL, but you could instead pass it through another function to generate a valid URL. Also, URLs already have an encoding strategy for invalid characters.

[…] replace it with - which is not a valid identifier character […]

This worries me because it is perfectly valid in a quoted method/property name. You would need some mechanism for "quoting" a name with invalid characters to avoid collisions. Another option is to double up - characters in a name:

  • a() -> a-method
  • "a-method"() -> a--method-method

[…] BTW I thought about appending function parameter names, […]

I would argue for parameter types rather than names, as that is the DocFX standard for ManagedReference documents.

Another way to deal with the shuffled links concern would be to have something like a @uid foo tag that could be used to explicitly set the uid (possibly for historical purposes).

Parameter types won't work in TypeScript because the parameter types can be arbitrarily large expressions, whereas in C# they are generally a simple dotted name.

C# UIDs can be quite large as well. (IStyle|IStyleBaseArray|false|null|undefined)[] isn't a very large UID compared to System.Collections.Generic.Dictionary{System.String,System.Collections.Generic.List{System.Int32}} (taken from the DocFX documentation).

For some cases, an arbitrarily complex type (such as an anonymous type) is better broken down into an alias and a reference:

  
  - uid: some-package.some-class.some-method
    name: method()
    fullName: method()
    syntax:
      return:
        type:
        - some-package.some-class.some-method.anonymous-0
  ...
references:
  - uid: some-package.some-class.some-method.anonymous-0
    name.typeScript: '{ a: A, b: B, c: C, d: D }`
    fullName.typeScript: '{ a: A, b: B, c: C, d: D }`
    spec.typeScript:
    - name: '{ a: '
      fullName: '{ a: '
    - uid: A
      name: A
      fullName: A
    

@rbuckton
Copy link
Member

rbuckton commented Jun 5, 2019

Regarding the parameter types and complex aliases, I posted a comment on an issue about complex types with additional context here: #867 (comment)

@postspectacular
Copy link

I understand this all is still in flux, but may I ask what is the currently supported syntax to link to an overloaded function? It seems there's some missing information and discrepancies between different parts of the documentation and the actual tooling/implementation. E.g. using the canonical references created by api-extractor in a {@link} tag cause the warnings for ambiguous references to disappear, but then still no link at all is showing up in the generated markdown...

Example:

Generated .api.json snippet:

{
  "kind": "Function",
  "canonicalReference": "@thi.ng/arrays!quickSort:function(1)",
  ...
}

Neither of these work (excluded from api-documenter markdown output):

  • {@link @thi.ng/arrays!quickSort:function(1)}
  • {@link @thi.ng/arrays#quickSort:function(1)}

Any (interim) advice is highly appreciated! Thanks!

@octogonz
Copy link
Collaborator Author

@postspectacular Let's discuss your question with #1240 since it's really about @link syntax, which is not ambiguous (in either of the two versions of the notation heheh).

@kasperisager
Copy link

Is there any news on this or something I can help with to move things along?

@octogonz
Copy link
Collaborator Author

@kasperisager which subproblem were you interested in?

@kasperisager
Copy link

This issue is affecting us quite a bit in our attempts to land Siteimprove/alfa#290, especially in relation to #1830 and #1921 that both seem to be manifestations of this underlying issue. I don't believe there are any patterns in the Alfa codebase that haven't already been outlined in the previous comments, so there's probably nothing new I can add on that front.

Also, I do apologise for this new "Tracked in [...]" notice now appearing fairly prominently on all the Rushstack issues we've referenced from Siteimprove/alfa#290. I don't know why GitHub all of a sudden decided it would be a good idea to broadcast that to the world 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ae-ambiguous-ids This issue belongs to a set of bugs tracked by Pete bug Something isn't working as intended
Projects
Status: AE/AD
Development

Successfully merging a pull request may close this issue.

5 participants