explicit ID generation for easier IC #15559

Araq · 2020-10-12T20:29:06Z

Changing the ID generation mechanism for an easier IC-implementation.

disruptek · 2020-10-12T23:53:09Z

This is pretty destructive, though, no?

Araq · 2020-10-13T01:08:56Z

If that's what it takes to make it all work out, it's totally acceptable. Arguably it's cleaner this way, dependencies are clearer.

cooldome · 2020-10-13T14:52:55Z

I like it, it will be also very useful for future multithreading handling.
With --gc:arc -d:useMalloc working for compiler I plan to start looking at multithreading in sempass and codegen. Per module threading in sempass and per proc threading in codegen.

disruptek · 2020-10-13T19:30:36Z

You're still planning on adding documentation to the new procs, right? 😉

disruptek · 2020-10-13T20:29:19Z

The more I read this, the less I like. Why not just use a uuid? The code breaks mangling in pretty fundamental ways; it even breaks my refactor to "fix" the outstanding mangling problems.

In fact, it's not identity generation that is the problem, as far as I can tell. Can you frame the purpose of this PR and how it relates to IC?

Araq · 2020-10-13T20:54:24Z

I don't know yet if it simplifies IC all that much, but here is my reasoning:

IC is about tracking inter-module dependencies precisely and also about stable interfaces between modules. Both properties are handled well by changing the ID from a simple auto-increment number to a pair of numbers, (moduleId, itemId).

Frontend

When we serialize a module M to disk we want to store everything that belongs to M in a binary file -- that is a simple for entity: if entity.moduleId == currentModuleId check, previously it was a fragile traversal of the .owner chain. Entities that are referenced that have a different moduleId are "foreign" references which refer to a different module. We need to resolve foreign references lazily in order to be able to "cut" the set modules into the set of modules that are to be recompiled and the set that didn't change.

Backend

The backend works on a set of binary files, the backend is told to produce fresh C code for the modules with IDs x, y, z, it knows exactly which symbols and types are affected and does its job. No need for special logic for generic instantiations etc, it all works because we have a precise serialized graph structure.

Araq · 2020-10-13T21:02:59Z

Also notice how close the idgen: IdGenerator is to the "witness" idea that I proposed and how easy name mangling could be: $symbolname_$modulename_$itemid is stable, even though still less than ideal for readability and debugging support.

disruptek · 2020-10-13T21:12:34Z

What would handle the requirement well would be a value that represents the interface. Then we can test if the interface is met by the new code. This sounds like a hash tree to me.

The problem of serializing the graph is already solved with frosty and the mangling branch already demonstrates finding source modules for generic instantiations. If that can be improved (somehow?) by this code, then that's a win, but so far, it feels like a minor one.

Stable IDs don't solve the mangling problem because the[y] are not sufficient to establish equivalence; witness proc and tuple types. They are "anonymous" and what is important to us about them is their signature in the current code -- another word for interface -- which we currently don't have a solution for, and which this change in particular does not address.

Since IC happens at top-level statements, to me the pain point is cloning the whole graph just to compile one statement. That's the bit I'd like a better solution for. A hash tree would let us skip the graph clone with the realization that no mutation will impact any interface.

I was trying to get @Clyybber interested in adding a generic Witness concept to the compiler but he didn't seem convinced. I think it's a pattern we need to exploit more generally...

disruptek · 2020-10-13T21:13:31Z

If the owner traversal is fragile, shouldn't we

remove owner, or
fix owner

instead?

timotheecour · 2020-10-14T01:10:47Z

a few points regarding IC and mangling:

C++ and D use a mangling based not on incremented id but on type information, however I'm not sure it's simpler/better since we can use a lookup table Table[string, PSym] for current scheme
I'd still prefer IC implemented similarly to nimsuggest, using a compiler as a service, which avoids the need for serialization; see for eg markClientsDirty and transitiveClosure to compute what needs to be marked as dirty and recompiled. One objection i heard (nimsuggest leaks) shouldn't be too hard to address (IIRC I found a leak in nimsuggest, need to search through my notes)
one un-tried idea I've had for backend wrt IC is using weak symbols (https://en.wikipedia.org/wiki/Weak_symbol); it'd be for re-linking after (say) a single symbol was recompiled; a weak symbol can be re-defined without leading to multiple definition error during linking; I haven't tried that yet though
in D, IIRC dmd uses custom code directly manipulating object files in the backend code, and with some simple hacking re-defining an symbol is possible in this case without leading to multiple definition link error nor using weak symbols; one can write some code that recompiles just 1 D module (say after some code change), and relink to a .a library file, and finally re-generate a binary without having to recompile the other files; it leads to fast recompilation

disruptek · 2020-10-14T01:36:42Z

a few points regarding IC and mangling:

C++ and D use a mangling based not on incremented id but on type information, however I'm not sure it's simpler/better since we can use a lookup table Table[string, PSym] for current scheme

Sounds the same as the current scheme to me.

I'd still prefer IC implemented similarly to nimsuggest, using a compiler as a service, which avoids the need for serialization; see for eg markClientsDirty and transitiveClosure to compute what needs to be marked as dirty and recompiled. One objection i heard (nimsuggest leaks) shouldn't be too hard to address (IIRC I found a leak in nimsuggest, need to search through my notes)

Implementing a compiler server is comparatively trivial, though; I'd prefer to do this as a separate project later. Especially since some recent changes have broken my out-of-compiler compiler (dust).

one un-tried idea I've had for backend wrt IC is using weak symbols (https://en.wikipedia.org/wiki/Weak_symbol); it'd be for re-linking after (say) a single symbol was recompiled; a weak symbol can be re-defined without leading to multiple definition error during linking; I haven't tried that yet though

If I'm reading the article correctly, this buys us just one recompile. I would sooner use symbol versioning, but in neither case is the loss of portability acceptable, so I'd say these cannot be considered.

in D, IIRC dmd uses custom code directly manipulating object files in the backend code, and with some simple hacking re-defining an symbol is possible in this case without leading to multiple definition link error nor using weak symbols; one can write some code that recompiles just 1 D module (say after some code change), and relink to a .a library file, and finally re-generate a binary without having to recompile the other files; it leads to fast recompilation

This seems to be less performant than our current approach, which uses a much smaller scope; single statements means fewer dependencies. One thing that's interesting about the D approach is the idea of naming the libraries according to their interfaces so we can fragment the compiled cache arbitrarily and reuse stuff that isn't even in the cache. I don't think @Araq cares much about C compiler artifacts, though.

timotheecour · 2020-10-14T06:03:03Z

Sounds the same as the current scheme to me.

I don't see how; C++ and D use a reversible encoding eg _ZN9wikipedia7article8print_toERSo and _D4test4findFPxaiZPxa can be reversed via c++filt and dedemangle, without a lookup table. AFAIK nim needs ndi files or in memory lookup table, see writeMangledName(m.ndi, s, m.config). Note that in D the mangling is not so good and (at least till recently) had exponential growth wrt template instantiation depth, see https://dlang.org/blog/2017/12/20/ds-newfangled-name-mangling/

all in all, I'm not convinced that nim's current way is inferior though, and it generates smaller mangled symbols (at expense of requiring a lookup table to demangle/debug), and the question is whether IC can handle it (in particular deduping 2 mangled symbols after a edit-recompilation could be done via lookup table)

@Araq IIUC this PR breaks typetraits.getTypeId from new: typetraits.getTypeId #13305, can getTypeId be implemented after this PR? (it really was not clear to me that some hypothetical signatureHash(getTypeImpl(t)) would solve this; getTypeId was solving real problems in a very simple way

disruptek · 2020-10-14T14:13:49Z

The question isn’t whether IC can reverse mangling because once you’ve spent any time reading the new mangler output, you no longer care about such concerns.

…solution here)

Araq · 2020-10-25T07:50:10Z

Unrelated CI failures, merging now as other PRs depend on it.

(cherry picked from commit 1996614dc58c5dd9943986329c3edc1900fd7167)

* refactoring: idents don't need inheritance * refactoring: adding an IdGenerator (part 1) * refactoring: adding an IdGenerator (part 2) * refactoring: adding an IdGenerator (part 3) * refactoring: adding an IdGenerator (part 4) * refactoring: adding an IdGenerator (part 5) * refactoring: adding an IdGenerator (part 5) * IdGenerator must be a ref type; hello world works again * make bootstrapping work again * progress: add back the 'exactReplica' ideas * added back the missing exactReplica hacks * make tcompilerapi work again * make important packages green * attempt to fix the build for 32 bit machines (probably need a better solution here)

Araq added 10 commits October 11, 2020 12:11

refactoring: removed cmdlinehelper.mainCommand callback

0af2be3

refactoring: moved setOutFile to where it belongs

6783e1d

refactoring: idents don't need inheritance

616a830

refactoring: adding an IdGenerator (part 1)

8c4eb5c

refactoring: adding an IdGenerator (part 2)

9b1d376

refactoring: adding an IdGenerator (part 3)

204e607

refactoring: adding an IdGenerator (part 4)

1ead4f9

refactoring: adding an IdGenerator (part 5)

06eafe5

refactoring: adding an IdGenerator (part 5)

4fc6ad9

progress

5ed0e50

Araq added 3 commits October 13, 2020 03:45

progress

2cbcab8

IdGenerator must be a ref type; hello world works again

2cc9667

progress

7c8b78a

Araq added 4 commits October 17, 2020 11:05

Merge branch 'devel' into araq-ic3

13f2d56

Merge branch 'devel' into araq-ic3

1dbbb9a

make bootstrapping work again

ca00ad6

Merge branch 'devel' into araq-ic3

1bdb6a5

Araq added 4 commits October 23, 2020 21:09

progress: add back the 'exactReplica' ideas

d94939c

added back the mising exactReplica hacks

f28827d

make tcompilerapi work again

a51f6ef

make important packages green

e6b352d

Araq changed the title ~~WIP, experiment, do not merge~~ explicit ID generation for easier IC Oct 24, 2020

Araq added 2 commits October 24, 2020 18:03

Merge branch 'devel' into araq-ic3

b78a2a3

attempt to fix the build for 32 bit machines (probably need a better …

e126e68

…solution here)

Araq merged commit 2265955 into devel Oct 25, 2020

Araq deleted the araq-ic3 branch October 25, 2020 07:50

disruptek added a commit to disruptek/Nim that referenced this pull request Oct 25, 2020

adapt to nim-lang#15559

5c15f74

(cherry picked from commit 1996614dc58c5dd9943986329c3edc1900fd7167)

disruptek added a commit to disruptek/Nim that referenced this pull request Jan 2, 2021

adapt to nim-lang#15559

5f0b60e

(cherry picked from commit 1996614dc58c5dd9943986329c3edc1900fd7167)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explicit ID generation for easier IC #15559

explicit ID generation for easier IC #15559

Araq commented Oct 12, 2020 •

edited

Loading

disruptek commented Oct 12, 2020

Araq commented Oct 13, 2020

cooldome commented Oct 13, 2020

disruptek commented Oct 13, 2020

disruptek commented Oct 13, 2020

Araq commented Oct 13, 2020 •

edited

Loading

Araq commented Oct 13, 2020

disruptek commented Oct 13, 2020 •

edited

Loading

disruptek commented Oct 13, 2020

timotheecour commented Oct 14, 2020

disruptek commented Oct 14, 2020

timotheecour commented Oct 14, 2020 •

edited

Loading

disruptek commented Oct 14, 2020

Araq commented Oct 25, 2020

explicit ID generation for easier IC #15559

explicit ID generation for easier IC #15559

Conversation

Araq commented Oct 12, 2020 • edited Loading

disruptek commented Oct 12, 2020

Araq commented Oct 13, 2020

cooldome commented Oct 13, 2020

disruptek commented Oct 13, 2020

disruptek commented Oct 13, 2020

Araq commented Oct 13, 2020 • edited Loading

Frontend

Backend

Araq commented Oct 13, 2020

disruptek commented Oct 13, 2020 • edited Loading

disruptek commented Oct 13, 2020

timotheecour commented Oct 14, 2020

disruptek commented Oct 14, 2020

timotheecour commented Oct 14, 2020 • edited Loading

disruptek commented Oct 14, 2020

Araq commented Oct 25, 2020

Araq commented Oct 12, 2020 •

edited

Loading

Araq commented Oct 13, 2020 •

edited

Loading

disruptek commented Oct 13, 2020 •

edited

Loading

timotheecour commented Oct 14, 2020 •

edited

Loading