Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explicit ID generation for easier IC #15559

Merged
merged 23 commits into from
Oct 25, 2020
Merged

explicit ID generation for easier IC #15559

merged 23 commits into from
Oct 25, 2020

Conversation

Araq
Copy link
Member

@Araq Araq commented Oct 12, 2020

Changing the ID generation mechanism for an easier IC-implementation.

@disruptek
Copy link
Contributor

This is pretty destructive, though, no?

@Araq
Copy link
Member Author

Araq commented Oct 13, 2020

If that's what it takes to make it all work out, it's totally acceptable. Arguably it's cleaner this way, dependencies are clearer.

@cooldome
Copy link
Member

I like it, it will be also very useful for future multithreading handling.
With --gc:arc -d:useMalloc working for compiler I plan to start looking at multithreading in sempass and codegen. Per module threading in sempass and per proc threading in codegen.

@disruptek
Copy link
Contributor

You're still planning on adding documentation to the new procs, right? 😉

@disruptek
Copy link
Contributor

The more I read this, the less I like. Why not just use a uuid? The code breaks mangling in pretty fundamental ways; it even breaks my refactor to "fix" the outstanding mangling problems.

In fact, it's not identity generation that is the problem, as far as I can tell. Can you frame the purpose of this PR and how it relates to IC?

@Araq
Copy link
Member Author

Araq commented Oct 13, 2020

I don't know yet if it simplifies IC all that much, but here is my reasoning:

IC is about tracking inter-module dependencies precisely and also about stable interfaces between modules. Both properties are handled well by changing the ID from a simple auto-increment number to a pair of numbers, (moduleId, itemId).

Frontend

When we serialize a module M to disk we want to store everything that belongs to M in a binary file -- that is a simple for entity: if entity.moduleId == currentModuleId check, previously it was a fragile traversal of the .owner chain. Entities that are referenced that have a different moduleId are "foreign" references which refer to a different module. We need to resolve foreign references lazily in order to be able to "cut" the set modules into the set of modules that are to be recompiled and the set that didn't change.

Backend

The backend works on a set of binary files, the backend is told to produce fresh C code for the modules with IDs x, y, z, it knows exactly which symbols and types are affected and does its job. No need for special logic for generic instantiations etc, it all works because we have a precise serialized graph structure.

@Araq
Copy link
Member Author

Araq commented Oct 13, 2020

Also notice how close the idgen: IdGenerator is to the "witness" idea that I proposed and how easy name mangling could be: $symbolname_$modulename_$itemid is stable, even though still less than ideal for readability and debugging support.

@disruptek
Copy link
Contributor

disruptek commented Oct 13, 2020

What would handle the requirement well would be a value that represents the interface. Then we can test if the interface is met by the new code. This sounds like a hash tree to me.

The problem of serializing the graph is already solved with frosty and the mangling branch already demonstrates finding source modules for generic instantiations. If that can be improved (somehow?) by this code, then that's a win, but so far, it feels like a minor one.

Stable IDs don't solve the mangling problem because the[y] are not sufficient to establish equivalence; witness proc and tuple types. They are "anonymous" and what is important to us about them is their signature in the current code -- another word for interface -- which we currently don't have a solution for, and which this change in particular does not address.

Since IC happens at top-level statements, to me the pain point is cloning the whole graph just to compile one statement. That's the bit I'd like a better solution for. A hash tree would let us skip the graph clone with the realization that no mutation will impact any interface.

I was trying to get @Clyybber interested in adding a generic Witness concept to the compiler but he didn't seem convinced. I think it's a pattern we need to exploit more generally...

@disruptek
Copy link
Contributor

If the owner traversal is fragile, shouldn't we

  1. remove owner, or
  2. fix owner

instead?

@timotheecour
Copy link
Member

a few points regarding IC and mangling:

  • C++ and D use a mangling based not on incremented id but on type information, however I'm not sure it's simpler/better since we can use a lookup table Table[string, PSym] for current scheme
  • I'd still prefer IC implemented similarly to nimsuggest, using a compiler as a service, which avoids the need for serialization; see for eg markClientsDirty and transitiveClosure to compute what needs to be marked as dirty and recompiled. One objection i heard (nimsuggest leaks) shouldn't be too hard to address (IIRC I found a leak in nimsuggest, need to search through my notes)
  • one un-tried idea I've had for backend wrt IC is using weak symbols (https://en.wikipedia.org/wiki/Weak_symbol); it'd be for re-linking after (say) a single symbol was recompiled; a weak symbol can be re-defined without leading to multiple definition error during linking; I haven't tried that yet though
  • in D, IIRC dmd uses custom code directly manipulating object files in the backend code, and with some simple hacking re-defining an symbol is possible in this case without leading to multiple definition link error nor using weak symbols; one can write some code that recompiles just 1 D module (say after some code change), and relink to a .a library file, and finally re-generate a binary without having to recompile the other files; it leads to fast recompilation

@disruptek
Copy link
Contributor

a few points regarding IC and mangling:

  • C++ and D use a mangling based not on incremented id but on type information, however I'm not sure it's simpler/better since we can use a lookup table Table[string, PSym] for current scheme

Sounds the same as the current scheme to me.

  • I'd still prefer IC implemented similarly to nimsuggest, using a compiler as a service, which avoids the need for serialization; see for eg markClientsDirty and transitiveClosure to compute what needs to be marked as dirty and recompiled. One objection i heard (nimsuggest leaks) shouldn't be too hard to address (IIRC I found a leak in nimsuggest, need to search through my notes)

Implementing a compiler server is comparatively trivial, though; I'd prefer to do this as a separate project later. Especially since some recent changes have broken my out-of-compiler compiler (dust).

  • one un-tried idea I've had for backend wrt IC is using weak symbols (https://en.wikipedia.org/wiki/Weak_symbol); it'd be for re-linking after (say) a single symbol was recompiled; a weak symbol can be re-defined without leading to multiple definition error during linking; I haven't tried that yet though

If I'm reading the article correctly, this buys us just one recompile. I would sooner use symbol versioning, but in neither case is the loss of portability acceptable, so I'd say these cannot be considered.

  • in D, IIRC dmd uses custom code directly manipulating object files in the backend code, and with some simple hacking re-defining an symbol is possible in this case without leading to multiple definition link error nor using weak symbols; one can write some code that recompiles just 1 D module (say after some code change), and relink to a .a library file, and finally re-generate a binary without having to recompile the other files; it leads to fast recompilation

This seems to be less performant than our current approach, which uses a much smaller scope; single statements means fewer dependencies. One thing that's interesting about the D approach is the idea of naming the libraries according to their interfaces so we can fragment the compiled cache arbitrarily and reuse stuff that isn't even in the cache. I don't think @Araq cares much about C compiler artifacts, though.

@timotheecour
Copy link
Member

timotheecour commented Oct 14, 2020

Sounds the same as the current scheme to me.

I don't see how; C++ and D use a reversible encoding eg _ZN9wikipedia7article8print_toERSo and _D4test4findFPxaiZPxa can be reversed via c++filt and dedemangle, without a lookup table. AFAIK nim needs ndi files or in memory lookup table, see writeMangledName(m.ndi, s, m.config). Note that in D the mangling is not so good and (at least till recently) had exponential growth wrt template instantiation depth, see https://dlang.org/blog/2017/12/20/ds-newfangled-name-mangling/

all in all, I'm not convinced that nim's current way is inferior though, and it generates smaller mangled symbols (at expense of requiring a lookup table to demangle/debug), and the question is whether IC can handle it (in particular deduping 2 mangled symbols after a edit-recompilation could be done via lookup table)

  • @Araq IIUC this PR breaks typetraits.getTypeId from new: typetraits.getTypeId #13305, can getTypeId be implemented after this PR? (it really was not clear to me that some hypothetical signatureHash(getTypeImpl(t)) would solve this; getTypeId was solving real problems in a very simple way

@disruptek
Copy link
Contributor

The question isn’t whether IC can reverse mangling because once you’ve spent any time reading the new mangler output, you no longer care about such concerns.

@Araq Araq changed the title WIP, experiment, do not merge explicit ID generation for easier IC Oct 24, 2020
@Araq
Copy link
Member Author

Araq commented Oct 25, 2020

Unrelated CI failures, merging now as other PRs depend on it.

@Araq Araq merged commit 2265955 into devel Oct 25, 2020
@Araq Araq deleted the araq-ic3 branch October 25, 2020 07:50
disruptek added a commit to disruptek/Nim that referenced this pull request Oct 25, 2020
(cherry picked from commit 1996614dc58c5dd9943986329c3edc1900fd7167)
disruptek added a commit to disruptek/Nim that referenced this pull request Jan 2, 2021
(cherry picked from commit 1996614dc58c5dd9943986329c3edc1900fd7167)
mildred pushed a commit to mildred/Nim that referenced this pull request Jan 11, 2021
* refactoring: idents don't need inheritance
* refactoring: adding an IdGenerator (part 1)
* refactoring: adding an IdGenerator (part 2)
* refactoring: adding an IdGenerator (part 3)
* refactoring: adding an IdGenerator (part 4)
* refactoring: adding an IdGenerator (part 5)
* refactoring: adding an IdGenerator (part 5)
* IdGenerator must be a ref type; hello world works again
* make bootstrapping work again
* progress: add back the 'exactReplica' ideas
* added back the missing exactReplica hacks
* make tcompilerapi work again
* make important packages green
* attempt to fix the build for 32 bit machines (probably need a better solution here)
irdassis pushed a commit to irdassis/Nim that referenced this pull request Mar 16, 2021
* refactoring: idents don't need inheritance
* refactoring: adding an IdGenerator (part 1)
* refactoring: adding an IdGenerator (part 2)
* refactoring: adding an IdGenerator (part 3)
* refactoring: adding an IdGenerator (part 4)
* refactoring: adding an IdGenerator (part 5)
* refactoring: adding an IdGenerator (part 5)
* IdGenerator must be a ref type; hello world works again
* make bootstrapping work again
* progress: add back the 'exactReplica' ideas
* added back the missing exactReplica hacks
* make tcompilerapi work again
* make important packages green
* attempt to fix the build for 32 bit machines (probably need a better solution here)
ardek66 pushed a commit to ardek66/Nim that referenced this pull request Mar 26, 2021
* refactoring: idents don't need inheritance
* refactoring: adding an IdGenerator (part 1)
* refactoring: adding an IdGenerator (part 2)
* refactoring: adding an IdGenerator (part 3)
* refactoring: adding an IdGenerator (part 4)
* refactoring: adding an IdGenerator (part 5)
* refactoring: adding an IdGenerator (part 5)
* IdGenerator must be a ref type; hello world works again
* make bootstrapping work again
* progress: add back the 'exactReplica' ideas
* added back the missing exactReplica hacks
* make tcompilerapi work again
* make important packages green
* attempt to fix the build for 32 bit machines (probably need a better solution here)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants