-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hot Cold Splitting] Redesign the hot cold mapping data structure to allow fast lookup #1903
Comments
Design ProposalCurrent Implementation@cshung's initial prototype for hot/cold splitting in Crossgen2 introduces a new data structure mapping indices of cold runtime functions to indices of their hot counterparts in the runtime function table. This mapping (currently referred to as
Problems
ProposalWe propose replacing At compile-time,
At runtime,
Given a function index
Example WalkthroughSuppose the runtime function table has 5 entries with the following order:
Functions 1 and 3 are split, so indices 0, 2, 3, and 4 in the bit vector should be set to 1. We also note that We then calculate the rank and select tables as so: Now, suppose we want to find the corresponding cold function to
Let's try going the other way. Suppose we want to find the corresponding hot function to
PerformanceFor each function in the runtime function table, we write one bit to file. Thus, for a runtime function table with At runtime, additional space complexity will be introduced by generating rank and select tables (structures with sizes linear on the size of the bit vector). However, the initial cost of creating these tables will be quickly offset by the search algorithm's constant time complexity. Limitations
Upon approval, @EugenioPena and @amanasifkhalid will lead implementation. @jkotas @trylek @davidwrighton PTAL. Thank you! |
Are there any alternative designs that you have considered and rejected? It would be useful to see sketches of a few alternatives and why this one was choosen as the winner. Have you looked into leveraging information in the existing unwind and gcinfo? The lookups may be expensive if there is only a small number of hot/cold split functions or if there are large clusters of functions that are not split. The lookups do not have to be super fast, but potentially scanning nearly all methods in the image maybe too slow.
This is a fine assumption to make. |
As far as succinct data structures go, I do not know of any others that enable this mapping behavior. I have briefly considered more traditional structures that could enable constant-time lookup on average, but the additional space they would require make them impractical for writing to file. I haven't looked into the unwind/GC info blobs in the runtime function table, but if hot/cold function mappings can be inferred from here, we could skip creating a new data structure altogether and get constant-time lookup in the runtime function table. I believe @cshung didn't find enough information here to do so, though. To somewhat lessen the cost of lookups, we could cache the last search result -- this cache would be at most 8 bytes at runtime. Would scanning all methods to set up the bit-vector and rank/select tables be too slow if it is only done once? |
Is this equivalent to (1) eliminate all non-split hot functions from the list, (2) for hot function Related: does it require the hot and cold functions (in the "runtime function table") to be sorted equivalently? i.e., hot1/hot2/hot3/cold3/cold1 would be an illegal ordering? What are the constraints on the ordering of the runtime functions, and how does that relate to the constraints on the layout of the cold code section? The layout of the cold function fragments or cold functions might not matter (if they are truly cold). But if there is an ordering constraint implied by this design I wonder if we need to be careful in case we want to eventually have, say, a "definitely cold" and "probably cold" region, depending on how certain we are of the "coldness" of a function or function fragment. So, if the bit vector always had two bits for every function (one for hot, one for cold), then you wouldn't need to build up the rank/select data at runtime, right? |
Your implementation sounds valid, but I don't think we can guarantee the new data structure's indices map 1:1 to the runtime function table's indices; for example, if Yes, this data structure relies on the current order of the runtime function table. If I understand correctly, the runtime function table is sorted by each function's beginning RVA, so the table reflects the order of the file. So hot functions come first in order, followed by their cold counterparts in the same order (i.e. if
Sorry, I'm not sure if I follow this. What would each bit indicate for a function? |
I realize I was neglecting the most important thing here, which is that we have a packed, sorted-by-RVA, array of RUNTIME_FUNCTION entries, and we are actually mapping between the hot and cold RUNTIME_FUNCTION (correct?). So we can't pad the RUNTIME_FUNCTION table to match a "padded" mapping. Note, however, that a function or hot/cold function part can have more than one RUNTIME_FUNCTION (see |
Yes, that's correct.
Interesting, thank you for bringing this up. In the context of, say, unwinding, if we are in a cold RUNTIME_FUNCTION and need to get to its corresponding hot RUNTIME_FUNCTION, there will always be only one correct RUNTIME_FUNCTION to jump to, right? As long as we can make pairs of RUNTIME_FUNCTIONs (I apologize, as I likely meant RUNTIME_FUNCTION when previously saying just "function"), I think this approach should still work. |
Andrew just clarified with me that we don't actually intend to create the rank/select arrays at runtime or compile-time (those arrays were more for facilitating the explanation of the algorithm). Rather, we can do some operations on the bit-vector itself to calculate rank/select, as explained by Andrew below: RankTo support rank, you divide the bit vector into chunks of equal size, for each chunk, we store the rank of the last bit of the previous chunk (for the first chunk, we can simply skip it because the rank start with 0). For each bit, we can figure which chunk it is in easily (because chunks size is constant). SelectTo support select, we use a similar idea of dividing the bit vector into groups of equals number of ones. Depending on the input, these groups will have the same number of ones, but different number of zeros, so they have different length.
Why do we do this?These indexes provide us with a constant lookup time. Every time when we perform an operation, we simply determine which chunk/subchunk group/subgroup and perform just a few table lookup for the solution. |
The explanation above comes from this video. |
I've been thinking about this for a bit and I believe there are several things to consider.
In looking at the algorithm, I see that generating the actual bit vector is trivial, and the logical rank/select operations look reasonable for use. The question becomes, how to efficiently implement the rank and select algorithms and whether or not they actually need to be constant, or if some sort of tree structure is acceptable instead. Rank data structureLooking at this, a chunking approach, where we chunk on 32byte or 64 byte boundaries makes sense. We can easily create the chunks, and encoding them in the file is again, trivial (Its just another parallel array to the bit vector). However, the algorithm described above has this concept of subchunks, and I do not see significant value to carrying an extra structure for subchunks. Instead, I would take advantage of the population count instructions of the hardware, which will allow (on a 64bit platform) computing the subchunk rank in something like 24 cycles. Not all hardware we run on actually has support for these instructions, so we would need a fallback routine, but their presence is common enough that we can assume the instructions exist. Select data structureThis looks much more complicated and expensive to build, search, and maintain, and the benefit over doing a binary search through the chunk rank array seems minimal. I would suggest considering an algorithm which did the select operation by binary searching the rank chunk array to find the check where the select will succeed, and then implementing a bit scanning algorithm to find the appropriate index. |
Thank you for the detailed feedback, @davidwrighton! I'll do my best to address your points:
I don't have a comprehensive answer for this yet, but I do have some preliminary metrics that reveal how often the JIT splits code from various SuperPMI collections here; note that this is with splitting of EH funclets turned on. On the low end, the JIT split ~14% of functions in the
You touch on this later, but I agree emitting the rank array to file should be trivial. I initially prioritized minimizing the amount of new data written to the R2R format, but if startup time is our top priority, then it makes sense to write the rank structure to file.
I like your idea of foregoing subchunks in favor of using the PopCount instruction. Just to clarify, do you mean using 32/64-bit chunk sizes, or 32/64-byte chunks?
I also agree the select data structure and algorithm may not be worth its complexity for our purposes. Binary search over the rank structure would be intuitive from a development standpoint, and would save us from creating and loading another data structure. I guess this raises the question of whether there are other data structures capable of binary search that are also worth our consideration, though I do not know of any that are as succinct. With your suggestions in mind, I imagine a workflow like this:
Thanks again for the feedback! I'm curious to hear what @cshung and @EugenioPena think. |
@amanasifkhalid I would suggest 32 or 64 byte chunks. With the popcount instruction on X86/the equivalent NEON instructions on Arm, it becomes extremely efficient and fast to count that sort of number of bits. As far as more efficient layout for binary searching, we could use an Eytzinger layout for the rank array data, but then it becomes more complex to implement the rank algorithm as we want to use the sorted index for the lookup there, and translating from a sorted index into Eytzinger layout index is non-trivial. My expectation is that we don't run these operations sufficiently often enough for it to be worth the complexity. At a chunk size of 64 bytes, if there are 1,000,000 entries then we have about 2000 chunks to search through and, the log base 2 is something like 11, which is probably reasonably cheap. Alternatively, we could add a PGM index on top of the sorted array. That would be interesting, but again, I doubt that this data structure is hot enough for the complexity of a PGM index to really be worth it. See https://pgm.di.unipi.it/ I suspect that if we actually were to find value in one of these, it would be to address the cost of the lookup of a |
Right now, the data structure is simply a list of (cold runtime function index, hot runtime function index), this is not efficient for lookup. We need to redesign it so that it can be lookup fast.
The text was updated successfully, but these errors were encountered: