-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SharedCache] All proposed shared cache improvements #2
base: dev
Are you sure you want to change the base?
Conversation
api/MetadataSerializable.hpp is removed in favor of including core/MetadataSerializable.hpp. Both headers defined types with the same name leading to One Definition Rule violations and surprising behavior. The serialization and deserialization context are now created on-demand during serialization rather than being a member of `MetadataSerializable`. This reduces the size of every serializable object by ~220 bytes. The context is passed explicitly as an argument to `Serialize` / `Deserialize`. As a result, `Serialize` / `Deserialize` can now be free functions rather than member functions. Since `MetadataSerializable` is not used for dynamic dispatch, the virtual methods are removed and the class is updated to be a class template using CRTP. This allows delegating to the derived class's `Load` and `Store` methods without the additional size overhead of the vtable pointer in every serializable object. These changes reduce the memory footprint of Binary Ninja after loading the macOS shared cache and loading a single dylib from it from 8.3GB to 4.6GB.
This ensures only one definition ends up in the final binary and makes compilation a little faster.
…o MetadataSerializable
Building up an in-memory representation of the JSON document is expensive in both CPU and memory. Instead of doing that we can directly write the appropriate types.
The flags field was being left uninitialized which could result in the region being mishandled during later analysis.
1. Continue to serialize the `cputype` / `cpusubtype` fields of `mach_header_64` as unsigned, despite them being signed. This preserves compatibility with the existing metadata version. 2. Add the `Serialize` declaration for the special `std::pair<uint64_t, std::pair<uint64_t, uint64_t>>` overload to the header. This ensures it will be favored over the generic `std::pair<First, Second>` template function and preserves the serialization used with the existing metadata version.
Copying the state from the cache into a new `SharedCache` object is done with a global lock held and is so expensive that it results in much of the shared cache analysis running on a single thread, with others blocked waiting to acquire the lock. The cache now holds a `std::shared_ptr` to the state. New `SharedCache` objects take a reference to the cached state and only create their own copy of it the first time they perform an operation that would mutate it. The cached copy is never mutated, only replaced, so there is no danger of modifying the state out from under a `SharedCache` object. Since the copy happens at first mutation, it is performed without any global locks held. This avoids blocking other threads. This cuts the initial load time of a macOS shared cache from 3 minutes to 70 seconds, and cuts the time taken to load and analyze AppKit from multiple hours to around 14 minutes.
1. Use moves where possible to avoid unnecessary copies. 2. Remove redundant work within SymbolTableModel::updateSymbols. It calls setFilter which immediately clears then repopulates m_symbols. 3. Use unordered_map rather than map in `VM`. It is faster and the order isn't significant. 4. Avoid multiple accesses to the map with `VM` in the common cases. 5. Optimize the common case within SharedCache::HeaderForAddress. 6. Change return type of SharedCache::HeaderForAddress to avoid copying SharedCacheMachOHeaders. It is a large type that is expensive to copy.
[immer](https://github.com/arximboldi/immer) provides persistent, immutable data structures such as vectors and maps. These data structures support passing by value without copying any data and structural sharing to copy only a subset of data when a data structure is mutated. immer is published under the Boost Software License which should be compatible with its use in this context. Using these data structures eliminates a lot of the unnecessary copying of the shared cache's state when retrieving it from the view cache and beginning to mutate it. Instead of all of the vectors and maps contained within the state being copied, only the portions of the vectors or maps that are mutated end up being copied. The downside is that the APIs used when mutating are less ergonomic than using the native C++ types. The upside is that this cuts the time taken for the initial load and analysis of a macOS shared cache to around 45 seconds (from 70 seconds with the basic CoW implementation in Vector35#6129) and cuts the time taken to load and analyze AppKit from 14 minutes to around 8.5 minutes.
The issue this commit fixes was causing `SharedCache::ParseAndApplySlideInfoForFile` to completely fail to work with v5 slide info, which had a lot of knock on effects, i.e. lots of Objective-C analysis was failing due to invalid pointers which hadn't been fixed up. `dyld_cache_slide_info5` has 4 bytes of padding before `value_add`. Whilst `value_add` is not actually being read from, `SharedCache::ParseAndApplySlideInfoForFile` will read at a location in the file based on the size of the structure `dyld_cache_slide_info5`. This being off by 4 bytes basically broke v5 slide info fixups. With this fix many more Objective-C functions have names and a lot more `msgSend` calls are fixed up.
`SharedCache::m_symbolInfos` isn't being serialized but there is an attempt to deserialize it. This commit adds in the code to serialize it. I slightly modified the format because I didn't really understand how it was expected to be serialized based on the deserialization code. The deserialization code looked wrong to me but its likely a misunderstanding on my part. I kept it similar to how `m_exportInfos` is serialized.
Prior to this commit the function `DSCObjCProcessor::PostProcessObjCSections` never does anything because it doesn't use the correct names to get the Objective-C sections of the recently loaded library. In fact it never does anything because the DSC never has sections with the names its searching for. This commit passes the `baseName` (the name of the library that was loaded), which is what other Objective-C section processing code does. Combining the base name with the section names it will now find them and process them as intended. This was resulting in alot of Objective-C related stuff being missed. There is however still an issue of the fact that the way this DSC plugin works means it only analyzes Objective-C sections once. This catches alot of things but there are a number of cases where other libraries need to be loaded first due to information being referenced in another library. For instance errors like `Failed to determine base classname for category` can be caused by the class reference in the category being to a class outside of the loaded library. Once the library containing the class has been loaded, the section containing the category should be re-proccessed.
There seem to be a number of issues with slide info parsing. I decided the simplest fix was to largely mimick what `dyld` is doing. Part of this process was to remove creating a vector of page starts and processing them in another loop below the one where they were read out. It wasn't clear to me the original design decision to separate it into 2 loops like that. I think this was part of the problem that was causing issues. By adding rewrites in the same loop where page starts are being read out, it was much easier to mimick the code in `dyld` which I assume has to be correct. So as long as my copying was correct then I believe this should work as intended. Not everything has been thoroughly tested but I'm pretty confident v3 and v5 are now working as intended. v2 should be but less testing of it has been done.
This change is mostly motivated by simplifying the code, but it also brings minor correctness and performance benefits. 1. The pointer returned by mmap is stored as a uint8_t* rather than void* as that is how it is used. This reduces how often it needs to be cast to a different type before it is used. 2. Read methods for primitives delegate to a new Read template function that in turn delegates to the general-purpose `Read(void* dest, size_t address, size_t length)`. This improves the consistency of bounds checking and simplifies the code. The compiler is more than willing to inline this so we get less repetition with no overhead. 3. ReadNullTermString now uses std::find to find the nul byte and directly constructs the string from that range of bytes. This removes an unnecessary allocation that was previously being forced by the use reserve followed by shrink_to_fit. It also avoids repeated reallocation for longer strings as they grew past the the reserved size as they were being built up a character at a time.
Wasn't sure where to put this information, but considering these are the majority of changes I'm testing the DSC plugin with, I thought I'd mention here some other areas where there's performance issues that I've found. I'm not an expert on C++ or sampling execution times, so I timed things by using
I think whilst the lock contention is a major issue it might be being exacerbated by excessive calls to |
This method serializes the entire in-memory state to a giant JSON string. With the macOS cache I've been testing with this results in around 500MB of JSON data and it takes around 400-500ms. I have been experimenting with replacing the JSON serialization with Flatbuffers as that should be more efficient both in terms of the size of the resulting data, the time it takes to generate it, and the temporary memory allocations during serialization. It cuts the data size and serialization time in half, but fundamentally it's always going to be slow to be serializing this much state. The real solution is to not to serialize this data as often as it does. Sadly I don't really understand why it was implemented how it is so I'm not sure of the implications of changing that.
I also see that when loading an iOS shared cache. It barely shows up with a macOS shared cache. I'm not sure whether it's possible to make the locking finer grained to improve parallelism since there's not much information about data what the locks protect and why. One thing I will look at is whether the work done with the lock held can be made faster.
That's a good point. A lot of work is done with the |
Yeah thats kind of the same conclusion I've come to. I'm guessing its about saving state in the database using the metadata API. This would also be useful to plugin makers to access information about the DSC from the metadata. Tbh it probably doesn't matter a huge amount (given other issues) but it does add up if you were to bulk load 100s of images.
I've just been working on this and have a solution, although it's kind of messy due to my lack of C++ skills. But basically I had to implement a global lock (per DSC view) for each memory region. If a memory region is not loaded then its respective global lock has to be acquired. The global state is updated to the latest copy and then another check is done to see if the region has been loaded whilst acquiring the lock. If it has then bail, otherwise grab the I'm finding there are a number of |
Are you loading a single additional image at a time, or multiple at once via the Python API? If the latter, could you please share a Python snippet so I can also take look at the behavior in that case? |
I'm just loading single images at a time using the UI |
I'm currently looking at |
There are typically only a few dozen mappings, while there can be millions of pages. This reduces the amount of time spent populating the mapping from region to file accessor along with the memory usage of the same.
I'd missed that |
It actually gets more wild. Using For
|
Do you want me to do pull requests against your |
Feel free to send a pull request. I'd love to see what you've come up with! I might end up rebasing it so it applies cleanly to I have personally been trying to stick to branches that apply cleanly to It does mean a bunch of extra work, though. I develop the changes against this branch, then have to port them back to |
`BackingCache` now tracks the `dyld_cache_mapping_info` for its mappings so it has access to the memory protections for the region. This means it can avoid marking some regions as containing code when they don't, reducing the amount of analysis work that has to be done. Using `dyld_cache_mapping_info` also makes references to mappings easier to understand due to its named fields vs the nested `std::pair`s that were previously in use.
Find the relative selector base address in the Objective-C optimization data pointed to by the shared cache header, rather than via `__objc_scoffs`. This is only present on iOS, and not for every iOS version that encodes selectors via direct offsets. This also includes some related improvements: 1. Direct selectors get their own pointer type so they're rendered correctly in the view. 2. Method lists encoded as lists of lists are now handled. 3. The `dyld_cache_header` type added to the view is truncated to the length in the loaded cache. This ensures it is applied to the view. 4. A couple of methods that process method IMPs and selectors are updated to check whether the address is valid before attempting to process them. They would otherwise fail by throwing an exception if they proceed, but checking for validity is quicker and makes exception breakpoints usable.
The existing view-specific state was stored in several global unordered maps. Many of these were accessed without locking, including `viewSpecificMutexes`, which is racy in the face of multiple threads. View-specific state is stored in a new heap-allocated `ViewSpecificState` struct that is reference counted via `std::shared_ptr`. A static map holds a `std::weak_ptr` to each view-specific state, keyed by session id. `SharedCache` retrieves its view-specific state during its constructor. Since `ViewSpecificState` is reference counted it will naturally be deallocated when the last `SharedCache` instance that references it goes away. Its corresponding entry will remain in the static map, though since it only holds a `std::weak_ptr` rather than any state it will not use much memory. The next time view-specific state is retrieved any expired entries will be removed from the map.
They're surprisingly expensive to look up.
This reverts commit 1539f3d.
Prior to this commit the function `DSCObjCProcessor::PostProcessObjCSections` never does anything because it doesn't use the correct names to get the Objective-C sections of the recently loaded library. In fact it never does anything because the DSC never has sections with the names its searching for. This commit passes the `baseName` (the name of the library that was loaded), which is what other Objective-C section processing code does. Combining the base name with the section names it will now find them and process them as intended. This was resulting in alot of Objective-C related stuff being missed. There is however still an issue of the fact that the way this DSC plugin works means it only analyzes Objective-C sections once. This catches alot of things but there are a number of cases where other libraries need to be loaded first due to information being referenced in another library. For instance errors like `Failed to determine base classname for category` can be caused by the class reference in the category being to a class outside of the loaded library. Once the library containing the class has been loaded, the section containing the category should be re-proccessed.
…ding a library This is useful when batch loading libraries to avoid extra processing (once the next commit has landed).
A problem with processing Objective-C sections at the time a library is loaded is that some of the references within those sections may refer to unload sections. This results in things like selectors not being correctly typed and named. This commit provides a way for users to manually trigger Objective-C parsing against sections for a specific library or all libraries, via the API. Combined with the previous commit a user can use the API to batch load a number of libraries and skip Objective-C processing for each one and then run it across the entire of the DSC once they are all loaded. Further improvement would be to provide a way to trigger Objective-C processing through the UI.
I did some heap profiling and it seems that when Binary Ninja is consuming 30GB of RAM due to loading 30 libraries, 90% of the RAM usage is coming from core Binary Ninja API allocation. So, like I think you mentioned before, we're at the mercy of Vector35 solving that issue. The biggest chunk of RAM usage from the DSC plugin are the symbol export list which will naturally consume 1 or 2 GBs because of the number of strings, so I don't think anything worthwhile can be done there. I also did some more performance testing and during image loading the bulk of the execution time is in During the analysis phase most of the execution time is in |
I think I'll personally hold off on trying to optimize anything further until we start seeing PRs merged upstream. As it stands it is now fast enough for my typical usage (opening a shared cache to poke at the couple of libraries I happened to be interested in at that moment, then throwing it all away). Once changes start to get merged and I'm not trying to work across a giant stack of changes like this I might revisit things again, just because making things faster is a lot of fun. I'll merge the remaining PRs we've each submitted upstream into this branch, and I'll continue to keep an eye out for any other improvements you make so I can merge them here. That will ensure that there's an easy way for anyone that's interested to get a working, fast shared cache plug-in. |
…on', 'WeiN76LQh/fix-v5-slide-info', 'bdash/dsc-memory-region-flags' and 'bdash/dsc-persistent-data-structures' into dsc-proposed
…into dsc-proposed
Yes I agree, I'm pretty much in the same boat. I appreciate all the work you've done to make working with the DSC in Binary Ninja as performant as it is with this branch. |
Not sure what happened between now and this comment but I'm seeing pretty major performance regressions using this branch. |
I've been consistently seeing times of 2-3 minutes for Foundation. The only time I've seen anything under 30 seconds it was due to a bug resulting in a bunch of analysis work being skipped. |
I think I know whats going. Using
What I don't understand is why |
Yeah, skipping |
Yes I cleaned up that function a little bit as well. The |
This PR exists to document the
dsc-proposed
branch. This branch aggregates a number of pending improvements related to the dyld shared cache plug-in from https://github.com/Vector35/binaryninja-api. It is intended to provide a single branch that can be built to take advantage of a number significant bug fixes and performance improvements that have not yet been merged upstream.This branch will periodically be rebased and its contents will change as pull requests are merged upstream or new changes are made.
Included on this branch
Bug fixes:
SharedCache::m_symbolInfos
Vector35/binaryninja-api#6174 by WeiN76LQhDSCObjCProcessor::PostProcessObjCSections
and improve Objective-C processing Vector35/binaryninja-api#6198 by WeiN76LQhPerformance improvements:
Changes for which no pull requests have been created at this time:
Please base further improvements on
dev
rather than this branch when possible to avoid problems when this branch is rebased. Pull requests should be submitted against the upstream repository unless you're fixing an issue introduced in a change that has not yet been merged.