Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SharedCache] All proposed shared cache improvements #2

Open
wants to merge 31 commits into
base: dev
Choose a base branch
from

Conversation

bdash
Copy link
Owner

@bdash bdash commented Nov 22, 2024

This PR exists to document the dsc-proposed branch. This branch aggregates a number of pending improvements related to the dyld shared cache plug-in from https://github.com/Vector35/binaryninja-api. It is intended to provide a single branch that can be built to take advantage of a number significant bug fixes and performance improvements that have not yet been merged upstream.

This branch will periodically be rebased and its contents will change as pull requests are merged upstream or new changes are made.

Included on this branch

Bug fixes:

Performance improvements:

Changes for which no pull requests have been created at this time:

Please base further improvements on dev rather than this branch when possible to avoid problems when this branch is rebased. Pull requests should be submitted against the upstream repository unless you're fixing an issue introduced in a change that has not yet been merged.

bdash and others added 16 commits November 16, 2024 21:56
api/MetadataSerializable.hpp is removed in favor of including
core/MetadataSerializable.hpp. Both headers defined types with the same
name leading to One Definition Rule violations and surprising behavior.

The serialization and deserialization context are now created on-demand
during serialization rather than being a member of
`MetadataSerializable`. This reduces the size of every serializable
object by ~220 bytes.

The context is passed explicitly as an argument to `Serialize` /
`Deserialize`. As a result, `Serialize` / `Deserialize` can now be free
functions rather than member functions.

Since `MetadataSerializable` is not used for dynamic dispatch,
the virtual methods are removed and the class is updated to be a class
template using CRTP. This allows delegating to the derived class's
`Load` and `Store` methods without the additional size overhead of the
vtable pointer in every serializable object.

These changes reduce the memory footprint of Binary Ninja after loading
the macOS shared cache and loading a single dylib from it from 8.3GB to
4.6GB.
This ensures only one definition ends up in the final binary and makes compilation a little faster.
Building up an in-memory representation of the JSON document is expensive in both CPU and memory. Instead of doing that we can directly write the appropriate types.
The flags field was being left uninitialized which could result in the
region being mishandled during later analysis.
1. Continue to serialize the `cputype` / `cpusubtype` fields of
   `mach_header_64` as unsigned, despite them being signed. This
   preserves compatibility with the existing metadata version.
2. Add the `Serialize` declaration for the special `std::pair<uint64_t,
   std::pair<uint64_t, uint64_t>>` overload to the header. This ensures
   it will be favored over the generic `std::pair<First, Second>`
   template function and preserves the serialization used with the
   existing metadata version.
Copying the state from the cache into a new `SharedCache` object is done
with a global lock held and is so expensive that it results in much of
the shared cache analysis running on a single thread, with others
blocked waiting to acquire the lock.

The cache now holds a `std::shared_ptr` to the state. New `SharedCache`
objects take a reference to the cached state and only create their own
copy of it the first time they perform an operation that would mutate
it. The cached copy is never mutated, only replaced, so there is no
danger of modifying the state out from under a `SharedCache` object.
Since the copy happens at first mutation, it is performed without any
global locks held. This avoids blocking other threads.

This cuts the initial load time of a macOS shared cache from 3 minutes
to 70 seconds, and cuts the time taken to load and analyze AppKit from
multiple hours to around 14 minutes.
1. Use moves where possible to avoid unnecessary copies.
2. Remove redundant work within SymbolTableModel::updateSymbols. It
   calls setFilter which immediately clears then repopulates m_symbols.
3. Use unordered_map rather than map in `VM`. It is faster and the order
   isn't significant.
4. Avoid multiple accesses to the map with `VM` in the common cases.
5. Optimize the common case within SharedCache::HeaderForAddress.
6. Change return type of SharedCache::HeaderForAddress to avoid copying
   SharedCacheMachOHeaders. It is a large type that is expensive to
   copy.
[immer](https://github.com/arximboldi/immer) provides persistent,
immutable data structures such as vectors and maps. These data
structures support passing by value without copying any data and
structural sharing to copy only a subset of data when a data structure
is mutated. immer is published under the Boost Software License which
should be compatible with its use in this context.

Using these data structures eliminates a lot of the unnecessary copying
of the shared cache's state when retrieving it from the view cache and
beginning to mutate it. Instead of all of the vectors and maps contained
within the state being copied, only the portions of the vectors or maps
that are mutated end up being copied.

The downside is that the APIs used when mutating are less ergonomic than
using the native C++ types.

The upside is that this cuts the time taken for the initial load and
analysis of a macOS shared cache to around 45 seconds (from 70 seconds
with the basic CoW implementation in Vector35#6129) and cuts the time taken to
load and analyze AppKit from 14 minutes to around 8.5 minutes.
The issue this commit fixes was causing `SharedCache::ParseAndApplySlideInfoForFile` to completely fail to work with v5 slide info, which had a lot of knock on effects, i.e. lots of Objective-C analysis was failing due to invalid pointers which hadn't been fixed up.

`dyld_cache_slide_info5` has 4 bytes of padding before `value_add`. Whilst `value_add` is not actually being read from, `SharedCache::ParseAndApplySlideInfoForFile` will read at a location in the file based on the size of the structure `dyld_cache_slide_info5`. This being off by 4 bytes basically broke v5 slide info fixups.

With this fix many more Objective-C functions have names and a lot more `msgSend` calls are fixed up.
`SharedCache::m_symbolInfos` isn't being serialized but there is an attempt to deserialize it. This commit adds in the code to serialize it.

I slightly modified the format because I didn't really understand how it was expected to be serialized based on the deserialization code. The deserialization code looked wrong to me but its likely a misunderstanding on my part. I kept it similar to how `m_exportInfos` is serialized.
Prior to this commit the function `DSCObjCProcessor::PostProcessObjCSections` never does anything because it doesn't use the correct names to get the Objective-C sections of the recently loaded library. In fact it never does anything because the DSC never has sections with the names its searching for.

This commit passes the `baseName` (the name of the library that was loaded), which is what other Objective-C section processing code does. Combining the base name with the section names it will now find them and process them as intended. This was resulting in alot of Objective-C related stuff being missed.

There is however still an issue of the fact that the way this DSC plugin works means it only analyzes Objective-C sections once. This catches alot of things but there are a number of cases where other libraries need to be loaded first due to information being referenced in another library. For instance errors like `Failed to determine base classname for category` can be caused by the class reference in the category being to a class outside of the loaded library. Once the library containing the class has been loaded, the section containing the category should be re-proccessed.
There seem to be a number of issues with slide info parsing. I decided the simplest fix was to largely mimick what `dyld` is doing.

Part of this process was to remove creating a vector of page starts and processing them in another loop below the one where they were read out. It wasn't clear to me the original design decision to separate it into 2 loops like that. I think this was part of the problem that was causing issues.

By adding rewrites in the same loop where page starts are being read out, it was much easier to mimick the code in `dyld` which I assume has to be correct. So as long as my copying was correct then I believe this should work as intended.

Not everything has been thoroughly tested but I'm pretty confident v3 and v5 are now working as intended. v2 should be but less testing of it has been done.
This change is mostly motivated by simplifying the code, but it also
brings minor correctness and performance benefits.

1. The pointer returned by mmap is stored as a uint8_t* rather than
   void* as that is how it is used. This reduces how often it needs to
   be cast to a different type before it is used.
2. Read methods for primitives delegate to a new Read template function
   that in turn delegates to the general-purpose `Read(void* dest, size_t
   address, size_t length)`. This improves the consistency of bounds
   checking and simplifies the code. The compiler is more than willing
   to inline this so we get less repetition with no overhead.
3. ReadNullTermString now uses std::find to find the nul byte and
   directly constructs the string from that range of bytes. This removes
   an unnecessary allocation that was previously being forced by the use
   reserve followed by shrink_to_fit. It also avoids repeated
   reallocation for longer strings as they grew past the the reserved
   size as they were being built up a character at a time.
@WeiN76LQh
Copy link

Wasn't sure where to put this information, but considering these are the majority of changes I'm testing the DSC plugin with, I thought I'd mention here some other areas where there's performance issues that I've found. I'm not an expert on C++ or sampling execution times, so I timed things by using std::chrono::high_resolution_clock::now() around sections of code to see how long they're taking. I presume thats good enough to determine bottlenecks when we're talking seconds of execution time. This is also with a debug build so things maybe running a little slower but obviously these are still areas of concern.

  • In SharedCache::SaveToDSCView the line auto data = AsMetadata(); is taking multiple seconds consistently. This is called at least once when either loading a section or image.
  • When using the Python API to load sections by address, in SharedCache::LoadSectionAtAddress the line auto id = m_dscView->BeginUndoActions(); is often taking at least a second and in some cases has taken 10s of seconds. Subsequently in SharedCache::LoadSectionAtAddress the line m_dscView->CommitUndoActions(id); is usually taking 100ms. There appear to be some underlying issues with undo actions when using Binary Ninja via the Python API.
  • By far the biggest performance issue is lock contention. During phase 1 of analysis there is a huge amount of time spent blocking analysis threads on the first line in SharedCache::LoadSectionAtAddress:
    std::unique_lock<std::mutex> lock(viewSpecificMutexes[m_dscView->GetFile()->GetSessionId()].viewOperationsThatInfluenceMetadataMutex);
    This lock contention is definitely whats causing DSC analysis to be less parallelised than it could be.
  • The line shortly after auto vm = GetVMMap(); (in SharedCache::LoadSectionAtAddress) is consistently taking 100s of milliseconds. Which is much less problematic but SharedCache::LoadSectionAtAddress is getting called alot by phase 1 analysis threads so it adds up.

I think whilst the lock contention is a major issue it might be being exacerbated by excessive calls to SharedCache::LoadSectionAtAddress. I'm looking into this because loading one image is causing 71 calls (not likely a consistent number but a real world example) to SharedCache::LoadSectionAtAddress. Often with addresses that are very close by or the same. Its probable that no lock needs to be taken here and a quick return can be done in the cases where the section is already loaded.

@bdash
Copy link
Owner Author

bdash commented Nov 23, 2024

In SharedCache::SaveToDSCView the line auto data = AsMetadata(); is taking multiple seconds consistently. This is called at least once when either loading a section or image.

This method serializes the entire in-memory state to a giant JSON string. With the macOS cache I've been testing with this results in around 500MB of JSON data and it takes around 400-500ms.

I have been experimenting with replacing the JSON serialization with Flatbuffers as that should be more efficient both in terms of the size of the resulting data, the time it takes to generate it, and the temporary memory allocations during serialization. It cuts the data size and serialization time in half, but fundamentally it's always going to be slow to be serializing this much state.

The real solution is to not to serialize this data as often as it does. Sadly I don't really understand why it was implemented how it is so I'm not sure of the implications of changing that.

By far the biggest performance issue is lock contention. During phase 1 of analysis there is a huge amount of time spent blocking analysis threads on the first line in SharedCache::LoadSectionAtAddress:

I also see that when loading an iOS shared cache. It barely shows up with a macOS shared cache.

I'm not sure whether it's possible to make the locking finer grained to improve parallelism since there's not much information about data what the locks protect and why.

One thing I will look at is whether the work done with the lock held can be made faster. SharedCache::GetVMMap looks to be the culprit with an iOS shared cache. There's room for improvement there. I was working on improving it yesterday before I got sidetracked with Objective-C selectors misbehaving. It works for macOS shared caches, but not iOS at this point. I'll revisit this shortly.

Often with addresses that are very close by or the same. It's probable that no lock needs to be taken here and a quick return can be done in the cases where the section is already loaded.

That's a good point. A lot of work is done with the viewOperationsThatInfluenceMetadataMutex lock held prior to even determining whether there's anything to load. If viewOperationsThatInfluenceMetadataMutex only needs to be held while the operations that interact with BinaryView are being performed, a lot of that work could happen in parallel.

@WeiN76LQh
Copy link

WeiN76LQh commented Nov 23, 2024

This method serializes the entire in-memory state to a giant JSON string. With the macOS cache I've been testing with this results in around 500MB of JSON data and it takes around 400-500ms.

I have been experimenting with replacing the JSON serialization with Flatbuffers as that should be more efficient both in terms of the size of the resulting data, the time it takes to generate it, and the temporary memory allocations during serialization. It cuts the data size and serialization time in half, but fundamentally it's always going to be slow to be serializing this much state.

The real solution is to not to serialize this data as often as it does. Sadly I don't really understand why it was implemented how it is so I'm not sure of the implications of changing that.

Yeah thats kind of the same conclusion I've come to. I'm guessing its about saving state in the database using the metadata API. This would also be useful to plugin makers to access information about the DSC from the metadata. Tbh it probably doesn't matter a huge amount (given other issues) but it does add up if you were to bulk load 100s of images.

That's a good point. A lot of work is done with the viewOperationsThatInfluenceMetadataMutex lock held prior to even determining whether there's anything to load. If viewOperationsThatInfluenceMetadataMutex only needs to be held while the operations that interact with BinaryView are being performed, a lot of that work could happen in parallel.

I've just been working on this and have a solution, although it's kind of messy due to my lack of C++ skills. But basically I had to implement a global lock (per DSC view) for each memory region. If a memory region is not loaded then its respective global lock has to be acquired. The global state is updated to the latest copy and then another check is done to see if the region has been loaded whilst acquiring the lock. If it has then bail, otherwise grab the viewOperationsThatInfluenceMetadataMutex lock and load the region whilst still holding the global region lock. This ensures only 1 thread attempts to load the region but also means once its loaded threads spend very little time in SharedCache::LoadSectionAtAddress. It does speed up initial loading of an image but it seems alot of phase 1 analysis is still being slowed down elsewhere.

I'm finding there are a number of SharedCache functions being called frequently in phase 1 and 3 analysis that may also be problematic. SharedCache::FindSymbolAtAddrAndApplyToAddr seems to be a popular but slow path.

@bdash
Copy link
Owner Author

bdash commented Nov 23, 2024

Are you loading a single additional image at a time, or multiple at once via the Python API? If the latter, could you please share a Python snippet so I can also take look at the behavior in that case?

@WeiN76LQh
Copy link

Are you loading a single additional image at a time, or multiple at once via the Python API? If the latter, could you please share a Python snippet so I can also take look at the behavior in that case?

I'm just loading single images at a time using the UI

@WeiN76LQh
Copy link

I'm finding there are a number of SharedCache functions being called frequently in phase 1 and 3 analysis that may also be problematic. SharedCache::FindSymbolAtAddrAndApplyToAddr seems to be a popular but slow path.

I'm currently looking at SharedCache::FindSymbolAtAddrAndApplyToAddr. SharedCache::ParseExportTrie is called every time and the results are cached but the cache (State().exportInfos) is never actually being used. I think I can get some improvements here by actually using the cache, amongst some other minor code changes to avoid locks and repetitive code.

There are typically only a few dozen mappings, while there can be
millions of pages. This reduces the amount of time spent populating the
mapping from region to file accessor along with the memory usage of the
same.
@bdash
Copy link
Owner Author

bdash commented Nov 24, 2024

I'd missed that FindSymbolAtAddrAndApplyToAddr doesn't consult exportInfos. Amazing!

@WeiN76LQh
Copy link

I'd missed that FindSymbolAtAddrAndApplyToAddr doesn't consult exportInfos. Amazing!

It actually gets more wild. Using exportInfos only provided around a 10% performance improvement for loading the Foundation library. What actually turned out to be the issue with that function is the call to GetTypeLibrary. Caching the references to the type libraries provided over a 50% performance increase and dramatically gets up multi-core utilisation.

For Foundation:

  • Analysis phase 1 went from ~450 seconds (with exportInfos optimisation) -> ~200 seconds
  • Analysis phase 3 went from ~60 seconds (with exportInfos optimisation) -> ~20 seconds

@WeiN76LQh
Copy link

Do you want me to do pull requests against your dsc-proposed branch? At this point my changes are intertwined with yours so if I do a pull request against the main dev branch, they aren't going to neatly merge with dsc-proposed without changes.

@bdash
Copy link
Owner Author

bdash commented Nov 24, 2024

Feel free to send a pull request. I'd love to see what you've come up with!

I might end up rebasing it so it applies cleanly to dev and then merging it here, using your PR as the end state to see how to resolve the resulting merge conflicts.

I have personally been trying to stick to branches that apply cleanly to dev where there's not an explicit dependency on earlier work. This will hopefully avoid ending up with significant bug fixes or performance improvements being blocked on invasive changes such as the persistent data structures via immer.

It does mean a bunch of extra work, though. I develop the changes against this branch, then have to port them back to dev, and then deal with the merge conflicts when merging everything together here. I honestly don't know if it's worth the effort. Hopefully we'll see some of the PRs against upstream merged soon.

`BackingCache` now tracks the `dyld_cache_mapping_info` for its mappings
so it has access to the memory protections for the region. This means it
can avoid marking some regions as containing code when they don't,
reducing the amount of analysis work that has to be done.

Using `dyld_cache_mapping_info` also makes references to mappings easier
to understand due to its named fields vs the nested `std::pair`s that
were previously in use.
Find the relative selector base address in the Objective-C optimization
data pointed to by the shared cache header, rather than via
`__objc_scoffs`. This is only present on iOS, and not for every iOS
version that encodes selectors via direct offsets.

This also includes some related improvements:
1. Direct selectors get their own pointer type so they're rendered
   correctly in the view.
2. Method lists encoded as lists of lists are now handled.
3. The `dyld_cache_header` type added to the view is truncated to the
   length in the loaded cache. This ensures it is applied to the view.
4. A couple of methods that process method IMPs and selectors are
   updated to check whether the address is valid before attempting to
   process them. They would otherwise fail by throwing an exception if
   they proceed, but checking for validity is quicker and makes
   exception breakpoints usable.
The existing view-specific state was stored in several global unordered
maps. Many of these were accessed without locking, including
`viewSpecificMutexes`, which is racy in the face of multiple threads.

View-specific state is stored in a new heap-allocated
`ViewSpecificState` struct that is reference counted via
`std::shared_ptr`. A static map holds a `std::weak_ptr` to each
view-specific state, keyed by session id. `SharedCache` retrieves its
view-specific state during its constructor.

Since `ViewSpecificState` is reference counted it will naturally be
deallocated when the last `SharedCache` instance that references it goes
away. Its corresponding entry will remain in the static map, though
since it only holds a `std::weak_ptr` rather than any state it will not
use much memory. The next time view-specific state is retrieved any
expired entries will be removed from the map.
They're surprisingly expensive to look up.
WeiN76LQh added 4 commits November 25, 2024 19:39
Prior to this commit the function `DSCObjCProcessor::PostProcessObjCSections` never does anything because it doesn't use the correct names to get the Objective-C sections of the recently loaded library. In fact it never does anything because the DSC never has sections with the names its searching for.

This commit passes the `baseName` (the name of the library that was loaded), which is what other Objective-C section processing code does. Combining the base name with the section names it will now find them and process them as intended. This was resulting in alot of Objective-C related stuff being missed.

There is however still an issue of the fact that the way this DSC plugin works means it only analyzes Objective-C sections once. This catches alot of things but there are a number of cases where other libraries need to be loaded first due to information being referenced in another library. For instance errors like `Failed to determine base classname for category` can be caused by the class reference in the category being to a class outside of the loaded library. Once the library containing the class has been loaded, the section containing the category should be re-proccessed.
…ding a library

This is useful when batch loading libraries to avoid extra processing (once the next commit has landed).
A problem with processing Objective-C sections at the time a library is loaded is that some of the references within those sections may refer to unload sections. This results in things like selectors not being correctly typed and named.

This commit provides a way for users to manually trigger Objective-C parsing against sections for a specific library or all libraries, via the API. Combined with the previous commit a user can use the API to batch load a number of libraries and skip Objective-C processing for each one and then run it across the entire of the DSC once they are all loaded.

Further improvement would be to provide a way to trigger Objective-C processing through the UI.
@WeiN76LQh
Copy link

I did some heap profiling and it seems that when Binary Ninja is consuming 30GB of RAM due to loading 30 libraries, 90% of the RAM usage is coming from core Binary Ninja API allocation. So, like I think you mentioned before, we're at the mercy of Vector35 solving that issue. The biggest chunk of RAM usage from the DSC plugin are the symbol export list which will naturally consume 1 or 2 GBs because of the number of strings, so I don't think anything worthwhile can be done there.

I also did some more performance testing and during image loading the bulk of the execution time is in SharedCache::LoadSectionAtAddress, and some in SharedCache::LoadImageWithInstallName. I think we've solved SharedCache::FindSymbolAtAddrAndApplyToAddr which was also slowing loading. A couple of other functions have a small impact; SharedCache::SaveToDSCView, SharedCache::Store and SharedCache::InitializeHeader.

During the analysis phase most of the execution time is in fixObjCCallTypes and to a lesser extent in SharedCacheWorkflow::FixupStubs.

@bdash
Copy link
Owner Author

bdash commented Nov 26, 2024

I think I'll personally hold off on trying to optimize anything further until we start seeing PRs merged upstream. As it stands it is now fast enough for my typical usage (opening a shared cache to poke at the couple of libraries I happened to be interested in at that moment, then throwing it all away). Once changes start to get merged and I'm not trying to work across a giant stack of changes like this I might revisit things again, just because making things faster is a lot of fun.

I'll merge the remaining PRs we've each submitted upstream into this branch, and I'll continue to keep an eye out for any other improvements you make so I can merge them here. That will ensure that there's an easy way for anyone that's interested to get a working, fast shared cache plug-in.

@WeiN76LQh
Copy link

Yes I agree, I'm pretty much in the same boat. I appreciate all the work you've done to make working with the DSC in Binary Ninja as performant as it is with this branch.

@WeiN76LQh
Copy link

Not sure what happened between now and this comment but I'm seeing pretty major performance regressions using this branch. Foundation is taking more like 600 seconds to load on the latest commit of this branch. If I jump back to when I made that comment I'm still getting the numbers I stated.

@bdash
Copy link
Owner Author

bdash commented Nov 28, 2024

I've been consistently seeing times of 2-3 minutes for Foundation. The only time I've seen anything under 30 seconds it was due to a bug resulting in a bunch of analysis work being skipped.

@WeiN76LQh
Copy link

WeiN76LQh commented Nov 29, 2024

I think I know whats going. Using BeginBulkModifySymbols in SharedCache::FindSymbolAtAddrAndApplyToAddr is proving to be very expensive, as in removing the BeginBulkModifySymbols and FinishBulkModifySymbols, is resulting in a 2.5x to 3x speed increase when loading Foundation.

m_dscView->BeginBulkModifySymbols();

What I don't understand is why BeginBulkModifySymbols is being used in that location in the first place. It seems like its probably a mistake as only one symbol is likely to be defined in that situation. My expectation is you use that function when a block of code is adding a number of symbols in one go.

@bdash
Copy link
Owner Author

bdash commented Nov 29, 2024

Yeah, skipping BeginBulkModifySymbols when modifying only a single symbol was a decent speedup when I tested it, as is avoiding creating an undo action in cases where the code skips doing work.

@WeiN76LQh
Copy link

Yes I cleaned up that function a little bit as well. The if, else if at the start also seems to be duplicated work so I think the else if can be removed entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants