Adjacency list optimizations #9444

lettertwo · 2023-12-15T00:03:51Z

EDIT: Added docs, too! They're in the PR, but easier to read on the branch: https://github.com/parcel-bundler/parcel/blob/adjacency-list-optimizations/docs/AdjacencyList.md

By making two adjustments to Parcel’s AdjacencyList:

the memory footprint of Parcel’s three biggest graphs is reduced by ~52%
writes are faster by ~5%

For a real world, very large app, this amounts to ~800MB reduction in size with no regression in startup, build, or shutdown times.

Background

AdjacencyList is already highly optimized for avoiding overhead in message passing. In Parcel, graphs are used by multiple threads, so this implementation of the AdjacencyList stores data external to the JS heap to allow it to be shared across threads without incurring the overhead of serializing and deserializing what is often a large number of edges. This had big impact on Parcel’s runtime characteristics (see #6922).

However, it’s not all roses; some suboptimal behaviors have been observed, particularly at scale:

AdjacencyList uses a lot of memory
- even though this memory is shared, it’s still surprisingly high for how efficiently the data is stored
serialized data is very large on disk
- in Parcel’s cache, AdjacencyList data makes up a sizable percentage of the total cache size

It turns out that these are two symptoms of the same disease: premature optimization.

Optimization 1: the load factor

Today’s version of AdjacencyList automatically resizes itself as nodes and edges are added. This resizing event occurs when the ratio of edges to capacity meets or exceeds a constant term known as the load factor.

The current implementation uses a load factor of 0.7, which is meant to trigger a resize sooner than absolutely necessary. The intended benefit of this optimization is that collisions in the hashmap are less likely due to there always being at least 30% capacity available.

In hindsight, this may have been premature; it turns out that maintaining that much excess capacity at scale is quite expensive, memory-wise, and while it may yield some overall benefit in terms of amortizing a cost to collisions (as evident in the duration being shorter for really large graphs, see permutation B in the benchmarks), as it turns out, maximizing the load (a load factor of 1) has a big impact (~833 MB!) on memory footprint (and consequently, cache size).

Optimization 2: Right-sizing buckets

In the version of AdjacencyList shipped today, space is allocated to accommodate a bucket size of 2. The intention here was that we can avoid excessive resizing by allowing a higher number of collisions in the underlying hashmap before running out of space.

This optimization has also proven to be premature; it turns out that not allocating any extra space to accommodate collisions at all (a bucket size of 1), when combined with a load factor of 1, has roughly equivalent outcomes to the defaults, while still maintaining most of the size benefit of just adjusting the load factor alone. In fact, the benchmarks indicate that these two adjustments yield wins in both memory footprint and read/write/resize performance!

The benchmarks

The approach for these benchmarks starts with instrumenting the AdjacencyList to record every write operation that is applied during a production build of a large real world app. The resulting recordings are then played back using a differently instrumented version of AdjacencyList that allows tweaking the parameters of the list’s allocation behaviors.

The below charts show the impact of these changes on the AssetGraph, the BundleGraph, and the RequestGraph.

Permutation A is the default parameters, which reflects what is in production today.
Permutation B shows the effect of adjusting the load factor to 1.
Permutation C shows the effect of combining a bucket size of 1 with the adjusted load factor.

AssetGraph	BundleGraph	RequestGraph
nodes: 521,958 (0 unconnected)	nodes: 481,920 (493,589 unconnected)	nodes: 1,381,969 (2,571,265 unconnected)
edges: 806,717 (0 deleted)	edges: 6,139,973 (6,462 deleted)	edges: 12,068,253 (0 deleted)

Of particular interest in these results:

the load (the ratio of data to capacity) jumps from below 50% across all 3 graphs to above 90%. This means that, with these changes, almost all of the allocated space is being used for all three graphs (whereas before, there was more than 50% going unused).

the collision rate remains nearly identical with both changes applied.

Tweaking the resize curve

The AdjacencyList resizes the capacities for nodes and edges differently. For nodes, it simply doubles the capacity at each resize, but for edges, it resizes more aggressively early on, and then less aggressively, in linear regression until an inflection point, after which it is also just doubling the capacity each resize.

These parameters are now exposed for tweaking, but in my testing so far, I haven’t found any combo that is strictly better than the current defaults, so I did not change them.

Previously, we were allocating extra space for 'buckets' to accommodate hash collisions, but this turns out to waste a lot of space in large graphs. Additionally, we are no longer allocating space for nodes ahead of time; now, the nodes array will grow on demand, as edges are added.

This unlocks the ability to resize without creating a new intermediary AdjacencyList.

The (incorrect) assumption was that there should be the same node record count after a resize of edges, but this is not necessarily the case; if there were deleted edges before the resize, then there may be node records that will also be deleted (by virtue of no longer having any edges connected to them) as part of the resize.

mattcompiles

This looks great bud. The write-up is excellent 👏 Just one comment about the lingering TODO.

packages/core/graph/src/AdjacencyList.js

lettertwo · 2023-12-22T20:46:01Z

I used this branch to do a full production build of a big project (the same project used to benchmark the changes), and compared it to a build of the same project using v2, and there were no differences in build output 🎉

There were differences in cache output:

AssetGraph	BundleGraph	RequestGraph
nodes: 447,509 (0 unconnected)	nodes: 646,833 (661,425 unconnected)	nodes: 1,234,145 (2,145,889 unconnected)
edges: 659,997	edges: 5,549,467	edges: 10,253,224

Duration here is reflecting the time spent in v8’s deserialize function (captured while loading the graphs from cache to dump their stats). Almost all of the cost belongs to the node properties, which are stored in regular JS arrays, not the AdjacencyList.

The size recorded here is just the size of the array buffers; the impact on disk was
For this build, was a cumulative size reduction of the graphs from ~3.1GB to ~2.3GB, with a savings of ~750MB.

Overall, there is no regression in runtime performance apparent, and the reduction in memory footprint is sizeable.

The largest graph deserialized the fastest! this is because, despite having many more nodes and edges compared to the other graphs, the RequestGraph has relatively few and simpler node properties.

devongovett

This is great!

devongovett · 2024-01-20T17:36:56Z

docs/AdjacencyList.md

+  1[Node 1] -- incoming  --> a[[edge a]]
+  1[Node 1] -- incomingReverse   --> a[[edge a]]
+  1[Node 1] -- outgoing --> c[[edge c]]
+  1[Node 1] -- outgoingReverse  --> c[[edge c]]


Should these reverse arrows be pointing the other way? They look the same as the non-reversed arrows right now. Did I misunderstand something?

Is this any better?

graph LR subgraph 0[Node 0] direction LR 0o([outgoing]) --- 0oa[[a]] <--> 0ob[[b]] --- 0or([outgoingReverse]) end subgraph 1[Node 1] direction LR 1i([incoming]) --- 1ia[[a]] --- 1ir([incomingReverse]) 1o([outgoing]) --- 1oc[[c]] --- 1or([outgoingReverse]) end subgraph 2[Node 2] direction LR 2i([incoming]) --- 2ib[[b]] <--> 2ic[[c]] --- 2ir([incomingReverse]) end

Loading

devongovett · 2024-01-20T17:45:28Z

docs/AdjacencyList.md

+  na41 -- first out --> ea31
+  na41 -- last out  --> ea31
+```
+


packages/core/graph/src/AdjacencyList.js

* upstream/v2: (22 commits) Add source map support to the inline-require optimizer (#9511) [Web Extension] Add content script world property to manifest schema validation (#9510) feat: add getCurrentPackageManager (#9505) Default Bundler Contributor Notes (#9488) rename parentAsset to root for msb config and remove unstable (#9486) Macro errors -> v2 (#9501) Statically evaluate constants referenced by macros (#9487) Multiple css bundles in Entry bundle groups issue (#9023) Fix macro issues (#9485) Bump follow-redirects from 1.14.7 to 1.15.4 (#9475) Revert more CI changes to centos job (#9472) Use lightningcss to implement CSS packager (#8492) Fixup CI again (#9471) Clippy and use napi's Either3 (#9047) Upgrade to eslint 8 (#8580) Add support for JS macros (#9299) Fixup REPL CI (#9467) Drop per-pipeline transformation cache (#9459) Upgrade some CI actions (#9466) REPL (#9365) ...

lettertwo added 11 commits December 13, 2023 16:31

Parameterize AdjacencyList

11ba839

refactor: extract edge linking behavior from addEdges method

99dcb84

This unlocks the ability to resize without creating a new intermediary AdjacencyList.

fix: improve map capacity overflow detection

ae960c7

fix: resizing computations

54f9e74

Remove loadFactor

15a4c43

fix: node resizing

beca69c

fix: avg collisions calculation

7902031

Rename capacity to initialCapacity

9169b43

fix tests

651a1db

mattcompiles reviewed Dec 18, 2023

View reviewed changes

packages/core/graph/src/AdjacencyList.js Outdated Show resolved Hide resolved

lettertwo added 3 commits December 21, 2023 13:15

Enforce assumption that linked edge types must match

6f3a7c3

Add docs

6a66d4d

Merge branch 'v2' into adjacency-list-optimizations

868d1ad

lettertwo requested a review from mattcompiles December 22, 2023 20:46

lettertwo added 3 commits December 22, 2023 16:21

Update AdjacencyList.md

12fa072

Update AdjacencyList.md

d0a2667

Update AdjacencyList.md

2f09141

github-actions bot deployed to Preview January 8, 2024 23:57 View deployment

devongovett approved these changes Jan 20, 2024

View reviewed changes

lettertwo added 3 commits January 30, 2024 17:36

Refactor link results to an enum

cfb444e

Update AdjacencyList.md

45c7d25

github-actions bot deployed to Preview January 30, 2024 23:14 View deployment

Merge branch 'v2' into adjacency-list-optimizations

6f114ab

github-actions bot deployed to Preview February 1, 2024 18:18 View deployment

Merge branch 'v2' into adjacency-list-optimizations

3932d45

github-actions bot deployed to Preview February 3, 2024 00:04 View deployment

Merge branch 'v2' into adjacency-list-optimizations

20567ef

github-actions bot deployed to Preview February 13, 2024 21:26 View deployment

lettertwo merged commit 2215d36 into v2 Feb 13, 2024
14 of 16 checks passed

lettertwo deleted the adjacency-list-optimizations branch February 13, 2024 23:38

apiiro-snyk mentioned this pull request Jul 2, 2024

[Snyk] Upgrade parcel from 2.10.0 to 2.12.0 apiiro-snyk/esbuild#1

Open

Codimow mentioned this pull request Oct 30, 2024

[Snyk] Upgrade parcel from 2.0.0-beta.3 to 2.12.0 Codimow/bun#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjacency list optimizations #9444

Adjacency list optimizations #9444

lettertwo commented Dec 15, 2023 •

edited

Loading

mattcompiles left a comment

lettertwo commented Dec 22, 2023

devongovett left a comment

devongovett Jan 20, 2024

lettertwo Jan 30, 2024

devongovett Jan 20, 2024

Adjacency list optimizations #9444

Adjacency list optimizations #9444

Conversation

lettertwo commented Dec 15, 2023 • edited Loading

Background

Optimization 1: the load factor

Optimization 2: Right-sizing buckets

The benchmarks

Tweaking the resize curve

mattcompiles left a comment

Choose a reason for hiding this comment

lettertwo commented Dec 22, 2023

devongovett left a comment

Choose a reason for hiding this comment

devongovett Jan 20, 2024

Choose a reason for hiding this comment

lettertwo Jan 30, 2024

Choose a reason for hiding this comment

devongovett Jan 20, 2024

Choose a reason for hiding this comment

lettertwo commented Dec 15, 2023 •

edited

Loading