Skip to content

Leo meyerovich visit 2013 07 22

Jack Moffitt edited this page Jul 24, 2013 · 1 revision
  • pcwalton: we have three basic parallel travelsals for layout. global width, bubble heights, ...
  • pcwalton: here's constraint solving. bubble widths (bottom up), assign widths (top down), assign heights (in order switching to bottom up). display list building (top down). DOM goes to Flow and Boxes, adn then to display list. flow tree leaves are render boxes.
  • leo: that's sort of like our macro nodes.
  • pcwalton: there is also float flow which can be children of blocks or inlines. we parallelize over flows, but we render over boxes. renderboxes under an inline flow may get reshuffled due to linebreaking. in gecko flow and boxes are the same type, so there are complicated traversals to suppose line breaking, etc.
  • leo: this type of transformation is terrible for simd, but works fine for multicore.
  • pcwalton: the flow tree is immutable after creation. the boxes may change, but the flow tree is immutable and this makes the parallelism much easier. the other thing before this is css style matching which computes the style.
  • leo: it's unclear to me which things you must do doing styling and which during traversals for best parallelism.
  • pcwalton: incremental reflow is not done yet, but we plan to calculate style different and nuke/rebuild subtrees or flood dirty bits.
  • leo: when would you need ot recompute the subtree?
  • pcwalton: changing display property
  • eatkinson: webkit tries to reuse the subtree but has lots of bugs
  • pcwalton: changing inline elements to table cells doesn't work at all for example. display: none is the most common case.
  • pcwalton: the plan is to do what gecko does, but in a parallel setting. we have two dirty bits. frame is dirty and frame has dirty children. one naive appraoch is just to run the normal traversals but turn clean nodes inot no-ops. hopefully work stealing will distribute the work onto the other cores.
  • leo: getting scheduling right for the incremental context is tricky.
  • pcwalton: the workload should be small.
  • eatkinson: constraint solving seems ok, but building/rebulding the flow tere seems harder to do in parallel. (lmissed)
  • pcwalton: flowtree construction is currently in order but should be bottom up for parallelism.
  • leo: how the memory flows between selector matching and flow tree construction is very interesting. you don't want to be moving memory across cores, and i'm not sure how you'd do that in Rust.
  • leo: i think there are two sets of things that can happen. either selector matching is so big that you don't worry how it connects. or the other thing is as soon as you do style matching, fuse box creation and do it at the same time. (missed)
  • pcwalton: when it comes to constraint solving, we think we have a way to make all this safe (see email to dev-servo). the traversals call kernel functions which are constrained to be able to only look at their own children and not their parents or siblings... they can even mutate them if you want to. then we can guarantee data race freedom.
  • leo: you might also be able to do a linear typing thing where you can guarantee (missed)
  • pcwalton: i think ultimately you need an attribute grammar to prove that it's actually correct. when it comes to floats, the plan for now is during the bottom up traversals all the computation is no-ops until you get to the root of the subtree and then do a n inorder traversal there.
  • eatkinson: the inorder traversals never have to cross a float boundary.
  • leo: what's your goal for scalability here?
  • pcwalton: it can be sublinear for now, but we want to get it all parallelize. if we want linear, we'd have to start compressing memory.
  • eatkinson: on wikipedia you wouldnt' get any speedup.
  • pcwalton: there are clears aren't there?
  • eatkinson: the problem is there is always a bunch of stuff floated left and right. and we can't think of a way to guarantee that the floats don't go past a certain height.
  • pcwalton: the only other way to do it is speculation, and i'm concerned about that due to power.
  • leo: speculation should be ok, because it's 1-2x in the worst cast.
  • pcwalton: i'm not sure how this approach would work with speculation.
  • leo: to me this just sounds alarm bells if wikipedia is not going to work.
  • pcwalton: i'm not sure how else we would do it.
  • leo: you might be able to do a prepass. i'm a optimist in that i think you'll figure out something, but the other question is do you need to? i'm curious why speculation is an undesireable thing?
  • pcwalton: it's just not hte firs tthing i want ot reach for. it's complicated and it can also backfire. i'm more interested in creating the framework for these strategies. most likely any parallelization of constraint solving is parallelizing the top down and bottom up traversals. it's possible that there will be so many floats we lose the paralleism. (missed)
  • pcwalton: shaping was the vast majority of layout time, until we implemented a per-word cache. you can run harfbuzz in parallel across text runs. it's still not great because it doesn't give you intra-textrun parallelism.
  • leo: that's where i'd use SIMD instead of multi-core. the other thing there is this really awesome paper by martin bernard on compression. most stuff is predictably similiar and you can further compact things with predication. everytime i compressed things farther it would get faster.
  • pcwalton: if we wanted to do memory compression stuff, it might be best to approach it as a garbage collection problem. every so often you just compress.
  • leo: you can then combine that with data layout too.
  • pcwalton: it's like a compaction phase. you take the tree and compress all the nodes.
  • leo: let's say you have parallel layout, but you have sequential memory. but you save so much on bandwidth that it's worth it. sort of like style sharing but even more aggressive.
  • pcwalton: i think treating it as a gc problem is the way to handle it in a dynamic scenario.
  • leo: i like that.
  • pcwalton: does this sound sane?
  • leo: the only ambiguity is the data respresentation stuff. making sure data is where it needs to be so that you're not blocking on it.
  • leo: if you have a nice access pattern through the traversals, you could split the nodes. you could put the widths and heights in seaprate parts
  • pcwalton: when you did compression what did you do?
  • leo: the big things that i did: tiling so that a subtree would be contiguous in memory.
  • pcwalton: that's something we can probably do.
  • leo: make that part of the bottom up construction phase if you can figure that out. i did do a lot of stuff like using offsets instead of pointers to save space. if you can eliminate all the local pointers, you'll save a big chunk of the represetation.
  • pcwalton: i wonder what the distribution of floats to render boxes.
  • leo: here's the trick that martin bernard did. he represent things as 2 bits. 1st bit is speculation correct and second bit ... (missed)
  • pcwalton: that's totally doable
  • leo: for me tiling and that were the big wins. i'm not sure if this would apply, but in bernard's stuff he would try to do value speculation as well. if you hae objects where hte objects look the same, then you can compress that. when iw as doing stuff for vectorization, two interesting things popped out. the first is that you can actually do a structure split data layout. instead of having a node bubble width and assigned width, you'd have an array of bubble widths and an array o fassigned width. for data visualization this gave me a huge speed up. let's say you have a tree, and each node if (x) then y else z. all of the branch predictors would work in the same way. the other part of this, to create the tiles that became the bottleneck. i think you could do stuff like region allocation because you know the tile is only going to be so big.
  • pcwalton: we don't knkow how big the subtrees will be.
  • leo: you could do it in two passes.
  • pcwalton: or you build it inefficiently and then fix it up in the gc passes.
  • leo: you want to have it done before constraint solving.
  • pcwalton: the first layout of the page is the most important one. it does the most work. the only exception is infinite scroll on twitter and pinterest. leo: there are two nice things with tiling. you don't need perfect itling. just make a new tile if you need another one. the other is if you could limit how much memory is touched in a pass you can do small traversals.
  • pcwalton: we worry about scheduling overhead for really small traversals.
  • leo: what i found in practice is that you can get 1-2ms traversals but getting faster than that is really hard.
  • leo: i couldn't get stuff faster than 2ms even with all my tricks.
  • pcwalton: i don't remember off the top of my head what kind of numbers we're looking at in servo right now. and most of the number is css selector matching.
  • leo: for css you might wan tto do multiple passes for tiling.
  • pcwalton: in the end what was the time split between constraint solving and css matching?
  • leo: selectors was dominating. i did selectors and cascading in two passes, and optimized the heck out of selectors.
  • leo: the trick for selectors you have three hashtables.
  • pcwalton: you should also have a bloom filter. the problem is they destroy your parallelism, but qualcom had a very compressed bloom filter that ended up working well and you could just copy it to all of the cores.
  • leo: let's say you have your id selectors and class selectors. the id selectors might be interseted in class bar and parent foo. class might be interested in .baz and bar. split the bloom filters. i think it had an order of 2x speedup. once you start thinking about memory things get weird. (discussion about memory layout)
  • leo: i can send you my work stealing benchmark. it simulated running workstealing on a dom tere. each time step each thread makes 1 node of progress.
  • pcwalton: not everything an author will write will be fast. we can get people to write things that are parallel. if we can get people to use clear more often then it can send a powerful message to web developers. another thing is that we're runningi script and layout in parallel. what happens when script needs to block on layout. what can we do? maybe we could run the GC since we haven't nothing better to do. why don't we just introduce async versions of these APIs. (talk about rendering)
  • leo: we got nice speedups by doing two passes on display list creation. one to compute offsets and one to allocate.