Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce is a dog (even in jq>1.4) #618

Closed
pkoppstein opened this issue Nov 13, 2014 · 32 comments
Closed

reduce is a dog (even in jq>1.4) #618

pkoppstein opened this issue Nov 13, 2014 · 32 comments

Comments

@pkoppstein
Copy link
Contributor

pkoppstein commented Nov 13, 2014

The following function uses "reduce" to emit the "equilibrium indices" of the input array. An equivalent version written using recurse or foreach runs nicely in jq>1.4 (i.e. those versions that have TCO and foreach), but in both jq 1.4 and jq>1.4, the version using reduce has roughly exponentially worse performance in the length of the input array.

Here are some gross timings using jq>1.4 (i.e. based simply on "time jq ...") for the following query:

array(N;0) | equilibrium_indices | length
# N     (u+s in seconds)
#1000  0.19
# 1e4   0.43
# 2e4   1.54
# 3e4   3.43
# 4e4   6.2
# 5e4  10.4
# 6e4  16.5

Since the recurse and foreach versions of this program are fine in jq>1,4, the problem is evidently with the implementation of "reduce". Hopefully a skilled eye will be able to identify it; if not, then maybe "reduce" should be implemented using "foreach"!

def equilibrium_indices:
  def indices(a):
    reduce range(0;a|length) as $i 
      # state: [ headsum, tailsum, answers]
      ( [ 0, add, [] ];
        .[0] as $h
        | (.[1] - a[$i]) as $t
        | (if $h == $t then .[2] + [$i] else .[2] end) as $ans
        | [ $h + a[$i], $t, $ans ] )
    | .[2];

  . as $in | indices($in);

def array(n;init): reduce range(0;n) as $i ([]; . + [init]);
@wtlangford
Copy link
Contributor

Would you mind providing the foreach and recurse versions you made? I think I have an idea.

@pkoppstein
Copy link
Contributor Author

def equilibrium_indices_foreach:
  def indices(a):
    foreach range(0;a|length) as $i 
      # state: [ headsum, tailsum, answers]
      ( [ 0, add, null ];
        .[0] as $h
        | (.[1] - a[$i]) as $t
        | [ $h + a[$i], $t, if $h == $t then $i else null end];
        if .[2] then .[2] else empty end);
  . as $in | indices($in);
def equilibrium_indices_recurse:
  def indices:
    # Input: [ array, index, length, headsum, tailsum, answer]
    recurse(.[0] as $a | .[1] as $i | .[2] as $mx | .[3] as $h
            | if $i == $mx then empty          # all done
              else (.[4] - $a[$i]) as $t
              | (if $h == $t then $i else null end) as $ans
              | [ $a, $i + 1, $mx, $h + $a[$i], $t, $ans ]
             end )
    | .[5] | select( . );
  [., 0, length, 0, add, null] | indices;

@wtlangford
Copy link
Contributor

Well, those definitely confirm my suspicion. Looks like the problem with the reduce version is the .[2] + [$i]. I ran the reduce version through gprof to confirm and for the 3e4 version, we spend a LOT of time in jvp_array_write. jvp_array_write expands the array when there isn't enough space, so I split that out into its own function and profiled again. Turns out, for the 3e4 input, we expand an array 120025 times. That's a lot! Especially since each time requires a copy of the entire contents to the new location.

The foreach and recurse versions don't do any array appending themselves. As a result, they're both quite snappy!

I'm not sure what we can do to improve upon this. Perhaps we can find a way to speed up the array copy process, but that seems unlikely to me, given the way the types work internally. @nicowilliams?

@pkoppstein
Copy link
Contributor Author

@wtlangford - Some time ago, when I was working on a (pretty nifty) programming language, this issue about how far to extend an array was resolved basically by doubling. However, naively trying to increase new_length in jvp_array_write reveals dragons. (I'd be interested to know why.) On the other hand, wrapping recurse in [] does work nicely, so maybe it's time to retire the old implementation of reduce?

@wtlangford
Copy link
Contributor

Keep in mind that the block of code you're looking at does a couple of things.
It decides on a new length (the index you want or the length of the array, whichever is larger) and then multiplies it by 1.5 (ARRAY_SIZE_ROUND_UP). I'm not sure exactly what change you went for there, but at a guess you did new_length = 2*jvp_array_size(*a);, which actually results in an array of triple the original size!

I would say we definitely do not retire this implementation of reduce. While not necessarily the recommended call for reducing values into an array, it is still quite good at reducing into objects and numbers and so on.

That being said, I am looking for a way to make array duplication less expensive.

@pkoppstein
Copy link
Contributor Author

So there are at least two distinct issues:

  1. Since jq does expand arrays by a factor of 1.5, why are so many allocations required? Getting to 1e6 shouldn't take more than 35 allocations, so every iteration should require only about two
    allocations. Instead, we see exponentially worse performance.

  2. It seems to me that the key here is to avoid allocations in the first place, through optimization. Easier said than done, but suppose we eliminate the $ans variable and change the last clause in the
    reduce to:

        | [ $h + a[$i], $t, (if $h == $t then .[2] + [$i] else .[2] end) ] 

In this case, it is clear that the array at .[2] can be "recycled".

@wtlangford
Copy link
Contributor

why are so many allocations required?

collect (the internal name for the [ ... ] operator) emits APPEND opcodes, which call through to jv_array_append(). At a guess, these are (for some reason) falling through to expanding allocations, though I'm not sure they should... I'll have to track it, but won't really get to it until some time this weekend.

it is clear that the array at .[2] can be "recycled".

I'm not sure what you mean by recycled...

@nicowilliams
Copy link
Contributor

jvp_array_write() should probably increase the allocation size
exponentially, something like add room for the current element count div 2
elements, when growing the array.

@nicowilliams
Copy link
Contributor

Oh, right:

#define ARRAY_SIZE_ROUND_UP(n) (((n)*3)/2)

1.5, right.

@nicowilliams
Copy link
Contributor

BTW, the problem in the reduce version is a typo in your code. In the
foreach version the second element of the state is .[1] - a[$i], and
in the reduce version it's .[2] + [$i] (sometimes) -- you probably
meant .[2] + a[$i].

On Thu, Nov 13, 2014 at 1:34 PM, Nico Williams [email protected] wrote:

Oh, right:

#define ARRAY_SIZE_ROUND_UP(n) (((n)*3)/2)

1.5, right.

@wtlangford
Copy link
Contributor

Nope, the point there was to get an array of the indices that pass a test.
[$i] was correct. Even so, it still would have incurred an array append,
which was the overarching problem.

On Thu, Nov 13, 2014, 14:37 Nico Williams [email protected] wrote:

BTW, the problem in the reduce version is a typo in your code. In the
foreach version the second element of the state is .[1] - a[$i], and
in the reduce version it's .[2] + [$i] (sometimes) -- you probably
meant .[2] + a[$i].

On Thu, Nov 13, 2014 at 1:34 PM, Nico Williams [email protected]
wrote:

Oh, right:

#define ARRAY_SIZE_ROUND_UP(n) (((n)*3)/2)

1.5, right.

Reply to this email directly or view it on GitHub
#618 (comment).

@nicowilliams
Copy link
Contributor

Oh, I should stop trying to do two things at once.

@pkoppstein
Copy link
Contributor Author

@wtlangford wrote:

I'm not sure what you mean by recycled...

Under certain circumstances, there is no need for an array to be copied. (E.g. in the program
'[][10000] = 1 | .[0] = 2 | add'.) By "recycling" I basically mean taking advantage of such an opportunity.

I had at first assumed that the jq compiler was good at recycling in this sense, but I'm beginning to realize that there must be lots of succulent low-hanging recycling opportunities to be picked.

@nicowilliams
Copy link
Contributor

jq recycles array and object values when they have only one reference.
Always. The trick is to write jq code that yields that behavior so as to
avoid copies.

@nicowilliams
Copy link
Contributor

Your example should definitely recycle that array.

@pkoppstein
Copy link
Contributor Author

Hi, @nicowilliams, and welcome back to the party.

Based on my timings, it's hard to believe the array is being properly recycled in the way I think it should be:

def n: $n|tonumber;

def base:
  [][n] = 1 | length;

def test:
  [][n] = 1 | .[0] = 0 | length;

Results (u+s):

base # 1e8 => 2.43s
test # 1e8 => 7.1s

@nicowilliams
Copy link
Contributor

I've tracked this down to assignment expanding to calls to _assign or
_modify, which in turn add two references to the input value (the array)
get held, one in the reduction state, and one on the stack, I think.

Using setpath instead of = avoids the extra references to the array in
question.

I.e., this:

./jq --debug-trace --debug-dump-disasm -n '[][32] = 1 | setpath([0];0)'

goes fast (no reallocs), while this:

./jq --debug-dump-disasm --debug-trace -n '[][32] = 1 | .[0] = 0'

goes slow (reallocs once, in this case).

A fix would have to arrange to not have an internal var of any kind, using
only the stack to keep track of the assignment reduction, and avoiding any
extra DUPs. This will require bytecoding assignment, instead of jq-coding
it.

In the meantime you have a workaround: use setpath.

@nicowilliams
Copy link
Contributor

Oh, another possibility would be clearing the reduction state var
prior to evaluating the reduction state update closure -- so, yes,
indeed, reduce is arguably busted perf-wise, and there should be a
fix.

@nicowilliams
Copy link
Contributor

It's trickier than this. reduce uses LOADVN to avoid retaining a
reference via the reduction state variable.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

Using setpath instead of = avoids the extra references to the array in question.

Excellent. I should have guessed the culprit was "=" -- in my experience, it usually is! Thanks.

In the meantime you have a workaround: use setpath.

A workaround for the "=" problem, but are you implying this implies a workaround for the "reduce"-related issue as well?

@nicowilliams
Copy link
Contributor

It'd be nice to fix this in 1.5, but I don't think we have the energy for it.

@nicowilliams
Copy link
Contributor

@pkoppstein It turns out foreach is no better than reduce. Just look at the trace of .[3]=3 with s/reduce/foreach/ in _assign.

@pkoppstein
Copy link
Contributor Author

Yes, but with foreach, there's an easy workaround. Is there one for reduce?

Meanwhile, thanks for giving this your attention. The exponential growth of required memory for "map/reduce" is something devoutly to be avoided.

@nicowilliams
Copy link
Contributor

@pkoppstein Please re-post your foreach workaround, a complete test case.

BTW, I'm working with the hypothesis that stack_pop() is sometimes
grabbing unnecessarily too many references to the jv it's "popping", and
that we need a new variant of it for use in the implementation of
SUBEXP_END.

@nicowilliams
Copy link
Contributor

Actually, the places where we need a stack_popn() seem to be: DUP and LOADVN, but while that fixes this problem, using a stack_popn() in DUP causes others. Adding a DUPN for use in the loop of the reduction's bytecode does make those other problems go away, so I think we're near a fix, but my understanding of why it works is... incomplete.

The problem is that we have a reference count leak. There is no corresponding memory leak because of the way stack unwinding works: we always free jv's when unwinding. But the reference count leak causes the optimized path of jq's copy-on-write behavior to not work in reduce because the input to the update expression has an extra reference. Actually, that value has an extra two references: one from LOADVN, and one from the DUP, both in the reduce update loop.

Those two leaks come from the fact that the value at the top of the stack isn't actually at the top when those opcodes are executed: they follow a FORK that pushes forkpoint. This means that the pop will not actually free the value and so it has to grab a reference (since otherwise we'll have a double-free when the stack unwinds). The obvious thing to do is to replace that jv with null, but it turns out that there are cases where the value on the stack that was intended-to-be-popped-but-couldn't-be... actually will be used again. This may denote a larger bug in the compiler: there must be missing DUPs here and there, because making stack_pop() have this behavior breaks a handful of tests, generally tests with [...] in them.

Therefore, for now, I'm going with the following fix:

  • add a stack_popn() that replaces with null the value on the stack when it could not be popped;
  • add a DUPN that is like DUP but uses stack_popn();
  • make LOADVN use stack_popn();
  • make gen_reduce() and gen_foreach() use DUPN instead of DUP in the body of the loop.

That works and passes all tests.

@pkoppstein
Copy link
Contributor Author

@nicowilliams - Thanks for the fix!

For future reference, the "foreach" code is all there in my first two postings in this thread, but here's everything in one bundle:

# The index origin is 0 in jq.
def equilibrium_indices:
  def indices(a):
    foreach range(0;a|length) as $i 
      # state: [ headsum, tailsum, answers]
      ( [ 0, add, null ];
        .[0] as $h
        | (.[1] - a[$i]) as $t
        | [ $h + a[$i], $t, if $h == $t then $i else null end];
        if .[2] then .[2] else empty end);
  . as $in | indices($in);

def array(n;init): reduce range(0;n) as $i ([]; . + [0]);

def count(g): reduce g as $i (0; .+1);

##############################################             jq>1.4
count(array(1e6;0) | equilibrium_indices)              #                        6.6s
 $ time jq -n -f equilibrium_index.foreach.jq
1000000

real    0m6.471s
user    0m5.718s
sys 0m0.093s

@nicowilliams
Copy link
Contributor

@pkoppstein Can you confirm that this fixes it for reduce? Also, thanks for the bug report! This was a very interesting bug indeed.

@nicowilliams
Copy link
Contributor

(I myself confirmed it by observing a simple test case's debug trace with refcounts in the output.)

I do wonder how to write a test for this... I don't want to time something. I think I'll just compare output with debug tracing, though that seems brittle.

@nicowilliams
Copy link
Contributor

OK, so here's the last word on this bug.

The problem was that FORK pushes a forkpoint and the DUP that follows the FORK for the loops in reduce and foreach wants to pop the value that was on the top of the stack before the FORK -- that value is now not on top of the stack, so it can't be popped.

The better fix for this issue is not to add DUPN as I did, but to add a variant of FORK that pops a value, pushes a forkpoint, then pushes the value popped earlier. Much simpler and easier to understand.

(A generalization would be to add an operand to FORK indicating how many values to pop and re-push after pushing the forkpoint, but since this operand today would only ever be zero or one, it's easier to add a new opcode instead.)

@leonid-s-usov
Copy link
Contributor

Hi.
Came here via commit history, cause I'm sniffing around the place where DUPN is added and this instruction is standing in a way of generalising var destructure.

First of all, kudos for documenting this bug so well cause it was not clear at all why one would want to replace a value with null on the stack, and in this way fix performance.

I've considered the possibility of making FORK pop and push, and this won't work, actually. The problem is that fork should provide the same top of the stack for the initial execution and for the backtrack jump. This is achieved by restoring stack pointer to the value at the time of fork point creation.

I believe that a cleaner fix to this issue would be to use a local var for $dot and LOADVN it after the fork. WDYT?

@leonid-s-usov
Copy link
Contributor

Sorry, on the second thought there is even a simpler fix to this, I guess.
I've noticed this improvement yesterday but didn't think it could save us much.
The thing is, I don't see why FORK is needed there, at all!

  block foreach = BLOCK(gen_op_simple(DUP),
                        init,
                        state_var,
                        gen_op_target(FORK, loop),
                        loop,
                        // At this point `foreach`'s input will be on
                        // top of the stack, and we don't want to output
                        // it, so we backtrack.
                        gen_op_simple(BACKTRACK));

You see, the first time FORK is executed it saves fork point and proceeds to loop. When loop backtracks, FORK jumps to the next instruction, which is .... a BACKTRACK! So, effectively, FORK is just stopping for a moment to backtrack again! It doesn't look like FORK is needed there.

The backtrack from loop would proceed up the stack in the same manner it does today but without stopping at FORK for a moment, and this will solve the issue of a double reference - which is not an issue but rather a vital property of a FORK instruction

@leonid-s-usov
Copy link
Contributor

leonid-s-usov commented Feb 10, 2020

For reduce we may benefit from a FORK1 instruction.

What is similar between reduce and foreach - they both don't want the same value after the FORK backtracks (as I showed above, foreach can optimise the fork away completely, but in the current state it just discards the input by backtracking).
So, the value is only utilised by the first filter of the fork.

What is different between reduce and foreach - reduce does require the FORK because it wants to proceed further with the accumulated value. BUT that instruction is a LOADV, which simply discards the input - it could be a dup, that's it.

// reduce
  return BLOCK(gen_op_simple(DUP),
               init,
               res_var,
               gen_op_target(FORK, loop),
               loop,
               gen_op_bound(LOADVN, res_var));

Here are the two options I see that avoid the spooky DUPN:

  1. FORK1 which pops and pushes one value
  2. STOREV before FORK and LOADVN after fork - which effectively does the same as FORK1 above but in the currently available terms, and wasting a local var frame, and using a semi-magical LOADVN

P.S. I think that LOADVN should store a jv_invalid() into the frame var, just like the frame initialisation code does. This should mean "the value is consumed" and if one needs to store another value there it must be an explicit action. If one fails, it will assert on stack restore, signifying a programming error. it won't assert. It would assert on LOADV, or, potentially, backtrack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants