More flexible Hutchinson's implementation in Jax #14261

yaroslavvb · 2023-02-02T04:06:06Z

yaroslavvb
Feb 2, 2023

There's a flexible version of Hutchison's which recovers exact trace computation as a special case.
Instead of back-propagating 1 vector per example, we backpropagate k orthonormal vectors.
Using k=d produces exact result.
Proof: trace is rotationally invariant so this is the same as backpropagating identity matrix

This was motivated by discussion with @dpfau , where k=1 Hutchinson seemed not accurate enough
Below is a naively slow, yet correct implementation
How would you go about speeding this up?

k=1 case is should have the same complexity as 10 backward passes (1 hvp=5 backprops, use 2 samples)
k=d case should have the same complexity as jax.hessian

from jax import jvp, grad
from jax import numpy as jnp
import numpy as np
import math  # for fsum

def hvp(f, primals, tangents):
    """Hessian vector product"""
    return jvp(grad(f), primals, tangents)[1]


def random_ortho(d):
    """(d x d) sample from the Circular Real Ensemble (CRE)"""
    z = np.random.randn(d, d)
    q, r = np.linalg.qr(z)
    d = np.diag(r)
    ph = d / abs(d)
    return np.array(q * ph)


def hutchinsons(f, xs, d, k):
    """Estimates average Hessian trace using improved Hutchinson's estimator.

    f: R^d->R function to differentiate
    xs: batch of examples
    d: number of dimensions
    k: number of orthogonal vectors per pass, k=d gives exact result
    """

    s = 1 if k==d else 2   # number of stochastic samples to use per example
    trace = 0.
    for x in xs:
        for sample in range(s):
            vs = list(random_ortho(d))[:k]
            for v in vs:
                trace += v @ hvp(f, [x], [v])
    trace /= s * len(xs)    # average over examples and samples
    trace *= d / k          # bias correction
    return trace

d = 10    # dimensions
n = 10    # examples
coefs = 1/jnp.arange(1, d+1)
f = lambda x: (0.5*coefs*x*x).sum()
xs = jnp.ones((n, d))

def pp(s, val): print(f"{s:50s}: {val:2.6}")

print("correctness test")
np.testing.assert_almost_equal(math.fsum(coefs), hutchinsons(f, xs, d, k=d), decimal=5)

pp("Exact value of Hessian trace (native)", math.fsum(coefs))
pp("Exact value of Hessian trace (k=d)", hutchinsons(f, xs, d, k=d))
pp("Approx value of Hessian trace (k=d/2): ", hutchinsons(f, xs, d, k=d//2))
pp("Approx value of Hessian trace (k=2): ", hutchinsons(f, xs, d, k=2))
pp("Approx value of Hessian trace (k=1): ", hutchinsons(f, xs, d, k=1))

Answered by mattjj

Feb 2, 2023

Cool!

One quick thought is to replace some of those loops with calls to jax.vmap. For example, keeping only the loop over random sampling (because it's currently implemented with numpy):

def hutchinsons(f, xs, d, k):
    """Estimates average Hessian trace using improved Hutchinson's estimator.

    f: R^d->R function to differentiate
    xs: batch of examples
    d: number of dimensions
    k: number of orthogonal vectors per pass, k=d gives exact result
    """

    s = 1 if k==d else 2   # number of stochastic samples to use per example
    trace = 0.

    tr = lambda vs: jax.vmap(lambda x: jax.vmap(lambda v: v @ hvp(f, [x], [v]))(vs).sum())(xs).sum()
    for sample in range(s):
      t…

View full answer

mattjj · 2023-02-02T04:31:58Z

mattjj
Feb 2, 2023
Maintainer

Cool!

One quick thought is to replace some of those loops with calls to jax.vmap. For example, keeping only the loop over random sampling (because it's currently implemented with numpy):

def hutchinsons(f, xs, d, k):
    """Estimates average Hessian trace using improved Hutchinson's estimator.

    f: R^d->R function to differentiate
    xs: batch of examples
    d: number of dimensions
    k: number of orthogonal vectors per pass, k=d gives exact result
    """

    s = 1 if k==d else 2   # number of stochastic samples to use per example
    trace = 0.

    tr = lambda vs: jax.vmap(lambda x: jax.vmap(lambda v: v @ hvp(f, [x], [v]))(vs).sum())(xs).sum()
    for sample in range(s):
      trace += tr(random_ortho(d)[:k])

    trace /= s * len(xs)    # average over examples and samples
    trace *= d / k          # bias correction
    return trace

Also, we should apply jax.jit, though it might make this toy example slower from compile times. To jax.jit the whole thing, we'd want to replace random_ortho with a JAX-based implementation. We'd want to do that to vmap it too!

If we didn't use vmap to compute things all in one go, and if we for some reason didn't want to use jax.jit (and hence get its DCE powers), it may make sense to avoid recomputing the primal pass every time by using jax.linearize in place of jax.jvp.

At some point we'll be limited by memory and so we cant keep vmapping. It would then make sense to use xmap, which can actually do a combination of vmap, pmap, and sequential lax.map for us.

Is this the direction you had in mind?

2 replies

yaroslavvb Feb 2, 2023
Author

Ah, neat!

Actually new orthogonal draw per call is probably overkill, random examples already provide stochasticity so we could reuse random dirs.

What exactly should I be jitting here? The whole "hutchisons" function, or some part?

Simplified hutchison's below with timings, still passes correctness tests, but speed of exact trace is suspiciously fast compared to batched gradient

import jax
from jax import jvp, grad
from jax import numpy as jnp
import numpy as np
import math  # for fsum

import time
class timeit:
    def __init__(self, tag=""):
        self.tag = tag

    def __enter__(self):
        self.start = time.perf_counter()
        return self

    def __exit__(self, *args):
        self.end = time.perf_counter()
        interval_ms = 1000 * (self.end - self.start)
        print(f"{interval_ms:8.2f}   {self.tag}")


def hvp(f, primals, tangents):
    """Hessian vector product"""
    return jvp(grad(f), primals, tangents)[1]


def random_ortho(d):
    """(d x d) sample from the Circular Real Ensemble (CRE)"""
    z = np.random.randn(d, d)
    q, r = np.linalg.qr(z)
    d = np.diag(r)
    ph = d / abs(d)
    return np.array(q * ph)


def hutchinsons(f, xs, d, k, random_basis):
    """Estimates average Hessian trace using improved Hutchinson's estimator.

    f: R^d->R function to differentiate
    xs: batch of examples
    d: number of dimensions
    k: number of orthogonal vectors per pass, k=d gives exact result
    """

    tr = lambda vs: jax.vmap(lambda x: jax.vmap(lambda v: v @ hvp(f, [x], [v]))(vs).sum())(xs).sum()
    trace = tr(random_basis[:k]) / len(xs)
    trace *= d / k      # bias correction
    return trace


d = 500       # dimensions
n = 100       # examples
L = 100       # layers
coefs = 1/jnp.arange(1, d+1)

def g(x):
    for l in range(L):
        x = jnp.eye(d) @ x
    return x

f = lambda y: (0.5*coefs*y*y).sum()
F = lambda x: f(g(x))

rb = jnp.array(random_ortho(d))
xs = jnp.ones((n, d))

def pp(s, val): print(f"{s:50s}: {val:2.6}")

print("correctness test")
np.testing.assert_almost_equal(math.fsum(coefs), hutchinsons(F, xs, d, d, rb), decimal=5)

pp("Exact value of Hessian trace (native)", math.fsum(coefs))
pp("Exact value of Hessian trace (k=d)", hutchinsons(F, xs, d, d, rb))
pp("Approx value of Hessian trace (k=d/2): ", hutchinsons(F, xs, d, d//2, rb))
pp("Approx value of Hessian trace (k=2): ", hutchinsons(F, xs, d, 2, rb))
pp("Approx value of Hessian trace (k=1): ", hutchinsons(F, xs, d, 1, rb))


print('timing:')
grad_func = jax.grad(F)
with timeit('batch grad'):
    jax.vmap(grad_func)(xs).sum(axis=0)

with timeit('hutch k=1'):
    hutchinsons(F, xs, d, 1, rb)

with timeit('hutch k=5'):
    hutchinsons(F, xs, d, 5, rb)

with timeit('hutch k=d'):
    hutchinsons(F, xs, d, d, rb)

mattjj Feb 2, 2023
Maintainer

Oh yeah, when we time we need to account for async dispatch, so we should do stuff like hutchisons(...).block_until_ready().

What exactly should I be jitting here? The whole "hutchisons" function, or some part?

Probably the whole thing, like maybe

from functools import partial

@partial(jax.jit, static_argnames=['f', 'd', 'k'])
def hutchinsons(f, xs, d, k, random_basis):

yaroslavvb · 2023-02-13T23:17:06Z

yaroslavvb
Feb 13, 2023
Author

Thanks for the tips @mattjj . This approach seems faster for exact Hessian trace than jax.hessian, some timings in colab

correctness test for d= 1024
passed
Exact value of Hessian trace (native)             : 7.50918
Exact value of Hessian trace (Hutchinson)         : 7.50918
Approx value of Hessian trace (k=d/2):            : 7.54999
Approx value of Hessian trace (k=5):              : 6.94241
Approx value of Hessian trace (k=1):              : 6.45814
timing:
  388.59   batch grad
  104.00   hutch k=1
  120.32   hutch k=5
  972.05   hutch k=d
 1532.01   jax.hessian

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexible Hutchinson's implementation in Jax #14261

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

More flexible Hutchinson's implementation in Jax #14261

yaroslavvb Feb 2, 2023

Replies: 2 comments · 2 replies

mattjj Feb 2, 2023 Maintainer

yaroslavvb Feb 2, 2023 Author

mattjj Feb 2, 2023 Maintainer

yaroslavvb Feb 13, 2023 Author

yaroslavvb
Feb 2, 2023

Replies: 2 comments 2 replies

mattjj
Feb 2, 2023
Maintainer

yaroslavvb Feb 2, 2023
Author

mattjj Feb 2, 2023
Maintainer

yaroslavvb
Feb 13, 2023
Author