-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplification of update #5
Comments
Wow, thank you so much Robert, this is gold! I'm awestruck that you recognized the pdf of the time-decayed prior as this GB1 random variable's pdf! This is superb—I have often wished we could fix a Beta posterior at some time other than the time since last quiz, and now we have that!, and what's more with a very elegant posterior update for the success quizzes, and the possibility of an analytical posterior for failed quizzes too: your hint about optimizing I'm hopeful that we can let the time parameter of the prior can remain the original half-life even after many successful reviews (when the halflife becomes very large), because (1) predicting the recall probability and (2) updating upon success ought to be very stable. Just we'll have to confirm that (3) updating upon failure remains stable when over- and under-reviewing long-half-lived models (and short-half-lived ones too). I will certainly look at this more! Thank you so much for taking the time to read through the derivation, and thank you so much more for finding these wonderful simplifications and for thinking about them so deeply and describing them so clearly. I am beyond grateful. Messy code below: def match_moments(m1, m2, prior, tnow, obs, algo='orig'):
alpha0, beta0, t0 = prior
delta = tnow / t0
ab0 = np.array([alpha0, beta0])
if obs:
ab0[0] += delta
else:
# from looking at results
ab0[1] += 1
if algo == 'orig':
def f(ab):
alpha, beta = ab
mom1, mom2, _ = genbeta1_moments(alpha, beta, delta)
return np.array([mom1, mom2]) - np.array([m1, m2])
return optimize.least_squares(f, ab0, bounds=((1.1, 1.1), (np.inf, np.inf)))
if algo=='bd': # estimate beta, delta
ab0 = np.append(ab0, delta)
def f(x):
beta, d = x
mom1, mom2, _ = genbeta1_moments(ab0[0], beta, d)
return np.array([mom1, mom2]) - np.array([m1, m2])
ret = optimize.least_squares(f, [ab0[1], ab0[2]] , bounds=((1.1, 0), (np.inf, np.inf)))
ret['origx'] = ret['x'].copy()
ret['x'] = np.array([ab0[0], ret['x'][0], ret['x'][1]])
return ret
rows = []
for alpha0 in [2.0, 3.0, 4.0, 5.0, 10.0, 20.0, 40.0, 100.0, 200.0, 400.0]:
for beta0 in [2.0, 3.0, 4.0, 5.0, 10.0, 20.0, 40.0, 100.0, 200.0, 400.0]:
for delta in [0.1, 0.2, 0.5, 0.9, 1.1, 1.5, 2.0, 3.0, 4.0, 5.0, 10.0]:
prior = (alpha0, beta0, 1.0)
m1, m2, _ = posterior_moments(prior, delta, False)
res = match_moments(m1, m2, prior, delta, False)
alpha1, beta1 = res.x
res2 = match_moments(m1, m2, prior, delta, False, algo='bd')
alpha2, beta2, delta2 = res2.x
gb1_m1, gb1_m2, _ = genbeta1_moments(alpha1, beta1, delta)
rows.append([alpha0, beta0, delta, m1, m2, alpha1, beta1, gb1_m1, gb1_m2, res.cost, alpha2, beta2, delta2, res2.cost])
df = pandas.DataFrame(data=rows, columns=['alpha0', 'beta0', 'delta', 'm1', 'm2', 'alpha1', 'beta1', 'gb1_m1', 'gb1_m2', 'cost', 'alpha2', 'beta2', 'delta2', 'cost2'])
print(df.cost.max())
print(df.cost2.max())
print(df[['alpha0','beta0','delta','alpha2','beta2','delta2','cost2']].to_string()) This prints out the entire table of original and final parameters, and the patterns seem very interesting. For example,
Note how the For the high-alpha low-beta case:
Again, |
Also, I'm very glad that my implementation of |
For clarification, I fixed a problem just in |
Ah, great! My first attempt at fitting the beta-deltas didn't fit out as well in the worst case scenarios, but this seems to work well. I pushed a new version of my notebook. Not much more than what we have already, but I did explore the idea of telescoping out to follow the half-life. In the meantime, I made a more robust half-life estimation routine that doesn't require you to specify a min/max times; it will scan for an appropriate bracketing interval and should Just Work. Shifting the time constant out to the half-life and approximating out there with a pure Beta distribution has some appeal. The distribution out at the half-life is pretty close to a pure Beta distribution in most cases, only starting to diverge noticeably (but not significantly, in a practical sense) in the case of |
I sought a more detailed look at how the two schemes we're weighing compared in over/under-review and pass/fail quizzes, and found that they agreed quite well. (Code dump at bottom.) (I also found a negative result. Given a GB1 distribution, I tried searching for another GB1 distribution with the same moments but with a different I'm satisfied that staying at (roughly) the same time-constant is close enough in accuracy and behavior compared to telescoping to the new halflife.
I was hoping the below exercise, testing behavior at extreme cases, would suggest which of the two approaches are better, but they're both very good! I'm hoping to roll this into the library and push it in the next couple of days! The bulk of the work will be rewriting the math section 😂! # -*- coding: utf-8 -*-
import numpy as np
import scipy.special as special
import scipy.optimize as optimize
import ebisu
from typing import Tuple
def failureMoments(model: Tuple[float, float, float], result: bool, tnow: float, num: int = 4):
"""Moments of the posterior on recall at time tnow upon quiz failure"""
a, b, t0 = model
t = tnow / t0
from scipy.special import gammaln, logsumexp
from numpy import log, exp
s = [gammaln(a + n * t) - gammaln(a + b + n * t) for n in range(num + 2)]
logt = log(t)
marginal = logsumexp([s[0], s[1]], b=[1, -1])
return [exp(logsumexp([s[n], s[n + 1]], b=[1, -1]) - marginal) for n in range(1, num + 1)]
def gb1Moments(a: float, b: float, p: float, q: float, num: int = 2):
"""Raw moments of GB1 via Wikipedia"""
from scipy.special import betaln
bpq = betaln(p, q)
logb = np.log(b)
return [np.exp(h * logb + betaln(p + h / a, q) - bpq) for h in np.arange(1.0, num + 1)]
def gb1ToBeta(gb1: Tuple[float, float, float, float, float]):
"""Convert a GB1 model (four GB1 parameters, and time) to a Beta model"""
return (gb1[2], gb1[3], gb1[4] * gb1[0])
def updateViaGb1(prior: Tuple[float, float, float], result: bool, tnow: float):
"""Alternate to ebisu.updateRecall that returns several posterior Beta models"""
(alpha, beta, t) = prior
delta = tnow / t
if result:
gb1 = (1.0 / delta, 1.0, alpha + delta, beta, tnow)
else:
mom = np.array(failureMoments(prior, result, tnow, num=2))
def f(bd):
b, d = bd
this = np.array(gb1Moments(1 / d, 1., alpha, b, num=2))
return this - mom
from scipy.optimize import least_squares
res = least_squares(f, [beta, delta], bounds=((1.01, 0), (np.inf, np.inf)))
# print("optimization cost", res.cost)
newBeta, newDelta = res.x
gb1 = (1 / newDelta, 1, alpha, newBeta, tnow)
return dict(
simple=gb1ToBeta(gb1),
moved=gb1ToBeta(moveGb1(gb1, prior[2])),
moved2=moveBeta(gb1ToBeta(gb1)))
def moveGb1(gb1: Tuple[float, float, float, float, float], priorT):
"""Given a GB1 model (4 parameters and time), find the closest GB1 distribution where alpha=beta
This will be at the halflife. Can produce terrible results when stressed. Why?"""
mom = np.array(gb1Moments(*gb1[:-1], num=4))
def f(aa):
newA, newAlpha = np.exp(aa[0]), np.exp(aa[1])
this = np.array(gb1Moments(newA, 1., newAlpha, newAlpha, num=4))
return this - mom
from scipy.optimize import least_squares
res = least_squares(f, [np.log(gb1[0]), np.log((gb1[2] + gb1[3]) / 2)])
# print("optimization cost", res.cost)
print('movegb1res', np.exp(res.x), 'cost', res.cost)
expx = np.exp(res.x)
return [expx[0], gb1[1], expx[1], expx[1], gb1[4]]
def estimate_half_life(model, quantile=0.5):
"""Robust half-life (or quantile-life) estimation.
Use a root-finding routine in log-delta space to find the delta that
will cause the GB1 distribution to have a mean of the requested quantile.
Because we are using well-behaved normalized deltas instead of times, and
owing to the monotonicity of the expectation with respect to delta, we can
quickly scan for a rough estimate of the scale of delta, then do a finishing
optimization to get the right value.
"""
alpha, beta, t0 = model
Bab = special.beta(alpha, beta)
def f(lndelta):
mean = special.beta(alpha + np.exp(lndelta), beta) / Bab
return mean - quantile
# Scan for a bracket.
bracket_width = 6.0
blow = -bracket_width / 2.0
bhigh = bracket_width / 2.0
flow = f(blow)
fhigh = f(bhigh)
while flow > 0 and fhigh > 0:
# Move the bracket up.
blow = bhigh
flow = fhigh
bhigh += bracket_width
fhigh = f(bhigh)
while flow < 0 and fhigh < 0:
# Move the bracket down.
bhigh = blow
fhigh = flow
blow -= bracket_width
flow = f(blow)
assert flow > 0 and fhigh < 0
sol = optimize.root_scalar(f, bracket=[blow, bhigh])
t1 = np.exp(sol.root) * t0
return t1
def moveBeta(model: Tuple[float, float, float]):
"""Given a Beta model (2 parameters and time), find the closest Beta model at the halflife
Works bmuch better than `moveGb1`"""
th = estimate_half_life(model, 0.5)
m = ebisu.predictRecall(model, th)
v = ebisu.predictRecallVar(model, th)
alpha1, beta1 = ebisu._meanVarToBeta(m, v)
return (alpha1, beta1, th)
if __name__ == '__main__':
model = (4., 30., 1.)
model = (40., 6., 1.)
model = (4., 6., 1.)
def simulation(model, result, tnow):
gold = ebisu.updateRecall(model, result, tnow)
newModel = updateViaGb1(model, result, tnow)
print(newModel)
t = np.linspace(.1, 25, 200)
def trace(model):
return np.vectorize(lambda t: ebisu.predictRecall(model, t))(t)
import pylab as plt
plt.style.use('ggplot')
plt.ion()
plt.figure()
plt.semilogy(t, trace(model), linewidth=5, label='prior')
plt.semilogy(t, trace(gold), linewidth=4, label='post')
plt.semilogy(t, trace(newModel['simple']), '--', linewidth=3, label='via GB1')
plt.semilogy(t, trace(newModel['moved']), linewidth=2, label='via GB1@HL')
plt.semilogy(t, trace(newModel['moved2']), '--', linewidth=1, label='Beta@HL')
plt.legend()
plt.title('Quiz={}, T={}'.format(result, tnow))
simulation(model, True, 100.)
simulation(model, False, 100.)
simulation(model, True, 30.)
simulation(model, False, 30.)
simulation(model, True, 3.)
simulation(model, False, 3.)
simulation(model, True, .01)
simulation(model, False, .01)
"""
For `model = (4., 6., 1.)`,
- the "via GB1" model, which doesn't move the prior halflife (much) and
- the Beta@HL model, which telescopes the "via GB1" model to the halflife,
both agree very well for severe over/under-review, pass/fail quizzes.
Similarly for `model = (40, 6, 1)` and `(4., 30., 1.)`.
Telescoping is slightly less accurate extremely far in the future but...
""" |
Yeah, I don't think there's any benefit in that, per se. If you have a GB1 distribution, you can just evaluate its moments at its half-life and approximate the symmetric Beta there (i.e. I suspect that moment-matching becomes inaccurate (at least via least-squares numerical optimization) in the low-/high- Playing around with the formulae, I'm increasingly convinced that in the negative-recall case, GB1 is not formally conjugate with a Bernoulli trial. I think the best we can do is find a numerically approximate GB1 distribution. |
Robert, would you be willing to releasing your Yesterday I did experiment with just what you suggested: fitting a symmetric Beta to the posterior after a failed quiz. Most of the time it produced very similar results as the existing telescoping half-life approach that you'd already investigated. I've actually decided to not telescope the posterior to the half-life because of results like this: the plot below supposes you start with a high- Stay tuned for the release drop! |
Yes, unless otherwise noted, any code I have posted here or in my fork can be considered released under the same terms as fasiha/ebisu itself at the time of posting. As for explaining it, as far as I'd be concerned, any absolute difference in probabilities less than 1% has no practical effect, especially when we are so close to 0% anyways. The difference between these ideal models and our "true" memory models is likely much, much larger. |
Robert!, it took me a month but I hope you'll agree that the result was worth it. It was the most serendipitous thing but as I was about to push the version matching what we'd last talked about, I was regenerating plots and saw that half-life after updating a failed quiz was higher than the previous half-life, which I didn't expect. I pulled on that string and it turns out that you can put the failed quiz's posterior density function through another exponential decay to any other time horizon, not just the original time from the prior. Not only is that transformed posterior analytic, so are its moments, and they're gorgeous: I updated the doc but here's a screenshot: I also added a version of your telescoping idea because I was encountering, under very extreme under-review (which can happen because I use 15 minutes as the initial half-life for newly learned items and I sometimes don't have a chance to review them for a couple of days), some numerical instabilities can pop up when keeping the updated models at the same time horizon as the prior. I also didn't want users to have to be sophisticated about picking a time for the updated models. So I tweaked the half-life-finding function you wrote to optionally return early with the result of the coarse window, and if the proposed posterior (back at the prior's time horizon) is too lopsided, I rerun the update for this approximate half-life. I called this “rebalancing” in the code to emphasize that we’re balancing alpha and beta, and maybe I was a bit foolish in using the coarse result instead of using the finishing optimization (the rebalanacing idea is literally two hours old)… A huge, huge thank you for this! I took a break to work on other projects after we last spoke above, but for the last few nights I've been working on this full-time and am happy to publish Ebisu 1.0 to PyPI, acknowledging my debt to your keen insights and time, not just code! |
PS. The failed-quiz posterior's half-life being greater than the prior's I tracked down to the fact that even the best-matching GB1 (found via numerical optimization) doesn't match the posterior that well. Using Monte Carlo, I was able to get very nice posterior Beta fits (whose half-lives were strictly less than or equal to the prior's) but couldn't get GB1s with that property. That's when I started playing with the posterior density itself and noticed I could transform it back to the prior's time horizon analytically, and then to any other time horizon I wanted. |
Playing with your formulae, I noticed that the update algorithm could be simplified significantly, at least in the positive recall case. First, I noticed that time-shifting the Beta distribution by
delta
gives a Generalized Beta Distribution of the First Kind (witha=1/delta; b=1
in the terms of that page). Unfortunately, the literature seems to be sparse on that particular formulation (or I'm not using the right search terms), but the formula for moments provided on the page is quite useful. It does look like the GB1 distribution is conjugate for a Bernoulli trial, with an easy analytical formula for the positive-recall update and at least a good numerical moment-matching fit for the negative-recall case.I've made a notebook over here to explore the possibilities.
Notice that when you multiply the likelihood (
p
) to the prior, you can fold it into the term (p^((alpha-delta)/delta)*p = p^((alpha+delta-delta)/delta
) which is formally the same as if you hadalpha+delta
instead ofalpha
. So at least for the positive recall case, you can leave the time part of the model alone and just incrementalpha += delta
and leave the time part of the model alone. Of course, whendelta==1.0
, where our prior is a pure Beta distribution, that collapses to the well-known Bayesian update for a Beta distribution and a Bernoulli trial. You can also work through the moment formula given on the Wikipedia page to verify that it reproduces your moments when you usealpha + delta
in the Wikipedia formula (basically, then
th moment isgamma[n+1]/gamma[1]
, in your shorthand). Basically, when you evolve the posterior distribution back to the original time constant, you get back another pure Beta distribution. Qualitatively, this update makes some sense. Whendelta
is low, the prior is more heavily weighted towards 1, so a positive recall is less surprising and therefore updates less. Whendelta
is high, the prior is more heavily weighted towards 0, so a positive recall is more surprising and updates more.The negative recall case is more tricky, and I haven't made any headway on getting an analytical form. However, it is relatively easy to match the moments using numerical optimization by adjusting both
alpha
andbeta
in the GB1 distribution. Again, like the positive-recall update, the time constant is left alone. I do wonder, however, if maybe adjustingbeta
anddelta
(e.g. by way of adjusting the time constant) would lead to a concise analytical form. That would make a better formal correspondence to the negative-recall update in the pure Beta case.If the time constant (i.e. the time at which the prior is a pure Beta) is kept fixed, that might mean that when the actual half-life has increased to many times the original value, you might get numerical instabilities. It might be a good idea to always move out the time constant to the estimated half-life. One scheme could be to do the GB1 update where the time constant remains fixed, then project that distribution out to the estimated half-life, find the moments there, and then moment-match a pure Beta distribution there.
What originally led me to investigate this is when I was considering what would happen if I happened to review twice very quickly (my Anki history has some of these due to Anki's lapse algorithm). With Ebisu's current scheme of using a time constant corresponding to the last review interval, that created large numerical problems in my tests. The pure Beta approximation for such a small
delta
is likely not great.t
would be very tiny, so a normal review after that would project that not-great approximation out to quite a highdelta
. I think that using the GB1 update, either alone or with my suggestion of telescoping out the time constants to match the estimated half-lives, would probably alleviate some of those issues.Thank you for your attention and an interesting, well-thought-out algorithm.
The text was updated successfully, but these errors were encountered: