Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logging for estimate_msm not working #468

Closed
ksarnoff opened this issue Aug 12, 2023 · 7 comments
Closed

logging for estimate_msm not working #468

ksarnoff opened this issue Aug 12, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@ksarnoff
Copy link

Hi,

Bug description

I am using estimate_msm and want to log the output. The log file is being created, but I am not sure it is updating and I can't open it.

To reproduce

Here's my estimate_msm code:

res = em.estimate_msm(
    simulate_moments,
    empirical_moments,
    moments_cov,
    start_params,
    optimize_options="scipy_lbfgsb",
    logging="log.db",
    log_options={"fast_logging": True},
)

Screenshots/Error messages

After some time, log.db did get generated. I wanted to check it, so I ran estimagic dashboard log.db from the terminal. I got the following message:

Bokeh app running at: http://localhost:59219/
WARNING:tornado.access:404 GET /favicon.ico (::1) 1.69ms

and the page itself had the following message in the console:
Failed to load resource: the server responded with a status of 404 (Not Found).

I am not sure log.db is updating (my computer says no changes have occurred in the past hour). I saw in issue 431 that estimagic is only compatible with sqlalchemy 1.4, so I downgraded to that and it didn't fix it.

System

Mac OS 12.4
estimagic 0.4.6
sqlalchemy 1.4

Thanks,
Kim

@ksarnoff ksarnoff added the bug Something isn't working label Aug 12, 2023
@janosg
Copy link
Member

janosg commented Aug 13, 2023

Hi @ksarnoff,

Thank you for opening this issue. I will try to look into this soon. If you want, you can help me to localize the problem in the meantime.

To me it looks like this is actually a problem with the Dashboard and not with logging itself. Could you confirm that by trying out other means to read the log file? For example:

`em.criterion_plot("path/to/your/log.db")

or

em.OptimizeLogReader("path/to/your/log.db")

Both are described in more detail and with examples here

@ksarnoff
Copy link
Author

ksarnoff commented Aug 13, 2023

em.criterion_plot('log.db') gave:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
 in 
      1 import estimagic as em
----> 2 fig = em.criterion_plot("log.db")
      3 fig.show(renderer="png")

[/opt/anaconda3/lib/python3.8/site-packages/estimagic/visualization/history_plots.py](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.8/site-packages/estimagic/visualization/history_plots.py) in criterion_plot(results, names, max_evaluations, template, palette, stack_multistart, monotone, show_exploration)
     88             )
     89         elif isinstance(res, (str, Path)):
---> 90             _data = _extract_plotting_data_from_database(
     91                 res, stack_multistart, show_exploration
     92             )

[/opt/anaconda3/lib/python3.8/site-packages/estimagic/visualization/history_plots.py](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.8/site-packages/estimagic/visualization/history_plots.py) in _extract_plotting_data_from_database(res, stack_multistart, show_exploration)
    361     direction = _problem_table["direction"].tolist()[-1]
    362 
--> 363     history, local_histories, exploration = reader.read_multistart_history(direction)
    364 
    365     if stack_multistart and local_histories is not None:

[/opt/anaconda3/lib/python3.8/site-packages/estimagic/logging/read_log.py](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.8/site-packages/estimagic/logging/read_log.py) in read_multistart_history(self, direction)
    132 
    133     def read_multistart_history(self, direction):
--> 134         out = _read_multistart_optimization_history(
    135             database=self._database,
...
--> 238     times -= times[0]
    239     history["runtime"] = times
    240
IndexError: index 0 is out of bounds for axis 0 with size 0

em.OptimizeLogReader("log.db") did return parameters for reader.read_start_params(), but for reader.read_iteration(-1) I also got an error:

IndexError                                Traceback (most recent call last)
 in 
      1 import estimagic as em
      2 reader = em.OptimizeLogReader("log.db")
----> 3 reader.read_iteration(-1)

[/opt/anaconda3/lib/python3.8/site-packages/estimagic/logging/read_log.py](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.8/site-packages/estimagic/logging/read_log.py) in read_iteration(self, iteration)
    115 
    116     def read_iteration(self, iteration):
--> 117         out = _read_optimization_iteration(
    118             database=self._database,
    119             iteration=iteration,

[/opt/anaconda3/lib/python3.8/site-packages/estimagic/logging/read_log.py](https://file+.vscode-resource.vscode-cdn.net/opt/anaconda3/lib/python3.8/site-packages/estimagic/logging/read_log.py) in _read_optimization_iteration(database, iteration, params_treedef, registry)
    155             return_type="list_of_dicts",
    156         )
--> 157         highest_rowid = last_iteration[0]["rowid"]
    158 
    159         # iteration is negative here!

IndexError: list index out of range

@janosg
Copy link
Member

janosg commented Aug 21, 2023

Hi Kim,

Thanks for your patience.

I tried to replicate your problem with an example from the documentation. However, everything works fine there.

Does the following code work on your machine?

# imports
import estimagic as em
import numpy as np
import pandas as pd


# create random number generator
rng = np.random.default_rng(seed=0)

# simulate the data
def simulate_data(params, n_draws, rng):
    x = rng.normal(0, 1, size=n_draws)
    e = rng.normal(0, params.loc["sd", "value"], size=n_draws)
    y = params.loc["intercept", "value"] + params.loc["slope", "value"] * x + e
    return pd.DataFrame({"y": y, "x": x})

true_params = pd.DataFrame(
    data=[[2, -np.inf], [-1, -np.inf], [1, 1e-10]],
    columns=["value", "lower_bound"],
    index=["intercept", "slope", "sd"],
)

data = simulate_data(true_params, n_draws=100, rng=rng)

# calculate moments 
def calculate_moments(sample):
    moments = {
        "y_mean": sample["y"].mean(),
        "x_mean": sample["x"].mean(),
        "yx_mean": (sample["y"] * sample["x"]).mean(),
        "y_sqrd_mean": (sample["y"] ** 2).mean(),
        "x_sqrd_mean": (sample["x"] ** 2).mean(),
    }
    return pd.Series(moments)

empirical_moments = calculate_moments(data)

# calculate moments_cov
moments_cov = em.get_moments_cov(
    data, calculate_moments, bootstrap_kwargs={"n_draws": 5_000, "seed": 0}
)

# define simulation function
def simulate_moments(params, n_draws=10_000, seed=0):
    rng = np.random.default_rng(seed)
    sim_data = simulate_data(params, n_draws, rng)
    sim_moments = calculate_moments(sim_data)
    return sim_moments

# run estimation
start_params = true_params.assign(value=[100, 100, 100])

res = em.estimate_msm(
    simulate_moments,
    empirical_moments,
    moments_cov,
    start_params,
    optimize_options="scipy_lbfgsb",
    logging="log.db",
    log_options={"fast_logging": True, "if_table_exists": "replace"},
)

# create criterion plot
em.criterion_plot("log.db")

If so, it should produce a figure similar to this:

image

By now estimagic is compatible with sqlalchemy 2.x. I produced the above output using:

  • Python 3.10
  • estimagic 0.4.6
  • sqlalchemy 2.0.20

Even if the above example runs on your computer I would like to help you to find the cause of your problem.

@ksarnoff
Copy link
Author

That code does run correctly. I just re-ran my actual code for about 10 minutes and the log.db is not getting generated. Does it only generate once the optimization is done? I thought that wasn't the case, but if it is, it would explain the issue. Otherwise, I can give you the code I am actually feeding into estimate_msm if that's helpful.

@janosg
Copy link
Member

janosg commented Aug 21, 2023

Ok, that is good news, even though it makes the debugging harder.

The log file is generated after one evaluation of your objective function and then updated after each further evaluation.

It would be super helpful to have your code or a small self contained example that produces the issue for you. If you don't want to post it publicly, you can find my email on my github profile.

@janosg
Copy link
Member

janosg commented Aug 22, 2023

Hi Kim,

Thanks for sending your code.

My feeling is that you just did not wait long enough for something to be logged.

Estimagic logs the parameter vector and criterion value after each iteration. For the optimizer you chose ("scipy_lbfgsb"), one iteration means one evaluation of the criterion function and one evaluation of the gradient. Since you did not provide a closed form derivative, a numerical one is calculated which, for 25 parameters requires an additional 25 criterion evaluations. On my laptop, your criterion function takes approx. 5 minutes to evaluate, so you would see an update of the database approximately every 2 hours.

Switching to a gradient free optimizer (I tried "nlopt_neldermead") shows that the logging works perfectly. After about 10 minutes the database is created and then there is an update every 5 minutes. The criterion_plot also works.

If you don't have a closed form gradient, probably a gradient free optimizer (e.g. "nlopt_neldermead", "nlopt_bobyqa", "nag_pybobyqa") is a good choice. Since MSM problems are nonlinear least-squares problems you could also try a specialized optimizer like "nag_dfols" or "pounders".

I would also strongly recommend to try to speed up your simulate_moments function. With the current runtime, any optimization will take very long! After a brief look, my guess would be that you can make it at least 10x (but probably more like 100x) faster if you make it numba compatible and jit compile the loop over the rows of the dataset. Making it numba compatible here mainly means that you use numpy arrays instead of lists ad DataFrames.

Let me know if this helped and we can close the issue.

@ksarnoff
Copy link
Author

Yes, switching to a gradient free optimizer was much faster. The logging works totally fine. Thanks!

@janosg janosg closed this as completed Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants