(yet) Another stochastic L-BFGS implementation for interest

Hi Michael,

I wanted to share a stochastic first order optimizer idea and implementation for L-BFGS, in case you might find it interesting or even practical in your research (congrats also on your PhD & new role at fb!)

The idea behind the algorithm is pretty simple in brief:
- when optimizing a stochastic empirical loss function, use bootstrap sampling over mini-batches of data to estimate the function value, the variance, and sampling bias
- run a first-order solver using a line search that treats the variance estimates from the bootstapped function/gradient evaluations as numerical uncertainties, such that the line search routine looks for step sizes points that satisfy the Wolfe conditions within the uncertainty level

I implemented the idea above in the following repository using `tensorflow`:
https://github.com/mbhynes/boots

I admittedly haven't run particularly thorough experiments on this, but did use a small convnet as a toy problem and it looks like the convergence of the bootstrapped L-BFGS algorithm can have the same or better convergence as ADAM up to a point, measured by the decrease in training loss vs the # of function/gradient evaluations [ie data points accessed], e.g. [here](https://github.com/mbhynes/boots/pull/1)

![](https://user-images.githubusercontent.com/10452129/190928192-d709905e-1305-4242-bd71-8aadf7be6f05.png)

There's no formal convergence criteria implemented, and empirically it looks like the optimizer can get within a radius of a minimum, but then just start oscillating a bit without making substantive progress.

The real practical downside to all this is that neither `tensorflow` nor `pytorch` are particularly efficient at evaluating per-example gradients... so unfortunately even if the trace above looks sorta neat, in reality you have to wait quite a while to get it from bootstrapping, compared to ADAM (which I've just taken as the ~best off-the-shelf algo) since the `boots` implementation above doesn't have an efficient way of evaluating the jacobian matrix. (It wouldn't be quite so bad in a map-reduce framework where per-example gradients have to be materialized mind you...)

Anyway, hope it's of interest, and all the best!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(yet) Another stochastic L-BFGS implementation for interest #22

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

(yet) Another stochastic L-BFGS implementation for interest #22

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions