这是indexloc提供的服务,不要输入任何密码
Skip to content

(yet) Another stochastic L-BFGS implementation for interest #22

@mbhynes

Description

@mbhynes

Hi Michael,

I wanted to share a stochastic first order optimizer idea and implementation for L-BFGS, in case you might find it interesting or even practical in your research (congrats also on your PhD & new role at fb!)

The idea behind the algorithm is pretty simple in brief:

  • when optimizing a stochastic empirical loss function, use bootstrap sampling over mini-batches of data to estimate the function value, the variance, and sampling bias
  • run a first-order solver using a line search that treats the variance estimates from the bootstapped function/gradient evaluations as numerical uncertainties, such that the line search routine looks for step sizes points that satisfy the Wolfe conditions within the uncertainty level

I implemented the idea above in the following repository using tensorflow:
https://github.com/mbhynes/boots

I admittedly haven't run particularly thorough experiments on this, but did use a small convnet as a toy problem and it looks like the convergence of the bootstrapped L-BFGS algorithm can have the same or better convergence as ADAM up to a point, measured by the decrease in training loss vs the # of function/gradient evaluations [ie data points accessed], e.g. here

There's no formal convergence criteria implemented, and empirically it looks like the optimizer can get within a radius of a minimum, but then just start oscillating a bit without making substantive progress.

The real practical downside to all this is that neither tensorflow nor pytorch are particularly efficient at evaluating per-example gradients... so unfortunately even if the trace above looks sorta neat, in reality you have to wait quite a while to get it from bootstrapping, compared to ADAM (which I've just taken as the ~best off-the-shelf algo) since the boots implementation above doesn't have an efficient way of evaluating the jacobian matrix. (It wouldn't be quite so bad in a map-reduce framework where per-example gradients have to be materialized mind you...)

Anyway, hope it's of interest, and all the best!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions