Variance Reduction for Gradient Compression

September 16, 2019, 2:40 PM - 3:20 PM

Location:

Center Hall

Rutgers University

Busch Campus Student Center

604 Bartholomew Rd

Piscataway NJ

Click here for map.

Peter Richtarik, University of Edinburgh

Over the past few years, various randomized gradient compression (e.g., quantization, sparsification, sketching) techniques have been proposed for reducing communication in distributed training of very large machine learning models. However, despite high level of research activity in this area, surprisingly little is known about how such compression techniques should properly interact with first order optimization algorithms. For instance, randomized compression increases the variance of the stochastic gradient estimator, and this has an adverse effect on convergence speed. While a number of variance-reduction techniques exists for taming the variance of stochastic gradients arising from sub-sampling in finite-sum optimization problems, no variance reduction techniques exist for taming the variance introduced by gradient compression. Further, gradient compression techniques are invariably applied to unconstrained problems, and it is not known whether and how they could be applied to solve constrained or proximal problems. In this talk I will give positive resolutions to both of these problems. In particular, I will show how one can design fast variance-reduced proximal stochastic gradient descent methods in settings where stochasticity comes from gradient compression. 

This talk is based on:

[1] Filip Hanzely, Konstantin Mishchenko and Peter Richtárik SEGA: Variance reduction via gradient sketching NeurIPS 2018

[2] Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč and Peter Richtárik Distributed learning with compressed gradient differences arXiv: 1901.09269

[3] Konstantin Mishchenko, Filip Hanzely and Peter Richtárik 99% of distributed optimization is a waste of time: the issue and how to fix it arXiv:1901.09437

[4] Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Peter Richtárik and Sebastian Stich Stochastic distributed learning with gradient quantization and variance reduction arXiv:1904.05115