DIMACS :: Details

The Power of Interpolation: Machine Learning without Loss Functions and Regularization

August 13, 2018, 4:30 PM - 5:00 PM

Location:

Iacocca Hall

Lehigh University

Bethlehem PA

Click here for map.

Mikhail Belkin, Ohio State University

Abstract

A striking feature of modern supervised machine learning is its consistent use of techniques that nearly interpolate the data. Deep networks often containing several orders of magnitude more parameters than data points, are trained to obtain near zero error on the training set. Yet, at odds with most theory, they show excellent test performance. It has become accepted wisdom that these properties are special to deep networks and require nonconvex analysis to understand. In this talk I will show that classical (convex) kernel machines do, in fact, exhibit these unusual properties. Indeed, kernel machines explicitly constructed to interpolate the training data, show excellent test performance. Our empirical and theoretical results indicate that we are unlikely to make progress on understanding deep learning until we develop a fundamental understanding of classical "shallow" kernel classifiers in the "modern" near-interpolated setting. Significantly, interpolating regimes lead to fast (exponential) convergence of SGD even with fixed step size, thus providing a cue toward explaining the efficiency of SGD in neural networks. Moreover, in the quadratic case we are able to derive explicit bounds on the step size and convergence rates in terms of the mini-batch size. These bounds are tight and are directly applicable to parameter selection, allowing us to construct very fast and accurate kernel machines, adaptive to both parallel computing resource (e.g., a GPU) and data. For example, training a kernel machine on full Imagenet dataset takes under an hour, on a single GPU. Smaller datasets, such as MNIST, take seconds. Finally, I will conclude by discussing the advantages of interpolation and arguing that these recent findings, as well as older observations on interpolation/overfitting in Adaboost and Random Forests suggest a need to revisit high-dimensional inference.