Table 3.2

Convergence rates for a (non-comprehensive) set of distributed optimization algorithms in the IID-data setting. We assume M devices participate in each iterations, and the loss functions are H-smooth, convex, and we have access to stochastic gradients with variance at most σ2. All rates are upper bounds on (3.1) after T iterations (potentially with some iterate averaging scheme)

MethodCommentsConvergence
Baselines
Mini-batch SGDBatch size KMO(HT+σTKM)
SGD(on one worker, no communication)O(HTK+σTK)
Baselines with accelerationa
A-mini-batch SGD [95], [97]Batch size KMO(HT2+σTKM)
A-SGD [97](on one worker, no communication)O(H(TK)2+σTK)
Parallel SGD/Fed-Avg/Local SGD
Yu et al. [98],b Stich [99]cGradient norm bounded by GO(HKMTG2σ2+σTKM)
Wang and Joshi [53],b Stich and Karimireddy [100] O(HMT+σTKM)
Other algorithms
SCAFFOLD [101]Control variates and two stepsizesO(HT+σTKM)
a

There are no accelerated fed-avg/local SGD variants so far.

b

These papers consider the smooth non-convex setting, we adapt here the results for our setting.

c

This paper considers the smooth strongly convex setting, we adapt here the results for our setting.

or Create an Account

Close Modal
Close Modal