Table 3.2 Convergence rates for a...

Table 3.2

Convergence rates for a (non-comprehensive) set of distributed optimization algorithms in the IID-data setting. We assume M devices participate in each iterations, and the loss functions are H-smooth, convex, and we have access to stochastic gradients with variance at most σ². All rates are upper bounds on (3.1) after T iterations (potentially with some iterate averaging scheme)

Method	Comments	Convergence
Baselines
Mini-batch SGD	Batch size KM	$O (\frac{H}{T} + \frac{σ}{\sqrt{T K M}})$
SGD	(on one worker, no communication)	$O (\frac{H}{T K} + \frac{σ}{\sqrt{T K}})$
Baselines with acceleration^a
A-mini-batch SGD [95], [97]	Batch size KM	$O (\frac{H}{T^{2}} + \frac{σ}{\sqrt{T K M}})$
A-SGD [97]	(on one worker, no communication)	$O (\frac{H}{{(T K)}^{2}} + \frac{σ}{\sqrt{T K}})$
Parallel SGD/Fed-Avg/Local SGD
Yu et al. [98],^b Stich [99]^c	Gradient norm bounded by G	$O (\frac{H K M}{T} \frac{G^{2}}{σ^{2}} + \frac{σ}{\sqrt{T K M}})$
Wang and Joshi [53],^b Stich and Karimireddy [100]		$O (\frac{H M}{T} + \frac{σ}{\sqrt{T K M}})$
Other algorithms
SCAFFOLD [101]	Control variates and two stepsizes	$O (\frac{H}{T} + \frac{σ}{\sqrt{T K M}})$

Method	Comments	Convergence
Baselines
Mini-batch SGD	Batch size KM	$O (\frac{H}{T} + \frac{σ}{\sqrt{T K M}})$
SGD	(on one worker, no communication)	$O (\frac{H}{T K} + \frac{σ}{\sqrt{T K}})$
Baselines with acceleration^a
A-mini-batch SGD [95], [97]	Batch size KM	$O (\frac{H}{T^{2}} + \frac{σ}{\sqrt{T K M}})$
A-SGD [97]	(on one worker, no communication)	$O (\frac{H}{{(T K)}^{2}} + \frac{σ}{\sqrt{T K}})$
Parallel SGD/Fed-Avg/Local SGD
Yu et al. [98],^b Stich [99]^c	Gradient norm bounded by G	$O (\frac{H K M}{T} \frac{G^{2}}{σ^{2}} + \frac{σ}{\sqrt{T K M}})$
Wang and Joshi [53],^b Stich and Karimireddy [100]		$O (\frac{H M}{T} + \frac{σ}{\sqrt{T K M}})$
Other algorithms
SCAFFOLD [101]	Control variates and two stepsizes	$O (\frac{H}{T} + \frac{σ}{\sqrt{T K M}})$

There are no accelerated fed-avg/local SGD variants so far.

These papers consider the smooth non-convex setting, we adapt here the results for our setting.

This paper considers the smooth strongly convex setting, we adapt here the results for our setting.

[ViewLarge]

Sharing Unavailable