Convergence rates for a (non-comprehensive) set of distributed optimization algorithms in the IID-data setting. We assume M devices participate in each iterations, and the loss functions are H-smooth, convex, and we have access to stochastic gradients with variance at most σ2. All rates are upper bounds on (3.1) after T iterations (potentially with some iterate averaging scheme)
| Method | Comments | Convergence |
|---|---|---|
| Baselines | ||
| Mini-batch SGD | Batch size KM | |
| SGD | (on one worker, no communication) | |
| Baselines with accelerationa | ||
| A-mini-batch SGD [95], [97] | Batch size KM | |
| A-SGD [97] | (on one worker, no communication) | |
| Parallel SGD/Fed-Avg/Local SGD | ||
| Yu et al. [98],b Stich [99]c | Gradient norm bounded by G | |
| Wang and Joshi [53],b Stich and Karimireddy [100] | ||
| Other algorithms | ||
| SCAFFOLD [101] | Control variates and two stepsizes | |
| Method | Comments | Convergence |
|---|---|---|
| Mini-batch SGD | Batch size | |
| SGD | (on one worker, no communication) | |
| A-mini-batch SGD [ | Batch size | |
| A-SGD [ | (on one worker, no communication) | |
| Yu | Gradient norm bounded by | |
| Wang and Joshi [ | ||
| SCAFFOLD [ | Control variates and two stepsizes | |
There are no accelerated fed-avg/local SGD variants so far.
These papers consider the smooth non-convex setting, we adapt here the results for our setting.
This paper considers the smooth strongly convex setting, we adapt here the results for our setting.
Sharing content requires targeting cookies to be enabled. Please update your cookie preferences to use this feature.