Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning

Dmitry Andreevich Bylinkin; Былинкин Дмитрий Андреевич; Kirill Danilovich Degtyarev; Дегтярев Кирилл Данилович; Aleksandr Nikolaevich Beznosikov; Безносиков Александр Николаевич

doi:10.4213/rm10206

Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning

作者: Bylinkin D.A.¹^,2, Degtyarev K.D.¹, Beznosikov A.N.³^,2^,4
隶属关系:
1. Moscow Institute of Physics and Technology
2. Institute for System Programming, Russian Academy of Sciences
3. Sber AI Lab
4. Innopolis University
期: 卷 79, 编号 6 (2024)
页面: 5-38
栏目: Articles
URL: https://journal-vniispk.ru/0042-1316/article/view/281939
DOI: https://doi.org/10.4213/rm10206
ID: 281939

如何引用文章

全文:

开放存取

##reader.subscriptionAccessGranted##
受限制的访问

订阅存取

详细
作者简介
参考
补充文件
统计

详细

Modern realities and trends in learning require more and more generalization ability of models, which leads to an increase in both models and training sample size. It is already difficult to solve such tasks in a single device mode. This is the reason why distributed and federated learning approaches are becoming more popular every day. Distributed computing involves communication between devices, which requires solving two key problems: efficiency and privacy. One of the most well-known approaches to combat communication costs is to exploit the similarity of local data. Both Hessian similarity and homogeneous gradients have been studied in the literature, but separately. In this paper we combine both of these assumptions in analyzing a new method that incorporates the ideas of using data similarity and clients sampling. Moreover, to address privacy concerns, we apply the technique of additional noise and analyze its impact on the convergence of the proposed method. The theory is confirmed by training on real datasets.Bibliography: 45 titles.

关键词

ASEG, distributed learning, federated learning, communication costs, Hessian similarity, homogeneous gradients, technique of additional noise

参考

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, Li Zhang, “Deep learning with differential privacy”, Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (Vienna, 2016), ACM, New York, 2016, 308–318
M. Aitsam, “Differential privacy made easy”, 2022 International conference on emerging trends in electrical, control, and telecommunication engineering (ETECTE) (Lahore, 2022), IEEE, 2022, 1–7
D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding”, NIPS'17: Proceedings of the 31st international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 30, Curran Associates, Inc., Red Hook, NY, 2017, 1707–1718,
Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods”, J. Mach. Learn. Res., 18 (2018), 221, 51 pp.
Y. Arjevani, O. Shamir, “Communication complexity of distributed convex learning and optimization”, NIPS'15: Proceedings of the 28th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 1756–1764
B. Barazandeh, Tianjian Huang, G. Michailidis, “A decentralized adaptive momentum method for solving a class of min-max optimization problems”, Signal Process., 189 (2021), 108245, 23 pp.
A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133
A. Beznosikov, A. Gasnikov, “Compression and data similarity: combination of two techniques for communication-efficient solving of distributed variational inequalities”, Optimization and applications, Lecture Notes in Comput. Sci., 13781, Springer, Cham, 2022, 151–162
A. Beznosikov, S. Horvath, P. Richtarik, M. Safaryan, “On biased compression for distributed learning”, J. Mach. Learn. Res., 24 (2023), 276, 50 pp.
A. Beznosikov, G. Scutari, A. Rogozin, A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184
A. Beznosikov, M. Takac, A. Gasnikov, “Similarity, compression and local steps: three pillars of efficient communications for distributed variational inequalities”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1246, 28663–28677
Chih-Chung Chang, Chih-Jen Lin, “LIBSVM: A library for support vector machines”, ACM Trans. Intell. Syst. Technol. (TIST), 2:3 (2011), 27, 27 pp.
A. Defazio, F. Bach, S. Lacoste-Julien, “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”, NIPS'14: Proceedings of the 27th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 27, MIT Press, Cambridge, MA, 2014, 1646–1654,
Jiashi Feng, Huan Xu, Shie Mannor, Distributed robust learning, 2015 (v1 – 2014), 18 pp.
S. Gade, N. H. Vaidya, “Private optimization on networks”, 2018 Annual american control conference (ACC) (Milwaukee, WI, 2018), IEEE, 2018, 1402–1409
E. Gorbunov, F. Hanzely, P. Richtarik, “A unified theory of SGD: variance reduction, sampling, quantization and coordinate descent”, Proceedings of the 23rd international conference on artificial intelligence and statistics (AISTATS 2020), Proc. Mach. Learn. Res. (PMLR), 108, 2020, 680–690
E. Gorbunov, F. Hanzely, P. Richtarik, “Local SGD: unified theory and new efficient methods”, Proceedings of the 24th international conference on artificial intelligence and statistics (AISTATS 2021), Proc. Mach. Learn. Res. (PMLR), 130, 2021, 3556–3564
T. M. Hehn, J. F. Kooij, F. A. Hamprecht, “End-to-end learning of decision trees and forests”, Int. J. Comput. Vis., 128:4 (2020), 997–1011
H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning (ICML 2020), Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227,
R. Johnson, Tong Zhang, “Accelerating stochastic gradient descent using predictive variance reduction”, NIPS'13: Proceedings of the 26th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 26, Curran Associates Inc., Red Hook, NY, 2013, 315–323,
A. Khaled, Chi Jin, Faster federated optimization under second-order similarity, 2023 (v1 – 2022), 44 pp.
A. Khaled, K. Mishchenko, P. Richtarik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics (AISTATS 2020), Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529
A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning (ICML 2020), Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393,
D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, G. Scutari, “Optimal gradient sliding and its application to optimal distributed optimization under similarity”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2427, 33494–33507,
D. Kovalev, S. Horvath, P. Richtarik, “Don't jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop”, Proceedings of the 31st international conference on algorithmic learning theory (ALT 2020), Proc. Mach. Learn. Res. (PMLR), 117, 2020, 451–467
Zhize Li, Hongyan Bao, Xiangliang Zhang, P. Richtarik, “PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization”, Proceedings of the 38th international conference on machine learning (ICML 2021), Proc. Mach. Learn. Res. (PMLR), 139, 2021, 6286–6295
Dachao Lin, Yuze Han, Haishan Ye, Zhihua Zhang, “Stochastic distributed optimization under average second-order similarity: algorithms and analysis”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 89, 1849–1862
Hongzhou Lin, J. Mairal, Z. Harchaoui, “A universal catalyst for first-order optimization”, NIPS'15: Proceedings of the 28th international conference on neural information processing systems, v. 2, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 3384–3392,
B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics (AISTATS 2017), Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282
J. M. Moguerza, A. Muñoz, “Support vector machines with applications”, Statist. Sci., 21:3 (2006), 322–336
L. M. Nguyen, Jie Liu, K. Scheinberg, M. Takač, “SARAH: A novel method for machine learning problems using stochastic recursive gradient”, Proceedings of the 34th international conference on machine learning (ICML 2017), Proc. Mach. Learn. Res. (PMLR), 70, 2017, 2613–2621
E. Nozari, P. Tallapragada, J. Cortes, “Differentially private distributed convex optimization via functional perturbation”, IEEE Trans. Control Netw. Syst., 5:1 (2018), 395–408
S. S. Sannakki, A. A. Kulkarni, “A review paper on artificial neural network in cognitive science”, Int. J. Engrg. Techn., 2:2 (2016), 118–123,
S. Shalev-Shwartz, O. Shamir, N. Srebro, K. Sridharan, “Learnability, stability and uniform convergence”, J. Mach. Learn. Res., 11 (2010), 2635–2670
O. Shamir, N. Srebro, Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning (ICML 2014), Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008
S. Sreenivasan, R. Cohen, E. Lopez, Z. Toroczkai, H. E. Stanley, Communication bottlenecks in scale-free networks, 2006, 5 pp.
S. U. Stich, Unified optimal analysis of the (stochastic) gradient method, 2019, 11 pp.
Ying Sun, G. Scutari, A. Daneshmand, “Distributed optimization based on gradient tracking revisited: enhancing convergence rate via surrogation”, SIAM J. Optim., 32:2 (2022), 354–385
Yг Tian, G. Scutari, Tianyu Cao, A. Gasnikov, “Acceleration in distributed optimization under similarity”, Proceedings of the 25th international conference on artificial intelligence and statistics (AISTATS 2022), Proc. Mach. Learn. Res. (PMLR), 151, 2022, 5721–5756
G. Tripepi, K. J. Jager, F. W. Dekker, C. Zoccali, “Linear and logistic regression analysis”, Kidney Int., 73:7 (2008), 806–810
S. Vaswani, I. Laradji, F. Kunstner, Si Yi Meng, M. Schmidt, S. Lacoste-Julien, Adaptive gradient methods converge faster with over-parameterization (but you should do a line-search), 2021 (v1 – 2020), 41 pp.
J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 33 pp.
P. C. Weeraddana, C. Fischione, “On the privacy of optimization”, IFAC-PapersOnLine, 50:1 (2017), 9502–9508
B. E. Woodworth, K. K. Patel, N. Srebro, “Minibatch vs local SGD for heterogeneous distributed learning”, NIPS'20: Proceedings of the 34th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 33, Curran Associates, Inc., Red Hook, NY, 2020, 527, 6281–6292,
Xiao-Tong Yuan, Ping Li, “On convergence of distributed approximate Newton methods: globalization, sharper bounds and beyond”, J. Mach. Learn. Res., 21 (2020), 206, 51 pp.

补充文件

附件文件

动作

1. JATS XML

下载

用户名
密码
记住我

忘记您的密码?	注册

用户名
密码
记住我

忘记您的密码?	注册

Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning

全文:

详细

关键词

作者简介

Dmitry Bylinkin

Kirill Degtyarev

Aleksandr Beznosikov

参考

补充文件