Asynchronous ADMM
(This passage contains 0 words)
This article provides an analysis on the 2023 paper Distributed Optimization on Directed Graphs Based on Inexact ADMM with Partial Participation by Yi and Freris, where the authors proposed an elegant integration of gradient step used in gradient descent and asynchronous ADMM, to solve a convex problem . Their arXiv e-print is available here.
Problem Setup
Our problem is to solve \( \mathrm{min}_x F(x) = \sum_{i \in [n]} f_i (x) \). Suppose we have \( n \in \mathbb{Z}_+ \) agents (computation units), where the \( i \)-th agent handles part of \( F \), which is denoted by \( f_i \). The optimization variable on the \( i \)-th agent is denoted by \( x \). The goal of a distributed optimization algorithm is to find the minimizer of \( F \) while achieving consensus, i.e. \( x_1 = x_2 = \cdots = x_n \in \mathbb{R}^d \).
Let \( x \coloneqq [x_1^\top, x_2^\top, \cdots, x_n^\top]^\top \in \mathbb{R}^{nd} \), the original objective \( F \) can be equivalently expressed by \( F(x) = \sum_{i = [n]} f_i(x_i) \) with the constraint that \( x_1 = x_2 = \cdots = x_n \). Define the consensus set \( \mathcal{C} \coloneqq \{ x \mid x_1 = x_2 = \cdots = x_n \} \), the aforementioned constraint can then be formulated as \( x \in \mathcal{C} \). Define the following indicator function: \[ I_{\mathcal{C}}(x) \coloneqq \begin{cases} 0 & x \in \mathcal{C}, \\ \infty & \text{otherwise}, \end{cases} \] and it turns out that we can obtain an unconstrained problem, expressed as follows: \[ \min F(x) + I_{\mathcal{C}}(x). \] To apply ADMM, we can introduce another variable \( z \in \mathbb{R}^{nd} \) to split the problem into a smooth part and a nonsmooth part. The authors did not propose a name for the \( z \) variable, but it's often called the consensus variable in the field of distributed ADMM. With \( z \), we can then transform the previous problem into a typical ADMM one: \[ \min F(x) + I_{\mathcal{C}}(z) \text{ subject to } x = z. \] It is important to note that \( z \) is introduced not for modeling reasons, but to make the proof of convergence more convenient (If not, why does everyone involve such a term in their work?).
IPD Algorithm
The authors termed their algorithm as IPD — Inexact, Partial praticipation, Directed graph. Define the augmented Lagrangian by \[ L_{\rho}(x, z, y) \coloneqq F(x) + I_{\mathcal{C}}(z) + \langle y, x - z \rangle + \frac{\rho}{2} \Vert x - z \Vert^2, \] and both primal variables \( x \) and \( z \) are defined as the (possibly inexact) minimizers of the augmented Lagrangian.
First, consider the update of primal variable \( x \). It is defined by \[ x^k \coloneqq \mathrm{argmin}_x L_{\rho}(x, z^{k - 1}, y^{k - 1}) \] at the \( k \)-th iteration. Next, we need to split the computation into agents — based on the spirit of asynchronous ADMM. It seems that the proper way to do so for the \( i \)-th agent is to estimate the latest values of \( x_{j \neq i} \) using the local knowledge \( x_{j, i} \). Let \( \hat{x}_i^k \in \mathbb{R}^{nd} \) be the estimation of \( x \) of the \( i \)-th agent, at the \( k \)-th iteration, analogously for \( \hat{z}_i^k \) and \( \hat{y}_i^k \), and it seems that we should obtain something like \[ \begin{aligned} x_i^k &= x_i^{k - 1} - \eta \nabla_{x_i} L_{\rho}(\hat{x}_i^{k - 1}, \hat{z}_i^{k - 1}, \hat{y}_i^{k - 1}) \\ &= x_i^{k - 1} - \eta \nabla_{x_i} (f_i(x_i^k) + \langle \hat{y}_i^{k - 1}, \phi_i^{k - 1} \rangle + \frac{\rho}{2} \Vert \phi_i^{k - 1} \Vert^2), \end{aligned} \] where \( \phi_i^{k - 1} \coloneqq \hat{x}_i^{k - 1} - \hat{z}_i^{k - 1} \in \mathbb{R}^{nd} \). However, the authors of this paper used a simplified version, which is expressed by \[ x_i^{k + 1} \coloneqq x_i^k - \eta (\nabla f_i(x_i^k) + y_i^k + \rho(x_i^k - z_i^k)). \] The authors did not explain explicitly how \( z_i \) and \( y_i \) can be derived from the augmented Lagrangian, and most importantly, what they actually mean. For now, we can only take this as a manual choice of design. Using this formula, \( z_i \) and \( y_i \) are both vectors matching the shape of \( x_i \), and this definitely saves storage when the algorithm is carried out on multiple computers. Besides, the gradients can be computed more easily than the first update formula. The only flaw in this formula is logical inconsistency, as we don't know how \( z_i \) and \( y_i \) can be obtained from the augmented Lagrangian.
As for the update of \( z_i \), as the authors have pointed out, it is convex and therefore can be solved in closed form: \[ z_i^{k + 1} = \frac{1}{n} \sum_{j \in [n]} \left( x_j^{k + 1} + \frac{y_j^k}{\rho} \right). \] But asynchronous ADMM does not guarantee each \( x_j^{k + 1} \) is available when updating \( z_i^{k + 1} \). So here we still need an estimation. Yi et al. used the same technique as in , which is termed balancing weights (See Definition 1 ): let \( G \) represent a (possibly directed) graph, the node weights \( w_i \) balance \( G \) if for any \( i \in [n] \), there holds \( w_i d_i^{\mathrm{out}} = \sum_{j \in N^{\mathrm{in}}(i)} w_j \), where \( N^{\mathrm{in}}(i) \) represents the set of agents who can send information to \( i \), \( d_i^{\mathrm{out}} \) denotes the number of agents \( i \) can send information to. When the weights in graph \( G \) balance the graph, for each agent, the total incoming weights equals the total outgoing weights, and this explains the term "balance". To write it in a clearer term, let \( w_{ij} \) denote the weight agent \( i \) sends information to \( j \), \( G \) is said to be weight-balanced if there holds \[ \sum_{j \in [n]} w_{ij} = \sum_{j \in [n]} w_{ji}. \] By observation, we can conclude that balancing weights are a looser condition than symmetry, i.e. \( w_{ij} = w_{ji} \), which holds only in undirected graphs. Additionally, we can tell that balancing weights hold in undirected graphs, too.
References
- Yi, Dingran, and Nikolaos M. Freris. "Distributed optimization on directed graphs based on inexact ADMM with partial participation." 2023 62nd IEEE Conference on Decision and Control (CDC). IEEE, 2023.
- Makhdoumi, A., & Ozdaglar, A. (2015, December). Graph balancing for distributed subgradient methods over directed graphs. In 2015 54th IEEE Conference on Decision and Control (CDC) (pp. 1364-1371). IEEE.