ADMM — Alternating Direction Method of Multipliers

(This passage contains 0 words)

Table of Contents

    Introduction

    Alternating direction method of multipliers (ADMM) is a distributed optimization method which is well-known for having a solid theoretical foundation and its capability to optimize functions concurrently. People have been exploring ADMM for decades, from developing vairants to proving the global convergence of this algorithm. ADMM has shown promise in solving large scale convex optimization problems, and is also applied in solving nonconvex problems. Sometimes the problems can be complex, like the training process of a deep neural network (DNN). In this article, we review some of representative works in the field of ADMM, covering its theoretical groundings and practical usage, and conclude a vision of how this method would be developed in the future.

    In this note (and maybe the rest upcoming notes related to this field), we use the basic terminology in optimization such as convex, decision variables and objective. For more details, please refer to and other resources. If there is no explicit explanation, norms and inner products used in this note are Frobenius norms and inner products.

    Framework of ADMM

    Lagrangian and Dual Problems

    A constrained otimization problem with equality constraints can be converted into an unconstrained problem by constructing a Lagrangian function . Consider the following constrained problem: $$ \text{minimize } R(x) \text{ subject to } f_i(x) = 0, i = 1, 2, \cdots, N,$$ where \(x \in \mathbb{R}^n\) is an \(n\)-dimensional decision variable, while \(R: \mathbb{R}^n \to \mathbb{R}\) is the objective function defined. We call the domain where the constraints are satisfied the feasible region, denoted by \( \mathcal{F} \coloneqq \{x \mid f_i(x) = 0, i = 1, 2, \cdots, N\} \). We also assume that functions \( R \) and \( f_i \) for every \( i = 1, 2, \cdots, N \) are defined over the extended real space \( \bar{\mathbb{R}}^n \coloneqq \mathbb{R}^n \cup \{\infty, -\infty\} \).This is the primal problem that we are concerned about. To reduce the primal problem into an unconstrained problem, we first derive the following Lagrangian: \[ \tilde{\mathcal{L}}(x; \lambda) \coloneqq R(x) + \sum_{i = 1}^N \langle \lambda_i, f_i(x) \rangle, \] where \(\lambda = (\lambda_i)_{i = 1}^N\) is called a Lagrangian multiplier. It is straightforward to conclude that when all the constraints are satisfiled, i.e., when \( x \) is within the feasible region \( \mathcal{F} \), it holds that \(f_i(x) = 0\) for every \(i = 1, 2, \cdots, N\), the Lagrangian equals the objective function. Therefore, the dual problem shares the same global optimal point as the primal problem.

    Using the Lagrangian, we can define another function, whose job is to "find the possible minimums of Lagrangian with respect to primal variables \( x \) with different Lagrangian multipliers." It is given by: \[ \tilde{L}(\lambda) \coloneqq \inf_{x \in \bar{\mathbb{R}}} \tilde{\mathcal{L}}(x; \lambda). \]

    Here \( \tilde{L} \) is called a dual function, and Lagrangian multipliers \( \lambda \) in the context of dual functions are also known as dual variables. Dual functions are concave with respect to dual variables, representing a series of infimums of the Lagrangian functions. First, we note that these functions are lower than the minimal value of the primal problem if the primal problem has a minimal point \( p^\ast \). This is because for every point \( x \in \mathcal{F} \), we have that \( \tilde{\mathcal{L}}(x; \lambda) = R(x) \), and therefore, it can be derived that \( \tilde{L}(\lambda) \le p^\ast \coloneqq R(x^\ast) \le R(x), x, x^\ast \in \mathcal{F} \). This intuition that all dual functions are lower than the original objective \( R(x) \) is called weak duality. Weak duality always holds, as is proved before.

    Next, we show that dual functions are also concave. When \( x \) is fixed, the dual function is clearly affine with respect to dual variables. However, this affinity can be broken by the infimum operation with respect to \( x \). To prove that dual functions are concave, we first establish that the infimum of the sum of two functions are at least the sum of infimums of both functions. Let \( f, g \) be defined over \( x \in \mathcal{D} \), \( \inf_x f(x) \) and \( \inf_x g(x) \) are their infimums. It holds that: \[ \begin{align} \forall x \in \mathcal{D}, &f(x) + g(x) \ge \inf_{x \in \mathcal{D}} f(x) + \inf_{x \in \mathcal{D}} g(x), \\ &\theta \inf_{x \in \mathcal{D}} f(x) \ge \inf_{x \in \mathcal{D}} \theta f(x) \text{ for } \theta \ge 0. \end{align} \] Therefore, let \( \theta \in [0, 1] \), it holds for dual functions that: \[ \begin{align} \tilde{L}(\theta \lambda + (1 - \theta) \tilde{\lambda}) &= \inf_{x \in \bar{\mathbb{R}}} \left( R(z) + \sum_{i = 1}^N \langle \theta \lambda_i + (1 - \theta) \tilde{\lambda}_i, f_i(x) \rangle \right) \\ &\ge \theta \inf_{x \in \bar{\mathbb{R}}} \left( R(x) + \sum_{i = 1}^N \langle \lambda_i, f_i(x) \rangle \right) + (1 - \theta) \inf_{x \in \bar{\mathbb{R}}} \left( R(x) + \sum_{i = 1}^N \langle \tilde{\lambda}_i, f_i(x) \rangle \right) \\ &= \theta \tilde{L}(\lambda) + (1 - \theta) \tilde{L}(\tilde{\lambda}), \end{align} \] which indicates that the dual problem is concave.

    As for a concave function like the dual function itself, we can always find a maximum over the extended real space. Denote that maximum by \( d^\ast \). Then by weak duality, we have that \( d^\ast \le p^\ast \le R(x) \) for \( x \in \bar{\mathbb{R}}^n \). In occasions where \( d^\ast < p^\ast \), we can never find the optimal value of the objective by maximizing the dual function. However, there are times when \( d^\ast = p^\ast \). This happens if certain conditions are satisfied, including when the objective is convex. Specifically, Slater has proved in 1950 that convexity of objetive functions leads to strong duality , which is known as Slater's condition. Slater's condition implies that the optimal value of the objective is exactly the maximum of dual function, while maximizing the dual function is also a convex problem. This is useful when the original constrained primal problem is hard to solve, while the maximum of the concave dual function can be easily found.

    Augmented Lagrangian

    However, duality isn't useful when strong duality doesn't hold, or at least is hard to prove. As for minimizing the Lagrangian, things get tricky when the inner product \( \langle \lambda_i, f_i(x) \rangle \) have negative values. Suppose in a certain step of our optimization, there exists an index \(\iota\) such that \( \langle \lambda_{\iota}, f_{\iota}(x) \rangle < -M \), where \(M \in \mathbb{R}_+ \) is sufficiently large. This is likely to cause the Lagrangian function to have a lower value (\(\tilde{\mathcal{L}} < \sum_{i \neq \iota} \langle \lambda_i, f_i(x) \rangle - M\)), however, it also leads to a larger loss of constraint function, since \(f_{\iota}\) is getting far away from 0, which does not match our expection.

    To avoid these problems, we can introduce a term to force the equality constraints to approach 0. This term could be any norm of the constraint. Following the conventions in this field, we use Frobenius norm \(\Vert \cdot \Vert \coloneqq \Vert \cdot \Vert_F\). In ADMM, we derive an augmented Lagrangian from the basic Lagrangian function, which is expressed by: \[ \mathcal{L}(x; \lambda, \rho) = R(x) + \sum_{i = 1}^N \left( \langle \lambda_i, f_i(x) \rangle + \frac{\rho_i}{2} \Vert f_i(x) \Vert^2 \right). \] We call the problem that minimizes the augmented Lagrangian \(\mathcal{L}\) the ADMM dual problem of the primal problem. Variables \( \lambda_i \) are still called dual variables, while constants \( \rho_i \in \mathbb{R}_+ \) are penalty parameters. The ADMM dual problem is to minimize the augmented Lagrangian, and what makes the ADMM dual problem different from the common dual problem used in constrained optimization is that ADMM dual problem handles constraints more flexibly such that they can be reinforced in most cases. The objective of an ADMM dual problem equals the original objective when decision points are within the feasible region \( \mathcal{F} \), indicating that they share the same global optimal point.

    ADMM Algorithm

    To minimize the augmented Lagrangian, ADMM follows an iterative update strategy , which is shown as Algorithm 1:

    Algorithm 1: Alternating Direction Method of Multipliers
    Input: initial values \(x^0, \lambda_{i}^0\); penalty parameter \(\rho\);
    Output: optimal point \(x^{\ast}\) and optimal value \(R^\ast\);
    Initialize: Let \( k \gets 0 \), define \( \mathcal{L} \);
    1 While not stopping criteria; do
    2 Let \( k \gets k + 1 \);
    3 Update \( x^k \gets \mathrm{argmin}_x \mathcal{L}(x; \lambda^{k - 1}, \rho) \);
    4 Update \( \lambda_i^k \gets \lambda_i^{k - 1} + \rho_i f_i(x^k) \) for \( i = 1, 2, \cdots, N \);
    5 end while;
    6 Return \( x^\ast \gets x^k \), \( R^\ast \gets R(x^\ast) \);

    Algorithm 1 shows the basic framework of ADMM algorithm. In problems involving more than one primal variables, the order of updating each variable should be carefully adjusted for the best performance. As for dual variables, both theoretical and emperical evidence suggests that they should be updated after updating the primal variables to ensure global convergence of the algorithm, which we will discuss later. In practice, there are many types of problems to which ADMM is commonly applied, as we will state in the next sections.

    Typical ADMM Problems

    Convex Objective

    The first type of problems are problems involving an oobjective with at least one strongly convex term and all constraints being linear. Such problems have been thoroughly researched by existing literature. For this type of problems, the nature of applying ADMM is to split the primal objective into two convex terms. The constraints of such problems are linear functions represented by matrix products. Generally, we express these problems by: \[ \text{minimize } R(x, z) = f(x) + g(z) \text{ subject to } Ax + Bz = c, \] where \( f \) and \( g \) are both scaler-valued, strongly convex functions. The feasible domain is defined by \( \mathcal{F} \coloneqq \{ (x, z) \mid Ax + Bz = 0 \} \). Based on the framework of ADMM, the following augmented Lagrangian can be derived: \[ \mathcal{L}(x, z; \lambda) \coloneqq f(x) + g(z) + \langle \lambda, Ax + Bz - c \rangle + \frac{\rho}{2} \Vert Ax + Bz - c \Vert^2. \] The ADMM algorithm is designed as follows:

    Algorithm 2: Convex, Linear Optimization
    Input: initial values \(x^0, z^0, \lambda^0\); penalty parameter \(\rho\);
    Output: optimal point \(x^{\ast}, z^\ast\) and optimal value \(p^\ast\);
    Initialize: Let \( k \gets 0 \), define \( \mathcal{L} \);
    1 While not stopping criteria; do
    2 Let \( k \gets k + 1 \);
    3 Update \( x^k \gets \mathrm{argmin}_x \mathcal{L}(x, z^{k - 1}; \lambda^{k - 1}) \);
    4 Update \( x^k \gets \mathrm{argmin}_z \mathcal{L}(x^k, z; \lambda^{k - 1}) \);
    5 Update \( \lambda^k \gets \lambda^{k - 1} + \rho (Ax^k + Bz^k - c) \);
    6 end while;
    7 Return \( x^\ast \gets x^k \), \( z^\ast \gets z^k \), \( p^\ast \gets f(x^\ast) + g(z^\ast) \);

    Now we provide a brief analysis on Algorithm 2. By the assumptions, we have assumed that objective function \( f \) and \( g \) are both convex, and therefore, 1) solutions to \( \mathrm{argmin}_x \mathcal{L}(x, z^{k - 1}; \lambda^{k - 1}) \) and \( \mathrm{argmin}_z \mathcal{L}(x^k, z; \lambda^{k - 1}) \) exist; 2) optimal value of the primal problem \( p^\ast \) exists. To ensure that Algorithm 2 works, the following conditions have to be satisfied:

    • 1. The feasible region is not empty, i.e. there exists \( y = (x, z)^\top \) such that \( [A, B] y = c \).
    • 2. The solutions to ADMM subproblems (\( x^k \gets \mathrm{argmin}_z \mathcal{L}(x^k, z; \lambda^{k - 1}) \) and \( x^k \gets \mathrm{argmin}_z \mathcal{L}(x^k, z; \lambda^{k - 1}) \)) are unique.
    • 3. The norms of variables \( x \), \( z \) and \( \lambda \) are always bounded.
    • 4. Each of the variables \( x \), \( z \) and \( \lambda \) converges to a stationary point as \( k \to \infty \).

    These conditions provide general goals for proving Algorithm 2 to converge. Some of them can be satisfied by specifying proper assumptions. For example, in the work of Robert Nishihara et al. (2015) , the authors have assumed in their Assumption 3 that 1) matrix \( A \) is invertible, 2) matrix \( B \) has a full column rank. As for the first assumption, when matrix \( A \) is invertible, we can always find an \( x = A^{-1} (c - Bz) \) such that the feasible region is not empty. This ensures condition 1. In addition, this assumption also ensures that matrix \( A \) has a full column rank, which turns the subproblem \( x^k \gets \mathrm{argmin}_z \mathcal{L}(x^k, z; \lambda^{k - 1}) \) into a strongly convex one, and therefore leads to condition 2. with respect to variable \( x \). Using the same approach, we ensure condition 2. with respect to variable \( z \). Therefore, we only have to prove that condition 3. and 4. hold under these assumptions.

    Nishihara et al. also analysed the convergence of Algorithm 2 in an over-relaxed form (Algorithm 2 in their paper) by treating the algorithm as a discrete-time dynamical system (Theorem 6), which is quite inspiring. The over-relaxed form of Algorithm 2 is derived from the augmented Lagrangian by applying transformation to the last two terms: \[ \begin{align} \mathcal{L}_{2,3} &= \langle \lambda, Ax + Bz - c \rangle + \frac{\rho}{2} \Vert Ax + Bz - c \Vert^2 \\ &= \frac{\rho}{2} \left( \Vert r + u \Vert^2 - \Vert u \Vert^2 \right), \end{align} \] where \( r = Ax + Bz - c \), \( u = \lambda / \rho \). This derives an over-relaxed ADMM (Algorithm 2) in . The authors used integral quadratic constraints (IQCs) to analyse the convergence. Readers could refer to related resources such as to get the details about IQCs. In general, IQCs reduce the convergence analysis of a feedback dynamical system into verifying a linear matrix inequality. Let the state variable be represented by \( \xi^k \), a stationary point by \( \xi^\ast \), applying IQCs should lead to the following inequality: \[ \Vert \xi^k - \xi^\ast \Vert_P \le \tau^k \Vert \xi^0 - \xi^\ast \Vert^2, \] if a positive definite matrix \( P \) and convergence rate \( \tau \in (0, 1) \) exist. This inequality then leads to the convergence of state variables, and then the ADMM algorithm.

    In , Li has provided another feasible proof of convergence of Algorithm 2. Li assumes that functions \( f \) and \( g \) are both strongly convex, which ensures that subproblems are solvable. Besides, Li also assumes that the classical Lagrangian \( \tilde{\mathcal{L}} \) has at least one saddle point. The author states that the existence of saddle points ensures strong duality. In the proof of convergence, the author has proved that as \( k \to \infty \) 1) the residual \( r^k \coloneqq Ax^k + Bz^k - c \to 0 \), 2) the objective \( f(x^k) + g(z^k) \to p^\ast \) and 3) the dual variable \( \lambda^k \to \lambda^\ast \). By Lyapunov's direct method, suppose we have a dynamical system \( x^k = T(x^{k - 1}) \), if there exists a Lyapunov function \( V(x) \) such that 1) \( V(x) \ge 0 \), and \( V(x) = 0 \) iff. \( x = x^\ast \); 2) \( V(x^k) \le V(x^{k - 1}) \) and \( V(x^k) \le V(x^{k - 1}) \) when \( x^k \neq x^{k - 1} \), then we can prove that the system is stable in the sense of Lyapunov. In Li's work, the Lyapunov function is defined by \[ V^k = V(z^k; \lambda^k) \coloneqq \frac{1}{\rho} \Vert \lambda^k - \lambda^\ast \Vert^2 + \rho \Vert B(z^k - z^\ast) \Vert^2. \] Li points out that as long as we can prove that the sequence \( \{V^k\} \) is monotonely nonincreasing, then there exist upper bounds for both \( \{\lambda^k\} \) and \( \{Bz^k\} \). Specifically, the goal is to prove that \[ V^k + \rho \Vert r^k \Vert^2 + \rho \Vert B(z^k - z^\ast) \Vert^2 \le V^{k - 1}, \] indicating that \( \{V^k\} \) is nonincreasing. Since \( V^k \) is also positive definite, there exists \( V^\ast \ge 0 \) such that \( V^k \to V^\ast \) as \( k \to \infty \). Taking the sum from \( k = 1 \) to \( \infty \), there holds: \[ \sum_{k = 1}^{\infty} \rho (\Vert r^k \Vert^2 + \Vert B(z^k - z^\ast) \Vert^2) \le \sum_{k = 1}^{\infty} (V^{k - 1} - V^k) = V^0 - V^\ast. \] Therefore, based on the boundedness of the infinite series, we can derive that \( r^k \to 0 \) and \( B(z^k - z^\ast) \to 0 \) as \( k \to \infty \). To prove the inequality, we first use the optimality condition to solve the strongly convex subproblem \( \mathrm{argmin}_x \mathcal{L}(x, z^{k - 1}; \lambda^{k - 1}) \). This yields \[ 0 \in \partial \mathcal{L}(x, z^{k - 1}; \lambda^{k - 1}) \big\vert_{x = x^k} = \partial f(x^k) + A^\top \lambda + \rho A^\top (Ax^k + Bz^{k - 1} - c). \] Here we use transpose instead of inner product to align with the author's notation. Using the update formula of dual varibale \( \lambda \): \( \lambda^k \gets \lambda^{k - 1} + \rho(Ax^k + Bz^k - c) \), there holds: \[ 0 \in \partial f(x^k) + A^\top (\lambda^k - \rho B(z^k - z^{k - 1})). \] Therefore, \( x^k \) is the minimum of strongly convex function \( f(x) + (\lambda^k - \rho B(z^k - z^{k - 1})) Ax \). For any \( x^\ast \), there holds: \[ f(x^k) + (\lambda^k - \rho B(z^k - z^{k - 1}))^\top Ax^k \le f(x^\ast) + (\lambda^k - \rho B(z^k - z^{k - 1}))^\top Ax^\ast. \] Similarly, we can derive \[ g(z^k) + (\lambda^k)^\top B z^k \le g(z^\ast) + (\lambda^k)^\top B z^\ast \] by computing the subgradients of the \( z \)-subproblem, and derive an equivalent optimization problem that solves the minimum of another strongly convex function. Adding these two inequalities together, there holds: \[ p^k - p^\ast \le -(\lambda^k)^\top r^k - \rho(B(z^k - z^{k - 1}))^\top (-r^k + B(z^k - z^\ast)). \] Besides, Li proved that \( p^\ast - p^k \le (\lambda^\ast)^\top r^k \) by applying \( \tilde{\mathcal{L}}(x^\ast, z^\ast; \lambda^\ast) \le \tilde{\mathcal{L}}(x^k, z^k; \lambda^\ast) \), which is derived from the assumption that \( \tilde{\mathcal{L}} \) has at least one saddle point. Adding these two inequalities together, there holds: \[ (\lambda^\ast - \lambda^k)^\top r^k - \rho(B(z^k - z^{k - 1}))^\top (-r^k + B(z^k - z^\ast)) \ge 0. \] First, substituting \( \lambda^k \) with \( \lambda^{k - 1} + \rho r^k \), we have that \[ (\lambda^\ast - \lambda^k)^\top r^k = \frac{1}{\rho} (\lambda^\ast - \lambda^{k - 1})^\top r^k - \rho \Vert r^k \Vert^2. \] Multiplying with 2, substituting \( r^k \) with \( (1 / \rho)(\lambda^k - \lambda^{k - 1}) \), it holds that \[ \begin{align} 2 (\lambda^\ast - \lambda^{k - 1})^\top r^k &= \frac{2}{\rho} (\lambda^\ast - \lambda^{k - 1})^\top (\lambda^k - \lambda^{k - 1}) - 2 \rho \Vert r^k \Vert^2 \\ &= \frac{2}{\rho} (\lambda^\ast - \lambda^{k - 1})^\top (\lambda^k - \lambda^{k - 1}) - \rho \Vert r^k \Vert^2 - \frac{1}{\rho} \Vert \lambda^k - \lambda^{k - 1} \Vert^2 \\ &= \frac{1}{\rho} \left( \Vert \lambda^{k - 1} - \lambda^\ast \Vert^2 - \Vert \lambda^k - \lambda^\ast \Vert^2 \right) - \rho \Vert r^k \Vert^2. \end{align} \] The last equation holds because \[ \Vert a - b \Vert^2 - \Vert a - c \Vert^2 = - 2 \langle a - c, b - c \rangle + \Vert b - c \Vert^2. \] Apply this equality to the inequality, we have that \[ \begin{align} &\space (\lambda^\ast - \lambda^k)^\top r^k - \rho(B(z^k - z^{k - 1}))^\top (-r^k + B(z^k - z^\ast)) \\ &= \frac{1}{2 \rho} \left( \Vert \lambda^{k - 1} - \lambda^\ast \Vert^2 - \Vert \lambda^k - \lambda^\ast \Vert^2 \right) - \frac{\rho}{2} \Vert r^k \Vert^2 \\ &\space\space\space\space\space - \rho(B(z^k - z^{k - 1}))^\top (-r^k + B(z^k - z^{k - 1} + z^{k - 1} - z^\ast)) \\ &= - \frac{1}{2} \Big[ \frac{1}{\rho} \left( \Vert \lambda^k - \lambda^\ast \Vert^2 - \Vert \lambda^{k - 1} - \lambda^\ast \Vert^2 \right) + \rho \Vert r^k \Vert^2 \\ &\space\space\space\space\space\space\space\space\space\space\space\space - 2 \rho (B(z^k - z^{k - 1}))^\top r^k + 2 \rho \Vert B(z^k - z^{k - 1}) \Vert^2 + 2 \rho (B(z^k - z^{k - 1}))^\top (B(z^{k - 1} - z^\ast)) \Big] \\ &= - \frac{1}{2} \Big[ \frac{1}{\rho} \left( \Vert \lambda^k - \lambda^\ast \Vert^2 - \Vert \lambda^{k - 1} - \lambda^\ast \Vert^2 \right) - 2 \rho (B(z^k - z^{k - 1}))^\top r^k \\ &\space\space\space\space\space\space\space\space\space\space\space\space\space + 2 \rho \Vert B(z^k - z^{k - 1}) \Vert^2 + 2 \rho (B(z^k - z^{k - 1}))^\top (B(z^{k - 1} - z^\ast)) \\ &\space\space\space\space\space\space\space\space\space\space\space\space\space + \rho \left( \Vert r^k - B(z^k - z^{k - 1}) \Vert^2 + 2 (B(z^k - z^{k - 1}))^\top r^k - \Vert B(z^k - z^{k - 1}) \Vert^2 \right) \Big]\\ &= - \frac{1}{2} \Big[ \frac{1}{\rho} \left( \Vert \lambda^k - \lambda^\ast \Vert^2 - \Vert \lambda^{k - 1} - \lambda^\ast \Vert^2 \right) + \rho \Vert r^k - B(z^k - z^{k - 1}) \Vert^2 \\ &\space\space\space\space\space\space\space\space\space\space\space\space\space + \rho \Big( \Vert B(z^k - z^{k - 1}) \Vert^2 + 2 (B(z^k - z^{k - 1}))^\top (B(z^{k - 1} - z^\ast)) \Big) \Big] \\ &= - \frac{1}{2} \Big[ \frac{1}{\rho} \left( \Vert \lambda^k - \lambda^\ast \Vert^2 - \Vert \lambda^{k - 1} - \lambda^\ast \Vert^2 \right) + \rho \Vert r^k - B(z^k - z^{k - 1}) \Vert^2 \\ &\space\space\space\space\space\space\space\space\space\space\space\space\space + \rho \Vert B(z^k - z^\ast) \Vert^2 - \rho \Vert B(z^{k - 1} - z^\ast) \Vert^2 \Big] \ge 0. \\ \end{align} \] Based on the definition of Lyapunov function \( V^k \), there holds: \[ V^k - V^{k - 1} = \frac{1}{\rho} \Vert \lambda^k - \lambda^\ast \Vert^2 + \rho \Vert B(z^k - z^\ast) \Vert^2 - \frac{1}{\rho} \Vert \lambda^{k - 1} - \lambda^\ast \Vert^2 - \rho \Vert B(z^{k - 1} - z^\ast) \Vert^2. \] Combining with the previous equations, there holds: \[ V^k + \rho \Vert r^k - B(z^k - z^{k - 1}) \Vert^2 \le V^{k - 1}, \] which proves that sequence \( V^k \) is nonincreasing. And therefore, based on Lyapunov's direct method, we can say that the algorithm is convergent in the sense of Lyapunov. This proves that the sequence generated by Algorithm 2 is at least not divergent. In addition, Li proves that the variable sequence \( \{ x, z, \lambda \}_{i = 1}^k \) converges to a stationary point as \( k \to \infty \). Readers could refer to 2.2.5, for more details.

    Nonconvex Objective

    When the objective functions are convex or at least affine, the augmented Lagrangian is also convex. If the objective is strongly convex, or the matrices involved in the linear constraints have a full column rank, the augmented Lagrangian is strongly convex. However, if the objective is not convex, the augmented Lagrangian might not be convex. In this sense, we cannot ensure that there exists unique solutions to ADMM subproblems.

    Wang et al. (2019) dived into the case where the objective functions are nonconvex, but satisfy certain criteria . They assume that the objetive \( \phi \) is nonvex but continuous, and the constraints are represented by \[ \sum_{i = 0}^p A_i x_i + By = b. \] The authors argue that their proof holds when \( b \in \mathrm{Im}(B) \), and set \( b = 0 \) for brevity. The algorithm developed by Wang et al. is shown as Algorithm 3:

    Algorithm 3: Nonconvex ADMM
    Input: initial values \(x^0, z^0, \lambda^0\); penalty parameter \(\rho\);
    Output: optimal point \(x^{\ast}, z^\ast\) and optimal value \(p^\ast\);
    Initialize: Let \( k \gets 0 \), define \( \mathcal{L} \);
    1 While not stopping criteria; do
    2 Let \( k \gets k + 1 \);
    3 Update \( x_i^k \gets \mathrm{argmin}_{x_i} \mathcal{L}(x_0^k, \cdots, x_{i - 1}^k, x_i, x_{i + 1}^{k - 1}, \cdots, x_p^{k - 1}, y^{k - 1}; \lambda^{k - 1}) \); for \( i = 0, 1, \cdots, p \);
    4 Update \( y^k \gets \mathrm{argmin}_y \mathcal{L}(\{x_i^k\}_{i = 1}^p, y; \lambda^{k - 1}) \);
    5 Update \( \lambda^k \gets \lambda^{k - 1} + \rho (\sum_{i = 0}^p A_i x_i + By^k) \);
    6 end while;
    7 Return \( x^\ast \gets x^k \), \( z^\ast \gets z^k \), \( p^\ast \gets f(x^\ast) + g(z^\ast) \);

    In the proof of convergence, the authors defined the following four properties, which together establish the boundedness of Algorithm 3:

    • P1 (Boundedness) Sequence \( \{ \{x^k\}_{i = 0}^p, y^k, \lambda^k \} \) is bounded, and \( \mathcal{L}(\{x^k\}_{i = 0}^p, y^k, \lambda^k) \) is lower bounded.
    • P2 (Sufficient Descent) There exist a sufficiently large \( C_1(B) > 0 \) for all sufficiently large \( k \), such that \[ \mathcal{L}^{k - 1} - \mathcal{L}^k \ge C_1(\beta) \Big( \Vert B(y^k - y^{k - 1}) \Vert^2 + \sum_{i = 1}^p \Vert A_i (x_i^k - x_i^{k - 1}) \Vert^2 \Big), \] where \( \mathcal{L}^k \) is short for \( \mathcal{L}(\{x^k\}_{i = 0}^p, y^k, \lambda^k) \).
    • P3 (Subgradient Bound) There exists \( C_2(\beta) > 0 \) and subderivative \( d^k \in \partial \mathcal{L}^k \) such that \[ \Vert d^k \Vert \le C_2(\beta) \Big( \Vert B(y^k - y^{k - 1}) \Vert + \sum_{i = 1}^p \Vert A_i(x_i^k - x_i^{k - 1}) \Vert \Big). \]
    • P4 (Limiting Continuity) For index \( s \in \mathbb{N} \), there exist a subsequence \( \{ \{ x_i^{k_s} \}_{i = 0}^p, y^{k_s}, \lambda^{k_s} \} \) with limit point \( (\{ x_i^{\ast} \}_{i = 0}^p, y^{\ast}, \lambda^{\ast}) \). It holds that \[ \lim_{s \to \infty} \mathcal{L}(\{ x_i^{k_s} \}_{i = 0}^p, y^{k_s}, \lambda^{k_s}) = \mathcal{L}(\{ x_i^{\ast} \}_{i = 0}^p, y^{\ast}, \lambda^{\ast}). \]

    After proving these four properties, the author used the conclusions (Theorem 2.9) of to establish the boundedness of Algorithm 3. Here we omit the tedious proof and introduce only the necessary steps. Readers could refer to their original paper for the detailed proof. Incidentally, we will see that the approach used in establishes a solid grounding for future researches in this area, especially ADMM applied to complicated problems. We can see in representative works in the field of using ADMM to train DNNs such as and that the authors used techinques similar to .

    Now let's walk through the proof introduced in . First, the author establish the upper bound of the descent of dual variable \( \lambda \), which is bounded by the descent of primal variable \( y \). The result is shown as follows: \[ \Vert \lambda^k - \lambda^{k - 1} \Vert \le C \Vert B (y^k - y^{k - 1}) \Vert, \] where \( C \) is a constant determined by matrix \( B \). The proof is established on a conclusion used in matrix analysis: for every \( x \in \mathrm{Im}(B) \), let \( \lambda_{++} \) denote the smallest postive eigenvalue of symmetric matrix \( B^\top B \), it holds that \( \lambda_{++}^{1/2} \Vert x \Vert \le \Vert B^\top x \Vert \). The boundedness of \( \Vert \lambda^k - \lambda^{k - 1} \Vert \) is crucial, because the descent of the augmented Lagrangian as dual variables update can be expressed by the descent of dual variables.

    Next, the authors establish the lower boundedness of descent of the augmented Lagrangian \[ \Delta \mathcal{L}(x_i^{k - 1} \to x_i^k) \coloneqq \mathcal{L}(x_{< i}^k, x_i^{k - 1}, x_{> i}^{k - 1}) - \mathcal{L}(x_{< i}^k, x_i^k, x_{> i}^{k - 1}), \] which is caused by updating of primal variables. This lemma can be easily established because the descent is ensured by the ADMM algorithm: when solving the subproblem, we always use the minimizer of the augmented Lagrangian function to update primal variables. Similarly, the descent of augmented Lagrangian caused by the update of \( y \) can be also established.

    Now we're at the most difficult stage of proving convergence of ADMM with nonconvex objectives: the lower bound of descent of augmented Lagrangian caused by updating dual variables. We should notice that updating dual variables in ADMM will not make the augmented Lagrangian smaller. Instead, it causes the augmented Lagrangian to grow. Therefore, when proving sufficient descent property (P2), we need to find a set of conditions such that \[ \mathcal{L}^{k - 1} - \mathcal{L}^k = \Vert \Delta \mathcal{L}_{\mathrm{primal}} \Vert - \Vert \Delta \mathcal{L}_{\mathrm{dual}} \Vert \ge 0. \] This process can be tough when there are many dual variables, and when the constraints are complicated. Fortunately, here we have only one linear constraint with one dual variable. In their proof, the authors used the property of Lipchitz constants. Specifically, this property says if \( H \) is the Lipschitz constant of the derivative of function \( f \), for any \( x, x_0 \) within the feasible region, it holds that \[ f(x) \le f(x_0) + \langle f^\prime(x_0), x - x_0 \rangle + \frac{H}{2} \Vert x - x_0 \Vert^2. \] This inequality can be proved by computing a line integral. This property is especially useful for functions that are Lipschitz-differentiable, which is often taken as an assumption in the field of optimization. Sometimes we apply a little trick when the inequality above is not what we want to prove: since \( f \) is \( H \)-Lipschitz differentiable, \( g = - f \) should also be the same. Therefore, using this property of function \( g \), there holds \[ f(x_0) + \langle f^\prime (x_0), x - x_0 \rangle \le f(x) + \frac{H}{2} \Vert x - x_0 \Vert^2, \forall x, x_0 \in \mathcal{F}. \]

    After establishing the lower bound of the descent of augmented Lagrangian, we are now able to 1) use the descent of primal variables to estimate the minimal descent of augmented Lagrangian and 2) establish a monotonely decreasing series of \( \{ \mathcal{L}^k \}_{k = 1}^{\infty} \). It is worth noting that almost all of the researches studying the convergence of ADMM have found sufficient conditions for the sequence of augmented Lagrangian \( \{ \mathcal{L}^k \}_{k = 1}^{\infty} \) to be monotonely decreasing (except those using indirect methods like treating ADMM as a dynamical system, and using IQCs or Lyapunov functions to prove convergence, as we have introduced before). It's true that this is the easiest and most stable approach, however, since we can only find a set of sufficient conditions for the sequence to be monotone, sometimes these conditions can be extremely complex and they can hardly guide the users to tune parameters of the algorithm. Besides, these researches don't reveal conditions for the algorithm to converge in an unstable manner either.

    After proving that the augmented Lagrangian function will decrease monotonely as \( k \to \infty \), they authors used the property of Lipschitz constants to prove that it is also lower-bounded. Such a monotonely decreasing, lower-bounded sequence of augmented Lagrangians must converge. Then the boundedness of primal and dual variables can be easily established, and then the sufficient descent property (H2) and the boundedness of gradients of the augmented Lagrangian function with respect to these variables (H3). Proof of the rest properties can also be established using these conclusions.

    DNN Training

    Till now we are studying ADMM applied to linear constraints. What if there are nonlinear constraints in optimization problems? A typical occasion is when we use ADMM to train a deep neural network (DNN). Generally, almost all DNNs have nonlinear activation functions (even ReLU is nonlinear unless all features are postive), and these functions provide nonlinearity to forward propagation, so that the models can fit more data features. Suppose we have a DNN with \( N > 1 \) layers. Let \( x_0 \) denote the input, and the output of the \( i \)-th layer be denoted by \( x_i, i = 1, 2, \cdots, N \), the ground truth by \( \hat{y} \). We will have a loss function that measures how far the model's prediction \( x_N \) is away from the ground truth \( \hat{y} \), which is denoted by \( R(x_N; \hat{y}) \).

    Take the simplest form of DNNs as an example, a feedforward neural network consists of linear layers, which can be written in mathematical language as follows: \[ \left\{ \begin{align} x_i &= \sigma(W_i x_{i - 1}), i = 1, 2, \cdots, N - 1 \\ x_N &= W_N x_{N - 1} \end{align} \right. \] where \( \sigma \) is an activation function, which could possibly be nonlinear, nonconvex, and the final output layer does not have an activation function. Bias \( b_i \) is omitted for brevity, and whether or not including bias does not affect the proof of convergence. Training this network is equivalent to solving the following optimization problem: \[ \begin{align} &p^\ast = \min_{ \{ w_i \}_{i = 1}^N } R(x_N; \hat{y}) \\ &\text{subject to } \left\{ \begin{aligned} x_i &= \sigma(W_i x_{i - 1}), i = 1, 2, \cdots, N - 1 \\ x_N &= W_N x_{N - 1} \end{aligned} \right. \end{align} \] To apply ADMM, we first construct the augmented Lagrangian. Conventionally, ADMM specifies dual variables for each of the constraints, yielding an augmented Lagrangian as follows: \[ \begin{align} \mathcal{L}(\{ x_i \}_{i = 1}^N; \{ \lambda_i \}_{i = 1}^N) &= R(x_N; \hat{y}) + \sum_{i = 1}^{N - 1} \Big( \langle \lambda_i, x_i - \sigma(W_i x_{i - 1}) \rangle + \frac{\rho_i}{2} \Vert x_i - \sigma(W_i x_{i - 1}) \Vert^2 \Big) \\ &\space\space\space\space\space + \langle \lambda_N, x_N - W_N x_{N - 1} \rangle + \frac{\rho_N}{2} \Vert x_N - W_N x_{N - 1} \Vert^2. \end{align} \] This augmented Lagrangian is a standard version, which is adopted in . However, using such many dual variables introduces significant complexity to proving the boundedness of descent of the augmented Lagrangian caused by the update of dual variables (as there are so many of them), and makes it even harder to tune penalty parameters \( \rho_i \). Another option is adopted in , which reduces the augmented Lagrangian into a variant with only one dual variable applied to the last constraint, i.e. the constraint of the output layer, and uses a set of squared norms to force the other constriants to be satisfied. This variant is given by the following equation: \[ \begin{align} \mathcal{L}(\{ x_i \}_{i = 1}^N; \{ \lambda_i \}_{i = 1}^N) &= R(x_N; \hat{y}) + (\nu / 2) \sum_{i = 1}^{N - 1} \Vert x_i - \sigma(W_i x_{i - 1}) \Vert^2 \\ &\space\space\space\space\space + \langle \lambda_N, x_N - W_N x_{N - 1} \rangle + \frac{\rho_N}{2} \Vert x_N - W_N x_{N - 1} \Vert^2, \end{align} \] where the hyperparameter \( \nu > 0 \). Also it should be clarified that both and introduced intermediate variables, which are the input to the activation function (i.e. the pre-activation), as extra constraints. Here we omit it for brevity, as introducing such variables will not significantly reduce the complexity of convergence proof. In addition, introduced a forward-backward update order that updates the variables twice, which ensures that the information of all layers can be exchanged "completely". On the other hand, utilizes the traditional update order, which updates the variables only once.

    During the training process, we need to update both weight matrices \( W_i \) and output variables \( x_i \). These are primal variables updated by solving ADMM subproblems, i.e. \( W_i^k \gets \mathrm{argmin}_{W_i} \mathcal{L} \) and \( x_i^k \gets \mathrm{argmin}_{x_i} \mathcal{L} \). When \( i = N \), the subproblems are convex, and therefore have only one global minimum (this doesn't ensure the uniqueness of optimal points, though). We can only assume the \( i = N \) subproblem to be convex because \( W_N \) is being updated and it's hard to assume it to possess any properties. Things get even more complex as \( i < N \), since we can't even ensure convexity of subproblems in that case.

    Therefore, to ensure that the subproblems can be solved, both and used an quadratic function to estimate the nonconvex subproblems. Quadratic functions have the best properties that we can expect: they are strongly convex, easy to solve, and we have the complete knowledge where its global minimum should be located. The authors of introduced an iterative algorithm to which they refer as "backtracking". The algorithm estiamtes the quadratic coefficient of the quadratic function, and provides a condition such that by making this approximation, the value of the augmented Lagrangian always decrease as we update primal variables. On the other hand, used matrix norms of terms involved in the subproblems to estimate the upper limit of Lipschitz constants of that subproblem, and used this upper limit as the quadratic coefficient and the properties of Lipschitz constants to prove convergence lemmas. The good point about the approach used in is that it is not iterative, which means the coefficients can be computed very fast. However, sometimes the coefficients can be too large and hinder the update of primal variables, reducing both accuracy and training speed.

    The proof approach used in both and is similar to , and that's why we believe establishes a solid foundation for ADMM applied to complicated problems such as training a DNN. and have a significant impact on future researches such as ADMM applied to LSTM training , they almost follow the same approach.

    In conclusion of this section, we have discussed the basics about ADMM applied to three typical problems: optimizing convex/nonconvex objectives with linear constraints, and optimizing problems with nonlinear constraints. We can see clearly from the development of this field that in the early stage of proving ADMM converges under certain criteria, such as (and , though it is proposed in 2024), researchers rely heavily on the mathematical properties of the updating process, using conclusions from control theory to handle these problems. When it comes to nonconvex objectives, researchers start to depend on conclusions from , and include properties such as sufficient descent and gradient boundedness. After ADMM are applied to even more complicated problems such as training DNNs, researchers first apply an approximation to the nonconvex subproblems to make them solvable, and introduce extra constraints such as the backtracking algorithm or Lipschitz constants to establish the convergence theorem ( ). In the next sections, we will take a glance at the variants of ADMM, which are different from the aforementioned standard algorithms with unique features.

    ADMM Variants

    Apart from the standard ADMM algorithm as we have discussed in the previous section, we should note that there are a wide range of ADMM variants. These include distributed ADMM, stochastic ADMM, adaptive ADMM, accelerated ADMM and so on. We will review some of the most representative works in the following subsections.

    Distributed ADMM

    Distributed ADMM can be divided into two types: the synchronous type and the asynchronous type. The difference between the two types is whether we need a master worker that centralizes the update of variables. Consider an objective that's splittable: \[ f(x) = \sum_{i = 1}^N f_i(x). \] To minimize \( f(x) \), it is equivalent to solve the following problem, which is reformulated from the original one: \[ \text{minimize } \sum_{i = 1}^N f_i(x_i) \text{ subject to } x_i = z \text{ for } i = 1, 2, \cdots, N. \] This leads to the following augmented Lagrangian: \[ \mathcal{L} = \sum_{i = 1}^N f_i(x_i) + \sum_{i = 1}^N \Big( \langle \lambda_i, x_i - z \rangle + \frac{\rho_i}{2} \Vert x_i - z \Vert^2 \Big). \] Suppose we have \( N + 1 \) machines, each one of them can efficiently communicate with each other. The first \( N \) machines are responsible for solving \[ x_i^k \gets \mathrm{argmin}_{x_i} \Big( f_i(x_i) + \langle \lambda_i^{k - 1}, x_i - z \rangle + \frac{\rho_i}{2} \Vert x_i - z^{k - 1} \Vert^2 \Big), \] while the last machine solves \( z \) in a centralized manner: \[ z^k \gets \mathrm{argmin}_z \Big( \langle \lambda_i^{k - 1}, x_i^k - z \rangle + \frac{\rho_i}{2} \Vert x_i^k - z \Vert^2 \Big). \] Then for the first \( N \) machines, the dual variables are updated in parallel: \[ \lambda_i^k \gets \lambda_i^{k - 1} + \rho_i \big( x_i^k - z^k \big). \]

    References

    1. Boyd S.等. 凸优化 . 清华大学出版社 (2013): 部分章节.
    2. Slater M. Lagrange multipliers revisited. Traces and emergence of nonlinear programming. Basel: Springer Basel (2013). 293-306.
    3. Daniel G. and Bertrand M. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications 2.1 (1976): 17-40.
    4. Robert N., et al. A General Analysis of the Convergence of ADMM. International conference on machine learning. PMLR (2015).
    5. MIT. Integral Quadratic Constraints (2009).
    6. 李征洋. ADMM算法在求解凸分裂问题中的研究. 华中科技大学 (2024).
    7. Yu W., Wotao Y. and Jinshan Z. Global Convergence of ADMM in Nonconvex Nonsmooth Optimization. Journal of Scientific Computing 78 (2019): 29-63.
    8. Attouch H., Bolte J. and Svaiter B. Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming 137 (2013): 91-129.
    9. Junxiang W., Fuxun Y. et al. ADMM for Efficient Deep Learning with Global Convergence. ACM (2019).
    10. Jinshan Z., Shaobo L. et al. On ADMM in Deep Learning: Convergence and Saturation-Avoidance. Journal of Machine Learning Research 22 (2021).
    11. Liu S., Kong Z., Huang T., et al. An ADMM-LSTM framework for short-term load forecasting. Neural Networks 173 (2024):173.