## 梯度下降法简介

Steepest descent follows from minimising the $1^{\text {st }}$ order laylor
$$f(x+\alpha d) \leq f(x)+\alpha \nabla f(x)^{T} d+\alpha^{2} \frac{L}{2}|d|^{2}$$
Proof:
\begin{aligned} f(x+\alpha d) &=f(x)+\alpha \nabla f(x)^{T} d+\alpha\left[\int_{0}^{1} \nabla f(x+\eta \alpha d)-\nabla f(x) d \eta\right]^{T} d \ & \leq f(x)+\alpha \nabla f(x)^{T} d+\alpha\left[\int_{0}^{1}|\nabla f(x+\eta \alpha d)-\nabla f(x)| d \eta\right]|d|_{2} \ & \leq f(x)+\alpha \nabla f(x)^{T} d+\alpha|d|_{2} \int_{0}^{1} L \alpha|d|_{2} \eta d \eta \end{aligned}
Step-length $\alpha_{k}=L^{-1}$ for $d=-\nabla f(x)$ by minimising the upper bound $f(x)+\alpha \nabla f(x)^{T} d+\alpha^{2} \frac{L}{2}|d|^{2}$ over $\alpha:$ setting its derivative to zero $\nabla f(x)^{T} d+\alpha L|d|^{2}=0$ and solving for $\alpha$.

Corollary

( $f$ sufficient decrease): Steepest descent, $x^{k+1}=x^{k}+\alpha_{k} \phi^{k}$ with $\phi^{k}=-\nabla f\left(x^{k}\right)$ and $\alpha=L^{-1}$ satisfies
$f\left(x^{k+1}\right)=f\left(x^{k}-L^{-1} \nabla f\left(x^{k}\right)\right) \leq f\left(x^{k}\right)-\frac{1}{2 L}\left|\nabla f\left(x^{k}\right)\right|_{2}^{2}$
Convergence of steepest descent:

• Sublinear with rate $\mathcal{O}\left(k^{-1 / 2}\right)$ for general $f$
$\mathbf{-}$ Sublinear with rate $\mathcal{O}\left(k^{-1}\right)$ for $f$ convex,
• $Q$ -Linear with rate $1-\gamma / L$ for $f \gamma$ -strongly convex.
Theorem 1.

Theorem 1 (SD convergence for $f$ general): If $f$ is bounded below by $\bar{f}$, then steepest descent with $\alpha_{k}=L^{-1}$ satisfies
$$\min {0 \leq k \leq T-1}\left|\nabla f\left(x^{k}\right)\right|{2} \leq\left(\frac{2 L\left(f\left(x^{0}\right)-\bar{f}\right)}{T}\right)^{1 / 2} \quad \forall T \geq 1$$

Proof . Proof: By Corollary
$$\sum_{k=0}^{T-1}\left|\nabla f\left(x^{k}\right)\right|_{2}^{2} \leq 2 L \sum_{k=0}^{T-1}\left[f\left(x^{k}\right)-f\left(x^{k+1}\right)\right]=$$
$2 L\left[f\left(x^{0}\right)-f\left(x^{T}\right)\right]$
and
$$\min {0 \leq k \leq T-1}\left|\nabla f\left(x^{k}\right)\right| \leq\left(\frac{1}{T} \sum{k=0}^{T-1}\left|\nabla f\left(x^{k}\right)\right|^{2}\right)^{1 / 2} \leq\left(\frac{2 L\left(f\left(x^{0}\right)-\bar{f}\right)}{T}\right)^{1 / 2}$$

$$\sum_{k=0}^{T-1} f\left(x^{k+1}\right)-f\left(x^{}\right) \leq \frac{L}{2} \sum_{k=0}^{T-1}\left(\left|x^{k}-x^{}\right|_{2}^{2}-\left|x^{k+1}-x^{}\right|_{2}^{2}\right) \leq \frac{L}{2}\left|x^{0}-x^{}\right|_{2}^{2}$$
We then use that $f\left(x^{T}\right) \leq f\left(x^{k+1}\right)$ for $k \in[0, T-1]$ and the above bound to achieve
$$f\left(x^{T}\right)-f\left(x^{}\right) \leq T^{-1} \sum_{k=0}^{T-1}\left(f\left(x^{k+1}\right)-f\left(x^{}\right)\right) \leq \frac{L}{2 T}\left|x^{0}-x^{*}\right|_{2}^{2}$$

Theorem 2.

Theorem 3 (SD convergence for convex $f \gamma$ -strongly convex $):$ If $f$ is $\gamma$ -strongly convex there there exists a unique minimiser $x^{}$ and $f\left(x^{k+1}\right)-f\left(x^{}\right) \leq(1-\gamma / L)\left(f\left(x^{k}\right)-f\left(x^{*}\right)\right)$ for all $k$

Proof .

Proof: Strong convexity and Proposition 2 gives
$$f(y) \geq f\left(x^{k}\right)+\nabla f\left(x^{k}\right)^{T}(y-x)+\frac{\gamma}{2}\left|y-x^{k}\right|_{2}^{2}$$
Therefore $\min {y} f(y)$ is bounded below by the minimiser of the above lower bound, which has minimiser $y^{}=x^{}-\gamma^{-1} \nabla f\left(x^{k}\right)$
\begin{aligned} f\left(x^{}\right) & \geq \min {y} f(y) \geq f\left(x^{k}\right)+\nabla f\left(x^{k}\right)^{T}\left(y^{}-x^{}\right)+\frac{\gamma}{2}\left|y^{}-x^{}\right|_{2}^{2} \ & \geq f\left(x^{k}\right)-\frac{1}{2 \gamma}\left|\nabla f\left(x^{k}\right)\right|_{2}^{2} \end{aligned} and consequently $\left|\nabla f\left(x^{k}\right)\right|_{2}^{2} \geq 2 \gamma\left(f\left(x^{k}\right)-f\left(x^{}\right)\right)$. Corollary 1
$$\text { then gives } f\left(x^{k+1}\right)-f\left(x^{}\right) \leq f\left(x^{k}\right)-f\left(x^{}\right)-\frac{\gamma}{L}\left(f\left(x^{k}\right)-f\left(x^{*}\right)\right)$$

(P) $\min {x} f(x) \quad$ subject to $x \in \mathcal{F}$ Subject to $f$ smooth with $\nabla f(x)$ being $L$ -Lipschitz continuous. Line-search methods iterative update estimates $x^{k}$ of the minimiser $$x^{k+1}=x^{k}+\alpha{k} \phi^{k}$$
with $\alpha_{k}$ the step length and $\phi^{k}$ the search direction. Steepest descent methods use some form of the negative gradient as the search direction, $\phi^{k}=-\nabla f\left(x^{k}\right)$, and can use step-length $\alpha_{k}=L^{-1}$

Theorem 3.

Steepest descent follows from minimising the $1^{s t}$ order Taylor
$$f(x+\alpha \phi) \leq f(x)+\alpha \nabla f(x)^{T} \phi+\alpha^{2} \frac{L}{2}|\phi|^{2}$$

Proof .

\begin{aligned} f(x+\alpha \phi) &=f(x)+\alpha \nabla f(x)^{T} \phi+\alpha\left[\int_{0}^{1} \nabla f(x+\eta \alpha \phi)-\nabla f(x) \phi \eta\right]^{\top} \phi \ & \leq f(x)+\alpha \nabla f(x)^{T} \phi+\alpha\left[\int_{0}^{1}|\nabla f(x+\eta \alpha \phi)-\nabla f(x)| \phi \eta\right]|\phi|_{2} \ & \leq f(x)+\alpha \nabla f(x)^{T} \phi+\alpha|\phi|_{2} \int_{0}^{1} L \alpha|\phi|_{2} \eta \phi \eta \end{aligned}
Step-length $\alpha_{k}=L^{-1}$ for $\phi=-\nabla f(x)$ by minimising the upper bound $f(x)+\alpha \nabla f(x)^{T} \phi+\alpha^{2} \frac{L}{2}|\phi|^{2}$ over $\alpha$ : setting its derivative to zero $\nabla f(x)^{T} \phi+\alpha L|\phi|^{2}=0$ and solving for $\alpha$.

## 梯度下降法中的步长

[4] 中有提出过一种简化版line search方法。这种方法每轮迭代只用一个子函数做一次line search 来估计Lipschitz常数 $L$, 然后选择理论上最好的步长，比如说 $1 / L$, 这样相对计算量还可以接受

[5] 提出了一种probabilistic line search。基本思想就是在每个迭代点计算一个子函数的值，然戌 用Gaussian Process来做拟合，给出目标函数在每个的估计值，然后利用这个估计值来做line search。我自己没有亲自实验过这种方法，但个人感觉就是计算量仍然偏大 (除了每次迭代要计算 个函数值, 还要要不断计算和更新一个Gaussian process)，而且感觉不一定能收数。
[6] 是做SGD步长估计的，几个月前才出炉的paper。大概思想是假设步长服从某个简单的函数，比 如 $C / k$ (因为SGD需要diminishing的步长来保证收敘) , 这儿 $k$ 是指第几轮迭代, $C$ 是某个我 们要估计的常数。然后在每轮迭代的时候用online梯度下降的方法更新下对 $C$ 的估计。但是这篇 paper实验做的巨简单无比，不太清楚这个方法到底盾样。

1) 一般的凸的、连续可导的情况下，我喜欢采用 Goldstein-Armijo rule 的方法来寻找步长。这类 似于一个ine-search, 只是line search寻找的是 stepsize $=\arg \min {h \geq 0} f\left(x{k}-h \nabla f\left(x_{k}\right)\right)$

2) 一般的凸的、连续可导的情况下，还有一种很好的方法叫bb-rule, 这种方法也非常快，缺点是 不能保证函数单调下降，我试过与1) 一起使用效果还不错。

## 梯度下降法的收敛性

$\min f(x)$

$\exists m>0, \nabla^{2} f(x)-m I \succeq 0$

$\exists M>0, M I-\nabla^{2} f(x) \succeq 0$

$f(y) \geq f(x)+\nabla f^{T}(x)(y-x)+\frac{|y-x|_{2}^{2}}{2 m}$ (1)

$p+\frac{|\nabla f(x)|_{2}^{2}}{2 m} \geq f(x) \geq p$

$U(\alpha)=f\left(x^{k+1}\right) \leq f\left(x^{k}\right)-\alpha\left|\nabla f\left(x^{k}\right)\right|_{2}^{2}+\frac{M \alpha^{2}\left|\nabla f\left(x^{k}\right)\right|_{2}^{2}}{2}$

$f\left(x^{k+1}\right) \leq f\left(x^{k}\right)-\frac{\left|\nabla f\left(x^{k}\right)\right|_{2}^{2}}{2 M}$

$f\left(x^{k}\right)-\frac{\left|\nabla f\left(x^{k}\right)\right|_{2}^{2}}{2 m}-p \leq 0$

$f\left(x^{k+1}\right)+\frac{\left|\nabla f\left(x^{k}\right)\right|_{2}^{2}}{2 M}-p \leq f\left(x^{k}\right)-p$ @

$f\left(x^{k+1}\right)-p \leq\left(1-\frac{m}{M}\right)\left(f\left(x^{k}\right)-p\right)$

Steepest Descent