Spherical motion dynamic

本文主要介绍了归一化神经网络在训练中,在 Weight Decay 的约束下,权重范数(Weight Norm)和角度更新步长(Angular Update) 会以线性速率收敛到由超参数决定的平衡态值。

基本假设

Hypothesis 1. 稳定性假设

  • 学习率远小于1 η1\eta \ll 1
  • 达到稳定态时(即权重范数收敛时), 有 wtwt+1\|w_{t}\|\simeq\|w_{t+1}\|

Hypothesis 2. Scaling-Invarient

k,L(kw)=L(w)\forall k, \mathcal{L}(kw) = \mathcal{L}(w)

基于这个假设可以得到两个基本性质

Property 1. 权重方向与Loss的梯度向量的方向正交

<wt,wL(wt)>=0\left<w_t,\nabla_{w}\mathcal{L}(w_t) \right> = 0

Proof:
由于Scaling-Invarient Property, f(k)=L(kw)f(k) = \mathcal{L}(kw)是关于kk 的常函数,则

dL(kw)dk=jL(kw)wid(kwi)dk=iL(kw)wiwi=<w,wL(w)>=0\frac{\mathrm{d} \mathcal{L}(kw)}{\mathrm{d} k } =\sum_{j}\frac{\partial \mathcal{L}(kw)}{\partial w_i}\frac{\mathrm{d}(kw_i)}{\mathrm{d}k} = \sum_i \frac{\partial \mathcal{L}(kw)}{\partial w_i} \cdot w_i = \left<w,\nabla_w\mathcal{L}(w)\right> = 0

对于任何w=wtw = w_t均成立

Corollary 1. 没有Weight Decay的优化器的权重范数是严格增的

Proof:

wt+12=wtηwL(w)2=wt2+ηwL(w)2η<wt,wL(w)>=wt2+ηwL(w)20\begin{aligned} \|w_{t+1}\|^2&=\|w_t-\eta \nabla_w\mathcal{L}(w)\|^2\\ &= \|w_t\|^2 +\|\eta\nabla_w\mathcal{L}(w)\|^2 -\eta\left<w_t,\nabla_w\mathcal{L}(w)\right> \\ &=\|w_t\|^2 +\|\eta\nabla_w\mathcal{L}(w)\|^2 \geq 0 \end{aligned}

这说明如果没有Weight Decay, 优化器在有限步梯度下降中难以实现权重范数收敛

Property 2. Gradient Homogeneity

kwL(kw)=1kwL(w)\nabla_{kw}\mathcal{L}(kw) = \frac{1}{k} \nabla_w \mathcal{L}(w)

Proof:

wL(w)=wL(kw)=(L(kw)wi)in=k(L(kw)(kwi))in=kkwL(kw) \nabla_w \mathcal{L}(w) = \nabla_w\mathcal{L}(kw) = \left(\frac{\partial \mathcal{L}(kw)}{\partial w_i}\right)_{i\leq n} = k\left(\frac{\partial \mathcal{L}(kw)}{\partial (kw_i)}\right)_{i\leq n} =k\nabla_{kw}\mathcal{L}(kw)

kwL(kw)=1kwL(w)\nabla_{kw}\mathcal{L}(kw) = \frac{1}{k} \nabla_w \mathcal{L}(w)

或者记为

L(kw)=1kL(w)\nabla \mathcal{L}(kw) = \frac{1}{k}\nabla\mathcal{L}(w)

分析对象

本文主要的分析对象是优化器的梯度与优化前后的向量夹角。做如下定义

Definition 1. SGD的归一化梯度与归一化学习率

这个的计算需要基于向量的模长已经收敛的假设,即 wt+1wt\|w_{t+1}\| \simeq \|w_t\|

对于最朴素的随机梯度下降,有

wt+1=wtηwL(wt):=wtηgtw_{t+1} = w_t - \eta \nabla_w\mathcal{L}(w_t):=w_t - \eta\cdot g_t

取归一化向量梯度下降wt~=wtwt\tilde{w_t} = \dfrac{w_t}{\|w_t\|}, 根据Property2与模长收敛 , 有

w~t+1=w~tηwtwL(wt)=w~tηwt2L(w~t):=w~tηwt2g~t\tilde{w}_{t+1} = \tilde{w}_{t} - \frac{\eta}{\|w_t\|} \nabla_w \mathcal{L}(w_t) = \tilde{w}_t - \frac{\eta}{\|w_t\|^2} \nabla\mathcal{L}(\tilde{w}_t):= \tilde{w}_t -\frac{\eta}{\|w_t\|^2}\cdot \tilde{g}_t


ηerr=ηwt2\eta_{err} = \dfrac{\eta}{\|w_t\|^2} 为修正后的学习率

Definition 2。 夹角更新量

Δt=arccos(<wt,wt+1>wtwt+1)\Delta_t = \arccos \left(\frac{\left< w_t,w_{t+1}\right>}{\|w_t\|\|w_{t+1}\|}\right)

对于充分小的学习率η\etawtwt+1\|w_t\|\simeq \|w_{t+1}\|时,有

Δt=ηgtwt=ηwtL(wt)\Delta_t = \frac{\|\eta\cdot g_t\|}{\|w_t\|} = \frac{\eta}{\|w_t\|}\nabla\mathcal{L}(w_t)

Core Concept

SGD 的 稳定性讨论

Theorem 1. SGD 的梯度模稳定态

考虑带WD的SGD

wt+1=wtη(gt+λwt)w_{t+1} = w_{t} - \eta(g_t+\lambda w_t)

取模得

wt+12=wt2+η2gt+λwt22ηwt(gt+λwt)=wt2+η2gt2+η2λ2wt22ηλwt2=(1ηλ)2wt2+η2g~t2wt2\begin{aligned} \|w_{t+1}\|^2 &= \|w_t\|^2 +\eta^2 \|g_t+\lambda w_t\|^2 - 2\eta w_t(g_t+\lambda w_t)\\ & = \|w_t\|^2 +\eta^2\|g_t\|^2 + \eta^2\lambda^2 \|w_t\|^2 -2\eta\lambda \|w_t\|^2\\ & = \left(1-\eta\lambda \right)^2\|w_t\|^2 +\frac{\eta^2\|\tilde{g}_t\|^2}{\|w_t\|^2} \end{aligned}

xt=wt2x_t = \|w_t\|^2, 此时需要一个 g~t2\|\tilde{g}_t\|^2 的下界保证 g~t2>l\|\tilde{g}_t\|^2> l, 以保证分子为tt无关的常数,则

xt+1(12ηλ)xt+η2lxtx_{t+1} \geq (1-2\eta\lambda)x_t + \frac{\eta^2 l}{x_t}

对于递推

xt+1Axt+BxtA>0,B>0x_{t+1} \geq Ax_t + \frac{B}{x_t}\quad A>0,B>0

正不动点为

x=Ax+Bxx^* = Ax^* + \frac{B}{x^*}

x=B1Ax^* = \sqrt{\frac{B}{1-A}}

对于充分大的tt, 有

  • t,xt<x\forall t,x_t< x^*

xxn+1xAxnBxn=xAxn(1A)x2xn=xxn+(1A)xn(1A)x2xn=(xxn)(1(1A)(xn+x)xn)=(xxn)(A(1A)xxn)A(xxn)\begin{aligned} x^* -x_{n+1}&\leq x^* -Ax_n -\frac{B}{x_n}\\ & = x^* -Ax_n - \frac{(1-A)x^{*2}}{x_n}\\ & = x^* -x_n +(1-A)x_n -\frac{(1-A)x^{*2}}{x_n}\\ & = (x^*-x_n) \left(1-\frac{(1-A)(x_n+x^*)}{x_n}\right)\\ & = (x^*-x_n) \left(A -\frac{(1-A)x^*}{x_n} \right)\leq A(x^*-x_n) \end{aligned}

因此

xtxAt1xx1x_t \geq x^* - A^{t-1}|x^*-x_1|

  • t,xt>x\exists t, x_t> x^*

xt>x>xAt1xx1x_t>x^*> x^* - A^{t-1}|x^*-x_1|

因此SGD 的不动点为

x=ηl2λx^* = \sqrt{\frac{\eta l }{2\lambda}}

wt=ηl2λ4\|w_t\| = \sqrt[4]{\frac{\eta l }{2\lambda}}

且对于某个充分大的t0,t>t0t_0, \,t>t_0时有

xt>ηl2λ(12ηλ)t1ηl2λx1x_t > \sqrt{\frac{\eta l}{2\lambda}} - (1-2\eta\lambda)^{t-1}\left|\sqrt{\frac{\eta l}{2\lambda}} - x_1\right|

考虑递推

xt+1=(12ηλ)xt+Ltη2xtx_{t+1} = (1-2\eta\lambda)x_t+\frac{L_t\eta^2}{x_t}

真实不动点为

E[xt+1xt]=(12ηλ)E[xt]+E[Ltη2xt]\mathbb{E}[x_{t+1}\big| x_t] = (1-2\eta\lambda)\mathbb{E}[x_t]+\mathbb{E}\left[\frac{L_t\eta^2}{x_t}\right]

这导出了稳定性的第二个要求:E[Ltxt]=L\mathbb{E}[L_t\big| x_t] = Lg~t2\|\tilde{g}_t\|^2的期望稳定。基于这个要求,我们得出

x=(12ηλ)x+Lη2xx^* = (1-2\eta\lambda )x^* +\frac{L\eta^2}{x^*}

x=ηL2λx^* = \sqrt{\frac{\eta L}{2\lambda}}

基于最佳平方估计的理念,我们试图去证明权重范数的二阶矩是线性收敛的

E[(xt+1x)2xt]=E[[12ηλLη2xtx)(xxt)+η2LtLxt]2xt]=(12ηλLη2xtx)2(xxt)2+η2(12ηλLη2xtx)(xxt)E[LtLxtxt]+E[(η2LtLxt)2xt]=(12ηλLη2xtx)2(xxt)2+η4xt2E[(LtL)2xt]\begin{aligned} \mathbb{E}\left[(x_{t+1}-x^*)^2\big| x_t\right] &= \mathbb{E}\left[ [1-2\eta\lambda -\frac{L\eta^2}{x_tx^*})(x^*-x_t)+\eta^2\frac{L_t-L}{x_t}]^2\big | x_t\right]\\ &= \left(1-2\eta\lambda-\frac{L\eta^2}{x_tx^*}\right)^2(x^*-x_t)^2 +\eta^2\left(1-2\eta\lambda-\frac{L\eta^2}{x_tx^*}\right)(x^*-x_t)\mathbb{E}\left[\frac{L_t-L}{x_t}\big|x_t\right] + \mathbb{E}\left[\left(\eta^2\frac{L_t-L}{x_t}\right)^2\Big| x_t\right]\\ &= \left(1-2\eta\lambda-\frac{L\eta^2}{x_tx^*}\right)^2(x^*-x_t)^2 + \frac{\eta^4}{x_t^2} \mathbb{E}[(L_t-L)^2\big| x_t] \end{aligned}

此时我们需要梯度模的二阶中心矩稳定,即满足 E[(LtL)2xt]=V\mathbb{E}[(L_t-L)^2\big| x_t] = V, 这就是稳定性的第三个条件

基于这三个条件,我们总有

E[(xt+1x)2xt](12ηλ)2(xxt)2+4λVη3l\begin{aligned} \mathbb{E}\left[(x_{t+1}-x^*)^2\big| x_t\right]&\leq (1-2\eta\lambda)^2 (x^*-x_t)^2+\frac{4\lambda V\eta^3}{l} \end{aligned}

对于整体权重范数的二阶中心矩,且根据全期望公式,有

E[(xt+1x)2]=E[E[(xt+1x)2xt]](12ηλ)2E[(xxt)2]+4λVη3l\begin{aligned} \mathrm{E}[(x_{t+1}-x^*)^2] & = \mathbb{E}[\mathbb{E}[(x_{t+1}-x^*)^2\big | x_t]]\\ &\leq (1-2\eta\lambda)^2 \mathbb{E}[(x^*-x_t)^2]+\frac{4\lambda V\eta^3}{l} \end{aligned}

对于t<t0t< t_0 的期望总是一个有限值,因此总有有限的常数N,KN, K

E[(xtx)2]<(12ηλ)2tN+K\mathrm{E}[(x_{t}-x^*)^2] < (1-2\eta\lambda )^{2t}N +K

这便说明了WD-SGD 总会在某一充分大时刻后,以线性速度收敛到稳定态。WD-SGD 能达到稳定的条件为

{(梯度模下界存在)l>0Lt>l(梯度的期望与方差稳定)L,V>0E[Ltxt]=L, E[(LtL)2xt]=V\begin{dcases} \begin{aligned} &\text{(梯度模下界存在)} &\quad \exists\, l>0 &\quad L_t>l \\[2mm] &\text{(梯度的期望与方差稳定)} &\quad \exists\, L,V>0 &\quad \mathbb{E}[L_t\mid x_t]=L,\ \mathbb{E}[(L_t-L)^2\mid x_t]=V \end{aligned} \end{dcases}

Theorem 2. SGD的角度更新速度

当归一化梯度模 g~t2\|\tilde{g}_t\|_2 趋于稳定时,角度更新量会以线性速率稳定到固定的角度更新量,即

Δt2ληC(1ηλ)t|\Delta_t-\sqrt{2\lambda\eta}|\leq C(1-\eta\lambda)^t

对于

wt+1=wtη(gt+λwt)w_{t+1} = w_{t} - \eta(g_t+\lambda w_t)

<wt+1,wt>=wt2ηλwt2=(1ηλ)wt2\left<w_{t+1},w_t\right> = \|w_t\|^2 - \eta\lambda \|w_t\|^2 = (1-\eta\lambda )\|w_t\|^2

cos2Δt=<wt+1,wt>2wt2wt+12=(1ηλ)2wt4wt2wt+12=(1ηλ)2wt2wt+12(12ηλ)wt2wt+12\begin{aligned} \cos^2 \Delta_t &= \frac{\left<w_{t+1},w_t\right>^2}{\|w_t\|^2\|w_{t+1}\|^2}\\ &= \frac{(1-\eta\lambda)^2 \|w_t\|^4}{\|w_t\|^2\|w_{t+1}\|^2}\\ &= (1-\eta\lambda)^2\frac{\|w_t\|^2}{\|w_{t+1}\|^2}\\ &\sim(1-2\eta\lambda)\frac{\|w_t\|^2}{\|w_{t+1}\|^2} \end{aligned}

根据梯度模稳定性

cos2Δt1+2ηλ=(12ηλ)wt2wt+121=O(14ηλ)t\left|\cos^2\Delta_t - 1+2\eta\lambda\right| = (1-2\eta\lambda)\left|\frac{\|w_t\|^2}{\|w_{t+1}\|^2} - 1\right|= \mathcal{O}(1-4\eta\lambda)^t

cosΔt1+2ηλ2sin2Δt22ηλ=12Δt24ηλΔt2ηλ\begin{aligned} |\cos \Delta_t - 1+2\eta\lambda| &\sim \left|2\sin^2\frac{\Delta_t}{2} - 2\eta\lambda\right| \\ &= \frac{1}{2}|\Delta_t^2-4\eta\lambda| \\ &\sim |\Delta_t - 2\sqrt{\eta\lambda}| \end{aligned}

因此

Δt2ηλ=O(14ηλ)t|\Delta_t-2\sqrt{\eta\lambda}| = \mathcal{O}(1-4\eta\lambda)^t