EM求解的高斯混合模型——Q函数的极大似然估计(八)
先导:EM求解的混合密度模型——Q函数
p ( x ∣ θ k ) → N ( x ∣ μ k , Σ k ) p(\boldsymbol{x} \mid \boldsymbol {\theta}_k) \rightarrow {N}(\boldsymbol{x} \mid \boldsymbol{\mu_k}, \boldsymbol{\Sigma}_k) p(x∣θk)→N(x∣μk,Σk)
由上述推导即可获得高斯混合模型的 EM 算法:在每步迭代中,先根据当前参数来计算每个样本属于每个高斯成分的后验概率 γ j i \gamma_{ji} γji(E步),再更新模型参数 { ( π i , μ i , Σ i ) ∣ 1 ≤ i ≤ k } \{(\pi_i, \mu_i, \Sigma_i) \mid 1 \leq i \leq k\} {(πi,μi,Σi)∣1≤i≤k}(M步)。
E步:
γ k ( t ) ( x i ) = π k ( t ) N ( x i ∣ μ k ( t ) , Σ k ( t ) ) ∑ k = 1 K π k ( t ) N ( x i ∣ μ k ( t ) , Σ k ( t ) ) \gamma_k^{(t)}({\bm x}_i) = \frac{\pi_k^{(t)} N({\bm x}_i \mid {\bm \mu}_k^{(t)}, {\bm \varSigma}_k^{(t)})}{\sum\limits_{k=1}^K \pi_k^{(t)} N({\bm x}_i \mid {\bm \mu}_k^{(t)}, {\bm \varSigma}_k^{(t)})} γk(t)(xi)=k=1∑Kπk(t)N(xi∣μk(t),Σk(t))πk(t)N(xi∣μk(t),Σk(t))
可以发现,根据离散随机变量期望的定义,替换后的式子事实上是在对原先的对数似然函数计算期望(Q函数):
Q ( θ ∣ X , θ ( t ) ) = ∑ j = 1 n ∑ k = 1 K γ ( z j k ) ( ln π k + ln N k ( x j ∣ μ k , Σ k ) ) Q(\theta \mid X, \theta^{(t)}) = \sum\limits_{j=1}^{n} \sum\limits_{k=1}^{K} \gamma(z_{jk}) \left( \ln \pi_k + \ln N_k(\bm{x}_j \mid {\bm \mu}_k, {\bm \varSigma}_k) \right) Q(θ∣X,θ(t))=j=1∑nk=1∑Kγ(zjk)(lnπk+lnNk(xj∣μk,Σk))
其中, θ ( t ) \theta^{(t)} θ(t)是当前的模型参数。根据式 (10),上式中的 γ ( z j k ) \gamma(z_{jk}) γ(zjk)可以根据当前的模型参数计算出来,不涉及隐变量。
由于高斯混合模型是概率模型,无法肯定地指出某个样本属于某个聚类,因此,将隐变量 z i k z_{ik} zik替换为上文所介绍的责任 γ ( z i k ) = P ( z i k = 1 ) = P ( z k = 1 ∣ x i ) \gamma(z_{ik}) = {P}(z_{ik} = 1) = {P}(z_k = 1 | \bm{x}_i) γ(zik)=P(zik=1)=P(zk=1∣xi),即样本 x i \bm{x}_i xi属于第 k k k个聚类的概率,这相当于一个软版本的二元隐变量。
其代入式 (18),回顾 θ = { π k , μ k , Σ k } i = 1 K \theta = \{ \pi_k, {\bm \mu}_k, {\bm \varSigma}_k \}_{i=1}^{K} θ={πk,μk,Σk}i=1K,得到
Q ( { π k , μ k , Σ k } i = 1 K , { π k ( t ) , μ k ( t ) , Σ k ( t ) } i = 1 K ) = ∑ i = 1 n ∑ k = 1 K γ k ( t ) ( x i ) ln ( π k N k ( x j ∣ μ k , Σ k ) ) (20) Q(\{ \pi_k, {\bm \mu}_k, {\bm \varSigma}_k \}_{i=1}^{K}, \{ \pi_k^{(t)}, {\bm \mu}_k^{(t)}, {\bm \varSigma}_k^{(t)} \}_{i=1}^{K}) \\= \sum\limits_{i=1}^{n} \sum\limits_{k=1}^{K} \gamma_k^{(t)}({\bm x}_i) \ln (\pi_k N_k(\bm{x}_j \mid {\bm \mu}_k, {\bm \varSigma}_k)) \tag{20} Q({πk,μk,Σk}i=1K,{πk(t),μk(t),Σk(t)}i=1K)=i=1∑nk=1∑Kγk(t)(xi)ln(πkNk(xj∣μk,Σk))(20)
有了式 (19) 的期望对数似然函数,可以对其最大化,从而得到对模型参数的当前最优估计。
为了获得观测数据的聚类隶属度,需要找到一组参数 { π k , μ k , Σ k } i = 1 K \{ \pi_k, {\bm \mu}_k, {\bm \varSigma}_k \}_{i=1}^{K} {πk,μk,Σk}i=1K的估计。
M 步:
{ π k ( t + 1 ) , μ k ( t + 1 ) , Σ k ( t + 1 ) } i = 1 K = arg max { π k , μ k , Σ k } i = 1 K Q ( { π k , μ k , Σ k } i = 1 K , { π k ( t ) , μ k ( t ) , Σ k ( t ) } i = 1 K ) (21) \{ \pi_k^{(t+1)}, {\bm \mu}_k^{(t+1)}, {\bm \varSigma}_k^{(t+1)} \}_{i=1}^{K} \\= \arg \max_{\{ \pi_k, {\bm \mu}_k, {\bm \varSigma}_k \}_{i=1}^{K}} Q(\{ \pi_k, {\bm \mu}_k, {\bm \varSigma}_k \}_{i=1}^{K}, \{ \pi_k^{(t)}, {\bm \mu}_k^{(t)}, {\bm \varSigma}_k^{(t)} \}_{i=1}^{K}) \tag{21} {πk(t+1),μk(t+1),Σk(t+1)}i=1K=arg{πk,μk,Σk}i=1KmaxQ({πk,μk,Σk}i=1K,{πk(t),μk(t),Σk(t)}i=1K)(21)
这个最大化是比较直接的。通过对 μ k {\bm \mu}_k μk和 Σ k − 1 {\bm \varSigma}_k^{-1} Σk−1微分,有:
∂ Q ( θ ∣ X , θ ( t ) ) ∂ μ k = ∑ i = 1 n γ k ( t ) ( x i ) Σ k − 1 ( x i − μ k ) = 0 \frac{\partial Q(\theta\mid X, \theta^{(t)})}{\partial {\bm \mu}_k} = \sum\limits_{i=1}^{n} \gamma_k^{(t)}({\bm x}_i) {\bm \varSigma}_k^{-1} ({\bm x}_i - {\bm \mu}_k) = 0 ∂μk∂Q(θ∣X,θ(t))=i=1∑nγk(t)(xi)Σk−1(xi−μk)=0
∂ Q ( θ ∣ X , θ ( t ) ) ∂ Σ k − 1 = − 1 2 ∑ i = 1 n γ k ( t ) ( x i ) ( Σ k − ( x i − μ k ) ( x i − μ k ) ⊤ ) = 0 (22) \frac{\partial Q(\theta\mid X, \theta^{(t)})}{\partial {\bm \varSigma}_k^{-1}} = -\frac{1}{2} \sum\limits_{i=1}^{n} \gamma_k^{(t)}({\bm x}_i) ({\bm \varSigma}_k - ({\bm x}_i - {\bm \mu}_k)({\bm x}_i - {\bm \mu}_k)^\top) = 0 \tag{22} ∂Σk−1∂Q(θ∣X,θ(t))=−21i=1∑nγk(t)(xi)(Σk−(xi−μk)(xi−μk)⊤)=0(22)
其中 k = 1 , ⋯ , K k = 1, \cdots, K k=1,⋯,K,解出方程组:
μ k ( t + 1 ) = ∑ i = 1 n γ k ( t ) ( x i ) x i ∑ i = 1 n γ k ( t ) ( x i ) {\bm \mu}_k^{(t+1)} = \frac{\sum\limits_{i=1}^{n} \gamma_k^{(t)}({\bm x}_i) {\bm x}_i}{\sum\limits_{i=1}^{n} \gamma_k^{(t)}({\bm x}_i)} μk(t+1)=i=1∑nγk(t)(xi)i=1∑nγk(t)(xi)xi
Σ k ( t + 1 ) = ∑ i = 1 n γ k ( t ) ( x i ) ( x i − μ ^ k ( t + 1 ) ) ( x i − μ ^ k ( t + 1 ) ) ⊤ ∑ i = 1 n γ k ( t ) ( x i ) (23) {\bm \varSigma}_k^{(t+1)} = \frac{\sum\limits_{i=1}^{n} \gamma_k^{(t)}({\bm x}_i) ({\bm x}_i - \hat{\mu}_k^{(t+1)}) ({\bm x}_i - \hat{\mu}_k^{(t+1)})^\top}{\sum\limits_{i=1}^{n} \gamma_k^{(t)}({\bm x}_i)} \tag{23} Σk(t+1)=i=1∑nγk(t)(xi)i=1∑nγk(t)(xi)(xi−μ^k(t+1))(xi−μ^k(t+1))⊤(23)
其中 k = 1 , ⋯ , K k = 1, \cdots, K k=1,⋯,K,可以证明这个驻点确实是一个极大值点。
至于 π k \pi_k πk的最大化,为约束 ∑ k = 1 K π k = 1 \sum\limits_{k=1}^{K} \pi_k = 1 k=1∑Kπk=1,引入拉格朗日乘子 λ \lambda λ从而寻找 Q ( θ ∣ X , θ ( t ) ) + λ ( ∑ k = 1 K π k − 1 ) Q(\theta\mid X, \theta^{(t)}) + \lambda (\sum\limits_{k=1}^{K} \pi_k - 1) Q(θ∣X,θ(t))+λ(k=1∑Kπk−1)的一个驻点。对 π k \pi_k πk微分得到:
∂ Q ( θ ∣ X , θ ( t ) ) ∂ π k = ∑ i = 1 n γ k ( t ) ( x i ) π k + λ = 0 (24) \frac{\partial Q(\theta\mid X, \theta^{(t)})}{\partial \pi_k} = \sum\limits_{i=1}^{n} \frac{\gamma_k^{(t)}({\bm x}_i)}{\pi_k} + \lambda = 0 \tag{24} ∂πk∂Q(θ∣X,θ(t))=i=1∑nπkγk(t)(xi)+λ=0(24)
其中 k = 1 , ⋯ , K k = 1, \cdots, K k=1,⋯,K,把这 K 个方程的两边都乘以 π k \pi_k πk(假设所有的 π k \pi_k πk都不为零),然后把它们相加得到:
∑ i = 1 n ∑ k = 1 K γ k ( t ) ( x i ) + λ ∑ k = 1 K π k = 0 \sum\limits_{i=1}^{n} \sum\limits_{k=1}^{K} \gamma_k^{(t)}({\bm x}_i) + \lambda \sum\limits_{k=1}^{K} \pi_k = 0 i=1∑nk=1∑Kγk(t)(xi)+λk=1∑Kπk=0
由于 ∑ k = 1 K γ k ( t ) ( x i ) \sum\limits_{k=1}^{K} \gamma_k^{(t)}({\bm x}_i) k=1∑Kγk(t)(xi)和 ∑ k = 1 K π k \sum\limits_{k=1}^{K} \pi_k k=1∑Kπk都等于 1,有
λ = − n (25) \lambda = -n \tag{25} λ=−n(25)
把 λ = − n \lambda = -n λ=−n代入到式 (24),解方程得到
π k ( t + 1 ) = 1 n ∑ i = 1 n γ k ( t ) ( x i ) (26) \pi_k^{(t+1)} = \frac{1}{n} \sum\limits_{i=1}^{n} \gamma_k^{(t)}({\bm x}_i) \tag{26} πk(t+1)=n1i=1∑nγk(t)(xi)(26)
其中 k = 1 , ⋯ , K k = 1, \cdots, K k=1,⋯,K。这个过程一直重复,直到对数似然 L ( θ ) L(\theta) L(θ)没有显著变化。此时,如果需要,可以进行聚类分配,方法是将点 x i {\bm x}_i xi分配给拥有最大组成员隶属度的组 k k k。注意,可以进行其他的硬聚类分配。例如,该算法可能拒绝为点分配任何组,除非最大的组隶属度超过预先指定的阈值。
整个 EM 过程总结如下。
算法 使用EM和高斯混合模型聚类
-
初始化 K K K, τ > 0 \tau > 0 τ>0, { π k ( 0 ) , μ k ( 0 ) , Σ k ( 0 ) } k = 1 K \{\pi_k^{(0)}, {\bm \mu}_k^{(0)}, {\bm \varSigma}_k^{(0)}\}_{k=1}^K {πk(0),μk(0),Σk(0)}k=1K
-
repeat
-
E步:更新组成员
γ k ( t ) ( x i ) = π k ( t ) N ( x i ∣ μ k ( t ) , Σ k ( t ) ) ∑ k = 1 K π k ( t ) N ( x i ∣ μ k ( t ) , Σ k ( t ) ) \gamma_k^{(t)}({\bm x}_i) = \frac{\pi_k^{(t)} N({\bm x}_i \mid {\bm \mu}_k^{(t)}, {\bm \varSigma}_k^{(t)})}{\sum\limits_{k=1}^K \pi_k^{(t)} N({\bm x}_i \mid {\bm \mu}_k^{(t)}, {\bm \varSigma}_k^{(t)})} γk(t)(xi)=k=1∑Kπk(t)N(xi∣μk(t),Σk(t))πk(t)N(xi∣μk(t),Σk(t)) -
M步:重新估计模型参数
μ k ( t + 1 ) = ∑ i = 1 n γ k ( t ) ( x i ) x i ∑ i = 1 n γ k ( t ) ( x i ) {\bm \mu}_k^{(t+1)} = \frac{\sum\limits_{i=1}^n \gamma_k^{(t)}({\bm x}_i) {\bm x}_i}{\sum\limits_{i=1}^n \gamma_k^{(t)}({\bm x}_i)} μk(t+1)=i=1∑nγk(t)(xi)i=1∑nγk(t)(xi)xi Σ k ( t + 1 ) = ∑ i = 1 n γ k ( t ) ( x i ) ( x i − μ ^ k ( t + 1 ) ) ( x i − μ ^ k ( t + 1 ) ) ⊤ ∑ i = 1 n γ k ( t ) ( x i ) {\bm \varSigma}_k^{(t+1)} = \frac{\sum\limits_{i=1}^n \gamma_k^{(t)}({\bm x}_i) ({\bm x}_i - \hat{\mu}_k^{(t+1)}) ({\bm x}_i - \hat{\mu}_k^{(t+1)})^ {\top} }{\sum\limits_{i=1}^n \gamma_k^{(t)}({\bm x}_i)} Σk(t+1)=i=1∑nγk(t)(xi)i=1∑nγk(t)(xi)(xi−μ^k(t+1))(xi−μ^k(t+1))⊤ π k ( t + 1 ) = 1 n ∑ i = 1 n γ k ( t ) ( x i ) \pi_k^{(t+1)} = \frac{1}{n} \sum\limits_{i=1}^n \gamma_k^{(t)}({\bm x}_i) πk(t+1)=n1i=1∑nγk(t)(xi) -
计算对数似然:
L ( { π k ( t + 1 ) , μ k ( t + 1 ) , Σ k ( t + 1 ) } k = 1 K ) = ∑ i = 1 n ln ( ∑ k = 1 K π k ( t + 1 ) N ( x i ∣ μ k ( t + 1 ) , Σ k ( t + 1 ) ) ) L(\{\pi_k^{(t+1)}, {\bm \mu}_k^{(t+1)}, {\bm \varSigma}_k^{(t+1)}\}_{k=1}^K) = \sum\limits_{i=1}^n \ln \left( \sum\limits_{k=1}^K \pi_k^{(t+1)} N({\bm x}_i \mid {\bm \mu}_k^{(t+1)}, {\bm \varSigma}_k^{(t+1)}) \right) L({πk(t+1),μk(t+1),Σk(t+1)}k=1K)=i=1∑nln(k=1∑Kπk(t+1)N(xi∣μk(t+1),Σk(t+1))) -
until ∣ L ( { π k ( t + 1 ) , μ k ( t + 1 ) , Σ k ( t + 1 ) } k = 1 K ) − L ( { π k ( t ) , μ k ( t ) , Σ k ( t ) } k = 1 K ) ∣ < τ |L(\{\pi_k^{(t+1)}, {\bm \mu}_k^{(t+1)}, {\bm \varSigma}_k^{(t+1)}\}_{k=1}^K) - L(\{\pi_k^{(t)}, {\bm \mu}_k^{(t)}, {\bm \varSigma}_k^{(t)}\}_{k=1}^K)| < \tau ∣L({πk(t+1),μk(t+1),Σk(t+1)}k=1K)−L({πk(t),μk(t),Σk(t)}k=1K)∣<τ
总结
- 均值 μ i \mu_i μi:通过样本加权平均来估计。
- 协方差矩阵 Σ i \Sigma_i Σi:通过样本加权中心矩来估计。
- 混合系数 π i \pi_i πi:通过样本属于该成分的平均后验概率来确定。