隐马尔可夫模型

隐马尔可夫模型

隐马尔可夫模型(Hidden Markov Model)的对象是序列类型的数据,其基本假设为:该序列的观测值实际上是来自于完全关于当前时序下的状态(state,也可叫作隐变量<latent variable>)取值的一个条件概率分布,而状态取值亦会随着时序发生变化,且其状态变化遵从马尔可夫过程。

\(Y_1, \dots, Y_T\)表示\(T\)个时序下的观测值,\(X_1, \dots, X_T\)表示\(T\)个时序下的状态取值,对于任意\(i = 1, \dots, T\),有\(Y_i \sim p(\cdot | X_i)\);一般来说,隐马尔科夫模型中假设状态取值和观测取值都是离散的,假设共有\(N\)种状态取值\(x_1, \dots, x_N\)\(M\)种观测取值\(y_1, \dots, y_M\),令状态转移矩阵(transition matrix)为\(N \times N\)的方阵\(A\),令观测概率矩阵(emission matrix)为\(N \times M\)的矩阵\(B\);令\(\pi\)为初始状态概率向量。

隐马尔可夫模型即是由\(\pi, A, B\)三者决定,所以其可以用三元组表示: \[ \lambda = (\pi, A, B) \]

隐马尔可夫模型有三类基本问题: 1. 概率问题。给定模型\(\lambda = (\pi, A, B)\)和观测序列\(\Y = (y_{o_1}, \dots, y_{o_T})\),计算模型\(\lambda\)产生观测序列\(\Y\)的概率。 2. 学习问题。给定观测序列\(\Y = (y_{o_1}, \dots, y_{o_T})\),估计模型参数\(\lambda = (\pi, A, B)\)。 3. 预测问题(也叫解码<decoding>问题)。给定模型\(\lambda = (\pi, A, B)\)和观测序列\(\Y = (y_{o_1}, \dots, y_{o_T})\),求使得条件概率\(P(\X | \Y)\)最大的状态序列\(\X\)

概率问题

暴力解法

在预测问题中,最直接的想法便是首先枚举出所有的状态序列,然后根据状态序列计算观测序列的概率。某个状态序列\(\X = (x_{i_1}, \dots, x_{i_T})\)出现的概率为 \[ P(\X) = \pi_{i_1} a_{i_1, i_2} a_{i_2, i_3} \dots a_{i_{T-1}, i_T} \] 给定这个状态序列\(\X\)的情况下,观测序列\(\Y\)出现的概率为 \[ P(\Y | \X) = b_{i_1, o_1} b_{i_2, o_2} \dots b_{i_T, o_T} \] 两者同时出现的联合概率为 \[ P(\X, \Y) = P(\X) \times P(\Y | \X) = \pi_{i_1} a_{i_1, i_2} b_{i_1, o_1} \dots a_{i_{T-1}, i_T} b_{i_T, o_T} \] 然后对上式关于所有可能的状态序列求和,得到 \[ P(\Y) = \sum_\X P(\X, \Y) = \sum_{i_1, i_2, \dots, i_T} \pi_{i_1} a_{i_1, i_2} b_{i_1, o_1} \dots a_{i_{T-1}, i_T} b_{i_T, o_T} \] 该式共有\(N^T\)项,每一项有\(O(T)\)次相乘,故整体复杂度为\(O(T N^T)\),难以接受。

前向算法

首先引入前向概率的概念。给定隐马尔可夫模型\(\lambda\)及观测序列\(\Y = (y_{o_1}, \dots, y_{o_T})\),定义到时刻\(t\)的观测序列为给定的\((y_{o_1}, \dots, y_{o_t})\)且此时状态为\(x_{i}\)的概率为前向概率,记作 \[ \alpha_t(i) = P(Y_1 = y_{o_1}, \dots, Y_T = y_{o_t}, X_T = x_i; \lambda) \] 初始情况下,即\(t=1\)时,有 \[ \alpha_1(i) = \pi_i a_{i, o_1} \]\(t = 2, \dots, T\),有 \[ \alpha_t(i) = \left[ \sum_{j=1}^N \alpha_{t-1}(j) a_{j, i} \right] b_{i, o_t} \] 最终, \[ P(\Y) = \sum_{i=1}^N \alpha_T(i) \] 前向算法主要利用了状态变化的序列结构,记录了状态变化的中间过程,从而节省了计算时间。整个算法最耗时的步骤为递推这一步,\(\alpha_t(i)\)对应的加和表达式包含了\(O(N)\)个项,而对于每一个时刻\(t\),有\(N\)个这样的加和表达式,故\(T\)个时刻整体耗时为\(O(N^2 T)\)

学习问题

有监督学习

有监督学习是较好处理的一种问题,在这种情况下,\(\X, \Y\)都已知,需要我们估计模型参数\(\pi, A, B\)

设样本中状态由\(i\)转移到\(j\)的频数为\(A_{i, j}\),则\(a_{i, j}\)的估计为 \[ \hat a_{i, j} = \frac{A_{i, j}} {\sum_{k=1}^N A_{i, k}} \] 设样本中状态为\(i\)且观测为\(j\)的频数是\(B_{i, j}\),则\(b_{i, j}\)的估计为 \[ \hat b_{i, j} = \frac{B_{i, j}} {\sum_{k=1}^N B_{i, k}} \] 至于初始状态向量\(\pi\),若样本中由多条时序链,\(\pi\)亦可由相应频次估计而来。

无监督学习

多数情况下,状态——或者说隐变量\(\X\)——是没有办法观测到的,而这时就可以采用EM算法,通过优化似然函数的下界,找出最好的模型参数。EM算法在隐马尔科夫模型中具体实现由Baum和Welch提出,故实际求解算法被称作Baum-Welch算法。

此问题的似然函数以及对数似然函数分别为: \[ \begin{aligned} P(\Y; \lambda) &= \sum_{\X} P(\X, \Y; \lambda) \\ \log P(\Y; \lambda) &= \log \sum_{\X} P(\X, \Y; \lambda) \end{aligned} \] 根据EM算法, \[ \begin{aligned} &\log P(\Y; \lambda) = \log \E_{\X \sim P(\cdot | \Y; \lambda^t)} P(\X, \Y; \lambda) \\ &\ge \E_{\X \sim P(\cdot | \Y; \lambda^t)} \log P(\X, \Y; \lambda) \triangleq Q(\lambda^t, \lambda) \\ \end{aligned} \]

而下一轮迭代中,新的估计\(\lambda^{t+1}\)为: \[ \lambda^{t+1} = \arg \max_\lambda Q(\lambda^t, \lambda) \]

又由于 \[ \begin{aligned} &Q(\lambda^t, \lambda) = \E_{\X \sim P(\cdot | \Y; \lambda^t)} \log P(\X, \Y; \lambda) \\ &= \sum_\X P(\X | \Y; \lambda^t) \log P(\X, \Y; \lambda) \\ &= \sum_\X \frac{P(\X, \Y; \lambda^t)}{P(\Y; \lambda^t)} \log P(\X, \Y; \lambda) \\ &= \frac{\sum_\X P(\X, \Y; \lambda^t) \log P(\X, \Y; \lambda)}{P(\Y; \lambda^t)} \\ \end{aligned} \] 其中的\(\frac{1}{P(\Y; \lambda^t)}\)\(\lambda\)无关,所以 \[ \lambda^{t+1} = \arg \max_\lambda \sum_\X P(\X, \Y; \lambda^t) \log P(\X, \Y; \lambda) \]

注意在接下来的讨论中,我们会令\(\X = (x_{i_1}, \dots, x_{i_T})\),但为了整体简洁,我们不会在每一个\(\X\)出现的地方将此式展开。对上式做进一步展开, \[ \begin{aligned} &\sum_\X P(\X, \Y; \lambda^t) \log P(\X, \Y; \lambda) = \sum_\X P(\X, \Y; \lambda^t) \log \pi_{i_1} a_{i_1, i_2} b_{i_1, o_1} \dots a_{i_{T-1}, i_T} b_{i_T, o_T} \\ &= \sum_{\X} P(\X, \Y; \lambda^t) \log \pi_{i_1} + \sum_{\X} P(\X, \Y; \lambda^t) \sum_{t=1}^{T-1} a_{i_t, i_{t+1}} + \sum_{\X} P(\X, \Y; \lambda^t) \sum_{t=1}^T \log b_{i_t, o_t} \end{aligned} \]

我们可以对其中的三项分别最大化,以达到整体最大化。

  1. 注意到\(\pi\)满足约束条件\(\sum_{j=1}^N \pi_j = 1\),写出其拉格朗日函数: \[ \begin{aligned} L_\pi (\pi, \gamma) &= \sum_{\X} P(\X, \Y; \lambda^t) \log \pi_{i_1} + \gamma (\sum_{j=1}^N \pi_j - 1) \\ &= \sum_{j=1}^N P(\Y, i_1 = j; \lambda^t) \log \pi_{j} + \gamma (\sum_{j=1}^N \pi_j - 1) \end{aligned} \] 对上式关于\(\pi\)求导并令其为\(0\),得到对于任意\(j = 1, \dots, N\)\[ \begin{align} \frac{P(\Y, i_1 = j; \lambda^t)}{\pi_j} + \gamma &= 0 \\ P(\Y, i_1 = j; \lambda^t) + \pi_j \gamma &= 0 \label{pi-eq} \end{align} \] 对上式关于\(j\)求和,得到 \[ \sum_{j=1}^N P(\Y, i_1 = j; \lambda^t) + \sum_{j=1}^N \pi_j \gamma = 0 \\ \gamma = -P(\Y; \lambda^t) \] 重新代入\(\eqref{pi-eq}\)得到 \[ \pi_j = \frac{P(\Y, i_1=j; \lambda^t)}{P(\Y; \lambda^t)} \]

  2. 注意到\(A\)满足约束条件\(\forall j = 1, \dots, N, \sum_{k=1}^N a_{j, k} = 1\),写出其拉格朗日函数: \[ \begin{aligned} L_A (A, \gamma) &= \sum_{\X} P(\X, \Y; \lambda^t) \sum_{t=1}^{T-1} \log a_{i_t, i_{t+1}} + \sum_{j=1}^N \gamma_j (\sum_{k=1}^N a_{j, k} - 1) \\ &= \sum_{j=1}^N \sum_{k=1}^N \sum_{t=1}^{T-1} P(\Y, i_1=j, i_2=k; \lambda^t) \log a_{j, k} + \sum_{j=1}^N \gamma_j (\sum_{k=1}^N a_{j, k} - 1) \end{aligned} \] 类似对\(\pi\)的求解,我们可以得到 \[ a_{j, k} = \frac{\sum_{t=1}^{T-1} P(\Y, i_1=j, i_2=k; \lambda^t)}{\sum_{t=1}^{T-1} P(\Y, i_1=j; \lambda^t)} \]

  3. 注意到\(B\)满足约束条件\(\forall j = 1, \dots, N, \sum_{k=1}^M b_{j, k} = 1\),写出其拉格朗日函数: \[ \begin{aligned} L_B (B, \gamma) &= \sum_{\X} P(\X, \Y; \lambda^t) \sum_{t=1}^T \log b_{i_t, o_t} + \sum_{j=1}^N \gamma_j (\sum_{k=1}^M b_{j, k} - 1) \\ &= \sum_{j=1}^N \sum_{t=1}^T P(\Y, i_t=j; \lambda^t) \log b_{j, o_t} + \sum_{j=1}^N \gamma_j (\sum_{k=1}^M b_{j, k} - 1) \end{aligned} \] 类似对\(\pi\)的求解,我们可以得到 \[ b_{j, k} = \frac{\sum_{t=1}^T P(\Y, i_t=j; \lambda^t) \mathbb{I}[o_t = y_k]}{\sum_{t=1}^T P(\Y, i_t=j; \lambda^t)} \]

预测问题

预测问题常用的算法是Viterbi算法,它是通过动态规划求解出一条最优路径。

Previous
Next