Variable Elimination

某个确定了的概率图，它的推断可以看作是一个关于所有变量的函数，我们要求的是这个函数的具体值是多少，从概率的角度上消除变量，其实就是做这个函数的边缘化 marginalization，
我们尽量争取每次计算(消除)的变量比较少，这样总的复杂度不会高。。
Elimination 似乎可以适用于所有结构的 graph 。

Directed Chain

假设我们的图的结构是这样的 $A\rightarrow B\rightarrow C\rightarrow D\rightarrow E$

$\begin{align} P(e) &= \sum_{a,b,c,d} p(a,b,c,d) \\ &= \sum_{a,b,c,d} P (a)P (b|a)P (c|b)P (d|c)P (e|d) \\ &=\sum_{d,c,b} P(c|b) P(d|c) P(e|d) \sum_a P(a) P(b|a) \\ &\text{this is an one variable elimination cost } k^2\\ &= \sum_{d,c,b} P(c|b) P(d|c) P(e|d) p(b) \\ &\ \cdots \\ &= \sum_d P(e|d) p(d) \end{align}$

复杂度 Complexity:

Eliminate 方法: costs $O(k^2 n)$ _这里面的 $k^2$ 表示迭代 k 次，每次计算概率也要 k_
Naive 方法: cost $O(k^n)$

Undirected Chain

假设我们的图是这样的无向图 $A - B - C - D - E$

$\begin{align} P(e) &= \sum_{a,b,c,d} \frac{1}{Z} \phi(b,a) \phi(c,b) \phi(d,c) \phi(e,d) \\ &\propto \sum_{a,b,c,d} \phi(b,a) \phi(c,b) \phi(d,c) \phi(e,d) \\ &= \sum_{a,b,c,d} \phi(c,b) \phi(d,c) \phi(e,d)\sum_a \phi(b,a) \\ &= \sum_{a,b,c,d} \phi(c,b) \phi(d,c) \phi(e,d) m_a(b) \\ &\ \cdots \\ &= m_d(e) \end{align}$

这里是无向图，原始的势函数运算成为 m 的结果不是概率，所以这里我们要 normalize 一下:

$P(e) = \frac{m_d(e)}{\sum_e m_d(e)}$

Graph Elimination

我们从一张图来看 Elimination 的每一次过程后，剩下的图的结构

对于一张 graph 首先我们确认 elimination 的 order
对于每一个待消除的变量，它会连接一些变量，我们将这些变量两两相连
消除待消除的变量

我们再来观察每次 Elimination 后，也就是边缘化操作后形成的函数，发现这些函数的变量在一起，刚好能组成这个概率图结构的 cliques 集合，如果我们考虑这些 cliques 组成的树，那么 elimination 操作其实就是在这颗树上进行 message passing。

Key insight 就是这些 message 其实是可以 reused 的，重复使用是指，当我们要进行多次 querying 的时候，信息的重复使用，所以我们希望设计好的算法，能够在 querying 的过程中保存下来这些信息，于是有了后面的 Sum-Product 算法。

Complexity of Variable Elimination

这里 $y$ 未知的，后面操作可能要消除的变量，$x$ 是正在消除的变量，$m$ 是当前要消除的乘子，sum-product 算法分为下面两部

Sum: $m_x(y_1,...,y_k) = \sum_x m^{\prime}_x (x,y_1,...,y_k)$
Product: $m_x^{\prime} (x,y_1,...,y_k) = \prod_{i=1}^k m_i(x,y_{c_i})$

乘起来就是一次 Elimination 的复杂度： $k\cdot | Var(X) |\cdot \prod_i |Var(Y_{C_i})|$
也就是当前变量的状态乘上，乘子也就是对应的 clique 的所有变量的状态叉乘

我们发现整个算法的复杂度取决于最大的最大的 clique，我们称这个 clique 的变量数量 k 为 Tree-width , 同时要注意的是，不同的 elimination order 的 Tree-Width 是不同的，找到最优的 order 是 np-hard 问题。。。

Belief Propagation

asd

Trees
- Two-pass Algorithm
Factor Trees
- Message Passing on Factor Graph
Non-Trees (General Graph)
- Junction Tree Algorithm

概率图中有向图模型其实是无向图的一种特例，从有向图到无向图的转换关系如下

Undirected Tree:
$p(x) = \frac{1}{Z}\left( \prod_{i\in V} \psi(x_i) \prod_{(i,j)\in E} \psi (x_i,x_j) \right)$
Directed Tree:
$p(x) = p(x_r) \prod_{(i,j)\in E} p(x_i|x_j)$
Equivalence:
$\psi (x_r) = p(x_r);\ \psi(x_i,x_j)=p(x_j|x_i);\ Z=1,\psi(x_i) = 1$

Trees

Elimination 操作可以看作是 message passing.

令 $m_{ji}(x_i)$ 当作是从 i 那里变量消除后生成的乘子，同时，这就是 $x_i$ 的函数:

$m_{ji}(x_i) = \sum_{x_j} \left( \psi(x_j)\psi(x_i,x_j) \prod_{k\in N(j)\setminus i} m_{kj}(x_j) \right)$

上面的公式可以理解为：从 $i$ 到 $j$ 的信息，只和传递信息箭头相反的范围内的那些节点相关

对于某个节点所对应的概率，我们可以这样表示：

$p(x_i)\propto \psi(x_i) \prod_{e\in N(i)} m_{ei}(x_i)$

可以看到计算 $p(x_i)$ 的时候， $m_{ij}(x_i)$ 会被重复的使用, 所以我们可以存储 $m$ 的值

树的 Elimination 来做 querying 算法的复杂度是 $O(NC)$ (where N=nodes, C=complexity of one complete passing/clique bottleneck). 但是使用了 two path 算法（因为是无向图所以每条边有两个方向）以后，复杂度就变成了: 2C, or $O(C)$

belief propagation is only valid on trees

Factor Trees

首先，我们定义一个变换，这个变换把一个图变成了一个新的图，变换后的图称为 Factor Graph,如果变换后刚好是一颗树，那么我们也可以称之为 Factor Tree。在新的 Factor Graph 中，每一个 factor (clique) 在图中表示一个节点 f, 下面是一个例子, 其中的一个性质就是 $f$ 节点只和 $x$ 节点相连，也就是说，x 的某个变量把它和其他变量的关系都托付给了 f 节点。

对于一个图，可能有好几种变换方式，我们希望变换后的结果就是一个树，和下面的 Example 3 一样。

另外变换后的图其实是一个二分图（bipartite），二分图每一侧都是一种类型的节点，所以信息传递策略于传统的方法有些不同。。有两种信息的传递方式

$\nu$ : from variables to factors（左图） $\nu_{is}(x_i) = \prod_{t\in N(i)\setminus s} \mu_{ti}(x_i)$
$\mu$ : from factors to variables（右图）
- _上面的 $\sum$ 操作是遍历变量的赋值，$\sum$ 操作下面的 x 可以看作是一个向量，遍历向量里面所有的赋值._

Factor Tree 算法只能够处理一些长得像树的概率图

Junction Trees

Junction tree data-structure for exact inference on general graphs

Algorithm

Moralization
Triangulation
Junction tree
Message Propagation

Moral Graph

因为我们要处理的是广泛结构的概率图模型，所以我们先把 BN 纳入到 MRF 的框架里面，这一步骤叫做 Moralization，我们知道 BN 中的 factor 是某些父变量对于指定变量的条件概率，我们不管哪些是条件变量，我们就把他们看成是一个整体的函数，我们的终极目的是生成一个 clique，clique 有要求是全联通的，于是我们就将这个变量的父节点两两配对相连，这样就形成了一个 clique，原来的 factor 就变成了势函数 potential。

在这里我们得到一个启发，就是增加一条边后，原本的 graph 是新的 graph 的一种特殊情况。

Triangulation

对于三角化的操作，我们可以先看后面两个操作的介绍再回来，因为这是为了解决后面问题的而诞生的一个步骤

问题就是 Local Consistency 不能导出 Global Consistency，只有在三角化后的图中

三角化以后的图是没有大于 4 个节点以上的环的，三角化的方法就是在大的环中添加额外边

Clique Tree

下面的推断可以知道，有向图条件概率乘积的表达形式，其实就是 clique tree 表达形式的一种特殊情况

$\begin{align} &P(X_1,X_2,X_3,X_4,X_5,X_6)\\ & = P(X_1)P(X_2)P(X_3 | X_1,X_2)P(X_4 | X_3)P(X_5 | X_3)P(X_6 | X_4,X_5) \\ & = P(X_1,X_2,X_3) \frac{P(X_3,X_4,X_5)}{P(X_3)} \frac{P(X_4,X_5,X_6)}{P(X_4,X_5)} \\ & = \psi(X_1,X_2,X_3) \frac{\psi(X_3,X_4,X_5)}{\phi(X_3)} \frac{\psi(X_4,X_5,X_6)}{\phi(X_4,X_5)} \end{align}$

General Form : 之所以下面是要除以 cliques 之间的交集 S ，是因为交集的信息可能出现了多次

$P(\mathbf{X}) = \frac{\prod_{c} \psi_c(\mathbf{X_c})}{\prod_{s} \phi_s(\mathbf{X_s})}$

Message Passing

传递方式有两种，这两种方法算出来的结果应该是一样的，这是我们做出的假设称为 Local Consistency

$P(S) = \sum_{V\setminus S} \psi(V) \qquad \qquad P(S) = \sum_{W\setminus S} \psi(W)$

下面的第一行是 forward update，第二行是 backward update，其中 $\frac{\phi_S^*}{\phi_S}$ 是通过 Local Consistency 得出的，是建立起矩形节点两边沟通的桥梁

上面是 clique tree 信息传递的方式，

Shafer-Shenoy algorithm

Appendix

General Variable Elimination

为了让计算机能够自动的处理各种各样结构的概率图的 Elimination，我们可以设计一种更加 general 的形式，但是这个形式的设定，主要还是为了进行计算机的运算的。。

Let $X$ be set of all random variables
Let $F$ denote the set of factors and then for each $\phi \in F$,$Scope[\phi] \in X$
There three type of variables in Elimination Model
- Let $Y\subset X$ be a set of query variables
- Let $Z = X - Y$ would be the set of variables to be eliminated
- Let $\mathcal{E}$ be the known variables, and $\bar{e}_i$ is the assignment

The core operation can be view as the form of, we can extend it to general form by import evidence potential

$\tau(Y) = \sum_z \prod_{\phi \in F} \phi$

The evidence potantial:
$\begin{align} \delta(\mathcal{E}) = \left \{ \begin{array}{ll} 1& if\ \mathcal{E_i} \equiv \bar{e}_i \\ 0 & if\ \mathcal{E_i} \neq \bar{e}_i \end{array} \right . \end{align}$
Total evidence potential:
$\begin{align} \delta(\mathbf{\mathcal{E}},\mathbf{\bar{e}})= \prod_{i\in I_{\mathcal{E}}} \delta (\mathcal{E}_i,\bar{e}_i) \end{align}$
Introducing evidence: $\tau(\mathbf{Y},\mathbf{\bar{e}}) = \sum_{z,e}\prod_{\phi \in F} \phi \times \delta(\mathbf{\mathcal{E}},\bar{\mathbf{e}})$

The elimination algorithm

… …

Reference

https://www.jianshu.com/p/f90100680749

PGM-Exact Inference

2018-08-14

PGM-Exact Inference

Variable Elimination

Directed Chain

Undirected Chain

Graph Elimination

Complexity of Variable Elimination

Belief Propagation

Trees

Factor Trees

Junction Trees

Moral Graph

Triangulation

Clique Tree

Message Passing

Appendix

General Variable Elimination

The elimination algorithm

Reference

PGM-Exact Inference

Variable Elimination

Directed Chain

Undirected Chain

Graph Elimination

Complexity of Variable Elimination

Belief Propagation

Trees

Factor Trees

Junction Trees

Moral Graph

Triangulation

Clique Tree

Message Passing

Appendix

General Variable Elimination

The elimination algorithm

Reference

thanks~

Factor Trees