fgg blog

Book Notes: Probability and Information Theory

## Probability and Information Theory

Probability theory is a mathematical framework for representing uncertain statements. It provides a means of quantifying uncertainty as well as axioms (公理) for deriving new uncertain statements. In artificial intelligence applications, we use probability theory in two major ways:

  • The laws of probability tell us how AI systems should reason, so we design our algorithms to compute or approximate various expressions derived using probability theory;
  • We can use probability and statistics to theoretically analyze the behavior of proposed AI systems.

While probability theory allows us to make uncertain statements and to reason in the presence of uncertainty, information theory enables us to quantify the amount of uncertainty in a probability distribution.

Book Notes: semi-supervised clustering methods

注明:
原理部分的内容均来自周志华的西瓜书,真正的大师之作。
其他内容来自开源包文档、开源电子书、ipynb文档等。

# 半监督聚类 (semi-supervised clustering)

聚类是一种典型的无监督学习任务,然而在现实聚类任务中我们往往能获得一些额外的监督信息,于是可以通过半监督聚类来利用额外监督信息以获得更好的聚类效果。

聚类任务中获得额外监督信息大致有两种类型:

  • 样本约束:

    必连 (must-link): 指的是样本必属于同一个簇

    勿连 (cannot-link): 样本必不属于同一个簇

  • 样本标签:

    监督信息来自少量带有标签的样本


## 约束$k$均值算法 (pseudo-code)

约束$k$均值算法 (Constrained k-means) 是利用第一类监督信息的代表。给定样本集 $D={x_1, x_2, \ldots, x_m}$ 以及 “必连” 关系集合 $\cal{M}$ 和 “勿连” 关系集合 $\cal{C}$ ,$(x_i, x_j) \in \cal{M}$ 表示 $x_i, x_j$ 必属于同簇,$(x_i, x_j) \in \cal{C}$ 表示 $x_i, x_j$ 必不属于同簇。该算法是 $k$-means 算法的扩展,它在聚类过程中要确保样本的约束得到满足,否则返回错误提示,算法如下:


输入: 样本集 $D = {x_1, x_2, \ldots, x_m}$;

​ 必连约束集合 $\cal{M}$ ;

​ 勿连约束集合 $\cal{C}$ ;

Book Notes: Optimization algorithms

## 8.3 Basic Algorithms

Most deep learning algorithms involve optimization of some sort. Optimization refers to the task of either minimizing or maximizing some function $f(x)$ by altering $x$. The function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.

Suppose we have a function $y = f(x)$, where both $x$ and $y$ are real numbers. The derivative of this function is denoted as $f’(x)$ or as $\frac{dx}{dy}$. The derivative $f’(x)$ gives the slope of $f(x)$ at the point $x$. In other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output: $f(x + \epsilon) \approx f(x) + \epsilon f’(x)$.

Book Notes: second order approximation (1)

## Approximate Second-Order Methods

For simplicity of exposition, the only objective function we examine is the empirical risk: $$ \tag{8.25} J(\theta) = \mathbb{E}{\mathbf{x, y} \sim \hat{p}{data}} [L(f(x; \theta), y)] = \frac{1}{m} \sum^m_{i=1} L(f(x^{(i)}; \theta), y^{(i)}). $$ In contrast to first-order methods, second-order methods make use of the second derivatives to improve optimization. The most widely used second-order method is Newton’s method.

# 8.6.1 Newton’s Method

Newton’s method is an optimization scheme based on using a second-order Taylor series expansion to approximate $J(\theta)$ near some point $\theta_0$, ignoring derivatives of higher order: $$ \tag{8.26} J(\theta) \approx J(\theta_0) + (\theta - \theta_0)^{\mathsf{T}} \nabla_{\theta} J(\theta_0) + \frac{1}{2} (\theta - \theta_0)^{\mathsf{T}} H (\theta - \theta_0), $$ where $H$ is the Hessian of $J$ with respect to $\theta$ evaluated at $\theta_0$. If we then solve for the critical point of this function, we obtain the Newton parameter update rule: $$ \tag{8.27} \theta^* = \theta_0 - H^{-1} \nabla_{\theta} J(\theta_0). $$ Thus for a locally quadratic function (with positive definite $H$), by rescaling the gradient by $H^{-1}$, Newton’s method jumps directly to the minimum (NOTE that $\theta^*$ means the $\theta$ which optimizes the $J(\theta)$). If the objective function is convex but not quadratic (there are higher-order terms), this update can be iterated, yielding the training algorithm associated with Newton’s method, given in algorithm 8.8.

Book Notes: clustering methods

注明:
原理部分的内容均来自周志华的西瓜书,真正的大师之作。
其他内容来自开源包文档、开源电子书、ipynb文档等。

# 原型聚类

原型1聚类也称为 “基于原型的聚类(prototype-based clustering)”,此类算法假设聚类结构能够通过一组原型刻画,在现实聚类任务中极为常用。通常情形下,算法先对原型进行初始化,然后对原型进行迭代更新求解。采用不同的原型表示、不同的求解方式,将产生不同的算法。

## $k$ 均值聚类算法

给定样本集 $D = {x_1, \ldots, x_m }$ ,“$k$ 均值($k$-means)” 算法针对聚类所得簇划分 $\mathcal{C} = {C_1, \ldots, C_k }$ 最小化平方误差 $$ \tag{9.24} E = \sum^k_{i=1} \sum_{x \in C_i} ||x - \mu_i||^2_2 , $$ 其中,$\mu_i = {1 \over |C_i|} \sum_{x \in C_i} x$ 是簇 $C_i$ 的均值向量。直观来看,式(9.24)在一定程度上刻画了簇内样本围绕簇均值向量的紧密程度,$E$ 值越小则簇内样本相似度越高。

最小化式(9.24)并不容易,找到它的最优解需考察样本集 $D$ 的所有可能簇划分,这是一个 NP 难问题2 。因此,$k$ 均值算法采用了贪心策略,通过迭代优化来近似求解式(9.24)。算法流程如下,其中第1行对均值向量进行初始化,在第4-8行与第9-16行依次对当前簇划分及均值向量迭代更新,若迭代更新后聚类结果保持不变,则在第18行将当前的簇划分结果返回。


k 均值算法流程


输入:样本集 $D = {x_1, \ldots, x_m }$ ;