文章目录

1. 回归问题
- 1.1 均方差
- 1.2 L1损失
- 1.3 平滑L1损失
2. 分类问题
- 2.1 合页损失
- 2.2 二分类交叉熵
- 2.3 交叉熵
3. 相似度
- 3.1 余弦相似度
- 3.2 相对熵
4. 不配合的使用

1. 回归问题

1.1 均方差

均方差是回归问题中最常用的损失函数了，Pytorch中的均方差损失函数为

torch.nn.MSELoss(prediction, target)

设训练集中的值为 $y_i$ ，预测得到的值为 $\hat{y}_i$ ，那么二者的均方差计算为：
$MSE=\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2$
适用场景
由于MSE需要对目标值进行平方，所以含有异常值时偏离会比较大，该情况不建议用MSE。

代码示范：

loss = nn.MSELoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
print(output)

1.2 L1损失

Pytorch中的均方差损失函数为

torch.nn.L1Loss(input, target)

设训练集中的值为 $y_i$ ，预测得到的值为 $\hat{y}_i$ ，那么二者的均方差计算为：
$L1=\frac{1}{N} \sum_{i=1}^N \vert y_i-\hat{y}_i \vert$
适用场景：
当目标变量含有较多的异常值时，L1损失具有很强的鲁棒性，这种情况下推荐使用L1损失。

代码示范：

loss = nn.L1Loss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
print(output)

1.3 平滑L1损失

Pytorch中的平滑L1损失函数为

torch.nn.SmoothL1Loss(input, target)

设训练集中的值为 $y_i$ ，预测得到的值为 $\hat{y}_i$ ，那么二者的均方差计算为：
$loss(y,\hat{y})=\frac{1}{n}\sum_{i=1}^Nz_i$ 其中，
$\begin{equation} z_i= \left\{ \begin{aligned} %\nonumber &0.5(y_i-\hat{y}_i)^2,&\vert y_i - \hat{y}_i \vert <1 \notag \\ &\vert y_i - \hat{y}_i \vert -0.5, &otherwise \notag \\ \end{aligned} \right. \end{equation}$
适用场景：
相当于L1Loss和L2Loss的结合，拥有两者的部分有点。可以有效的避免梯度爆炸。大多数的回归类问题都可以使用，尤其是特征值中有大特征的时候。

代码示范：

loss = nn.SmoothL1Loss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
print(output)

2. 分类问题

2.1 合页损失

合页损失函数是针对二分类的，且假设二分类的标签为 $\in \{1,-1\}$ 。Pytorch中的余弦相似度损失函数为

torch.nn.HingeEmbeddingLoss(y, target, margin=1)

其中 $y$ 为预测得到的类别， $t a r g e t$ 为样本本来的概率。

设 $y_i$ 为样本的类别， $\hat{y}_i$ 为样本预测的概率，合页损失函数可以表示为：
$\begin{equation} hinge(\hat{y}_i,y_i)=\max(0,margin-\hat{y}_iy_i) \notag \end{equation}$ 这个函数怎么理解呢，我们分情况讨论就很清楚了。

$y_i >0,\hat{y}_i>0$ ，这种情况是肯定分类正确的，但是该函数会对置信值比较低的预测值也进行惩罚，比如预测概率为 $0.1$ ，那么这个时如果 $ma r g in = 1$ ，就会惩罚 $0.9$ ，可以看到，置信值越高惩罚越小。
$y_i <0,\hat{y}_i<0$ ，这种情况也是肯定分类正确的，同样会对置信值低的进行惩罚。
$y_i <0,\hat{y}_i>0$ ，这种情况很明显分类错误，所以 $\hat{y}_iy_i>0$ ，会对其进行一个比较大的惩罚。
$y_i >0,\hat{y}_i<0$ ，这种情况很明显分类错误，所以 $\hat{y}_iy_i>0$ ，会对其进行一个比较大的惩罚。

Pytorch中使用的合页损失函数如下：
$\begin{equation} hinge(y,\hat{y})= \left\{ \begin{aligned} %\nonumber &\hat{y},&y&=1 \notag \\ &max(0,margin-\vert \hat{y} \vert), &y&=-1 \notag \\ \end{aligned} \right. \end{equation}$

可以看到当 $y$ 为正类时，模型输出负值会有较大的惩罚，当模型输出为正值且在 $(0, ma r g in)$ 区间时还会有一个较小的惩罚。即合页损失不仅惩罚预测错的，并且对于预测对了但是置信度不高的也会给一个惩罚，只有置信度高的才会有零损失。使用合页损失直觉上理解是要找到一个决策边界，使得所有数据点被这个边界正确地、高置信地被分类。

适用场景：
适用于二分类， $\in \{-1,1\}$ 。

代码示范：

loss_f = nn.HingeEmbeddingLoss()
inputs = torch.tensor([[1., 0.8, 0.5]])
target = torch.tensor([[1, 1, -1]])
output = loss_f(inputs,target)
print(output)

2.2 二分类交叉熵

二分类交叉熵损失函数是针对二分类的，且假设二分类的标签为 $\in \{1,0\}$ 。Pytorch中的二分类交叉熵损失函数为

torch.nn.BCELoss(x1, x2)

计算公式如下所示：
$\log(\hat{y})-(1-y) \log(1- \hat{y})$ 其中 $y$ 为真实标签， $\hat{y}$ 为预测的概率值。

适用场景：
适用于二分类， $\in \{0,1\}$ 。

代码示范：

m = nn.Sigmoid()
loss = nn.BCELoss()
input = torch.randn(3, requires_grad=True)
# 生成三个或0或1的数
target = torch.empty(3).random_(2)
# 因为概率是正数，所以通过一个sigmoid层将生成数据变为正数
output = loss(m(input), target)
print(output)

2.3 交叉熵

该交叉熵损失函数是计算多分类的，二分类交叉熵是其一种特殊的形式。交叉熵主要是用来判定实际的输出与期望的输出的接近程度，Pytorch中的交叉熵损失函数为

torch.nn.CrossEntropyLoss(y, target)

一般的交叉熵计算公式为
$\sum({y} \log(\hat{y})+(1-{y}) \log(1-\hat{y}))$ Pytorch 中为了简化计算量，使用的是另一种交叉熵函数，即
$\sum\hat{y} \log({y})$
适用场景：
适用于二分类与多分类。

代码示范：

loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
output = loss(input, target)
print(output)

3. 相似度

3.1 余弦相似度

Pytorch中的余弦相似度损失函数为

torch.nn.CosineEmbeddingLoss(x1, x2, target, margin=0)

常常用于评估两个向量的相似性，两个向量的余弦值越高，则相似性越高。

设传入的数据为 $x 1, x 2$ ，传入的标签值为 $y$ ，那么二者的余弦相似度计算为：
$\begin{equation} loss(x1,,x2,y)= \left\{ \begin{aligned} %\nonumber &1-cos(x1,x2),&y&=1 \notag \\ &max(0,cos(x1,x2)-margin), &y&=-1 \notag \\ \end{aligned} \right. \end{equation}$ 其中，
$\begin{aligned} cos(A,B)&=\frac{A \cdot B}{\Vert A \Vert\Vert B \Vert} \notag \\ \notag \\ &=\frac{\sum_{i=0}^n A_i \times B_i}{\sqrt{\sum_{i=0}^n (A_i)^2} \times \sqrt{\sum_{i=0}^n (B_i)^2} } \end{aligned}$

当 $y = 1$ 的时候，就是直接用 $- cos (x 1, x 2)$ 的平移函数作为损失函数
当 $y = - 1$ 的时候，在 $cos (x 1, x 2) = ma r g in$ 处做了分割，用于衡量两个向量的不相似性

Pytorch中支持自己设置 $ma r g in$ ，其值在 $[- 1, 1]$ ，一般的 $ma r g in$ 设置在 $[0, 0.5]$ 。
适用范围：
计算坐标距离的相似性。

代码示范：

loss = nn.CosineEmbeddingLoss()
inputs_1 = torch.tensor([[0.3, 0.5, 0.7], [0.3, 0.5, 0.7]])
inputs_2 = torch.tensor([[0.1, 0.3, 0.5], [0.1, 0.3, 0.5]])
target = torch.tensor([[1, -1]], dtype=torch.float)
output = loss(inputs_1,inputs_2,target)
print(output)

3.2 相对熵

相对熵又称KL散度，用来描述两个概率分布的差异，表示为 $\Vert Q)$ 。在信息论中，它是用来度量使用基于Q分布的编码来编码来自P分布的样本平均所需的额外的比特个数。在机器学习领域，是用来度量两个函数的相似程度或者相近程度。

Pytorch中的相对熵API为

torch.nn.KLDivLoss(P, Q)

离散型的相对熵计算公式如下所示：
$\begin{aligned} D(P \Vert Q)&=\sum (P(i) \times [\log(\frac{P(i)}{Q(i)})]) \\ &= \sum(P(i) [\log(P(i))-\log(Q(i))]) \end{aligned}$
适用范围：
计算概率分布的相似性。

代码示范：

inputs = torch.tensor([[0.5, 0.3, 0.2], [0.2, 0.3, 0.5]])
target = torch.tensor([[0.9, 0.05, 0.05], [0.1, 0.7, 0.2]], dtype=torch.float)
loss = nn.KLDivLoss()
output = loss(inputs,target)
print(output)

4. 不配合的使用

$MSE$ 与 $s i g m o i d$ 函数不适合配合使用。

$MSE$ 的公式如下所示：
$MSE=-\frac{1}{N}(y-\hat{y})^2$
如果其与 $s i g m o i d$ 函数一起配合使用，偏导数则为
$\frac{\partial Loss_i}{\partial w}=(y-\hat{y}) \sigma '(wx_i+b)x_i$ 其中
$\sigma '(wx_i+b)=\sigma (wx_i+b) (1-\sigma (wx_i+b))$ 于是，在 $\sigma (wx_i+b)$ 的值接近于0或者1的时候，导数都接近于0，这会导致模型的学习速度在一开始非常缓慢，所以这俩不适合配套使用。