不平衡问题相关讨论（2022/05/26）

xiaoxingxing • 2022年5月28日下午4:23 • 技术文章 • 阅读 369

目录

连续分布的不平衡

Real-world data often exhibit imbalanced distributions, where certain target values have significantly fewer observations. 

Existing techniques for dealing with imbalanced data focus on targets with categorical indices, i.e., different classes.

However, many tasks involve continuous targets, where hard boundaries between classes do not exist.

This causes ambiguity when directly applying traditional imbalanced classification methods such as re-sampling and re-weighting.

Moreover, continuous labels inherently possess a meaningful distance between targets, which has implication for how we should interpret data imbalance.
\begin{figure}[h]
	\centering
	\includegraphics[width = 5in]{distance.png}
	\caption{连续标签在不同目标值之间的距离是具有意义的，此距离会进一步指导我们该如何理解这个连续区间上的数据不平衡的程度。图中t1和t2在训练数据中具有同样的数量，而因t1位于一个具有高密度数据的邻域中，t2位于一个低密度数据的邻域中，那么t1和t2并不具有相同程度的数据不平衡。}
	\label{distance}
\end{figure}

怎么解决不平衡

\begin{itemize}
	\item 重采样（re-sampling）：这是解决数据类别不平衡的非常简单而暴力的方法，更具体可以分为两种，对少样本的过采样，或是对多样本的欠采样。当然，这类比较经典的方法一般效果都会欠佳，因为过采样容易overfit到minor classes，无法学到更鲁棒易泛化的特征，往往在非常不平衡的数据上泛化性能会更差；而欠采样则会直接造成major class严重的信息损失，甚至会导致欠拟合的现象发生。
	\item 数据合成（synthetic samples）：若不想直接重复采样相同样本，一种解决方法是生成和少样本相似的“新”数据。一个最粗暴的方法是直接对少类样本加随机高斯噪声，做data smoothing。此外，此类方法中比较经典的还有SMOTE，其思路简单来讲是对任意选取的一个少类的样本，用K近邻选取其相似的样本，通过对样本的线性插值得到新样本。
	\item 重加权（re-weighting）：顾名思义，重加权是对不同类别（甚至不同样本）分配不同权重，主要体现在重加权不同类别的loss来解决长尾分布问题。注意这里的权重可以是自适应的。此类方法的变种有很多，有最简单的按照类别数目的倒数来做加权，按照“有效”样本数加权，根据样本数优化分类间距的loss加权，等等。
	\item 迁移学习（transfer learning）：这类方法的基本思路是对多类样本和少类样本分别建模，将学到的多类样本的信息/表示/知识迁移给少类别使用。
	\item 解耦特征和分类器（decoupling representation and classifier）：最近的研究发现将特征学习和分类器学习解耦，把不平衡学习分为两个阶段，在特征学习阶段正常采样，在分类器学习阶段平衡采样，可以带来更好的长尾学习结果。
	
\end{itemize}

连续分布的不平衡

	\begin{figure}[h]
	\centering
	\includegraphics[width = 5in]{presentation.png}
	\caption{在两个不同的数据集上使用相同的训练标签分布（上）比较测试误差分布（下）：（a）CIFAR-100，一个具有分类标签空间的分类任务。 (b) IMDB-WIKI，具有连续标签空间的回归任务。}
	\label{presentation}
	\end{figure}
	由图~\ref{presentation} 可以得出，对于连续标签，其经验标签密度（empirical label density），也就是直接观测到的标签密度，不能准确反映模型或神经网络所看到的不平衡。
	因此，在连续的情况下，empirical label density是不能反映实际的标签密度分布。
	这是由于相临近标签（例如，年龄接近的图像）的数据样本之间是具有相关性，或是互相依赖的。
	\subsection{Label Distribution Smoothing}
	This paper proposes Label Distribution Smoothing (LDS) ，来估计在连续标签情况下的有效的 label density distribution。
	该方法参考了在统计学习领域中的核密度估计，kernel density estimation 的思路，来在这种情况下估计 expected density。
	具体而言，给定连续的经验标签密度分布，LDS 使用了一个 symmetric kernel distribution 对称核函数$k$ ，用经验密度分布与之进行卷积，来拿到一个 kernel-smoothed 的版本，称之为 effective label density，也就是有效的标签密度，用来直观体现临近标签的数据样本具有的信息重叠的问题。
	$$\mathrm{k}\left(y, y^{\prime}\right)=\mathrm{k}\left(y^{\prime}, y\right),
	\nabla_{y} \mathrm{k}\left(y, y^{\prime}\right)+\nabla_{y^{\prime}} \mathrm{k}\left(y^{\prime}, y\right)=0, \forall y, y^{\prime} \in \mathcal{Y}$$
	\begin{figure}[h]
		\centering
		\includegraphics[width = 5in]{labelsmooth.png}
		\caption{标签分布平滑 (LDS) 将对称核函数与经验标签密度进行卷积，以得到有效标签密度分布}
		\label{labelsmooth}
	\end{figure}
	$$
	\tilde{p}\left(y^{\prime}\right) \triangleq \int_{\mathcal{Y}} \mathrm{k}\left(y, y^{\prime}\right) p(y) d y
	$$
	where $p(y)$ is the number of appearances of label of $y$ in the training data, and $\tilde{p}(y^{\prime})$ is the effective density of label $y^{\prime}$.
	\subsection{Feature Distribution Smoothing}
	还需继续详读论文。

问题

	本文所描述的不平衡问题是否适用于标签分布学习？

文章出处登录后可见！

已经登录？立即刷新

人工智能机器学习

赞 (0)

xiaoxingxing管理团队

0

深入理解机器学习——类别不平衡学习（Imbalanced Learning）：样本采样技术-[随机采样技术]

上一篇 2022年5月28日

Transfomer各组件与Pytorch

下一篇 2022年5月28日