NB 算法

基础理论:
1.条件概率

P(AB)=P(A)P(B|A)                                                                                                             (1) 

P(A)  表示事件A 发生的概率;
P(AB) 表示事件 AB

举例:符号型数据集如下所示。

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

假设A代表晴天,即outlook = sunny。B代表湿度高,即humidty = high。

在数据集中共有14天的数据, 其中5天outlook = sunny, 则

P(A) = P(outlook = sunny) = 5 / 14

在outlook = sunny的5天中有3天的humidty = high,则

P(B|A) = P(humidity=high|outlook=sunny) = 3 / 5
由条件概率可知,既是晴天又湿度高的概率是P(AB)=P(A)P(B|A)= 3 / 14

2.Laplacian 平滑

首先了解一下什么是零概率问题。零概率问题:在计算事件的概率时,如果某个事件在训练集中没有出现,那么会导致该事件的概率结果为0。

那么如何解决零概率问题呢?Laplacian 平滑就是为了解决零概率问题。

举例:假设有三个仅包含字母的文本w1、w2、w3,在指定的训练样本中查找字母k。查找到的次数分别为0、990、10。则对应概率为0/1000、990/1000、10/1000。这种情况则为零概率问题。对其使用Laplacian 平滑,分子加1,分母加3(共有三个文本)分母加3 是为了与分子加 1 相适应, 保证平滑后的概率之和为 1。

平滑后可得:

(0 + 1)/ (1000 + 3) = 1 / 1003

(990+1) /(1000 + 3) = 991 / 1003

(10 + 1)/ (1000 + 3) = 11 / 1003

算法思想:

X = x_{1} \wedge x_{2} \wedge x_{3} \wedge... \wedge x_{m} 表示为一个条件的组合 , m为条件的总个数。

如:outlook = sunny \wedge temperature = hot \wedge humidity = high \wedge windy = FALSE

令 D_{i} 表示一个事件, 上示数据集中的 play = no。根据 (1) 式可知:                                                 

P(D_{i}|X) = \frac{P(XD_{i})}{P(X)} = \frac{P(D_{i})P(X|D_{i})}{P(X)}                                                                                  (2)


现在我们做一个大胆的假设, 认为各个条件之间是独立的。那么

P(X|D_{i}) = P(x_{1}|D_{i})P(x_{2}|D_{i})...P(x_{m}|D_{i}) = \prod_{j = 1}^{m}P(x_{j}|D_{i})                                                 (3)

结合(2)(3)式可得

P(D_{i}|X) = \frac{P(XD_{i})}{P(X)} = \frac{P(D_{i})P(X|D_{i})}{P(X)} = \frac{P(D_{i}) \prod_{j = 1}^{m}P(x_{j}|D_{i})}{P(X)}                                    (4) 

(4)式中的P(x_{j}|D_{i})不难得到(比如play = no的情况下的outlook = sunny的概率)。

P(D_{i})也不难得到(比如play = no的概率)。

但是(4)式其实计算不出结果, 因为计算不了分母P(X) 。但是如果只需要比较“ 某天气情况下打球 ”与“ 某天气情况下不打球 ”两者的概率哪个大。这两种情况式子的分母都是相同的,那么只需要比较分子即可。

所以可得:

d(X) = \mathop{argmax}\limits_{1\leq i\leq k}P(D_{i}|X) = \mathop{argmax}\limits_{1\leq i\leq k}P(D_{i}) \prod_{j = 1}^{m}P(x_{j}|D_{i})                                                     (5)

其中 d(X) 表示判决属性的下标。

argmax 表示哪个类别的相对概率高, 我们就预测为该类别。 

k为决策总数,如play = {yes,no},则k = 2。

举例:play = {yes,no},k=2时。当i = 1时P(D_{i}|X)的值高于i = 2时P(D_{i}|X)的值时,d(X)的值即为1,即play = yes。

但是(5) 式在预测时会出现零概率问题。

举例: 

P(outlook=overcast|play=no)=0, 即不打球的时候,天气不可能是多云。 如果新的一天为多云, 则不打球的概率为 0.
P(temperature=hot|play=yes)=0, 即打球的时候,温度不可能高。 如果新的一天温度高, 则打球的概率为 0.
那么, 当outlook=overcast\wedge temperature=hot ,打球和不打球的概率都为 0 。

问题的根源在于 (5) 式的 P(D_{i}) \prod_{j = 1}^{m}P(x_{j}|D_{i}) 。 只要其中有一个值为 0, 则乘积为 0。为了解决该问题,需要Laplacian 平滑。Laplacian 平滑后的公式为:

P^{L}(x_{j}|D_{i}) = \frac{nP(x_{j}D_{i})+1}{nP(D_{i})+v_{j}}                                                                                                (6)

n为对象的数量,v_{j}为数量的个数。 由于概率小于1,概率必须乘以n之后才是一个整数,才能进行Laplacian 平滑。

 对P(D_{i})进行平滑后:

P^{L}(D_{i}) = \frac{nP(D_{i}) +1}{n+k}                                                                                                          (7)

 k为决策总数,如play = {yes,no},则k = 2。

 由于概率的值都小于1,乘积过后就会导致值很小,由于argmax 的值是概率大的那个类别,因此可以采用log的方法解决,不会对最后的结果产生影响。最后可得:d(X) = \mathop{argmax}\limits_{1\leq i\leq k}(logP^{L}(D_{i}) +\sum_{j=1}^{m}logP^{L}(x_{j}|D_{i}))                                                                 (8)

输入:无输入

输出:1<=i<=k时P(D_{i})的值,以及1<=i<=k时P^{L}(D_{i})的值(k为决策总数)以及预测准确度。

优化目标:可能没有优化目标。

package knn5;
import java.io.FileReader;
import java.util.Arrays;
import weka.core.*;
public class NaiveBayes {

	private class GaussianParamters{
		double mu;
		double sigma;
		public GaussianParamters(double paraMu, double paraSigma) {
			mu = paraMu;
			sigma = paraSigma;
		}//Of the constructor
		public String toString() {
			return "(" + mu + ", " + sigma + ")";
		}// Of toString
	}// Of GaussianParamters

	Instances dataset;
	int numClasses;
	int numInstances;
	int numConditions;
	int[] predicts;
	double[] classDistribution;
	double[] classDistributionLaplacian;
	double[][][] conditionalCounts;
	double[][][] conditionalProbabilitiesLaplacian;
	GaussianParamters[][] gaussianParameters;
	int dataType;
	public static final int NOMINAL = 0;
	public static final int NUMERICAL = 1;
	
	public NaiveBayes(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		}catch(Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		}//Of try
		dataset.setClassIndex(dataset.numAttributes() - 1);
		numConditions = dataset.numAttributes() - 1;
		numInstances =dataset.numInstances();
		numClasses = dataset.attribute(numConditions).numValues();
	}// Of the constructor
	
	public void setDataType(int paraDataType) {
		dataType = paraDataType;
	}// Of setDataType
	
	public void calculateClassDistribution() { 
		classDistribution = new double[numClasses];
		classDistributionLaplacian = new double[numClasses];
		
		double[]tempCounts = new double[numClasses];
		for(int i = 0;i < numInstances; i++) {
			int tempClassValue = (int)dataset.instance(i).classValue();
		    tempCounts[tempClassValue]++;
		}//Of for i
		for(int i = 0; i < numClasses; i++) {
			classDistribution[i] = tempCounts[i]/numInstances;
			classDistributionLaplacian[i] = (tempCounts[i] + 1)/(numInstances + numClasses);
		}//Of for i
		System.out.println("Class distribution: " + Arrays.toString(classDistribution));
		System.out.println("Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
	}// Of calculateClassDistribution
	
	public void calculateConditionalProbabilities() {
		conditionalCounts = new double[numClasses][numConditions][];
		conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];
		for(int i = 0; i < numClasses; i++) {
			for(int j = 0; j < numConditions; j++) {
				int tempNumValues = (int)dataset.attribute(j).numValues() ;
				conditionalCounts[i][j] = new double[tempNumValues];
				conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
			}//Of for j
		}//Of for i
		int[] tempClassCounts = new int[numClasses];
		for(int i = 0; i < numInstances; i++) {
			int tempClass = (int)dataset.instance(i).classValue();
			tempClassCounts[tempClass] ++;
			for(int j = 0; j < numConditions; j++) {
				int tempValue = (int)dataset.instance(i).value(j);
	            conditionalCounts[tempClass][j][tempValue]++;
			}//Of for j	
		}//Of for i
		
		for(int i = 0; i < numClasses; i++) {
			for(int j = 0; j < numConditions; j++) {
				int tempNumValues = (int)dataset.attribute(j).numValues() ;
				for(int k = 0; k < tempNumValues; k++) {
					conditionalProbabilitiesLaplacian[i][j][k] = (conditionalCounts[i][j][k] + 1)
							/ (tempClassCounts[i] + tempNumValues);
				}// Of for k
			} // Of for j
		} // Of for i
		System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
	}// Of calculateConditionalProbabilities
	
	public void classify() {
		predicts = new int[numInstances];
		for(int i = 0; i < numInstances; i++) {
			predicts[i] = classify(dataset.instance(i));
		}//Of for i
	}//Of classify
	
	public int classify(Instance paraInstance) {
		if(dataType == NOMINAL) {
			return classifyNominal(paraInstance);			
		}else if(dataType == NUMERICAL) {
			return classifyNumerical(paraInstance);
		}//Of if
		return -1;
	}//Of classify
	
	public int classifyNominal(Instance paraInstance) {
		double tempBiggest = -10000;
		int resultBestIndex = 0;
		for(int i = 0; i < numClasses; i++) {
			double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
			double tempPseudoProbability = tempClassProbabilityLaplacian;
			for (int j = 0; j < numConditions; j++) {
				int tempAttributeValue = (int) paraInstance.value(j);

				tempPseudoProbability += Math.log(conditionalCounts[i][j][tempAttributeValue])
				- tempClassProbabilityLaplacian;
			} // Of for j

			if (tempBiggest < tempPseudoProbability) {
				tempBiggest = tempPseudoProbability;
				resultBestIndex = i;
			} // Of if
		} // Of for i

		return resultBestIndex;
	}// Of classifyNominal

    public int classifyNumerical(Instance paraInstance) {
    	double tempBiggest = -10000;
		int resultBestIndex = 0;

		for (int i = 0; i < numClasses; i++) {
			double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
			double tempPseudoProbability = tempClassProbabilityLaplacian;
			for (int j = 0; j < numConditions; j++) {
				double tempAttributeValue = paraInstance.value(j);
				double tempSigma = gaussianParameters[i][j].sigma;
				double tempMu = gaussianParameters[i][j].mu;

				tempPseudoProbability += -Math.log(tempSigma) - (tempAttributeValue - tempMu)
						* (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
			} // Of for j

			if (tempBiggest < tempPseudoProbability) {
				tempBiggest = tempPseudoProbability;
				resultBestIndex = i;
			} // Of if
		} // Of for i

		return resultBestIndex;
	}// Of classifyNumerical

    public double computeAccuracy() {
		double tempCorrect = 0;
		for (int i = 0; i < numInstances; i++) {
			if (predicts[i] == (int) dataset.instance(i).classValue()) {
				tempCorrect++;
			} // Of if
		} // Of for i

		double resultAccuracy = tempCorrect / numInstances;
		return resultAccuracy;
	}// Of computeAccuracy

    public static void testNominal() {
		System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
		String tempFilename = "C:\\\\Users\\\\ASUS\\\\Desktop\\\\文件\\\\mushroom.arff";

		NaiveBayes tempLearner = new NaiveBayes(tempFilename);
		tempLearner.setDataType(NOMINAL);
		tempLearner.calculateClassDistribution();
		tempLearner.calculateConditionalProbabilities();
		tempLearner.classify();

		System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
	}// Of testNominal

	public static void main(String[] args) {
		testNominal();
	}// Of main
}// Of class NaiveBayes

运行截图:

文章出处登录后可见!

已经登录?立即刷新

共计人评分,平均

到目前为止还没有投票!成为第一位评论此文章。

(0)
xiaoxingxing的头像xiaoxingxing管理团队
上一篇 2022年5月11日
下一篇 2022年5月11日

相关推荐