机器学习-03-NB算法+Laplacian平滑

学习来源:日撸 Java 三百行(51-60天,kNN 与 NB)_闵帆的博客-CSDN博客_knn算法java实现[0]

数据集

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

条件概率

P(AB)=P(A)P(B|A)                                                          (1)

其中,P(AB)表示事件A和B同时发生的概率;P(A)表示事件A发生的概率;P(B|A)表示在事件A发生的情况下,事件B发生的概率。

独立性假设

令x = x1∧x2∧…∧xn 表示一个条件组合,例如:outoutlook=sunny∧temperature=hot∧humidity=high∧windy=FALSE(对应数据集的第一行)。再令D_{i}表示play=yes/no的事件。假设对于x中每个条件都是独立的,所以:

P(x|D_{i})=P(x_{1}|D_{i})P(x_{2}|D_{i})...P(x_{n}|D_{i}) =\prod _{j=1}^{n}P(x_{j}|D_{i})                           (2)

各类别的的概率

根据式(1)得:

P(D_{i}|x) = \frac{P(xD_{i})}{P(x)} = \frac{P(D_{i})P(x|D_{i})}{P(x)}                                                (3)

因为对于每个D_{i}除以P(x)

d(i) = \max_{1\leq i\leq k} P(D_{i}|x) = \max_{1\leq i\leq k}P(D_{i})\prod _{j = 1}^{n}P(x_{j}|D_{i})                                (4)

并且由于P(D_{i})P(x|D_{i})都属于小数,乘积之后会导致值特别小,所以采用log的方法解决:

d(i) = \max_{1\leq i\leq k} P(D_{i}|x) = \max_{1\leq i\leq k}\log P(D_{i}) + \sum _{j = 1}^{m}\log P(x_{j}|D_{i})                      (5)

Laplacian平滑

由于在求P(x_{j}|D_{i})可能会是0,例如在数据集中可以发现在不出去玩时,天气不可能是多云,或出去玩时,温度不可能高。如果在预测的那一天这两个条件都满足,出不出去玩的概率都是0。所以为了解决这种问题就引入Laplacian平滑:

P^{L}(x_{j}|D_{i}) =\frac{nP(x_{j}D_{i}) + 1}{nP(D_{i}) + v_{j}}                                                (6)

其中,n代表属于i的类别个数,v_{j}代表j属性的属性值个数。 P(D_{i})还需要平滑:

P^{L}(D_{i}) = \frac{nP(D_{i}) + 1}{n + c}                                                        (7)

其中,c代表属性个数。最终的优化目标是:

d(i) = \max_{1\leq i\leq k}\log P^{L}(D_{i}) + \sum _{j = 1}^{m}\log P^{L}(x_{j}|D_{i})                             (8)

Code:

package 日撸Java300行_51_60;


import java.io.FileReader;
import java.util.Arrays;

import weka.core.*;

public class NaiveBayes {
	/**
	 ************************* 
	 * An inner class to store parameters.
	 ************************* 
	 */
	private class GaussianParamters {
		double mu;
		double sigma;

		public GaussianParamters(double paraMu, double paraSigma) {
			mu = paraMu;
			sigma = paraSigma;
		}// Of the constructor

		public String toString() {
			return "(" + mu + ", " + sigma + ")";
		}// Of toString
	}// Of GaussianParamters

	/**
	 * The data.
	 */
	Instances dataset;

	/**
	 * The number of classes. For binary classification it is 2.
	 */
	int numClasses;

	/**
	 * The number of instances.
	 */
	int numInstances;

	/**
	 * The number of conditional attributes.
	 */
	int numConditions;

	/**
	 * The prediction, including queried and predicted labels.
	 */
	int[] predicts;

	/**
	 * Class distribution.
	 */
	double[] classDistribution;

	/**
	 * Class distribution with Laplacian smooth.
	 */
	double[] classDistributionLaplacian;

	/**
	 * To calculate the conditional probabilities for all classes over all
	 * attributes on all values.
	 */
	double[][][] conditionalCounts;

	/**
	 * The conditional probabilities with Laplacian smooth.
	 */
	double[][][] conditionalProbabilitiesLaplacian;

	/**
	 * The Guassian parameters.
	 */
	GaussianParamters[][] gaussianParameters;

	/**
	 * Data type.
	 */
	int dataType;

	/**
	 * Nominal.
	 */
	public static final int NOMINAL = 0;

	/**
	 * Numerical.
	 */
	public static final int NUMERICAL = 1;

	/**
	 ********************
	 * The constructor.
	 * 
	 * @param paraFilename
	 *            The given file.
	 ********************
	 */
	public NaiveBayes(String paraFilename) {
		dataset = null;
		try {
			FileReader fileReader = new FileReader(paraFilename);
			dataset = new Instances(fileReader);
			fileReader.close();
		} catch (Exception ee) {
			System.out.println("Cannot read the file: " + paraFilename + "\r\n" + ee);
			System.exit(0);
		} // Of try

		dataset.setClassIndex(dataset.numAttributes() - 1);
		numConditions = dataset.numAttributes() - 1;
		numInstances = dataset.numInstances();
		numClasses = dataset.attribute(numConditions).numValues();
	}// Of the constructor

	/**
	 ********************
	 * Set the data type.
	 ********************
	 */
	public void setDataType(int paraDataType) {
		dataType = paraDataType;
	}// Of setDataType

	/**
	 ********************
	 * Calculate the class distribution with Laplacian smooth.
	 ********************
	 */
	public void calculateClassDistribution() {
		classDistribution = new double[numClasses];
		classDistributionLaplacian = new double[numClasses];

		double[] tempCounts = new double[numClasses];
		for (int i = 0; i < numInstances; i++) {
			int tempClassValue = (int) dataset.instance(i).classValue();
			tempCounts[tempClassValue]++;
		} // Of for i

		for (int i = 0; i < numClasses; i++) {
			classDistribution[i] = tempCounts[i] / numInstances;
			classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
		} // Of for i

		System.out.println("Class distribution: " + Arrays.toString(classDistribution));
		System.out.println(
				"Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
	}// Of calculateClassDistribution


	/**
	 ********************
	 * Classify all instances, the results are stored in predicts[].
	 ********************
	 */
	public void classify() {
		predicts = new int[numInstances];
		for (int i = 0; i < numInstances; i++) {
			predicts[i] = classify(dataset.instance(i));
		} // Of for i
	}// Of classify

	/**
	 ********************
	 * Classify an instances.
	 ********************
	 */
	public int classify(Instance paraInstance) {
		if (dataType == NOMINAL) {
			return classifyNominal(paraInstance);
		} else if (dataType == NUMERICAL) {
			return classifyNumerical(paraInstance);
		} // Of if

		return -1;
	}// Of classify

	/**
	 ********************
	 * Classify an instances with nominal data.
	 ********************
	 */
	public int classifyNominal(Instance paraInstance) {
		// Find the biggest one
		double tempBiggest = -10000;
		int resultBestIndex = 0;
		for (int i = 0; i < numClasses; i++) {
			double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
			double tempPseudoProbability = tempClassProbabilityLaplacian;
			for (int j = 0; j < numConditions; j++) {
				int tempAttributeValue = (int) paraInstance.value(j);

				// Laplacian smooth.
				tempPseudoProbability += Math.log(conditionalCounts[i][j][tempAttributeValue])
				- tempClassProbabilityLaplacian;
			} // Of for j

			if (tempBiggest < tempPseudoProbability) {
				tempBiggest = tempPseudoProbability;
				resultBestIndex = i;
			} // Of if
		} // Of for i

		return resultBestIndex;
	}// Of classifyNominal


	/**
	 ********************
	 * Compute accuracy.
	 ********************
	 */
	public double computeAccuracy() {
		double tempCorrect = 0;
		for (int i = 0; i < numInstances; i++) {
			if (predicts[i] == (int) dataset.instance(i).classValue()) {
				tempCorrect++;
			} // Of if
		} // Of for i

		double resultAccuracy = tempCorrect / numInstances;
		return resultAccuracy;
	}// Of computeAccuracy

	/**
	 ************************* 
	 * Test nominal data.
	 ************************* 
	 */
	public static void testNominal() {
		System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
		String tempFilename = "D:/data/weather.arff";

		NaiveBayes tempLearner = new NaiveBayes(tempFilename);
		tempLearner.setDataType(NOMINAL);
		tempLearner.calculateClassDistribution();
		tempLearner.calculateConditionalProbabilities();
		tempLearner.classify();

		System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
	}// Of testNominal


	/**
	 ************************* 
	 * Test this class.
	 * 
	 * @param args
	 *            Not used now.
	 ************************* 
	 */
	public static void main(String[] args) {
		testNominal();
	}// Of main
}// Of class NaiveBayes

 

screenshot:

文章出处登录后可见!

已经登录?立即刷新

共计人评分,平均

到目前为止还没有投票!成为第一位评论此文章。

(0)
青葱年少的头像青葱年少普通用户
上一篇 2022年5月7日
下一篇 2022年5月7日

相关推荐