Machine LearningGenerative Models
Kernel density estimation and quadratic discriminant analysis are
To recap, from the second section, quadratic discriminant analysis posits that the class conditional densities are multivariate Gaussian. We use the observations from each class to estimate a mean and a covariance matrix for that class. We also use sample proportions to estimate the class proportions. Given this approximation of the probability measure on , we return the classifier (where is the multivariate normal density with mean and covariance ).
A common variation on this idea is to posit that the class conditional densities have the same covariance matrix. Then observations from all of the classes can be pooled to estimate this common covariance matrix. We estimate the mean of each class , and then we average over all the sample points . This approach is called linear discriminant analysis (LDA).
The advantage of LDA over QDA stems from the difficulty of estimating the entries of a covariance matrix if is even moderately large. Pooling the classes allows us to marshal more observations in the service of this estimation task.
The terms quadratic and linear refer to the resulting decision boundaries: solution sets of equations of the form are quadric hypersurfaces or hyperplanes if and are real numbers and and are distinct multivariate normal densities. If the covariances of and are equal, then the solution set is a hyperplane.
Exercise
Use the code cell below to confirm for the given covariance matrix and mean vectors that the solution set of is indeed a plane in three-dimensional space. (Hint: call simplify
on the expression returned in the last line.)
using SymPy @vars x y z real=true p₁ = 1/5 p₂ = 2/5 Σ = [2 1 0 1 1 0 0 0 1] f(μ, Σ, x) = 1/((2π)^2 * sqrt(det(Σ))) * exp(-1/2 * (x-μ)' * inv(Σ) * (x-μ)) f([2,0,1], Σ, [x,y,z]) / f([1,1,-3], Σ, [x,y,z])
Solution. The last line returns , so the set of points where this ratio is equal to is the solution set of , which is a plane.
Although we used specific numbers in this example, it does illustrate the general point: the only quadratic term in the argument of the exponential in the formula for the multivariate normal distribution is . Thus if we divide two such densities with the same , the quadratic terms will cancel, and the only remaining variables will appear in the form of a linear combination in the exponent. When such expressions are set equal to a constant, the equation can be rearranged by dividing and taking logs to obtain a linear equation.
Naive Bayes
The naive Bayes approach to classification is to assume that the components of are conditionally independent given . In the context of the flower example, this would mean assuming that blue-flower petal width and length are independent (which was true in that example), that the red-flower petal width and length are independent (which was not true), and that the green-flower petal width and length are independent (also not true).
To train a naive Bayes classifier, we use the observations from each class to estimate a density on for each feature component , and then we estimate
in accordance with the conditional independence assumption. The method for estimating the univariate densities is up to the user; options include kernel density estimation and parametric estimation.
Exercise
Each scatter plot shows a set of sample points for a three-category classification problem. Match each data set to the best-suited model: Naive Bayes, LDA, QDA.
Solution. The correct order is (c), (a), (b), since the third plot shows class conditional densities which factor as a product of marginals, the first plot shows Gaussian class conditional probabilities with the same covariance matrices, and the second plot shows Gaussian class conditional probabilities with distinct covariance matrices.
Exercise
Consider a classification problem where the features and have the property that is uniformly distributed on and is equal to . Suppose further that the conditional distribution of given and assigns probability mass 80% to class 1 and 20% to class 0 when the observation is left of the vertical line , and assigns probability mass 75% to class 0 and 25% to class 1 when the observation is right of the vertical line .
(a) Find the prediction function which minimizes the misclassification probability.
(b) Show that the Naive Bayes assumption leads to the optimal prediction function, even though the relationship between and is modeled incorrectly.
Solution. (a) The classifier which minimizes the misclassification probability predicts class 1 for points in the northwest quadrant of the square (since the class-1 density is larger there), and class 0 for points in the southeast quadrant (since the class-0 density is larger there).
(b) The probability of the event is
Therefore, the conditional density of given is
Likewise, the conditional density of given is
Under the (erroneous) assumption that and are conditionally independent given , we would get a joint conditional density function (given the event ) which is constant on each quadrant of the unit square, with value throughout the northwest quadrant, on the southeast quadrant, and on each of the other two quadrants. To emphasize the distinction between the actual measures and the naive Bayes measure, here's a visualization of each:
using Plots ϵ = 0.015 plot([(0,1),(1/2,1/2)], linewidth = 3, color = :red, legend = false, ratio = 1, size = (400,400)) plot!([(0,1+ϵ),(1/2,1/2+ϵ)], linewidth = 1, color = :blue) plot!([(1/2,1/2),(1,0)], linewidth = 1, color = :red) plot!([(1/2,1/2+ϵ),(1,ϵ)], linewidth = 4, color = :blue)
function bayes_density(x1,x2) (x1 < 0.5 ? 32/21 : 10/21) * (x2 < 0.5 ? 10/21 : 32/21) end heatmap(0:0.01:1, 0:0.01:1, bayes_density, color = cgrad([:blue, :red]), legend = false, ratio = 1, size = (400,400))
Likewise the probability density of conditioned on is in the northwest quadrant of the square and in the southeast quadrant of the square. Since , the naive Bayes classifier predicts 1 in the northwest quadrant of the square. Likewise, it predicts 0 in the southeast corner.
Therefore, despite modeling the relationship between the features incorrectly, the naive Bayes classifier does yield the optimal prediction function.