Välj ett av nyckelorden till vänster ...

Machine LearningA classification example: QDA

Lästid: ~20 min

The task in the exam scores example is to predict a quantitative random variable Y given the value of a random variable X. Another common machine learning task is to predict a random variable Y taking values in a discrete set of labels. For example, we might want to identify users of a web app as academic, business, or personal users. Or we might want to classify medical devices as faulty or sound. These are called classification problems.

Suppose that (X,Y) are random variables defined on the same probability space, and suppose that Y takes values in \{1,2,3\}. For example, suppose that we select a forest animal at random and let X be its weight and Y the kind of animal it is (where 1,2, and 3 correspond to \mathrm{squirrel},\mathrm{bear}, and \mathrm{fox}, respectively). Suppose that f is a function which is intended to predict the value of Y based on the value of X. Explain why the mean squared error \mathbb{E}[(Y - f(X))^2] is not a reasonable way to measure the accuracy of the prediction function.

Solution. The mean squared error penalizes a misclassification differently depending on how far apart the class labels are (for example, misclassifying a squirrel as a fox would be worse than misclassifying it as a bear).

As discussed in the first section, the most common way to judge a prediction function for a classification problem is the 0-1 loss, which applies a penalty of 1 for misclassification and 0 for correct classification:

\begin{align*} L(f) = \mathbb{E}[\mathbf{1}_{\\{Y \neq f(X)\\}}] = \mathbb{P}(Y \neq f(X)).\end{align*}

Since it is typically not meaningful to put the possible classifications in order along an axis, we usually represent a data point's classification graphically using the point's shape or color. This allows us to use all of the spatial dimensions in the figure for the X values, which is helpful if X is multidimensional.

Given a flower randomly selected from a field, let X_1 be its petal width in centimeters, X_2 its petal length in centimeters, and Y\in\{\mathrm{R},\mathrm{G},\mathrm{B}\} its color. Let

\begin{align*}\boldsymbol{\mu}_{\mathrm{R}} &= \left[\begin{smallmatrix} 9 \\ 5 \end{smallmatrix}\right] &\quad \boldsymbol{\mu}_{\mathrm{G}} &= \left[\begin{smallmatrix}4 \\ 10 \end{smallmatrix}\right] &\quad \boldsymbol{\mu}_{\mathrm{B}} &= \left[\begin{smallmatrix}7 \\ 9 \end{smallmatrix}\right] \\ A_{\mathrm{R}} &= \left[\begin{smallmatrix}1.5 & -1 \\ 0 & \hphantom{-}1 \\ \end{smallmatrix}\right] &\quad A_{\mathrm{G}} &= \left[\begin{smallmatrix}0.5 & 0.25 \\ 0 & 0.5 \\ \end{smallmatrix}\right] &\quad A_{\mathrm{B}} &= \left[\begin{smallmatrix}2 & 0 \\ 0 & 2 \\ \end{smallmatrix}\right].\end{align*}

Suppose that the joint distribution of X_1, X_2, and Y is defined as follows: for any A\subset \mathbb{R}^2 and color c \in \{\mathrm{R},\mathrm{G},\mathrm{B}\}, we have

\begin{align*} \mathbb{P}(A \times \{c\}) = p_c\int_{\mathbb{R}^2} f_c(x_1,x_2) \mathrm{d} x_1 \mathrm{d} x_2,\end{align*}

where (p_R,p_G,p_B) = (1/3,1/6,1/2) and f_c is the multivariate normal density with mean \mu_c and covariance matrix A_cA_c'. In other words, we can sample from the joint distribution of of X_1, X_2, and Y by sampling Y from {R, G, B} with probabilities 1/3, 1/6, and 1/2, respectively, and then generate (X_1, X_2) by calculating A_Y Z + \mu_Y, where Z is a vector of two standard normal random variables which are independent and independent of Y.

Three hundred observations from the distribution of (X_1, X_2, Y) are shown in the figure below.

Colors and petal dimensions of 300 randomly selected flowers.

Find the best predictor of Y given (X_1,X_2) = (x_1, x_2) (using the 0-1 loss function), and find a way to estimate that predictor using the given observations.

We can visualize the joint distribution of flower dimension and color by representing Y as a color (as in the first figure) or using the third spatial dimension (as in the second figure).

Solution. As in the regression example, we can do a decent job of classification with our eyes. If (X_1,X_2) is located where there are lots of green observations, we would predict its classification as green, and similarly for blue and red. Let's think about how to approach this task mathematically.

To start, let's proceed using our knowledge of the joint distribution of (X_1,X_2,Y). The predictor which has minimal misclassification probability is the one which maps (x_1,x_2) to the classification with maximal conditional probability given (x_1,x_2). For example, if the conditional distribution on \{\mathrm{R},\mathrm{G},\mathrm{B}\} given (x_1,x_2) were \{45\%, 30\%, 25\%\}, then we would guess a classification of for the point (x_1, x_2).

The conditional distribution of Y given (X_1,X_2) is given by

\begin{align*} m_{(X_1,X_2) = (x_1,x_2)}(c) = \frac{p_c f_{c}(x_1,x_2)}{\sum_{d\in \{\mathrm{R},\mathrm{G},\mathrm{B}\}}p_d f_{d}(x_1,x_2)}\end{align*}

for c \in \{\mathrm{R},\mathrm{G},\mathrm{B}\}; in other words, we compute the proportion of the probability density at the point (x_1, x_2) which comes from each color c.

Let's build a visualization for the optimal classifier for the flowers by coloring each point in the plane according to its classification. First, let's get 300 observations from the joint distribution of (X_1, X_2, Y):

using Plots, StatsBase, Distributions, Random
struct Flower
# density function for the normal distribution N
xgrid = 0:0.01:15
ygrid = 0:0.01:15
As = [[1.5 -1; 0 1],[1/2 1/4; 0 1/2], [2 0; 0 2]]
μs = [[9,5],[4,10],[7,9]]
Ns = [MvNormal(μ,A*A') for (μ,A) in zip(μs,As)]
p = ProbabilityWeights([1/3,1/6,1/2])
colors = ["red","green","blue"]
function randflower(μs,As)
    i = sample(p)
flowers = [randflower(μs,As) for i in 1:300]

Next, let's make a classifier and color all of the points in a fine-mesh grid according to their predicted classifications.

predict(x,p,Ns) = argmax([p[i]*pdf(Ns[i],x) for i in 1:3])
function classificationplot(flowers,p,Ns)
    rgb = [:red,:green,:blue]
    P = heatmap(xgrid,ygrid,(x,y) -> predict([x,y],p,Ns),
          fillcolor = cgrad(rgb), opacity = 0.4,
          aspect_ratio = 1, legend = false)
    for c in ["red","green","blue"]
        scatter!(P,[(F.X[1],F.X[2]) for F in flowers if F.color==c], color=c)
correct(flowers,p,Ns) = count(colors[predict(F.X,p,Ns)] == F.color for F in flowers)
classificationplot(flowers, p, Ns)

We see that the optimal classifier does get most of the points right, but not all of them. correct(flowers,p,Ns) returns 265, so the optimal classification accuracy is 265/300 ≈ 88% for this example. Now suppose we don't have access to the joint distribution of (X_1, X_2, Y), but we do have n observations \{(X_1^{(i)}, X_2^{(i)},Y^{(i)})\}_{i=1}^n therefrom. We can estimate \widehat{p}_c as the proportion of observed flowers of color c. We could estimate the conditional densities f_c using kernel density estimation, but in the interest of bringing in a new idea, let's fit a multivariate normal distribution to the observations of each color. Let's begin by approximating the mean of the distribution of red flowers, using the plug-in estimator:

\begin{align*}\widehat{\boldsymbol{\mu}}_{\mathrm{R}} = \frac{1}{|\mathcal{R}|}\sum_{i\in \mathcal{R}} \left[\begin{array}{c} X_1^{(i)} \\ X_2^{(i)} \end{array}\right], \end{align*}

Each point in the square is colored according to the optimal classifier's prediction for the given $(x\_1,x\_2)$ pair.

and similarly for the other two colors. This formula estimates \mathbb{E}[\mathbf{X}] by using the empirical distribution as a proxy for the underlying distribution. Likewise, we approximate the red covariance matrix as

\begin{align*} \widehat{\Sigma}_{\mathrm{R}} = \frac{1}{|\mathcal{R}|} \sum_{i\in \mathcal{R}}\left(\left[ \begin{array}{c} X_1^{(i)}\\ X_2^{(i)} \end{array}\right] - \widehat{\boldsymbol{\mu}}_{\mathrm{R}}\right) \left(\left[ \begin{array}{c} X_1^{(i)}\\ X_2^{(i)} \end{array}\right] - \widehat{\boldsymbol{\mu}}_{\mathrm{R}}\right)',\end{align*}

which evaluates the covariance matrix formula \mathbb{E}[(\mathbf{X}-\boldsymbol{\mu}_{\mathbf{X}}) (\mathbf{X}-\boldsymbol{\mu}_{\mathbf{X}})'] with respect to the empirical distribution.

     function mvn_estimate(flowers,color)
    flowers_subset = [F.X for F in flowers if F.color == color]
    μ̂ = mean(flowers_subset)
    Σ̂ = mean([(X - μ̂)*(X - μ̂)' for X in flowers_subset])
colorcounts = countmap([F.color for F in flowers])
p̂ = [colorcounts[c]/length(flowers) for c in colors]
N̂s = [mvn_estimate(flowers,c) for c in colors]

The resulting plot looks very similar to the one we made for the optimal classifier, so this classifier does make the best prediction for most points (x_1,x_2).

Bruno Bruno