# StatisticsIntroduction

Mathematical probability is about drawing conclusions about the outcomes of random experiments whose randomness is known and specified precisely. *Statistics* works in the opposite direction: the outcomes are observed, but the probability measure giving rise to those outcomes is unknown. The goal of statistics is to draw conclusions about probability distributions based observations sampled from them.

For example, consider the eventual adult height of a particular newborn child. There are no pure mathematical considerations that would suggest a specific distribution for . Our best bet is to collect data on the heights of adults and try to **infer** a probability distribution which is compatible with the observed data. Suppose we measure the height (in inches) of 10 randomly selected folks and get the following numbers:

heights = [71.54, 66.62, 64.11, 62.72, 68.12, 69.07, 64.82, 61.92, 68.45, 66.3, 66.99, 62.2, 61.04, 63.31, 68.94, 66.27, 66.8, 71.7, 68.93, 66.65, 71.97, 60.27, 62.81, 70.64, 71.61, 65.51, 63.1, 66.21, 68.23, 72.32, 62.29, 63.12, 64.94, 71.89, 65.48, 63.66, 56.11, 65.63, 61.26, 65.12, 66.93, 68.51, 67.2, 71.57, 66.65, 59.77, 61.51, 63.25, 69.12, 64.98]

Each observation provides some evidence about where the probability mass of the height distribution is. We would expect that regions with many observations have more probability mass than regions with few observations, although we should not take this too literally: none of the 50 observations in the list above fall in the interval , but it would not make sense to conclude that adults who are taller than 70.7 inches are necessarily also taller than 71.4 inches.

**Exercise**

Brainstorm at least two ways to come up with a plausible density function given a list of observations like the one given above.

## Nonparametric estimation

A simple way to obtain a probability distribution from a list of observations is to make a

using Plots histogram(heights, nbins=12, label="", xlabel="height (inches)", ylabel="count")

You might think of a histogram as just a visualization of the data, but it does give an actual distribution: we consider the piecewise constant function whose graph consists of the tops of the histogram bars, and we divide it by the sum of the areas of the bars (to obtain a new function which integrates to 1):

using Plots histogram(heights, nbins=12, label="", xlabel="height (inches)", ylabel="count", normed=true)

The arbitrariness in the density function we obtain by normalizing the histogram is hardly disguised: we would have gotten a different result if we'd used a different number of bins, and we could have even decided to use bins of different widths. Nevertheless, the histogram density approximates the actual distribution quite well if we have a lot of data:

**Exercise**

Call the function `mysample`

10000 times and make a histogram of the resulting observations. Compare the histogram density to the actual density, and observe that the two are very close.

Note: you can evaluate the pdf of `N₁`

at `x`

using `pdf(N₁,x)`

.

function mysample() if rand() > 0.2 3 + 0.8*randn() else -1 + randn() end end using Distributions histogram([mysample() for _ in 1:10000], nbins=80, normed=true, label="histogram density") N₁ = Normal(3,0.8) N₂ = Normal(-1,1) #actualdensity(x) = DENSITYFUNCTIONHERE plot!(-6:0.1:6, actualdensity, linewidth=3, label="actual density", legend=:topright)

*Solution.* The density function describing the distribution that `mysample`

draws from is a linear combination of the two given Gaussian density functions, with weights and :

`actualdensity(x) = 0.8pdf(N₁,x)+0.2pdf(N₂,x)`

## Parametric estimation

Another way to come up with a density function for some data is to assume that the density function belongs to a specific parametric family of densities, like the set of Gaussian distributions. Then we approximate the parameters using the data.

**Exercise**

Use the sliders to find the μ and σ values for which the normal distribution does the best job of fitting the data. (The meaning of the term "best" here is deliberately left to your discretion). Compare your results to the values obtained using standard methods for this problem by entering your choices for μ and σ in the last line below.

The best μ value is

Later in this course, we will discuss some approaches to choosing parameters optimally, and we'll leave behind the "eyeball-it" strategy we used in this exercise.

The histogram estimator is called a

## Regression

Statistics is not limited to estimating the distribution of a single real-valued random variable like human height. Typically we want to have information about the *joint* distribution of such a variable with other variables whose values we are in a position to know. Such joint information allows us to make more accurate predictions, and that increased accuracy is usually critical for the business or research purposes that motivated the inquiry.

For example, if we're able to collect the heights of many adults together along the heights of each of their parents, then we can aim to understand the *conditional* expectation of a person's height, given the heights of their parents. Since we can measure the heights of a child's parents, we can use this information to make a better prediction for how tall the child will grow up to be. The problem of estimating the conditional expectation of one random variable given others is called **regression**.

In the next section, we will develop some intuitive techniques for estimating density functions for joint distributions. We'll close this section with an exercise involving the estimation of a *discrete* distribution.

**Exercise**

Consider a random variable that you know takes values in . Suppose that 100 independent observations are made from the distribution of , and suppose they are the values given below. Propose an estimate of the distribution of .

observations = [ 0, 2, 2, 2, 2, 2, 0, 2, 2, 1, 0, 2, 2, 1, 0, 1, 0, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 0, 2, 0, 1, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 1, 0, 2 ]

*Solution.* Since 70% of the observations are 2's, we posit that the probability of the event is 70%. Likewise, the probabilities of the events and we estimate to be 13% and 17%, respectively.