MPO 581 Class number 7/27   Feb 9, 2011

Topic: The (rest of the) Most Interesting 5% of Statistics

Loose ends from last class:

Teams and homework leaders all assigned!?    (edit amongst yourselves)
Testable questions compilation taking shape: add to it, for course brownie points.


Today's material: The (rest of the) Most Interesting 5% of Statistics

Outline - without content

  1. What is a statistic, what is statistics?
  2. Subdividing the very large field of statistics (in 2 axes)
    1. parametric vs. non-parametric; exploratory to confirmatory (or beyond: predictive).
  3. Distributions: of Total Stuff, over Bins (infinitesimal bins, in the continuous case)
  4. Probability as the Total Stuff (always adds up to 1)
  5. Probability distributions

Outline again - with content

....

4. Probability as the Total Stuff (always adds up to 1)

Probability -- it always adds up to to 1 over all Bins, because one reality exists!

5. Probability distributions

Probability distribution function or probability density function (PDF): The 1 of total probability is spread over one or more dimensions. Or we could say the Unity of probability is subdivided or decomposed into Bins or slices or pieces, like pie (or variance, or Error). A PDF is non-negative (PDF ≥ 0) everywhere. Its value is also called "likelihood". (as in MLE, search above). I think it makes more sense to think of likelihood as a probability density. The acronym "PDF" is nicely ambiguous.

Cumulative distribution function (CDF): In probability theory and statistics, the cumulative distribution function (CDF), or just distribution function, describes the probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x. Intuitively, it is the "area so far" function of the probability distribution.

PDFs are solutions to stochastic differential equations. The PDF represents an equilibrium between "fluxes" in and out of "bins" of probability, or in other words an equilibrium between processes that change other values of the Bin variable so that they falls into a given bin, minus processes that change values within the bin to other values which fall in other bins.  Animations from an intuitive case like money exchanges might help: http://www.physics.umd.edu/~yakovenk/econophysics/. The Fokker-Planck equation is a profound thing to stare at some day, it is the time dependent SDE governing "probability flux divergence" that these animations are solutions of:

∂(probability)/∂t = -∇•(probability flux)


Key PDFs worth knowing: (The Most Interesting 5% of Statistics).

a) Profound, fundamental, yet simply arising from symmetries or other non-statements:


The Normal distribution embodies a null hypothesis about the Nature behind the population your data came from: an adding machine fed by a random number generator. A null hypothesis is basically the most boring (or conservative) possible interpretation of your data (heck, my random number generator could make that). Can you prove otherwise: that your data tell an interesting story about their source? That is the challenge of statistics to scientific claims.

b) What we would see in limited samples, if the above were true:



How to build distributions of Stuff over Bins
A histogram can be built as the derivative of the CDF of your Stuff. I find this an instructive line of reasoning and will demo it. IDL code here.

But more typically, we just use a built-in histogram function, and work with the values in each bin: sum them, average them, etc.
Histogram (Wikipedia)
Histogram bin width optimization (MIT, has demo applet)

Open questions, assignments, and loose ends for next class:

Testable questions about today's material: