MPO 581 Class number 6/27   Feb 7, 2011

Topic: The Most Interesting 5% of Statistics

Loose ends from last class:

Teams and homework leaders all assigned!?    (edit amongst yourselves)
Testable questions compilation taking shape: add to it, for course brownie points.


Today's material: The Most Interesting 5% of Statistics

Outline - without content

  1. What is a statistic, what is statistics?
  2. Subdividing the very large field of statistics (in 2 axes)
    1. parametric vs. non-parametric; exploratory to confirmatory (or beyond: predictive).
  3. Distributions: of Total Stuff, over Bins (infinitesimal bins, in the continuous case)
  4. Probability as the Total Stuff (always adds up to 1)
  5. Probability distributions

Outline again - with content

  1. What is a statistic, what is statistics?

2. Subdividing the very large field of statistics (in 2 axes):

    parametric vs. non-parametric; exploratory to confirmatory (or beyond: predictive). Wikipedia:

    1. Parametric statistics is a branch of statistics that assumes that data have come from a type of probability distribution and makes inferences about the parameters of the distribution.[1] Most well-known elementary statistical methods are parametric.[2] Generally speaking parametric methods make more assumptions than non-parametric methods.[3] If those extra assumptions are correct, parametric methods can produce more accurate and precise estimates. They are said to have more statistical power. However, if those assumptions are incorrect, parametric methods can be very misleading. For that reason they are often not considered robust. On the other hand, parametric formulae are often simpler to write down and faster to compute. In somecases, their simplicity makes up for their non-robustness, especially if care is taken to examine diagnostic statistics.[4] Because parametric statistics assume a probability distribution, they are not distribution-free.[5]
    2. Robust statistics provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions. (I like this- BEM).

Nonparametric: data driven statistics 

Parametric (analytic) statistics: uses probability theory assumptions

Exploratory,
Descriptive,
Characterizing
the data
  • Data summarization
    • mean, variance = stdev^2, skew, kurtosis...
      • (moments)
    • Histogram
  • Probability distributions
    • implies processes
      • linear +/- additive <-> Gaussian or Normal
        • like momentum in gas molecules => exp(-V2/T)
      • A conserved positive quantity is exchanged symmetrically among interchangeable (arbitrarily labeled or defined) interacting subsystems <-> Exponential 
        • like KE in gas molecules => exp(-KE/T) = exp(-V2/T)
      • Umm hey those are the same...
↓  (has more and more)
  • Transform a dimension (independent variable or domain)
    • Variance is spread over d.o.f. 's
    • many cross-cuts of dofs possible
      • e.g. space, time, spacetime
      • partition by scale (regrid) --->
  • Transforms using specified basis functions 



      • <---Fourier analysis (sines and cosines)
↓  (science ideas)
  • Transform values (e.g. value --> rank)
  • Transform to Gaussian so can use use theorems
    • or rank statistics (there is theory) or other
↓  (driving it)
  • Associations (correlation, covariance)
  • Information theory (mutual information, entropy)
    • data compression, communication engineering
↓  (and more)
  • Signal or event detection & isolation
    • e.g. composites
  • Extreme value theories (events)
  • Signal processing
  • "System identification"
↓  (sophistication)
  • Confidence or robustness checks
    • e.g. subdividing sample
    • data denial experiments
  • Formal hypothesis testing
    • e.g. t test p-value of 5% ==> "stat. significant"
      • (based on undersampling theory for Normal populations)
FORMAL SCIENCE

    • Monte Carlo Resampling
      • (bootstrap, jack-knife)
    • Monte Carlo Synthetic data generation
Inferential or
Confirmatory
(and on to Predictive)
  • Statistical modeling or forecasting ------------------>
  • Statistical forecast evaluation
APPLICATIONS

For the most part we will concentrate on the left column: nonparametric data crunching. Learning parametric (analytic) probability and statsitics theory is time consuming, math intensive, and off the point of the course. But it's important to get some exposure to the spirit and concepts. If you can internalize this way of thinking, even without all the gory details, you can "think it through" over a lifetime as you stare at diagrams in conferences, seminars, and journals. As you sample the world more and more through the happenstance of your own life, a statistical mind-set can even be a source of wisdom, or at least equanimity. 

3. Distributions: of Total Stuff, over Bins (infinitesimal bins, in the continuous case)

Distribution of Stuff over Bins rests on a few principles:
  1. Total Stuff should mean something. Distribution diagrams should always have area under curve (integral) α Total Stuff.
  2. The Bin dimension (or domain of the distribution function, in continuous-function language) should mean something.
  3. A distribution function (of which a well normalized histogram is an estimator) has units u[Stuff]/u[Bin], and is most clearly written as a derivative (since its integral, Total Stuff, is the meaningful quantity: like dN/dD for a number distribution over drop diameter (D) in cloud physics. 
  4. The indefinite integral of a distribution function from -∞ to the Bin value is the cumulative distribution function (CDF). The CDF is 0 at -∞ on the Bin axis, and equals the Total Stuff on the +∞ end, and ramps up monotonically for positive definite Total Stuff like Probability or Energy or Variance. It may wander up and down for more general Total Stuff, like Flux (covariance) distributed over scale (ogive plots, we will make them in Fourier part of course).  
Example: size distibution of droplets. You may want the distribution volume or mass (not number); and you may want it distributed over log-diameter (not diameter) to test for exponential form. Here's an example (1-page pdf from Rogers and Yau).

4. Probability as the Total Stuff (always adds up to 1)

Probability -- it always adds up to to 1 over all Bins, because one reality exists!

5. Probability distributions

Probability distribution function or probability density function (PDF): The 1 of total probability is spread over one or more dimensions. Or we could say the Unity of probability is subdivided or decomposed into Bins or slices or pieces, like pie (or variance). A PDF is non-negative (PDF ≥ 0) everywhere. Its value is also called "likelihood". (as in MLE, search above)

Cumulative distribution function (CDF): In probability theory and statistics, the cumulative distribution function (CDF), or just distribution function, describes the probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x. Intuitively, it is the "area so far" function of the probability distribution.

PDFs are solutions to stochastic differential equations. The PDF represents an equilibrium between "fluxes" in and out of "bins" of probability, or in other words an equilibrium between processes that change other values of the Bin variable so that they falls into a given bin, minus processes that change values within the bin to other values which fall in other bins.  Animations from an intuitive case like money exchanges might help: http://www.physics.umd.edu/~yakovenk/econophysics/. The Fokker-Planck equation is a profound thing to stare at some day, it is the time dependent SDE governing "probability flux divergence" that these animations are solutions of:

∂(probability)/∂t = -∇•(probability flux)



Open questions, assignments, and loose ends for next class:

Testable questions about today's material: