*Stats for Data Science*: principles and objectives

These are some principles that I would like to follow in writing Stats for Data Science. I don’t always succeed and it’s helpful to have others point out places where I’ve deviated from the principles.

Pedagogy

  1. It’s intro stats, so make repeated, reinforcing use of a small set of concepts and of statistical “objects.”
    • Example: graphics.
      • Just three kinds of glyphs used: data, interval, density
      • Frame is always response versus explanatory
    • Example: intervals
      • Describe single variables using an interval, just as we do for confidence and prediction.
    • Example: Computing
      • Roles of variables denoted in the same way for all calculations: response ~ explanatory.
      • But … we’ve got two notations for passing data: data = and %>%.
        • emphasize %>% when the output is the same kind of thing as the input: wrangling, graphics
        • emphasize data = otherwise.
    • Example: models
      • Avoid referring to the internal details. Instead, emphasize things you can do with models:
        • Evaluate (including prediction intervals)
        • Effect size
        • Bootstrap & cross-validate
      • Use same notation for all: response ~ explanatory
      • Two basic kinds of models: bounded and unbounded
      • Avoid misleading terms, e.g. I use “proportional combination” rather than “linear.”
  2. Use data for all graphics.
    • Lay conceptual objects on top of data
    • Tie statistical quantities to the variables, e.g. differences and effect sizes have units.
  3. Leave out legacy notation, methods, concepts, that are not needed
    • standard error, chi-squared, graphics modes developed for pen-and-paper
    • soft pedal population. Our objective is never to estimate a population parameter, it is to make a meaningful prediction or correctly anticipate the effect of an intervention.
  4. Be relevant in examples
    • No urns!
    • Climate change, health, …
    • Try to be authentic in use of statistical methods
      • avoid poll examples unless the point is to talk about stratified sampling and adjustment
  5. Be relevant in methods
    • risk ratios (and a little odds ratios)
    • methods must provide a role for covariates
    • adjustment, stratified sampling, etc.

Avoid questionable inference

  1. Put p-values and confidence intervals in their place.
    • Confidence intervals are always on effect sizes
  2. Avoid spurious precision.
    • 1.96
    • small data techniques (e.g. t)

Causality

  1. Don’t avoid it!

  2. Use sensible notation to talk about causal connections: DAGs

  3. Two main objectives for use of models
    • Prediction. Don’t need to worry about causation, just performance.
    • Intervention. Causation is central.

Incorporate prior knowledge

  1. Bayes early
  2. Hypothetical causal networks

Be modern

  1. Machine learning
  2. Graphics
  3. Bootstrapping and cross-validation as the basic inference techniques
    • But there’s a chapter on “small data” to show there are situations where these don’t work.
  4. Data storage

Emphasize judgement

And subjectivity where it’s important.

Show some of the routes to false discovery (so that people can keep them in mind when they read research findings and so that they can try to avoid them in their own work).

Take GAISE 2016 seriously