These are some principles that I would like to follow in writing Stats for Data Science. I don’t always succeed and it’s helpful to have others point out places where I’ve deviated from the principles.

Pedagogy

It’s intro stats, so make repeated, reinforcing use of a small set of concepts and of statistical “objects.”
- Example: graphics.
  - Just three kinds of glyphs used: data, interval, density
  - Frame is always response versus explanatory
- Example: intervals
  - Describe single variables using an interval, just as we do for confidence and prediction.
- Example: Computing
  - Roles of variables denoted in the same way for all calculations: response ~ explanatory.
  - But … we’ve got two notations for passing data: data = and %>%.
    - emphasize %>% when the output is the same kind of thing as the input: wrangling, graphics
    - emphasize data = otherwise.
- Example: models
  - Avoid referring to the internal details. Instead, emphasize things you can do with models:
    - Evaluate (including prediction intervals)
    - Effect size
    - Bootstrap & cross-validate
  - Use same notation for all: response ~ explanatory
  - Two basic kinds of models: bounded and unbounded
  - Avoid misleading terms, e.g. I use “proportional combination” rather than “linear.”
Use data for all graphics.
- Lay conceptual objects on top of data
- Tie statistical quantities to the variables, e.g. differences and effect sizes have units.
Leave out legacy notation, methods, concepts, that are not needed
- ~~standard error~~, ~~chi-squared~~, ~~graphics modes developed for pen-and-paper~~
- soft pedal population. Our objective is never to estimate a population parameter, it is to make a meaningful prediction or correctly anticipate the effect of an intervention.
Be relevant in examples
- No urns!
- Climate change, health, …
- Try to be authentic in use of statistical methods
  - avoid poll examples unless the point is to talk about stratified sampling and adjustment
Be relevant in methods
- risk ratios (and a little odds ratios)
- methods must provide a role for covariates
- adjustment, stratified sampling, etc.

Avoid questionable inference

Put p-values and confidence intervals in their place.
- Confidence intervals are always on effect sizes
Avoid spurious precision.
- ~~1.96~~
- small data techniques (e.g. t)

Causality

Don’t avoid it!
Use sensible notation to talk about causal connections: DAGs
Two main objectives for use of models
- Prediction. Don’t need to worry about causation, just performance.
- Intervention. Causation is central.

Incorporate prior knowledge

Bayes early
Hypothetical causal networks

Be modern

Machine learning
Graphics
Bootstrapping and cross-validation as the basic inference techniques
- But there’s a chapter on “small data” to show there are situations where these don’t work.
Data storage

Emphasize judgement

And subjectivity where it’s important.

Show some of the routes to false discovery (so that people can keep them in mind when they read research findings and so that they can try to avoid them in their own work).

Stats for Data Science: principles and objectives

Pedagogy

Avoid questionable inference

Causality

Incorporate prior knowledge

Be modern

Emphasize judgement

Take GAISE 2016 seriously

*Stats for Data Science*: principles and objectives

Pedagogy

Avoid questionable inference

Causality

Incorporate prior knowledge

Be modern

Emphasize judgement

Take GAISE 2016 seriously

Stats for Data Science: principles and objectives