Two kinds of explanatory variables
- Sometimes explanatory variables are categorical, sometimes quantitative.
- In stratification, we’ve dealt with the situation where all explanatory variables are categorical
- we only needed to be concerned whether the response was quantitative or categorical
- quantitative: summary interval. Why?
- our initial interest was prediction, not estimation
- categorical: probability of each outcome level
- quantitative: summary interval. Why?
- we only needed to be concerned whether the response was quantitative or categorical
- Modeling functions provide a more general framework that
- encompasses stratification
- can work with quantitative explanatory variables
What should be the output of a modeling function?
Traditionally, we format the output as a number.
Instead …
The output will depend on whether the response is a quantitative or categorical variable, just as we did with stratification.
- quantitative response: output will be an interval. We can call this a prediction interval, but it’s more or less the same as a summry interval.
- categorical response: output will be a probability for each level.
How should we represent modeling functions?
Traditionally:
- a straight line: slope and intercept
- we rarely go on to have functions with multiple inputs
- graphics modes — color and faceting – make it feasible to present functions of four explanatory variables. (Four is possible, but difficult to interpret.)
- more generally, multiple coefficients
For data science:
- impractical to keep track of coefficients when there are multiple inputs.
- many model architectures do not have coefficients
- tree models & random forests
- support vector machines & neural networks
- classifiers such as k-nearest neighbors, linear & quadratic discriminant analysis
- we’re going to be using a computer, so represent models as a software function: takes inputs and produces outputs
Names for families of functions?
The two families that will do much of the lifting are:
- general linear models
- generalized linear models
Problems:
- The names are almost identical and are not descriptive.
- The names suggest geometrical lines
My proposal:
- The basic technology of linear combinations of functions will be called “proportional combinations”
- The distinction between “general” and “generalized” will be presented as unbounded and bounded
- numerical response variable: use unbounded
- categorical response variable: the output will be a probability, so use bounded (since the output has to be between zero and one)
- What about the other forms of GLM’s? I’m not seeing a need for them in an early stats course, but if we decided differently we could use names like probability-output model, count-output model. And we have other technologies to draw on for multinomial outcomes, e.g. LGA, SVM, tree
The technology
It’s hard to know whether to package up the software to correspond exactly to the naming convention, e.g. bounded()
and unbounded()
. But provisionally, I’ll use
- Unbounded:
lm()
- Bounded:
glm()
and importantly, we need a way to specify functions that will not be straight:
- not straight:
ns(x, order)
. (Actually, “natural spline.”) - no reason to use polynomials.
Operations on models
- Evaluate on a set of inputs
- Graph
- Effect size (which will be covered in another chapter)
- Support for inference (also in other chapters)
- model errors, e.g. sum of squared errors/residuals
- cross validation
- bootstrapping