Chap 9: Modeling functions

Two kinds of explanatory variables

  • Sometimes explanatory variables are categorical, sometimes quantitative.
  • In stratification, we’ve dealt with the situation where all explanatory variables are categorical
    • we only needed to be concerned whether the response was quantitative or categorical
      • quantitative: summary interval. Why?
        • our initial interest was prediction, not estimation
      • categorical: probability of each outcome level
  • Modeling functions provide a more general framework that
    • encompasses stratification
    • can work with quantitative explanatory variables

What should be the output of a modeling function?

Traditionally, we format the output as a number.

Instead

The output will depend on whether the response is a quantitative or categorical variable, just as we did with stratification.

  • quantitative response: output will be an interval. We can call this a prediction interval, but it’s more or less the same as a summry interval.
  • categorical response: output will be a probability for each level.

How should we represent modeling functions?

Traditionally:

  • a straight line: slope and intercept
    • we rarely go on to have functions with multiple inputs
    • graphics modes — color and faceting – make it feasible to present functions of four explanatory variables. (Four is possible, but difficult to interpret.)
  • more generally, multiple coefficients

For data science:

  • impractical to keep track of coefficients when there are multiple inputs.
  • many model architectures do not have coefficients
    • tree models & random forests
    • support vector machines & neural networks
    • classifiers such as k-nearest neighbors, linear & quadratic discriminant analysis
  • we’re going to be using a computer, so represent models as a software function: takes inputs and produces outputs

Names for families of functions?

The two families that will do much of the lifting are:

  • general linear models
  • generalized linear models

Problems:

  • The names are almost identical and are not descriptive.
  • The names suggest geometrical lines

My proposal:

  • The basic technology of linear combinations of functions will be called “proportional combinations
  • The distinction between “general” and “generalized” will be presented as unbounded and bounded
    • numerical response variable: use unbounded
    • categorical response variable: the output will be a probability, so use bounded (since the output has to be between zero and one)
  • What about the other forms of GLM’s? I’m not seeing a need for them in an early stats course, but if we decided differently we could use names like probability-output model, count-output model. And we have other technologies to draw on for multinomial outcomes, e.g. LGA, SVM, tree

The technology

It’s hard to know whether to package up the software to correspond exactly to the naming convention, e.g. bounded() and unbounded(). But provisionally, I’ll use

  • Unbounded: lm()
  • Bounded: glm()

and importantly, we need a way to specify functions that will not be straight:

  • not straight: ns(x, order). (Actually, “natural spline.”)
  • no reason to use polynomials.

Operations on models

  1. Evaluate on a set of inputs
  2. Graph
  3. Effect size (which will be covered in another chapter)
  4. Support for inference (also in other chapters)
    • model errors, e.g. sum of squared errors/residuals
    • cross validation
    • bootstrapping