Chap 2: What to teach about data?

1. Concepts about “data”

  • Data is recorded facts. One definition is, “Anything that can (potentially) be stored on a computer.”
  • There are many formats for data: video, gene sequences, satellite images, …
  • The format that is most important in statistics is called a data frame, which contains “tidy data.”

2. Tidy data

  • A spreadsheet-like format
  • Rectangular, with rows and columns; a grid of cells.
  • The entries in the cells are called values.
  • The top row is used to give names to the columns.
  • The columns are called “variables.”
    • every value in a give column must be the same kind of value.
      • Some columns have numerical values. We call these variables “quantitative” or, equivalently, “numerical.”
      • Some columns have words/characters/names. We call these variables “categorical”.
  • The rows are called … well there are a lot of choices: “cases,” “tuples,” “instances,” “units of observation,” and etc. We will simply call them rows.
  • Either quantitative or categorical variables might have some data missing. We use the symbol NA (“not available”) to mark such situations.

3. How is tidy data different from a spreadsheet?

  • You might use a spreadsheet to store tidy data, but many things that people put into spreadsheets are not tidy data. This includes formatting, titles, summary rows or columns, internal calculations such as pivot tables or summary graphics.
  • When you do any work with a spreadsheet, you usually “open” it and make changes to the file. But with tidy data, you only change the file when you are adding (or correcting) rows.
    • You should not have the spreadsheet open when you are doing statistical work with the data contained in the spreadsheet.
    • Instead, you get the statistics software to read the spreadsheet and always work with the copy of the data created by the statistical software.
  • The first step in working with data in statistics software is to read the spreadsheet. In the statistics software, we refer to the tidy data with an unquoted name, e.g. NHANES rather than a filename like my_data.xlsx.
  • To keep data consistent, it’s best to store the spreadsheet on a system that allows access from all the people authorized to work with it, e.g. as a file on a web server.

4. The codebook

The “codebook” contains documentation about the data. At a minimum, it should tell you what the variables stand for, and what the different levels of a categorical variable indicate.

Other words for “codebook”: meta-data, documentation, help file

Activity 1: Using the Point Plot Little App, open the Births_2014 data and examine the codebook

  1. What does the plurality variable stands for?
  2. What does cig_2 stand for.?

5. Quantitative variables

  • convention is to store them without units
  • the numbers must all refer to the same units: e.g. don’t mix miles and km.
  • missing data is represented by the symbol NA.

6. Categorical variables

  • Typically, there is a fixed set of possibilities for a categorical variable. These are called the levels of the variable.
  • Either the words used to represent the different levels make the meaning obvious, or the codebook should explain what’s what.
  • Missing data is represented by the symbol NA.

7. Data often needs “cleaning”

8. Computing

Chapter 2 (Data) tutorial

Activity 1 To illustrate how stratification is used to build a classifier, consider this very simple, unrealistically small, made-up data frame listing observations of animals:

species size color
A large reddish
B large brownish
B small brownish
A large brownish

You are going to build classifiers using the data. The output of the classifier will be the probability that the species is A.

  1. Use just size as an explanatory variable. Since there are two levels for size, the classifier can take the form of a simple table, giving the proportion of rows for each of the two sizes. Fill in the table to reflect the data.
size prop_of_A
large
small
  1. Repeat (1), but instead of “size”, use just “color” as an explanatory variable.
    color prop_of_A
    reddish
    brownish
  2. Again build a classifier, but use both color and size as explanatory variables.
color size prop_of_A
reddish large
reddish small
brownish large
brownish small
  1. Finally, build the “null model”, a no-input classifier. This means there is just one group, which has all four rows.