LECTURE & R LOG  on  lm  and glm  model-fitting functions
=========================================================

11/11/2016

The goal of this lecture is to introduce you to the syntax and computing 
formulas underlying the model-fitting function  lm  and the associated 
commands:  update, anova, model.matrix

The relevant R logs to look up in the   Rlogs  portion of the course web-page are:

RlogF09.LinRegr.txt
RlogF09.GLM.txt
Factor.Log

Data sources are given there, with the "Bass" data located in:

http://www.math.umd.edu/~slud/s430.old/Data/bass


lm SYNTAX
=========

> steamdat = matrix(c(10.98, 35.3, 11.13, 29.7, 12.51, 30.8, 8.40, 58.8,
 9.27, 61.4, 8.73, 71.3, 6.36, 74.4, 8.50, 76.7, 7.82, 70.7, 9.14, 57.5, 
 8.24, 46.4, 12.19, 28.9, 11.88, 28.1, 9.57, 39.1, 10.94, 46.8, 9.58, 
 48.5, 10.09, 59.3, 8.11, 70.0, 6.83, 70.0, 8.88, 74.5, 7.68, 72.1,
 8.47, 58.1, 8.86, 44.6, 10.36, 33.4, 11.08, 28.6), ncol=2, byrow=T,
       dimnames=list(NULL,c("steamuse","temp")))
### 25   (y,x) pairs : Steam Use is the response

## FOR SIMPLE LINEAR REGRESSION OF  "steamuse" on "temp":

> plot(steamdat[,"temp"], steamdat[,"steamuse"])

> fit1 = lm(  steamuse ~ temp,   data=data.frame(steamdat) )

### Note: data argument must be a data-frame, while  "steamdat" is a matrix

> attributes(fit1)          ## an "lm" object, not exactly a list but defined like one
$names
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        

$class
[1] "lm"

> fit1[1:2]
$coefficients
(Intercept)        temp 
13.62298927 -0.07982869 

$residuals
          1           2           3           4           5           6 
 0.17496361 -0.12207708  1.34573449 -0.52906210  0.54849250  0.79879656 
          7           8           9          10          11          12 
-1.32373449  0.99987151 -0.15910065  0.10716060 -1.67893790  0.87405997 
         13          14          15          16          17          18 
 0.50019701 -0.93168736  1.05299358 -0.17129764  1.20085225  0.07501926 
         19          20          21          22          23          24 
-1.20498074  1.20424838 -0.18734048 -0.51494219 -1.20262955 -0.59671091 
         25 
-0.25988864 


> fit1     ### same as: print(fit1)

Call:
lm(formula = steamuse ~ temp, data = data.frame(steamdat))

Coefficients:
(Intercept)         temp  
   13.62299     -0.07983  

> print(summary(fit1))

Call:
lm(formula = steamuse ~ temp, data = data.frame(steamdat))

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6789 -0.5291 -0.1221  0.7988  1.3457 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.62299    0.58146  23.429  < 2e-16 ***
temp        -0.07983    0.01052  -7.586 1.05e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8901 on 23 degrees of freedom
Multiple R-squared:  0.7144,    Adjusted R-squared:  0.702 
F-statistic: 57.54 on 1 and 23 DF,  p-value: 1.055e-07

### Now we make sure we know how to compute and gain access inside "fit1" 
## to all of the displayed quantities

## First, the coefficients and sigma^2 are method-of-moments estimates 
##   of the unknow parameters  (a,b,sig^2)  within the model   
## y_i = a + b x_i + eps_i  where  eps_i are iid with  mean 0  and  var  sig^2

## Matrix equation is  yvec = M %*% bvec + epsvec,  with  (M,yvec) observed
##   and  bvec = c(a,b)

yvec = steamdat[,1]
M = cbind(Int=1, temp=steamdat[,2])

> bvec = solve(t(M) %*% M, t(M) %*% yvec)    ## least-squares coefficients
            [,1]
Int  13.62298927
temp -0.07982869

### This is what you obtain by

> nlm(function(bv) sum((yvec-M%*%bv)^2), c(0,0), 
               gradtol=1.e-10, steptol=1.e-10)$est
[1] 13.62298927 -0.07982869

### The estimated SD  sigma =  sum of squared residual/(n-2)

> c( sqrt(sum((yvec-M%*%bvec)^2)/(length(yvec)-2)), summary(fit1)$sigma) 
[1] 0.8901245 0.8901245

### The information in summary(fit1) complements that in fit1

> summary(fit1)$coef
               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 13.62298927 0.58146349 23.428795 1.496788e-17
temp        -0.07982869 0.01052358 -7.585697 1.054950e-07

> names(summary(fit1))
[1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 

### df = n-p = length(yvec) - 2
> c(fit1$df, length(yvec)-2)
[1] 23 23

### "Residual Sum of Squares" = RSS = 
> c(sum(fit1$resid^2), summary(fit1)$sigma^2 * fit1$df )
[1] 18.2234 18.2234


> plot(steamdat[,"temp"], steamdat[,"steamuse"])
  abline(fit1$coef[1],fit1$coef[2], col="red")
  points(steamdat[,"temp"], fit1$fitted, pch=18)

> sum(abs(model.matrix(fit1) - M))
[1] 0

### A first instance of the "anova" command:

> anova(fit1)
Analysis of Variance Table

Response: steamuse
          Df Sum Sq Mean Sq F value    Pr(>F)    
temp       1 45.592  45.592  57.543 1.055e-07 ***
Residuals 23 18.223   0.792         

### We already saw the RSS and Df. The "sum of squares due to prediction" is
> c(anova(fit1)[1,2], sum((fit1$fitted-mean(yvec))^2))
[1] 45.5924 45.5924

### The p-value for significance of the "temp" coeff depends on normal theory
### (the assumption that eps_i are iid N(0,sig^2)) and is the same using either
### a t-test as in the summary(fit1)$coef  table above or the F-test since F=t^2

##========================================================================
## Now continue with additional features of Multiple Regression
##========================================================================

## Data example is now about motor-car miles-per-gallon  (supplied in R "datasets")

> mtcars[1:5,]                 #### 32 x 11 dataset
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
...

## initially try to find  y=mpg  in terms of x-variables  cyl, wt, gear, carb

> fit2 = lm( mpg ~ cyl + wt + gear + carb, data=mtcars)

## The syntax of the y ~ x1 + x2 + ... allows us to specify variables by name
##    as given in the data-frame  data=mtcars
## The "+" signs only indicate that variables are to be included in the fit.
## Later, when we modify one model to get another in "update", we will use "-" 
##    to indicate previously included variables which are to be removed.
## To specify a model without an intercept (which is usually not a good idea,
##    put "-1" into the model formula, e.g. mpg ~ cyl + wt + gear + carb - 1

> summary(fit2)

Call:
lm(formula = mpg ~ cyl + wt + gear + carb, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9646 -1.4057 -0.3622  1.2373  5.7866 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  34.7435     6.6690   5.210 1.73e-05 ***
cyl          -1.1108     0.4965  -2.237  0.03374 *  
wt           -2.7810     0.9012  -3.086  0.00465 ** 
gear          0.9302     1.2348   0.753  0.45780    
carb         -0.8046     0.5379  -1.496  0.14633    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.537 on 27 degrees of freedom
Multiple R-squared:  0.8457,    Adjusted R-squared:  0.8228 
F-statistic:    37 on 4 and 27 DF,  p-value: 1.371e-10

### Suppose before going any farther we want to change the model by 
###   removing the apparently unhelpful variables  gear, carb  and 
###   adding the interaction term cyl:wt. We can do that easily by "update"ing
###   starting from the model-fitted object  fit2

> fit2$call$formula
mpg ~ cyl + wt + gear + carb

> fit2B = update(fit2, formula = . ~ . - gear - carb + cyl:wt)
### Here the .'s indicate parts of the formula that are specified in fit2

### NOTE that at this point, the calling formula is
> fit2B$call$formula
mpg ~ cyl + wt + cyl:wt, mtcars)

### and this could have been specified from the beginning also as 
##   lm( mpg ~ cyl*wt, data = mtcars )

> fit2B$coef
(Intercept)         cyl          wt      cyl:wt 
 54.3068062  -3.8032187  -8.6555590   0.8083947 

## The coefficient for cyl:wt is applied to cyl*wt in the sense of 
##   ordinary multiplication. To confirm this, look at 

> M = model.matrix(fit2B); yvec = mtcars$mpg
> M[1:5,]
                  (Intercept) cyl    wt cyl:wt
Mazda RX4                   1   6 2.620  15.72
Mazda RX4 Wag               1   6 2.875  17.25
Datsun 710                  1   4 2.320   9.28
Hornet 4 Drive              1   6 3.215  19.29
Hornet Sportabout           1   8 3.440  27.52

### This matrix is the M that we use in the matrix equation yv = M %*% bvec + eps
###    and again we can see the bvec as a least-squares solution:

## As before, we could obtain these coefficients from first principles either
## by least-squares or Gaussian log-likelihood maximization:

> rbind( LSQ = c(solve( t(M) %*% M, t(M) %*% yvec)), 
         argmin = nlm(function(x) -sum(log(dnorm(yvec,M%*%x,1))), 
            c(mean(yvec),rep(0,3)), steptol=1.e-10, gradtol=1.e-10)$est)
           [,1]      [,2]      [,3]      [,4]
LSQ    54.30681 -3.803219 -8.655559 0.8083947
argmin 54.30681 -3.803219 -8.655559 0.8083947

### Next we look at "anova" again to give a useful breakdown of how the sums of 
###   squared residuals reduce by increments due to one variable at a time

> anova(fit2B)
Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq  F value    Pr(>F)    
cyl        1 817.71  817.71 145.8563 1.281e-12 ***
wt         1 117.16  117.16  20.8983 8.943e-05 ***
cyl:wt     1  34.20   34.20   6.0995   0.01988 *  
Residuals 28 156.98    5.61                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

## In this breakdown, we can get the sum of squared residuals also as 
> c(sum((yvec-M%*% fit2B$coef)^2), sum(fit2B$resid^2))
[1] 156.9762 156.9762

##   and the total sum of squares due to including wt after cyl is
##   as 117.16 , which could also be computed directly, "by hand", as

> sum( lm(mpg~ cyl,data=mtcars)$resid^2 - lm(mpg~ cyl+wt,data=mtcars)$resid^2)
[1] 117.1623

### We can see that this model definitely made all coefficients more significant.
### In order to quantify the sense in which predictions improve, consider the 
### "multiple R-squared" that can be computed in any of a number of ways:

  aov2 = anova(fit2)
> c(summary(fit2)$r.squared, 1-sum(fit2$resid^2)/sum((yvec-mean(yvec))^2),
    1-aov2[nrow(aov2),"Sum Sq"]/sum(aov2[,"Sum Sq"]))
[1] 0.845701 0.845701 0.845701

> summary(fit2B)$r.squared
[1] 0.8605954               ### the improvement is not large, but is significant
    #### You can see this in the following plot, where circles are actual mpg values,
    #### red diamonds are predicted points from fit2 , and blue diamonds from fit2B

> plot(fit2$fitted, yvec, xlab="fit2 fitted values", ylab="observed mpg",
    main="Actual & predicted points in 'mtcars'")
  points(fit2$fitted,fit2$fitted, col="red",pch=18)
  points(fit2$fitted,fit2B$fitted, col="blue",pch=18)
  legend(12,32, legend=c("obs'd mpg","fit2 fitted","fit2B fitted"),
            pch=c(1,18,18), col=c("black","red","blue"))

### An "adjusted" r-squared that takes the number of fitted coefficients into account
###   is usually preferred in reporting the quality of fit:

> c(summary(fit2)$adj.r.sq, summary(fit2B)$adj.r.sq)
[1] 0.8228419 0.8456592

## and eg the value for fit2 is calculated alternatively by

> 1- ((length(yvec)-1)/(length(yvec)-length(fit2$coef)))*
               sum(fit2$resid^2)/sum((yvec-mean(yvec))^2)
[1] 0.8228419

############################################################
## Some other aspects of model-formula syntax

## In current versions of R, you cann inject functions directly into the model formulas.
## For example, suppose that in addition to wt and the other variables in fit2B,
##   we want to include a nonlinear term  log(wt).

mpg ~ cyl*wt + log(wt)             ### this is OK

## and this will also work for more complicated function expressions involving existing
##  data-frame columns, if the function is not a sum of terms

> fit3 = update(fit2B, formula = .~. + log(cyl+wt^3))
  model.matrix(fit3)[1:5,]
                  (Intercept) cyl    wt log(cyl + wt^3) cyl:wt
Mazda RX4                   1   6 2.620        3.177417  15.72
Mazda RX4 Wag               1   6 2.875        3.393289  17.25
Datsun 710                  1   4 2.320        2.802582   9.28
Hornet 4 Drive              1   6 3.215        3.669466  19.29
Hornet Sportabout           1   8 3.440        3.885835  27.52

### but if your function is a sum of terms, even within parentheses, R does not use them

> fit3 = update(fit2B, formula = .~. + (cyl^2+wt^3))
  model.matrix(fit3)[1:5,]
                  (Intercept) cyl    wt cyl:wt
Mazda RX4                   1   6 2.620  15.72
Mazda RX4 Wag               1   6 2.875  17.25
Datsun 710                  1   4 2.320   9.28
Hornet 4 Drive              1   6 3.215  19.29
Hornet Sportabout           1   8 3.440  27.52

### unless you put the expression into I(.)

> fit3 = update(fit2B, formula = .~. + I(cyl^2+wt^3))
  model.matrix(fit3)[1:5,]
                  (Intercept) cyl    wt I(cyl^2 + wt^3) cyl:wt
Mazda RX4                   1   6 2.620        53.98473  15.72
Mazda RX4 Wag               1   6 2.875        59.76367  17.25
Datsun 710                  1   4 2.320        28.48717   9.28
Hornet 4 Drive              1   6 3.215        69.23096  19.29
Hornet Sportabout           1   8 3.440       104.70758  27.52

### Of course, you could always define new columns as brand-new R objects
###   in which case they could be used in the R lm fit even if they were not
###   included in the data-frame, e.g.

> newvar = mtcars$cyl^2 + mtcars$wt^3
  fit3 = update(fit2B, formula = .~. + newvar)
  model.matrix(fit3)[1:5,]
                  (Intercept) cyl    wt    newvar cyl:wt
Mazda RX4                   1   6 2.620  53.98473  15.72
Mazda RX4 Wag               1   6 2.875  59.76367  17.25
Datsun 710                  1   4 2.320  28.48717   9.28
Hornet 4 Drive              1   6 3.215  69.23096  19.29
Hornet Sportabout           1   8 3.440 104.70758  27.52

### FACTOR AND DUMMY VARIABLES

### There is another way that we conveniently recode new variables in regression
###   or in other generalized linear models in R, namely when we make use of 
###   "dummy variables" when certain variables are viewed as category-labels 
###   rather than numerical values to combine linearly.

### For example, in our "mtcars" dataset, we might want to view "cyl", the number 
###   of cylinders, not as a quantitative variable but rather a categorical one, 
###   with each subset of cars defined by cyl = 4, 6 or 8 getting its own 
###   intercept term. In that case, we could define a model

> fit4 = lm( mpg ~ wt + factor(cyl), data=mtcars )

> fit4$coef
 (Intercept)           wt factor(cyl)6 factor(cyl)8 
   33.990794    -3.205613    -4.255582    -6.070860 

### or even an interaction-model

> fit4B = lm( mpg ~ wt*factor(cyl), data=mtcars)
> fit4B$coef
    (Intercept)              wt    factor(cyl)6    factor(cyl)8 wt:factor(cyl)6 
      39.571196       -5.647025      -11.162351      -15.703167        2.866919 
wt:factor(cyl)8 
       3.454587 

### To see why and how the numbers of coefficients have gotten so much larger, 
###   look at the model matrices:

> model.matrix(fit4)[1:5,]
                  (Intercept)    wt factor(cyl)6 factor(cyl)8
Mazda RX4                   1 2.620            1            0
Mazda RX4 Wag               1 2.875            1            0
Datsun 710                  1 2.320            0            0
Hornet 4 Drive              1 3.215            1            0
Hornet Sportabout           1 3.440            0            1

### This is the matrix M based on which  "lm"  does least-squares, and the model
###   is still viewed as finding a best approximation to the response (mpg) vector
###   yvec on terms of linear combinations of columns of  M .

### The meaning here is that we have "coded" the three intercept terms for the 
###   subgroups with the different numbers (4,6,8) of cylinders repectively into
###   the coefficients Intercept, factor(cyl)6, and factor(cyl)8, which is summarized in

> contrasts(factor(mtcars$cyl))
  6 8
4 0 0
6 1 0
8 0 1

## This says that M puts 0's into the cyl6 and cyl8 columns when cyl=4, but 
##   puts 1 in the respective 6 or 8 columns when vcyl = 6 or 8.

### You could avoid the ugly variable-names factor(cyl)6, factor(cyl)8 by
###  defining the column  cyl  within  mtcars as a factor, and you could also
###  control the recoding of columns, which can be done in different ways, e.g.:

> mtcar2 = mtcars
  mtcar2$cyl = factor(mtcar2$cyl)
> contrasts(mtcar2$cyl) = contr.helmert(3)
> contrasts(mtcar2$cyl)
  [,1] [,2]
4   -1   -1
6    1   -1
8    0    2

> fit4 = lm(mpg ~ wt*cyl, data=mtcar2)
                  (Intercept)    wt cyl1 cyl2 wt:cyl1 wt:cyl2
Mazda RX4                   1 2.620    1   -1   2.620  -2.620
Mazda RX4 Wag               1 2.875    1   -1   2.875  -2.875
Datsun 710                  1 2.320   -1   -1  -2.320  -2.320
Hornet 4 Drive              1 3.215    1   -1   3.215  -3.215
Hornet Sportabout           1 3.440    0    2   0.000   6.880
Valiant                     1 3.460    1   -1   3.460  -3.460

###############################################################
######## CONTINUATION:  glm ,  stepwise model fitting with step
########    using similar model formulas, update's and factors