LECTURE & R LOG on lm and glm model-fitting functions ========================================================= 11/11/2016 The goal of this lecture is to introduce you to the syntax and computing formulas underlying the model-fitting function lm and the associated commands: update, anova, model.matrix The relevant R logs to look up in the Rlogs portion of the course web-page are: RlogF09.LinRegr.txt RlogF09.GLM.txt Factor.Log Data sources are given there, with the "Bass" data located in: http://www.math.umd.edu/~slud/s430.old/Data/bass lm SYNTAX ========= > steamdat = matrix(c(10.98, 35.3, 11.13, 29.7, 12.51, 30.8, 8.40, 58.8, 9.27, 61.4, 8.73, 71.3, 6.36, 74.4, 8.50, 76.7, 7.82, 70.7, 9.14, 57.5, 8.24, 46.4, 12.19, 28.9, 11.88, 28.1, 9.57, 39.1, 10.94, 46.8, 9.58, 48.5, 10.09, 59.3, 8.11, 70.0, 6.83, 70.0, 8.88, 74.5, 7.68, 72.1, 8.47, 58.1, 8.86, 44.6, 10.36, 33.4, 11.08, 28.6), ncol=2, byrow=T, dimnames=list(NULL,c("steamuse","temp"))) ### 25 (y,x) pairs : Steam Use is the response ## FOR SIMPLE LINEAR REGRESSION OF "steamuse" on "temp": > plot(steamdat[,"temp"], steamdat[,"steamuse"]) > fit1 = lm( steamuse ~ temp, data=data.frame(steamdat) ) ### Note: data argument must be a data-frame, while "steamdat" is a matrix > attributes(fit1) ## an "lm" object, not exactly a list but defined like one $names [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" $class [1] "lm" > fit1[1:2] $coefficients (Intercept) temp 13.62298927 -0.07982869 $residuals 1 2 3 4 5 6 0.17496361 -0.12207708 1.34573449 -0.52906210 0.54849250 0.79879656 7 8 9 10 11 12 -1.32373449 0.99987151 -0.15910065 0.10716060 -1.67893790 0.87405997 13 14 15 16 17 18 0.50019701 -0.93168736 1.05299358 -0.17129764 1.20085225 0.07501926 19 20 21 22 23 24 -1.20498074 1.20424838 -0.18734048 -0.51494219 -1.20262955 -0.59671091 25 -0.25988864 > fit1 ### same as: print(fit1) Call: lm(formula = steamuse ~ temp, data = data.frame(steamdat)) Coefficients: (Intercept) temp 13.62299 -0.07983 > print(summary(fit1)) Call: lm(formula = steamuse ~ temp, data = data.frame(steamdat)) Residuals: Min 1Q Median 3Q Max -1.6789 -0.5291 -0.1221 0.7988 1.3457 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.62299 0.58146 23.429 < 2e-16 *** temp -0.07983 0.01052 -7.586 1.05e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8901 on 23 degrees of freedom Multiple R-squared: 0.7144, Adjusted R-squared: 0.702 F-statistic: 57.54 on 1 and 23 DF, p-value: 1.055e-07 ### Now we make sure we know how to compute and gain access inside "fit1" ## to all of the displayed quantities ## First, the coefficients and sigma^2 are method-of-moments estimates ## of the unknow parameters (a,b,sig^2) within the model ## y_i = a + b x_i + eps_i where eps_i are iid with mean 0 and var sig^2 ## Matrix equation is yvec = M %*% bvec + epsvec, with (M,yvec) observed ## and bvec = c(a,b) yvec = steamdat[,1] M = cbind(Int=1, temp=steamdat[,2]) > bvec = solve(t(M) %*% M, t(M) %*% yvec) ## least-squares coefficients [,1] Int 13.62298927 temp -0.07982869 ### This is what you obtain by > nlm(function(bv) sum((yvec-M%*%bv)^2), c(0,0), gradtol=1.e-10, steptol=1.e-10)$est [1] 13.62298927 -0.07982869 ### The estimated SD sigma = sum of squared residual/(n-2) > c( sqrt(sum((yvec-M%*%bvec)^2)/(length(yvec)-2)), summary(fit1)$sigma) [1] 0.8901245 0.8901245 ### The information in summary(fit1) complements that in fit1 > summary(fit1)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 13.62298927 0.58146349 23.428795 1.496788e-17 temp -0.07982869 0.01052358 -7.585697 1.054950e-07 > names(summary(fit1)) [1] "call" "terms" "residuals" "coefficients" [5] "aliased" "sigma" "df" "r.squared" [9] "adj.r.squared" "fstatistic" "cov.unscaled" ### df = n-p = length(yvec) - 2 > c(fit1$df, length(yvec)-2) [1] 23 23 ### "Residual Sum of Squares" = RSS = > c(sum(fit1$resid^2), summary(fit1)$sigma^2 * fit1$df ) [1] 18.2234 18.2234 > plot(steamdat[,"temp"], steamdat[,"steamuse"]) abline(fit1$coef[1],fit1$coef[2], col="red") points(steamdat[,"temp"], fit1$fitted, pch=18) > sum(abs(model.matrix(fit1) - M)) [1] 0 ### A first instance of the "anova" command: > anova(fit1) Analysis of Variance Table Response: steamuse Df Sum Sq Mean Sq F value Pr(>F) temp 1 45.592 45.592 57.543 1.055e-07 *** Residuals 23 18.223 0.792 ### We already saw the RSS and Df. The "sum of squares due to prediction" is > c(anova(fit1)[1,2], sum((fit1$fitted-mean(yvec))^2)) [1] 45.5924 45.5924 ### The p-value for significance of the "temp" coeff depends on normal theory ### (the assumption that eps_i are iid N(0,sig^2)) and is the same using either ### a t-test as in the summary(fit1)$coef table above or the F-test since F=t^2 ##======================================================================== ## Now continue with additional features of Multiple Regression ##======================================================================== ## Data example is now about motor-car miles-per-gallon (supplied in R "datasets") > mtcars[1:5,] #### 32 x 11 dataset mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ... ## initially try to find y=mpg in terms of x-variables cyl, wt, gear, carb > fit2 = lm( mpg ~ cyl + wt + gear + carb, data=mtcars) ## The syntax of the y ~ x1 + x2 + ... allows us to specify variables by name ## as given in the data-frame data=mtcars ## The "+" signs only indicate that variables are to be included in the fit. ## Later, when we modify one model to get another in "update", we will use "-" ## to indicate previously included variables which are to be removed. ## To specify a model without an intercept (which is usually not a good idea, ## put "-1" into the model formula, e.g. mpg ~ cyl + wt + gear + carb - 1 > summary(fit2) Call: lm(formula = mpg ~ cyl + wt + gear + carb, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.9646 -1.4057 -0.3622 1.2373 5.7866 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.7435 6.6690 5.210 1.73e-05 *** cyl -1.1108 0.4965 -2.237 0.03374 * wt -2.7810 0.9012 -3.086 0.00465 ** gear 0.9302 1.2348 0.753 0.45780 carb -0.8046 0.5379 -1.496 0.14633 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.537 on 27 degrees of freedom Multiple R-squared: 0.8457, Adjusted R-squared: 0.8228 F-statistic: 37 on 4 and 27 DF, p-value: 1.371e-10 ### Suppose before going any farther we want to change the model by ### removing the apparently unhelpful variables gear, carb and ### adding the interaction term cyl:wt. We can do that easily by "update"ing ### starting from the model-fitted object fit2 > fit2$call$formula mpg ~ cyl + wt + gear + carb > fit2B = update(fit2, formula = . ~ . - gear - carb + cyl:wt) ### Here the .'s indicate parts of the formula that are specified in fit2 ### NOTE that at this point, the calling formula is > fit2B$call$formula mpg ~ cyl + wt + cyl:wt, mtcars) ### and this could have been specified from the beginning also as ## lm( mpg ~ cyl*wt, data = mtcars ) > fit2B$coef (Intercept) cyl wt cyl:wt 54.3068062 -3.8032187 -8.6555590 0.8083947 ## The coefficient for cyl:wt is applied to cyl*wt in the sense of ## ordinary multiplication. To confirm this, look at > M = model.matrix(fit2B); yvec = mtcars$mpg > M[1:5,] (Intercept) cyl wt cyl:wt Mazda RX4 1 6 2.620 15.72 Mazda RX4 Wag 1 6 2.875 17.25 Datsun 710 1 4 2.320 9.28 Hornet 4 Drive 1 6 3.215 19.29 Hornet Sportabout 1 8 3.440 27.52 ### This matrix is the M that we use in the matrix equation yv = M %*% bvec + eps ### and again we can see the bvec as a least-squares solution: ## As before, we could obtain these coefficients from first principles either ## by least-squares or Gaussian log-likelihood maximization: > rbind( LSQ = c(solve( t(M) %*% M, t(M) %*% yvec)), argmin = nlm(function(x) -sum(log(dnorm(yvec,M%*%x,1))), c(mean(yvec),rep(0,3)), steptol=1.e-10, gradtol=1.e-10)$est) [,1] [,2] [,3] [,4] LSQ 54.30681 -3.803219 -8.655559 0.8083947 argmin 54.30681 -3.803219 -8.655559 0.8083947 ### Next we look at "anova" again to give a useful breakdown of how the sums of ### squared residuals reduce by increments due to one variable at a time > anova(fit2B) Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) cyl 1 817.71 817.71 145.8563 1.281e-12 *** wt 1 117.16 117.16 20.8983 8.943e-05 *** cyl:wt 1 34.20 34.20 6.0995 0.01988 * Residuals 28 156.98 5.61 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## In this breakdown, we can get the sum of squared residuals also as > c(sum((yvec-M%*% fit2B$coef)^2), sum(fit2B$resid^2)) [1] 156.9762 156.9762 ## and the total sum of squares due to including wt after cyl is ## as 117.16 , which could also be computed directly, "by hand", as > sum( lm(mpg~ cyl,data=mtcars)$resid^2 - lm(mpg~ cyl+wt,data=mtcars)$resid^2) [1] 117.1623 ### We can see that this model definitely made all coefficients more significant. ### In order to quantify the sense in which predictions improve, consider the ### "multiple R-squared" that can be computed in any of a number of ways: aov2 = anova(fit2) > c(summary(fit2)$r.squared, 1-sum(fit2$resid^2)/sum((yvec-mean(yvec))^2), 1-aov2[nrow(aov2),"Sum Sq"]/sum(aov2[,"Sum Sq"])) [1] 0.845701 0.845701 0.845701 > summary(fit2B)$r.squared [1] 0.8605954 ### the improvement is not large, but is significant #### You can see this in the following plot, where circles are actual mpg values, #### red diamonds are predicted points from fit2 , and blue diamonds from fit2B > plot(fit2$fitted, yvec, xlab="fit2 fitted values", ylab="observed mpg", main="Actual & predicted points in 'mtcars'") points(fit2$fitted,fit2$fitted, col="red",pch=18) points(fit2$fitted,fit2B$fitted, col="blue",pch=18) legend(12,32, legend=c("obs'd mpg","fit2 fitted","fit2B fitted"), pch=c(1,18,18), col=c("black","red","blue")) ### An "adjusted" r-squared that takes the number of fitted coefficients into account ### is usually preferred in reporting the quality of fit: > c(summary(fit2)$adj.r.sq, summary(fit2B)$adj.r.sq) [1] 0.8228419 0.8456592 ## and eg the value for fit2 is calculated alternatively by > 1- ((length(yvec)-1)/(length(yvec)-length(fit2$coef)))* sum(fit2$resid^2)/sum((yvec-mean(yvec))^2) [1] 0.8228419 ############################################################ ## Some other aspects of model-formula syntax ## In current versions of R, you cann inject functions directly into the model formulas. ## For example, suppose that in addition to wt and the other variables in fit2B, ## we want to include a nonlinear term log(wt). mpg ~ cyl*wt + log(wt) ### this is OK ## and this will also work for more complicated function expressions involving existing ## data-frame columns, if the function is not a sum of terms > fit3 = update(fit2B, formula = .~. + log(cyl+wt^3)) model.matrix(fit3)[1:5,] (Intercept) cyl wt log(cyl + wt^3) cyl:wt Mazda RX4 1 6 2.620 3.177417 15.72 Mazda RX4 Wag 1 6 2.875 3.393289 17.25 Datsun 710 1 4 2.320 2.802582 9.28 Hornet 4 Drive 1 6 3.215 3.669466 19.29 Hornet Sportabout 1 8 3.440 3.885835 27.52 ### but if your function is a sum of terms, even within parentheses, R does not use them > fit3 = update(fit2B, formula = .~. + (cyl^2+wt^3)) model.matrix(fit3)[1:5,] (Intercept) cyl wt cyl:wt Mazda RX4 1 6 2.620 15.72 Mazda RX4 Wag 1 6 2.875 17.25 Datsun 710 1 4 2.320 9.28 Hornet 4 Drive 1 6 3.215 19.29 Hornet Sportabout 1 8 3.440 27.52 ### unless you put the expression into I(.) > fit3 = update(fit2B, formula = .~. + I(cyl^2+wt^3)) model.matrix(fit3)[1:5,] (Intercept) cyl wt I(cyl^2 + wt^3) cyl:wt Mazda RX4 1 6 2.620 53.98473 15.72 Mazda RX4 Wag 1 6 2.875 59.76367 17.25 Datsun 710 1 4 2.320 28.48717 9.28 Hornet 4 Drive 1 6 3.215 69.23096 19.29 Hornet Sportabout 1 8 3.440 104.70758 27.52 ### Of course, you could always define new columns as brand-new R objects ### in which case they could be used in the R lm fit even if they were not ### included in the data-frame, e.g. > newvar = mtcars$cyl^2 + mtcars$wt^3 fit3 = update(fit2B, formula = .~. + newvar) model.matrix(fit3)[1:5,] (Intercept) cyl wt newvar cyl:wt Mazda RX4 1 6 2.620 53.98473 15.72 Mazda RX4 Wag 1 6 2.875 59.76367 17.25 Datsun 710 1 4 2.320 28.48717 9.28 Hornet 4 Drive 1 6 3.215 69.23096 19.29 Hornet Sportabout 1 8 3.440 104.70758 27.52 ### FACTOR AND DUMMY VARIABLES ### There is another way that we conveniently recode new variables in regression ### or in other generalized linear models in R, namely when we make use of ### "dummy variables" when certain variables are viewed as category-labels ### rather than numerical values to combine linearly. ### For example, in our "mtcars" dataset, we might want to view "cyl", the number ### of cylinders, not as a quantitative variable but rather a categorical one, ### with each subset of cars defined by cyl = 4, 6 or 8 getting its own ### intercept term. In that case, we could define a model > fit4 = lm( mpg ~ wt + factor(cyl), data=mtcars ) > fit4$coef (Intercept) wt factor(cyl)6 factor(cyl)8 33.990794 -3.205613 -4.255582 -6.070860 ### or even an interaction-model > fit4B = lm( mpg ~ wt*factor(cyl), data=mtcars) > fit4B$coef (Intercept) wt factor(cyl)6 factor(cyl)8 wt:factor(cyl)6 39.571196 -5.647025 -11.162351 -15.703167 2.866919 wt:factor(cyl)8 3.454587 ### To see why and how the numbers of coefficients have gotten so much larger, ### look at the model matrices: > model.matrix(fit4)[1:5,] (Intercept) wt factor(cyl)6 factor(cyl)8 Mazda RX4 1 2.620 1 0 Mazda RX4 Wag 1 2.875 1 0 Datsun 710 1 2.320 0 0 Hornet 4 Drive 1 3.215 1 0 Hornet Sportabout 1 3.440 0 1 ### This is the matrix M based on which "lm" does least-squares, and the model ### is still viewed as finding a best approximation to the response (mpg) vector ### yvec on terms of linear combinations of columns of M . ### The meaning here is that we have "coded" the three intercept terms for the ### subgroups with the different numbers (4,6,8) of cylinders repectively into ### the coefficients Intercept, factor(cyl)6, and factor(cyl)8, which is summarized in > contrasts(factor(mtcars$cyl)) 6 8 4 0 0 6 1 0 8 0 1 ## This says that M puts 0's into the cyl6 and cyl8 columns when cyl=4, but ## puts 1 in the respective 6 or 8 columns when vcyl = 6 or 8. ### You could avoid the ugly variable-names factor(cyl)6, factor(cyl)8 by ### defining the column cyl within mtcars as a factor, and you could also ### control the recoding of columns, which can be done in different ways, e.g.: > mtcar2 = mtcars mtcar2$cyl = factor(mtcar2$cyl) > contrasts(mtcar2$cyl) = contr.helmert(3) > contrasts(mtcar2$cyl) [,1] [,2] 4 -1 -1 6 1 -1 8 0 2 > fit4 = lm(mpg ~ wt*cyl, data=mtcar2) (Intercept) wt cyl1 cyl2 wt:cyl1 wt:cyl2 Mazda RX4 1 2.620 1 -1 2.620 -2.620 Mazda RX4 Wag 1 2.875 1 -1 2.875 -2.875 Datsun 710 1 2.320 -1 -1 -2.320 -2.320 Hornet 4 Drive 1 3.215 1 -1 3.215 -3.215 Hornet Sportabout 1 3.440 0 2 0.000 6.880 Valiant 1 3.460 1 -1 3.460 -3.460 ############################################################### ######## CONTINUATION: glm , stepwise model fitting with step ######## using similar model formulas, update's and factors