Interpreting Regression Equations

 

Consider the following data from the 1999 NFL Season

 

TO DIFF is the difference between turnovers captured and turnovers given up, i.e. the greater the  number, the better for the team.  TODIFF ranges from -17 for Atlanta to +21 for Kansas City.  The NFL season is 16 games long, and WINS ranges from 2 for Cleveland to 13 for St. Louis.

 

 

WINS

TO DIFF

 

 

Arizona

6

-13

 

 

Atlanta

5

-17

 

 

Baltimore

8

0

 

 

Buffalo

11

-6

 

 

Carolina

8

-5

 

 

Chicago

6

-4

 

 

Cincinnati

4

-5

 

 

Cleveland

2

-11

 

 

Dallas

8

10

 

 

Denver

6

-2

 

 

Detroit

8

10

 

 

Green

8

5

 

 

Indianapolis

13

-5

 

 

Jacksonville

14

12

 

 

KC

9

21

 

 

Miami

9

-6

 

 

Minnesota

10

-10

 

 

NE

8

-2

 

 

NO

3

-5

 

 

NYG

7

-8

 

 

NYJ

8

13

 

 

Oakland

8

4

 

 

Philadelphia

5

7

 

 

Pittsburgh

6

3

 

 

San Diego

8

-8

 

 

Seattle

9

3

 

 

San Fran

4

-12

 

 

St. Louis

13

5

 

 

TB

11

-4

 

 

Tennessee

13

18

 

 

Washington

10

12

 

 

 

Average                         8                 0

 

 

Note that since each game has both a winner and a loser, the average number of WINS is 8.  Also, since a turnover captured by team A is one given up by tam B, the average value for TO DIFF is 0.

 

The relation between WINS and TO DIFF is depicted in the graph below.  There is an obvious positive relationship between the two variables.

 

 



A linear least squares regression is one way of quantifying the relation between two variables such as this.  A regression equation estimated by a statistical package like SAS or a spreadsheet program such as Excel uses formulas which define a line through the data (hence the term linear) which has the property that the sum of the squared distances between the line and the data points is minimized -(hence the term least squares).

 

The least squares regression line through the data on turnovers and wins is depicted below.

 


 

The pinkish dots are the points on the estimated regression line through the data.  Note that it passes through the means of the data (0 turnovers, 8 wins).

 

Note also that if a team were to move from 0 turnovers to +15 turnovers (remember, takeaways minus giveaways) then the line implies wins would increase from 8 to 10.  +15 turnovers in a 16 game season represents about 1 net turnover per game.  Such a team wins two more games in the season.  This suggests that turnovers are a non-trivial part of NFL football, perhaps important.
The regression line in this case takes the form:

 

WINS = A + B*TODIFF

 

where A is an intercept and B is the slope coefficient.  The slope coefficient measures the incremental effect of TODIFF on WINS, i.e. )WINS/)TODIFF.  The statistical or spreadsheet package delivers estimates for the intercept and slope coefficients derived from least squares formulas. 

 

 

The statistical tables presented in the papers we'll be reading in class have three important things to focus on.

 

1.  The magnitude of the coefficients in the estimated linear regression.

 

We will examine the magnitude of the coefficients to see how reasonable changes in a variable such as turnovers affects an outcome of interest, such as wins.

 

 

2.  The precision of the estimates (t-stat or t-ratio).

 

The t-ratios reported in tables are typically the ratio of the coefficient estimate divided by its standard error.  The larger the standard error, the less precise the coefficient estimate.  Obviously, we prefer to make observations based on coefficients which are large relative to the imprecision in their estimate.  Hence, the larger the t-ratio, the more confident we are in making these observations.

 

Our rule of thumb is that coefficient estimates are reasonably precise when their t-ratios are 2.0 or better.  There is some statistical theory behind this rule of thumb, but we don't need to go into that here.

 

 

3.  The explanatory power of the regression (R2 or R-squared).

 

R2 measures the percent of the variation of the dependent variable that is explained by the regression equation.  R2 thus lies between zero and one.

 

If R2 = 1, then all data points would lie on the regression line.    For lesser values of R2, the data points will be dispersed some distance from the line.  The better the fit of the line to the data, the higher the R2.  We are thus more confident in making observations when R2 is high (near 1) than low (near 0).

 

 


The output from an Excel spreadsheet regression of WINS on TODIFF is listed below.

 

 

SUMMARY OUTPUT

 

 

 

 

 

 

 

Regression Statistics

 

 

Multiple R

0.448752262

 

 

R Square

0.201378593

 

 

Adjusted R Square

0.173839923

 

 

Standard Error

2.716682416

 

 

Observations

31

 

 

 

 

 

 

ANOVA

 

 

 

 

df

SS

MS

Regression

1

53.96946284

53.96946284

Residual

29

214.0305372

7.38036335

Total

30

268

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

Intercept

8

0.487930566

16.39577546

TO DIFF

0.140912436

0.052109169

2.704177382

 

 

From the column labeled coefficients, we see that A=8 and B=.1409.  Hence, our estimated equation is

 

WINS = 8 + 0.1409*TODIFF

 

The t-Stat column reports the t-ratios for each coefficient estimate.  They are both well beyond our cutoff point of 2.0, so we can have some confidence in making observations based on this table.

 

The value for R2 is .201, indicating 20% of the variance in WINS across NFL teams is "explained" by just TODIFF.  This is surprisingly large, but far from 1.

 

If we were to give a team 1 additional turnover per game, TODIFF would be 16 (given the 16 game season).

 

Since the coefficient estimate is a slope, we can then calculate the following:

 

ΔWINS = .1409*ΔTODIFF

 

or ΔWINS = .1409*16 = 2.24

for ΔTODIFF = 16

 

Hence, 1 additional turnover per game accounts for 2 more wins in a 16 game NFL season.

 

One can add more variables to the equation, which puts us in the realm of multiple or multivariate (meaning more than 1 independent variable) regression.

 

In the case of NFL wins, the strength of the offense at gaining yards & the defense at preventing teams from gaining yards are obviously relevant.  If we estimate a linear regression using these factors in addition to turnovers we obtain the following (again, data from the 1999 season).  In the table OYDS represents average yards gained per game by a team's offense.  DYDS represents average yards allowed per game by a team's defense. 

 

 

SUMMARY OUTPUT

 

 

 

 

 

 

 

Regression Statistics

 

 

Multiple R

0.727705427

 

 

R Square

0.529555188

 

 

Adjusted R Square

0.477283543

 

 

Standard Error

2.160925809

 

 

Observations

31

 

 

 

 

 

 

ANOVA

 

 

 

 

df

SS

MS

Regression

3

141.9207905

47.30693017

Residual

27

126.0792095

4.669600352

Total

30

268

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

Intercept

10.98061036

5.898473127

1.861602168

TO DIFF

0.10322433

0.042604728

2.422837433

OYDS

0.032800104

0.011219634

2.923455698

DYDS

-0.041701242

0.013280327

-3.14007649

 

 

All coefficient estimates have t Stats in excess of 2 (except the intercept, and we don't care much about that).

 

The magnitude of the coefficient estimate for TO DIFF (at .103) is smaller now that we take into account offensive and defensive prowess.

 

The coefficient for OYDS is .03.  So a team that increases its yards gained by 50 yards per game could expect to increase its win total by .03(50) = 1.5.

 

The coefficient for DYDS is -.04.  So a team that decreases its yards gained by 50 yards per game could expect to increase its win total by -.04(-50) = 2. 

 

The regression hints that, yard for yard, it is more important to deny your opponent yardage than to gain it yourself.

 

Looking at R2, we find that adding these two factors raises the variance of WINS explained by the regression above 50%.  Not bad for a simple regression model.