Interpreting Regression Equations
Consider the following data from the 1999 NFL Season
TO DIFF is the difference between turnovers captured and turnovers given up, i.e. the greater the number, the better for the team. TODIFF ranges from -17 for Atlanta to +21 for Kansas City. The NFL season is 16 games long, and WINS ranges from 2 for Cleveland to 13 for St. Louis.
|
|
WINS |
TO DIFF |
|
|
|
Arizona |
6 |
-13 |
|
|
|
Atlanta |
5 |
-17 |
|
|
|
Baltimore |
8 |
0 |
|
|
|
Buffalo |
11 |
-6 |
|
|
|
Carolina |
8 |
-5 |
|
|
|
Chicago |
6 |
-4 |
|
|
|
Cincinnati |
4 |
-5 |
|
|
|
Cleveland |
2 |
-11 |
|
|
|
Dallas |
8 |
10 |
|
|
|
Denver |
6 |
-2 |
|
|
|
Detroit |
8 |
10 |
|
|
|
Green |
8 |
5 |
|
|
|
Indianapolis |
13 |
-5 |
|
|
|
Jacksonville |
14 |
12 |
|
|
|
KC |
9 |
21 |
|
|
|
Miami |
9 |
-6 |
|
|
|
Minnesota |
10 |
-10 |
|
|
|
NE |
8 |
-2 |
|
|
|
NO |
3 |
-5 |
|
|
|
NYG |
7 |
-8 |
|
|
|
NYJ |
8 |
13 |
|
|
|
Oakland |
8 |
4 |
|
|
|
Philadelphia |
5 |
7 |
|
|
|
Pittsburgh |
6 |
3 |
|
|
|
San
Diego |
8 |
-8 |
|
|
|
Seattle |
9 |
3 |
|
|
|
San
Fran |
4 |
-12 |
|
|
|
St.
Louis |
13 |
5 |
|
|
|
TB |
11 |
-4 |
|
|
|
Tennessee |
13 |
18 |
|
|
|
Washington |
10 |
12 |
|
|
Average 8 0
Note that since each game
has both a winner and a loser, the average number of WINS is 8. Also, since a turnover captured by team A is
one given up by tam B, the average value for TO DIFF is 0.
The relation between WINS
and TO DIFF is depicted in the graph below.
There is an obvious positive relationship between the two variables.

A linear least squares regression is one way of quantifying the relation
between two variables such as this. A regression
equation estimated by a statistical package like SAS or a spreadsheet program
such as Excel uses formulas which define a line through the data (hence the
term linear) which has the property that the sum of the squared distances
between the line and the data points is minimized -(hence the term least
squares).
The least squares regression
line through the data on turnovers and wins is depicted below.

The pinkish dots are the
points on the estimated regression line through the data. Note that it passes through the means of the
data (0 turnovers, 8 wins).
Note also that if a team
were to move from 0 turnovers to +15 turnovers (remember,
takeaways minus giveaways) then the line implies wins would increase from 8 to
10. +15 turnovers in a 16 game season represents about 1 net turnover per game. Such a team wins two more games in the
season. This suggests that turnovers are
a non-trivial part of NFL football, perhaps important.
The regression line in this case takes the form:
WINS = A + B*TODIFF
where A is an intercept and B is the slope
coefficient. The slope coefficient
measures the incremental effect of TODIFF on WINS, i.e. )WINS/)TODIFF. The statistical or spreadsheet package
delivers estimates for the intercept and slope coefficients derived from least
squares formulas.
The statistical tables
presented in the papers we'll be reading in class have three important things
to focus on.
1. The magnitude of the coefficients in the
estimated linear regression.
We will examine the magnitude
of the coefficients to see how reasonable changes in a
variable such as turnovers affects an outcome of interest, such as wins.
2. The precision of the estimates (t-stat or
t-ratio).
The t-ratios reported in
tables are typically the ratio of the coefficient estimate divided by its
standard error. The larger the standard
error, the less precise the coefficient estimate. Obviously, we prefer to make observations
based on coefficients which are large relative to the imprecision in their
estimate. Hence, the larger the t-ratio,
the more confident we are in making these observations.
Our rule of thumb is that
coefficient estimates are reasonably precise when their t-ratios are 2.0 or
better. There is some statistical theory
behind this rule of thumb, but we don't need to go into that here.
3. The explanatory power of the regression (R2
or R-squared).
R2 measures the
percent of the variation of the dependent variable that is explained by the
regression equation. R2 thus
lies between zero and one.
If R2 = 1, then
all data points would lie on the regression line. For lesser values of R2, the
data points will be dispersed some distance from the line. The better the fit of the
line to the data, the higher the R2. We are thus more confident in making
observations when R2 is high (near 1) than low (near 0).
The output from an Excel
spreadsheet regression of WINS on TODIFF is listed below.
|
SUMMARY
OUTPUT |
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
Multiple
R |
0.448752262 |
|
|
|
R Square |
0.201378593 |
|
|
|
Adjusted
R Square |
0.173839923 |
|
|
|
Standard
Error |
2.716682416 |
|
|
|
Observations |
31 |
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
df |
SS |
MS |
|
Regression |
1 |
53.96946284 |
53.96946284 |
|
Residual |
29 |
214.0305372 |
7.38036335 |
|
Total |
30 |
268 |
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
|
Intercept |
8 |
0.487930566 |
16.39577546 |
|
TO DIFF |
0.140912436 |
0.052109169 |
2.704177382 |
From the column labeled coefficients,
we see that A=8 and B=.1409. Hence, our
estimated equation is
WINS = 8 + 0.1409*TODIFF
The t-Stat column reports
the t-ratios for each coefficient estimate.
They are both well beyond our cutoff point of 2.0, so we can have some
confidence in making observations based on this table.
The value for R2
is .201, indicating 20% of the variance in WINS across
NFL teams is "explained" by just TODIFF. This is surprisingly large, but far from 1.
If we were to give a team 1
additional turnover per game, TODIFF would be 16 (given the 16 game season).
Since the coefficient
estimate is a slope, we can then calculate the following:
ΔWINS = .1409*ΔTODIFF
or ΔWINS = .1409*16 = 2.24
for ΔTODIFF = 16
Hence, 1 additional turnover
per game accounts for 2 more wins in a 16 game NFL season.
One can add more variables
to the equation, which puts us in the realm of multiple or multivariate
(meaning more than 1 independent variable) regression.
In the case of NFL wins, the
strength of the offense at gaining yards & the defense at preventing teams
from gaining yards are obviously relevant.
If we estimate a linear regression using these factors in addition to
turnovers we obtain the following (again, data from the 1999 season). In the table OYDS represents average yards
gained per game by a team's offense.
DYDS represents average yards allowed per game by a team's defense.
|
SUMMARY
OUTPUT |
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
Multiple
R |
0.727705427 |
|
|
|
R Square |
0.529555188 |
|
|
|
Adjusted
R Square |
0.477283543 |
|
|
|
Standard
Error |
2.160925809 |
|
|
|
Observations |
31 |
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
df |
SS |
MS |
|
Regression |
3 |
141.9207905 |
47.30693017 |
|
Residual |
27 |
126.0792095 |
4.669600352 |
|
Total |
30 |
268 |
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
|
Intercept |
10.98061036 |
5.898473127 |
1.861602168 |
|
TO DIFF |
0.10322433 |
0.042604728 |
2.422837433 |
|
OYDS |
0.032800104 |
0.011219634 |
2.923455698 |
|
DYDS |
-0.041701242 |
0.013280327 |
-3.14007649 |
All coefficient estimates
have t Stats in excess of 2 (except the intercept, and we don't care much about
that).
The magnitude of the coefficient
estimate for TO DIFF (at .103) is smaller now that we take into account
offensive and defensive prowess.
The coefficient for OYDS is
.03. So a team that increases its yards
gained by 50 yards per game could expect to increase its win total by .03(50) =
1.5.
The coefficient for DYDS is
-.04. So a team that decreases its yards
gained by 50 yards per game could expect to increase its win total by -.04(-50)
= 2.
The regression hints that,
yard for yard, it is more important to deny your opponent yardage than to gain
it yourself.
Looking at R2, we
find that adding these two factors raises the variance of WINS explained by the
regression above 50%. Not bad for a
simple regression model.