10/16/97
AP Statistics - Test 2 - Answers
Unit 2: Topics 6-10
Mr. Coons


Some of the following were adapted from problems suggested by Kathleen Wong Nirei of Iolani School, Honolulu and Bill Harrington.


1. Circle the correct answer:

If a correlation coefficient is 0.80, then:

a. The explanatory variable is usually less than the response variable.

b. The explanatory variable is usually more than the response variable.

c. Below average values of the explanatory variable are more often associated with below average values of the response variable.

d. Below average values of the explanatory variable are more often associated with above average values of the response variable.

e. None of the above.


2. Circle the correct answer:

a. The closer a correlation coefficient is to 1 or 1, the more evidence there is of a causal relationship between the explanatory variable and the response variable.

b. The closer a correlation coefficient is to 0, the more evidence there is of a causal relationship between the explanatory variable and the response variable.

c. The closer the value of r^2 is to 1 or -1, the more evidence there is of a causal relationship between the explanatory variable and the response variable.

d. The closer the value of r^2 is to 0, the more evidence there is of a causal relationship between the explanatory variable and the response variable.

e. None of the above.


3. One of the following statements is better than the others. Circle that statement. VERY BRIEFLY explain why you did not choose each of the other statements:

When comparing the size the residuals from two different models for the same data:

a. Use the range of each set of residuals as a basis for comparison.

From Graham's Howarth: "The range is only the max minus the minimum residual. It tells you nothing about what is in between."

b. Use the mean of each set of residuals as a basis for comparison.

From Graham's Howarth: "The mean of the residuals is always zero, no matter the model."

c. Use the sum of each set of residuals as a basis for comparison.

From Mike DiMella: "The sum of the residuals is always zero."

d. Use the standard deviation of of each set of residuals as a basis for comparison.

From Joanna Sandman: " By using the standard deviations of residual you can examine the variability of error, lower variability is best, the others don't tell you about the variability."



4. Below is a plot of the 1986 profits versus sales (each in ten of thousands of dollars) of 12 large US companies, the results of a least squares regression performed on a TI-83, and some other summary data. Note that some of the data with lower Sales values overlap on the graph.
 

 

a. Demonstrating your knowledge of the definition of r^2, explain what the value of r^2 means in the context of this problem.

From Jamie Holmes: r^2 [is] the proportion of variability in Profits that is explained using the LSR line by Sales. In other words, as sales increase, so do profits (on the whole) and r^2 explains how much of this increase in profit is accounted for by the increase in sales.

b. Annotate, i.e. fully add labels and lines, at any one point on the plot to help a reader understand what r^2 measures.

See plot above

c. The teacher who supplied this data set suggested that even though r^2 is close to one there is reason to doubt some of the interpolative predictive value of this model. He came to this conclusion with no further computation or residual analysis. Explain his reasoning.

From Jamie Holmes: "Although r^2 is very close to one the predictive value of the model is very low due to the fact that it... is greatly dependent on a single point (50056,6555). This point is an influential observation as it placement greatly effects the LSR line. The point has a small residual only because it pulls the LSR line towards itself. If this point were removed the r^2 value would be much lower."



5 . Note: The data for this problem is stored in a program named AIDS which is available from Mr. Coons. Do NOT enter this data by hand.

Consider the following data on the number of AIDS cases reported in the US by state health departments between 1982 and 1986:

 Year  1982  1983  1984  1985  1986
 Number of Cases  434  1,416  3,196  6,242  10,620

a. Using year as the independent variable, state the value of and interpret the slope of the least squares regression line in the context of this data.

From Brian O'Connor; The value of the slope is telling us that for every year that passes we should expect about 2520 more aids cases

b. State the value of and interpret the y-intercept of the regression line in the context of this data.

From Brian O'Connor: The y-intercept is -4,994,901.6. This value tells us that in the year 0 ... we would have expected to find about -4,994,902 cases of AIDS reported, though that obviously makes no sense in the context [of this problem].

c. Use the least squares regression line to predict the number of aids cases in the year 2000.

Using the table feature of your TI-83 (in ASK mode) or evaluating Y1(2000) the predicted number of cases in 2000 is about 44,698.

d. Assuming this data was an adequate and representative sample, how confident are you in the prediction you made in part c? Your answer must include conclusions from a residual analysis. Include a rough residual plot.

From Rachel Apfel: I am not very confident of this prediction (although the correlation coefficient shows a very strong positive association, 0.965, and r^2 is 0.93, a very large proportion) for 2 reasons: 1) The residual plot shows a pattern, signifying the relationship is not best shown with a linear model, and 2) the danger of extrapolation: The year 2000 is a value beyond what is contained in this data set so we have no way of knowing that this relationship will remain the same for values outside this data set.

 

e. State the equation of a quadratic model and compare it fully to your previous model. Include a rough plot(s).

 cases = 575.57(year)2 - 2281347(year) + 2260600436

From Leah Temple: "The residual plot for the quadratic regression shows no pattern and its standard deviation 87.8 as opposed to the linear regression's residual plot which showed a pattern and had a standard deviation of 1080.4. This indicates that the quadratic is a better model. Also the line seems to fit better as it is graphed. "

 

Quadratic:
1. No pattern
2. Low variability



6. a) State, without example, Simpson's Paradox.

Lily Altstein: "Simpson's Paradox states that aggregate proportions of two variables can reverse the association or relations between ... subcategories within each variable. This means that when comparing two variables in terms of the other, the relation/association between the two is often different from that when comparing individual parts of the two variables with respect to the third.

b) Create a numerical example of Simpson's Paradox. Briefly point out how your example demonstrates this deceiving situation.

From Craig Lund: In Craigville there are 4 towns. 51.1% of the voters are Democrats and 48>9% are Republicans. When looking at the chart below you might think there were more Republican towns. The reason is that "Town 1 has such a high population that it makes up for the lower
 

 # people

 %-Democrat

 %- Republican
 Town 1

 1,000

 56%

 44%
 Town 2

 20

 5%

 95%
 Town 3

300

45%

55%
Town 4

100

30%

70%




7. Roughly plot a set of data that would have a coefficient of correlation, r, close to or equal to zero but which clearly has a simple function which would model it well.

 

 



-end-