One particularly important application of the things we’ve learned in this chapter about linear equations is the idea of fitting lines to data or, if we are feeling more fancy, linear regression. Real-world data is usually not exactly linear — reality is too messy for that. So, a reasonable thing to do is to figure out a linear equation that does the best job possible at getting as close as it can to as many points as it can. Then we can make reasonable predictions; they certainly won’t be perfect, but it’s better than nothing!
Thanh has an internship studying road salt usage in a northern metropolitan area. Road salt is used to melt ice and snow on paved streets. Because it can damage vegetation and influence both surface water (lakes) and ground water, and because it costs money to run the trucks that apply the salt, people are interested in the amount of road salt used.
One data set compares road salt usage per county. Thanh learned from county officials that road salt use varies widely from county to county, but, not surprisingly, it depends heavily on the length of road in the county. So, the variables are
\begin{align*}
L \amp= \text{ road length (lane miles) } \sim \text{ dep} \\
S \amp= \text{ road salt applied (tons per year) } \sim \text{ indep}
\end{align*}
A lane mile is the area of road one mile long and one lane wide. Now you know.
Thanh also learned that while road salt use is a function of lane miles, it is not proportional as there are more complicated factors involved. Still, he would like to model road salt use as a function of road length. Here are the data for counties in the metro area; Thanh decided to lok at all the data at once in a scatter plot to start getting ideas about how to build a model.
After looking at this scatter plot, Thanh realized that there’s no one line that goes through all the points. Darn! I guess he’ll have to do something more complicated.
The first thing Thanh decides to try is to make up new imaginary counties with different amounts of roads, and use real counties that are close to the made-up counties to figure out how much salt his imaginary county would use.
For instance, Thanh first imagined a new county, County X, that had 600 lane miles of road. In looking at the data, he finds two counties with close to 600 lane miles: County T and County A.
Based on this data, Thanh expects County X would use between 5,000 and 14,700 tons/year of road salt. Since 600 is closer to 510 than to 710, he starts with a guess of around 9,000 tons/year of road salt. Seems reasonable.
To improve this estimate, Thanh decides to use a linear model, hoping that will account for both road length influence and fixed factors. He begins by finding the slope.
\begin{align*}
\text{slope} \amp = \text{rate of change}
= \frac{\text{change dep}}{\text{change indep}}
= \frac{14{,}700\text{ tons/year}-5{,}000 \text{ tons/year}}{710\text{ lane miles}-510 \text{ lane miles}}\\
\amp = (14{,}700-5{,}000)\div(710-510)= 48.5 \text{ tons/year per lane mile}
\end{align*}
When Thanh was using the nearby points to estimate for Counties X and Y, it’s as if he were connecting the dots with line segments on the graph. Notice that the line that goes through 500 lane miles is decreasing, just like Thanh saw in his table.
Thanh thinks this “connect-the-dots” approach is a silly model: it’s too all over the place, and he suspects that it is too heavily influenced by individual county road-salting habits. He would like a way to get one line to use for everything, knowing full well that one line cannot possibly go through all of the data points.
One option would be to stick with the line he found through the points for Counties T and A:
\begin{equation*}
\textbf{T-A line:} \quad S = 48.5L-19{,}735\text{.}
\end{equation*}
He redraws the scatter plot to show that line. Because the intercept is negative, it doesn’t show up on his graph. The line seems to be too low at first and too high later. The problem is that this line is too steep (has too large a slope).
Thanh decides to try a line that is less steep. After drawing in a few lines, he decides to try the line between the points for Counties C and D instead, which has equation
\begin{equation*}
\textbf{C-D line:} \quad S = 20.26L-4{,}610\text{.}
\end{equation*}
Unfortunately this line seems too low. (Again the negative intercept isn’t visible.)
Neither of these lines came close to the point for County H on the far right, so Thanh considers one more line, this time through County H and County R, which has equation
\begin{equation*}
\textbf{H-R line:} \quad S = 8.71L+3{,}130\text{.}
\end{equation*}
This line has a positive intercept just above 3,000 tons/year, as you can see on the graph.
Thanh thinks the H-R line is the most reasonable of the lines that he tried, but it makes him wonder how to decide if one line is better than another. Generally speaking the best fitting line makes the space between the line and the data points as small as possible. (There is actually a much more official definition.) After using a little statistical software, Thanh determines that for this data set, the official best fitting line has equation
\begin{equation*}
\textbf{Best fitting line:} \quad S = 10.0L+2{,}741\text{.}
\end{equation*}
Thanh wants to add this line to his graph so first he calculates a few values. While it’s true that any two points would do, he played it safe and plotted three points for imaginary new counties, being sure to use 0 in order to find the intercept.
He graphs this line and notices it is very similar to the H-R line, just a tiny bit higher and a tiny bit steeper. The points from the table are highlighted on the graph just to help you see how we graphed the line. Remember, those aren’t actual data points — they correspond to fictional counties that Thanh made up.
Thanh is bothered by the fact that County H seems to be off on its own. The largest city in this area is in County H. Between the budget crunch and the nature of the urban landscape, the city tends to use much less road salt than the surrounding areas. So County H really isn’t very typical at all. In statistics, this sort of value is known by the descriptive term outlier (as in, “it lies way out there.”)
Sure enough, this line is less steep than the T-A line, higher than the C-D line, and runs fairly close to most of the rest of the counties. Seems perfect.
For each line, state some reason why the fit is not good. (We know the line will not go through all, or even most, of the points, so that is not the problem. Instead look at slope/steepness, intercept/height, etc.)
Is it true that students who work part-time have lower grades? Do the number of hours matter? The table shows the grade point average (GPA) of ten students compared to the number of hours per week each student works at a part time job. The variables we used are \(T\text{,}\) for the time worked at job (hours/week), and \(G\) for the GPA, on the usual scale of 0.0 to 4.0.
According to line B, what is the greatest number of hours a student should work if they want to maintain a 3.5 GPA? Solve an equation, then check on your graph.
Mia and Mandi and opened a candy shop this January. The table shows their monthly sales profit. Except for some seasonal fluctuation, Mia and Mandi generally expect your profits to rise steadily while their business is getting established.
Write an equation for the line through March and July. Notice that you need to find the intercept this time. Add this line (#2) to your graph. This line is too steep.
Neither of these lines go anywhere near the data for February, April, and May, because those are outliers. Any idea why those months had much higher candy sales than the other months?
The best-fitting line (ignoring the outlier) had equation \(S = 19.7L-2{,}905\text{.}\) Make a table of values for \(L=600\) and \(1{,}000\) lane miles and use these values to check the graph Thanh drew. (They are highlighted on the graph.)
Wild rice is a native plant that grows in lakes in the upper Midwest.
Aside
The table shows how the annual acreage of wild rice has varied with the average spring temperature in various years. The variables are \(T\) for the temperature measured in °F and \(W\) for the wild rice yield, measured in acres. In case you’re curious, the year is included as well, but it’s not one of the variables we’re interested in.
Make a scatter plot of the points. Make your graph as large as possible by starting your temperature axis at 35°F and your acreage axis at 1,000 acres.
Based on your line, what might you expect the acreage of wild rice to be in a year when the average temperature is 46°F? 40°F? Use your equation to answer the questions.
If you use the best fitting line, how would that change your estimate for the acreage of wild rice in a year when the average temperature is 46°F? 40°F?
The amount of garbage generated in the United States has increased steadily, from 88.1 million tons in 1960 to 254.2 million tons in 2006.
Aside
Earlier we used a linear model. But, in fact, the amount of garbage has not increased exactly linearly. The table shows data for select years, where \(Y\) measures years since 1960 and \(G\) is the amount of garbage (in millions of tons).
Draw in the line through the points from 2000 and 2006. Would this line predict that garbage will reach 300 million tons sooner or later than the previous prediction? Use the graph to explain.
My mechanic, Paye, believes that frequent oil changes reduce the amount of maintenance on a car. To prove his point, Paye showed me a table of customers with the number of yearly oil changes and the cost of their engine repairs.
Write the equation for that line. Use your equation to predict the cost of engine repairs for a customer who does no oil changes, and one who does 8 oil changes.
Draw the line (A) that goes through the points for “Remember the Titans” and “A League of Their Own”. Explain why this line does not fit the data well.
\begin{equation*}
C =3.6375+0.2854Y
\end{equation*}
and the best fitting line for beef is
\begin{equation*}
B=11.774+0.007Y
\end{equation*}
Set up and solve a system of linear equations to find the year when chicken and red meat consumption will likely be equal. How does this answer compare to your estimate?