Linear Regression in a Nutshell

Linear Regression in a Nutshell

Calculate the results of a linear regression without any programming skills! Choose a data set on the left side, an independent variable (X) and a dependent variable (Y). For example, use the Catholic data set to estimate whether a family's income (faminc8) has an impact on children's reading achievement (read12).

Let's start by looking at the data. You have selected the following two variables:

In terms of data analysis, it's always a good idea to explore the data set in the beginning. Which variables were measured and on what scale? Above, you see the summary statistics for the independent and the dependent variable. Hopefully, this gives you an impression how your variables are measured and distributed.

Look at a histogram of the dependent variable to get a clearer picture about the distribution:

And the independent variable:

In a linear regression, we estimate the effect of an independent variable X on a dependent variable Y. Often, we use several independent variables to predict Y, but let's stick to the simple model with two variables. The principals of a linear regression are the same.

So, what happens when we regress X on Y? We calculate the linear association between X and Y and we try to fit a line in accordance to our data. We can use a scatter plot to check the linear association. Let's have a look for the chosen variables:

What would you say? Is there a linear association between X and Y?

Irrespective whether we saw a linear association, we may want to run a regression to illustrate the approach. You get the following output from your statistic software:

Can you interpret the results? Can you calculate the predicted value if X increases by 1 unit?

Hint: $$y_i=\beta_1+\beta_2*x_i$$

Now that we know whether X and Y are associated, can you tell me how strong the effect is? Let's use the regression results and visualise the point estimates from the regression. Has X a strong effect on Y?

Irrespective of statistical significance and effect size, one question remains: How well does X explain Y?

Most times the prediction of the regression is not perfect, so we make a mistake or an error. In the output the error is displayed in red. It's the deviation between the predicted value (regression line) and the observed value. What would you say in your case? How well explains X your outcome?

You may say, we make an error, no big deal! Well, to understand whether this is a big deal, we should check several aspects. At least you should know R-squared. It is an indicator which helps us to assess how big the mistake is or how well the model explains the outcome.

To understand R-squared, we have to think about the total variance between X and Y. Let's assume that X cannot explain Y at all. What would you say, how would a corresponding regression line or the graph look like?

You can see it in the output, the regression line would be flat or constant. Regardless of the X value, we would observe the same Y value. Thus, the blue lines in the graph above show you the total variance between X and Y, and since we assume that X cannot explain Y at all, that's the total error we could make.

We know better, don't we? We have already fitted a regression line and based on the observed values, X explains Y to a certain amount. The next output shows the explained variance - the green area - the amount of Y that can actually be explained by X:

So we know the total variance and the explained variance. Thus, we can assess the error by calculating R-squared, which is the proportion (%) of the variance of Y that is predictable based on the regression. The last bar plot shows all three variance components..

Now it's up to you! How well does your X explains your outcome?