Lesson 6: Continuous Outcomes, Linear Regression

This lesson reviews correlation and linear regression. At the end of this lesson, you will be able to:

Define correlation
Describe the Pearson correlation coefficient and the Spearman correlation coefficient
Interpret the results of a correlation analysis
Determine when to use a linear regression analysis
Describe the importance of statistical controls
Interpret the results of a linear regression

Terms that appear frequently throughout this lesson are defined below:

Term	Definition
Correlation	A statistical relationship that reflects the association between two variables
Pearson correlation coefficient	A number (r_p) that represents the strength of the association/correlation between two variables
Spearman correlation coefficient	The nonparametric equivalent (r_s) of the Pearson correlation coefficient
Multiple regression	Using two or more independent variables to predict a dependent variable
Unstandardized regression coefficient (b)	The change in the dependent variable associated with a 1 unit change in the independent variable when all other variables in the regression model are controlled for
Standardized regression coefficient (beta)	Each variable is transformed to have a mean of 0 and standard deviation of 1 so that the regression coefficients represent how predictive the variable is (i.e., the magnitude of the coefficients can be compared to one another to determine which is most predictive of the outcome)
R²	The amount of variance in the data accounted for by a regression model
Statistical control	Separating out the effect of one independent variable from the remaining independent variables to reduce the effect of confounding
Confounding variable	An extraneous variable that correlates with the dependent variable and at least one independent variable

I. Correlation

A correlation reflects the association between two variables. It tells us a) the strength of the relationship, on a scale of 0 to +/-1 with 0 indicating a lack of association and b) the direction of the relationship, either positive (i.e., as one variable increases the other increases) or negative (i.e., as one variable increases the other decreases).

Click here for a detailed description. Graphic borrowed from BioMedware

Pearson correlation coefficient (also known as the Pearson product moment correlation coefficient and simple correlation coefficient) provides a value for r that indicates the strength and direction of the relationship between two variables.

Spearman’s coefficient is the nonparametric equivalent of the Pearson correlation coefficient and is typically applied to data that do not meet the assumption of normality.

BEWARE! Correlation, not causality! Correlations indicate the strength of the relationship between two variables – it cannot tell us the nature of that relationship.

For Example

Large cities often see increases in violent crime and ice cream sales during the summer. Does ice cream cause violent crime? Not necessarily (whew!)…possible explanations include:

Murder causes people to eat ice cream
Buying ice cream causes murder
The correlation is due to coincidence
Both murders and ice cream sales are related to something else (warm weather)

BEWARE! P-values! In correlation analysis, the p-value indicates how likely the correlation happened by chance. A smaller p-value does not mean that the relationship between the two variables is stronger, only that it is less likely to have occurred by chance. When a significant correlation is found, consideration should be given to the r value or correlation coefficient since this indicates the strength of the association.

II. Multiple regression

Multiple regression is used to predict an outcome based on two or more independent variables. Multiple regression expresses the relationship between the variables in the form of a linear equation. The equation for the regression line generally follows the form:

y = b₀ +b₁x₁ +b₂x₂+…..b_nx_nwhere:

b₀ = the y-intercept (also called the constant), or the value of the dependent variable when all independent variables have the value of 0
b_n= the unstandardized regression coefficient, or the change in the dependent variable for every one unit change in the associated independent variable, controlling for all other independent variables in the equation
x_n = the independent variable

Click here for a detailed description.

Regression coefficients

The unstandardized regression coefficient represents the measurement unit (e.g., pounds, liters, years) associated with its independent variable. For example, in a single regression equation, one independent variable may be measured in years while another is measured in pounds. This makes comparisons about the predictive ability of the independent variables difficult, since the measurement units vary.

To make comparisons across independent variables, researchers often use standardized regression coefficients, or beta-weights. To calculate beta-weights, each variable is converted to have a mean of 0 and standard deviation of 1, resulting in coefficients that can be compared directly to one another. This enables the researcher to determine the relative effect of each independent variable in the regression model.

Statistical controls

Regression coefficients provide information about how each independent variable impacts the dependent variable. One condition of interpreting a single regression coefficient is “controlling for all other variables.” This means that all other independent variables are held constant for calculations made about the effect of the independent variable on the dependent variable.

Confounders

Variables that distort the observed relationship between the dependent variable and independent variable of interest are called confounders. In regression, confounders should be entered into the model in order to “adjust” for those variables, since we know or believe that they impact the dependent variable.

R²

R² (also called coefficient of determination) is the percentage of variance in the data explained by your regression model. In other words, it is a measure of how close the data from your sample are fitted to the regression line.

0% indicates that the model does not explain any of the variability in the data
100% indicates that the model explains all of the variability in the data

BEWARE! Assumptions of multiple regression include the following:

The dependent variable is a continuous variable
The relationship between the independent variable and dependent variable is linear
Continuous variables are normally distributed
Variance of the errors is the same across all levels of the independent variable (homoscedasticity)
The independent variables are not highly correlated with each other (collinearity or multicollinearity)

Correlation

Rule of Thumb – The following guidelines for interpreting a correlation coefficient (r) may be helpful:

+.70 or higher: Very strong positive relationship
+.40 to +.69: Strong positive relationship
+.30 to +.39: Moderate positive relationship
+.20 to +.29: Weak positive relationship
-.19 +.19: No or negligible relationship
-.20 to -.29: Weak negative relationship
-.30 to -.39: Moderate negative relationship
-.40 to -.69: Strong negative relationship
-.70 or higher: Very strong negative relationship

Example 1: Aging

Methods: Sixty participants were recruited from four continuing care retirement communities, providing services and approximately 400 apartments (some >1 resident) to adults aged 55 years and older. Means and standard deviations were calculated on all variables. All variables were normally distributed (mean skewness and kurtosis values greater than -1.0 and less than +1.0), with the exception of percent active time in high intensity activity which was positively skewed (skewness = 2.24) and leptokurtic (kurtosis = 5.59). Relationships between indices of stepping behavior (i.e., total steps/day, percent active time in moderate and high intensity activity) and self-efficacy for exercise, overcoming barriers self-efficacy, trait anxiety, and fear of falling were examined using Pearson’s product moment correlation.

**Pearson Correlations for Various Aspects of Physical Activity Behavior**
	Steps Per Day	Moderate-Intensity PA	High-Intensity PA	Exercise Self-Efficacy	Barriers Self-Efficacy	Trait Anxiety	Fear of Falling
Steps per day	–
Moderate-intensity PA	0.600°	–
High-intensity PA	0.459°	0.126	–
Exercise self-efficacy	0.367°	0.264	0.151	–
Barriers	0.265	0.077	0.298*†	0.617°	–
Trait anxiety	0.206	0.292*	-0.125	-0.171	-0.231	–
Fear of falling	0.080	0.120	-0.006	-0.163	-0.021	-0.001	–

PA = physical activity
*Correlation is significant, P<0.05 (two-tailed)
°Correlation is significant, P<0.01 (two-tailed)
†Spearman correlation coefficient=0.254
P=0.089 (two-tailed)

Results: Correlations among total number of steps per day, percent of active time in moderate and high intensity physical activity, exercise self-efficacy, overcoming barriers self-efficacy, trait anxiety, and fear of falling and are presented in Table 3. Exercise self-efficacy was the only variable that significantly correlated with total number of steps per day (r = 0.367, p = 0.009). Overcoming barriers self-efficacy was the only variable significantly correlated with percentage of time spent in high intensity physical activity (r = 0.298, p = 0.044). Fear of falling was not significantly related to total steps per day, moderate intensity physical activity, or high intensity physical activity behavior.

Smith JC, Zalewski KR, Motl RW, Van Hart M, Malzahn J. The contributions of self-efficacy, trait anxiety, and fear of falling to physical activity behavior among residents of continuing care retirement communities. Ageing Res. 2010; 1(1).

Regression

Example 2: Adherence

Methods: Multiple regression was used to evaluate the relationship between adherence and glycemic control by means of a backward elimination process. Several patient-related and provider-related factors were used as covariates in the model for statistical adjustment, including original ODM regimen, baseline A1C, sex, age at index date, CDS, and medication burden, as well as provider sex, specialty, and years since medical school graduation.

Click here for a detailed description.

Results: Multiple regression was used to determine the relationship between adherence and A1C, controlling for therapeutic regimen and baseline A1C. An inverse relationship was observed between ODM adherence and A1C (see figure above), in which a 10% increase in index ODM adherence was associated with a 0.1% decrease in A1C (p = .0004).

Rozenfeld Y, Hunt JS, Plaushinat C, Wong KS. Oral antidiabetic medication adherence and glycemic control in managed care. Am J Manag Care. 2008; 14(2):71-5.

Example 3: Admissions

Methods:Performance on the HSRT was compared with data collected during the admissions process, including gender, presence of a four-year degree, undergraduate GPA, PCAT composite percentile rank, PCAT chemistry percentile rank, PCAT quantitative percentile rank, PCAT biology percentile rank, PCAT verbal percentile rank, and PCAT reading comprehension percentile rank. Multiple linear regression was conducted using SAS (SAS Institute, Inc., Cary, NC) to assess the relationship between predictor variables and HSRT scores after controlling for other variables.

Linear Regression Model Predicting HSRT Scores

Variable	Unstandardized Regression Coefficient	P Value
Prior 4-year degree	-0.52	0.15
Undergraduate GPA	0.96	0.10
Reading comprehension PCAT score	0.05	<0.001
Verbal PCAT score	0.06	<0.001
Quantitative PCAT score	0.05	<0.001
Chemistry PCAT score	-0.02	0.05
Biology PCAT score	-0.01	0.38

GPA = grade point average
PCAT = pharmacy college admission test

Results: The table above presents the results of multivariate analyses predicting HSRT scores. The full model explained 27.4% of the variance in HSRT scores. After controlling for other predictors, three variables were significantly associated with HSRT scores: scores on the reading comprehension, verbal, and quantitative sections of the PCAT.

Cox WC, Persky A, Blalock SJ. Correlation of the Health Sciences Reasoning Test with student admission variables. Am J Pharm Educ. 2013; 77(6): 118.