- Describe the basic steps of hypothesis testing:
- Define null and alternative hypothesis
- Define alpha
- Select appropriate test
- Run analysis
- Draw conclusion
- Define type I and type II error
- Define power and it’s typical value (0.80)
- Define power analysis
- Describe the importance of sample size
Lesson 3: Inferential Statistics
Terms that appear frequently throughout this lesson are defined below:
Term | Definition |
Null hypothesis | The opposite of the hypothesis proposed, typically that there is no difference, relationship, or effect |
Alternative hypothesis | The proposed hypothesis or idea, typically that there is a difference, relationship, or effect |
Alpha | Probability of incorrectly rejecting the null hypothesis |
Type I error |
Detecting a difference when there isn’t one |
P-value |
Probability that the observed statistic occurred by chance if the null hypothesis is true |
Beta | Probability of incorrectly failing to reject the null hypothesis |
Type II error |
Failing to detect a difference when there is one |
Power | Ability of a test to detect a difference when a difference exists |
Power analysis |
Method for determining how large a sample size must be to detect a difference if in fact a difference exists |
Inferential statistics allows us to make inferences about the population based on the sample that we have studied. Inferential statistics relies on the use of hypothesis testing, which follows these basic steps:
Hypothesis testing
- Define the hypotheses. The null hypothesis is the opposite of the hypothesis proposed, typically the idea that a difference does not exist. The alternative hypothesis is the idea being researched, typically the idea that a difference does exist. Rejecting the null means that your statistical test has found a significant difference in the groups being researched.
- Define alpha, or the probability of incorrectly rejecting the null hypothesis. Most statistical analysis uses an alpha-level of 0.05, which means that there is a 1 in 20 chance that you will find a difference when there isn’t one. Detecting a difference when there isn’t one is referred to as Type 1 error. Failing to detect a difference when there is one is referred to as Type II error. This is based on the beta-level, or the probability of incorrectly failing to reject the null.
- Select a test. Based on the measurement scale of the data and other research design considerations, an appropriate statistical test should be selected. Basic statistical tests will be reviewed later in this module.
- Run the analysis. The test will provide a p-value, or the probability that the observed statistic occurred by chance.
- Draw a conclusion. The p-value is compared to alpha. A statistically significant relationship (p-value =< alpha) means it was unlikely that the difference detected by the statistical test occurred by chance, enabling us to conclude that the independent variable is related to the dependent variable. Failure to find a statistical significance (p-value > alpha) means that our observed data can be explained by chance.
Decision: |
Decision: |
|
Null Hypothesis = True | CORRECT! | type 1 error |
Null Hypothesis = False | type 2 error | CORRECT! |
Power to detect a difference
Power is the probability that a statistical test will correctly reject the null/correctly accept the alternative hypothesis. In other words, power is the ability of a test to detect a difference when a difference actually exists. Most researchers set power at 0.8, which means that 80% of the time, we will find a statistical significance if a difference actually exists.
Power is directly related to sample size and a power analysis can be used to determine how large a sample must be to detect a difference if in fact a difference exists.
n = 2δ2(Zβ+Zα/2)2 / difference2
where
n = sample size in each group (assumes equal-sized groups)
δ = standard deviation of the outcome variable
Zβ represents the desired power (typically .84 for 80% power)
Zα represents the level of statistical significance (typically 1.96)
difference = effect size (the difference in means)
Sample Size | Sample Mean | Population Mean | P value |
---|---|---|---|
4 | 110.0 | 100.0 | 0.05 |
25 | 104.0 | 100.0 | 0.05 |
64 | 102.5 | 100.0 | 0.05 |
100 | 102.0 | 100.0 | 0.05 |
400 | 101.0 | 100.0 | 0.05 |
2,500 | 100.4 | 100.0 | 0.05 |
10,000 | 100.2 | 100.0 | 0.05 |
Table adapted from Norman GR, Streiner DL. PDQ Statistics. BC Decker, INC: Philadelphia; 1986.
Again, a smaller p-value does not mean the detected difference is bigger or more significant.
Hypotheses
In the literature, the null hypothesis is often implied rather than clearly stated. Any table with p-values and any results written with a claim of significance are associated with a hypothesis, even if that hypothesis is not stated. When hypotheses are stated, they are sometimes accompanied by symbols H0 for the null and HA for the alternative.
Example 1
Classical hypothesis testing was used to determine whether the period-2 values were significantly different from those from period 1 by testing whether the study mean Ln(R) was significantly different from 0. The hypotheses were:
H0: µLn(R) = 0; alpha = 0.05
HA: µLn(R) not equal to 0 [Miyazawa, et al., 2002]
Example 2
Given the lack of empirical evidence linking student perceptions to actual outcomes in the general health sciences literature, we operated under a null hypothesis of no relationship (correlation) between any student perception indicators and the actual outcomes as measured by PCOA scores. [Naughton and Friesner, 2012]
Power
The following statements demonstrate the use of a power calculation to determine the sample size needed to detect an effect or difference:
Example 1
We needed 45 patients to have 95% power to reject the null hypothesis that the mean serum digoxin concentration was within 10% of the mean predicted digoxin concentration. Patients were recruited from two general practices and had been taking digoxin for at least four months. Exclusion criteria were dementia, low adherence to digoxin, and use of other medications known to interact to a clinically important extent with digoxin. [Kroese, et al., 2005]
Example 2
Power analysis calculation with an alpha of 0.05 and beta of 0.8, indicated that a minimum sample size of 85 was necessary to find significance associated with a 10-point difference in examination performance. Ninety-five students consented to participate in the study. [McLaughlin, et al., 2014]