Lesson 2: Descriptive Statistics

This lesson reviews descriptive statistics. At the end of this lesson, you will be able to:

Define population and sample
Differentiate between descriptive and inferential statistics
Define basic measures of central tendency (mean, median, mode) and dispersion (standard deviation, interquartile range, range)
Define frequency and probability
Describe the characteristics of a normal distribution and a binomial distribution
Describe what a 95% confidence interval represents

Terms that appear frequently throughout this lesson are defined below:

Term	Definition
Population	The entire collection of people, animals, cells, or other things from which we collect data
Parameter	A number that is calculated from an entire population
Sample	A subset or group drawn from the population
Statistic	A number or quantity that is calculated from a sample of data
Descriptive statistics	Statistics that describe the sample without attempting to generalize the results to other groups or the population
Inferential statistics	Statistics that infer the likelihood that the results can be generalized to the population
Measure of central tendency	A single value that attempts to describe the central position of a set of data
Mean	The average value
Median	The middle value
Mode	The most frequent value
Measure of dispersion/variation	A value that describes how the data are dispersed around the measure of central tendency, or the extent to which individual values differ from the mean, median, or mode
Standard deviation	On average, how much individual values differ from the mean; the square root of the variance
Variance	How far a set of numbers is spread out from the mean; the sum of the squared differences between each value and the mean, divided by the number of values minus one
Range	The difference between the largest and smallest value in the data set
Interquartile range	A measure of the “middle fifty” in the data set; where the bulk of the values exist
Outlier	An observation point that is distant from other observations
Frequency	The number of times a value appears in the data set
Frequency distribution	A table or graph that illustrates how frequently each value appears in the data set
Normal distribution	A symmetric, bell-shaped distribution for a continuous variable; 68% of observations fall within 1 standard deviation of the mean, 95% fall within 2 standard deviations of the mean, and 99.7% fall within 3 standard deviations of the mean
Binomial distribution	The probability distribution for a binomial variable (i.e. a variable that has only two possible values) with fixed probabilities that add up to one
Confidence interval	An estimate of the population parameter that will contain the population mean a specified proportion of the time, typically either 95% or 99% of the time
Probability	The likelihood that an event will occur

A sample is the subset or group (e.g., individuals, cells, animals, machines) drawn from the population for data collection. Descriptive statistics describe the sample from which data is collected and communicate the results without attempting to generalize beyond the individuals included in the sample. The goal of inferential statistics is to determine the likelihood that the observed results can be generalized to other samples.

Descriptive statistics represent large amounts of data in an aggregate form. There are three basic descriptive characteristics for a single variable:

Central tendency: An estimate of the “center” of a distribution of values for continuous variables.
1. The mean, or average, is calculated by adding up all the values and dividing by the number of the values.
2. The median is the value found in the exact middle of the set of values. If there are an even number of values and the middle values differ, the median is interpolated.
3. The mode is the most frequently occurring value.

Dispersion: Represents the spread of the data round the measure of central tendency.

The standard deviation (SD) shows the relation of the values to the mean and is most commonly reported with the mean. It is the square root of the variance, which is calculated by summing the squares of the difference between each value and the mean and dividing the sum of squares by the number of values minus 1. Large standard deviations suggest that the data are generally spread out far from the mean. Small standard deviations suggest that the values are tightly clustered around the mean.
The range is the difference between the highest value and the lowest value. Note that it is sensitive to outliers, as an outlier can greatly exaggerate the range.
The interquartile range (IQR), also called middle fifty or midspread, is the difference between the first quartile and the third quartile of the data. The median is the corresponding measure of central tendency.

Consider These Numbers	Mean, SD	Median, Range	Mode	IQR
2, 4, 5, 6, 7, 1, 3, 5, 4 ,3, 2, 2	3.7, 1.8	3.5, 6	2	3
10, 14, 34, 25, 92, 34, 54, 20, 34	35.2, 25.0	34, 82	34	27
6, 6, 7, 5, 7, 5, 4, 6, 8, 6	6.0, 1.2	6, 4	6	2

Frequency distribution: Summary of the frequency of individual values or ranges of values for a variable. Frequency distributions are usually represented in one of two ways: as a table or as a graph.
1. Normal distribution: The Gaussian or bell curve, in which 50% of the values occur above and below the mean, 68% of values fall within 1 standard deviation, 95% of values fall within 2 standard deviations, 99% of the values fall within 3 standard deviations, and the mean = median = mode.
2. Binomial distribution: The probability distribution for a binomial variable that is tested in repeated trials (e.g., flipping a coin for heads/tails 100 times); fixed probabilities add up to one.
3. Confidence Interval: Provides an estimated range of values in which the population parameter is likely to fall a specified proportion of the time, typically either 95% or 99% of the time. In other words, a 95% confidence interval means that the confidence interval should contain the population parameter 95 times if the study is repeated 100 times.

BEWARE! Approximately normal data. If the sample size is large enough, continuous data that is not normally distributed is sometimes used with statistical tests that are developed for normally distributed data. Debate surrounds how large the sample size must be to relax the normality assumption but some use the following rule of thumb:

If:	THEN:
n >= 100	It is always safe to relax the normality assumption
50 <= n < 100	It is almost always safe
30 <= n < 50	It is probably safe
n < 30	It is not safe

When descriptive data is computed, the following guidelines are generally followed:

OK TO COMPUTE	NOMINAL	ORDINAL	INTERVAL	RATIO
Frequency distribution	Yes	Yes	Yes	Yes
Mode	Yes	Yes (categorical)	Yes (uncommon)	Yes (uncommon)
Median, range	No	Yes (categorical)	Yes	Yes
Mean, standard deviation	No	No (categorical) Yes (continuous)	Yes	Yes

Example 1: Patient Characteristics

In the following table, mean and standard deviation are reported for continuous variables (e.g., age, number of medications per patient) and frequency for nominal (e.g., gender, ethnicity, race) and ordinal variables (e.g., number of past pharmacist-managed clinic visits):

Patient Demographics

Average age in years (SD)	56 (12)
Gender
Male	36 (50.7%)
Female	35 (49.3%)
Ethnicity
Hispanic or Latino	1 (1.4%)
Not Hispanic or Latino	70 (98.6%)
Race
American Indian/Alaska Native	3 (4%)
Asian	1 (1.4%)
Black/African American	26 (37.1%)
Native Hawaiian/Pacific Islander	0
White/Caucasian	40 (57.1%)
No. of past pharmacist-managed clinic visits
2	14 (34.1%)
3 – 4	21 (29.6%)
5 – 7	11 (15.5%)
8 – 10	13 (18.3%)
> 10	12 (16.9%)
Prescription drug coverage
Institution specific	32 (47.8%)
Medicaid/Medicare	21 (31.3%)
Private	6 (9.0%)
Cash/no third party	7 (10.4%)
Total average no. of medications per patient (SD)	8.1 (5.3)
Average no. of medications managed by pharmacist per patient (SD)	4.2 (3.6)
Pharmacist-managed disease state
Diabetes	34 (47.9%)
Warfarin (Coumadin) management	19 (26.7%)
High blood pressure	30 (42.2%)
Quitting smoking	3 (4.2%)
High cholesterol	42 (59.2%)
Other	3 (4.2%)*
Duration of pharmacist-managed disease state in years (SD)	6.5 (7.4)

*Total is greater than 100% due to patients having multiple pharmacist-managed disease states>

Example 2: Survey Responses

The following table reports the median and range for ordinal data measured on a Likert scale survey. This table also reports the totals for each response, which provides the frequency distribution for each of the Reasons:

Ranking of the Importance of the Reasons/Motivation to Study Pharmacy
(ranked on 5 Likert scale ranging from 1=not important to 5=very important)

Reasons	Number of Students Choosing the Importance of Each Reason					Interpolated Median
Reasons	1	2	3	4	5	Interpolated Median
Interested in health and medicine	1	12	5	97	123	4.5
Felt that health-related disciplines are good professions	3	11	15	135	74	4.2
Felt that pharmacy is a good profession	7	23	27	133	48	4.0
Felt that pharmacy has a good job prospect	14	19	40	111	54	3.9
Did well at chemistry/biology in higher school certificate	30	28	32	95	53	3.8
Am a people person and wanted a career with high levels of patient contact	17	26	46	96	53	3.8
Felt that pharmacy would have a high income	26	35	49	86	42	3.6
Had family members who encouraged a health-related field of study	51	26	41	85	35	3.5
Felt that pharmacy provides a diversity of options	27	35	77	65	34	3.2
Wanted to own a pharmacy	62	34	60	53	29	2.9
Had family members who encouraged pharmacy training	80	23	48	56	31	2.8
Joined pharmacy as a gateway to dentistry/medicine	102	27	52	35	22	2.1
Joined pharmacy because I want to work for a pharmaceutical company	111	21	42	50	14	1.9
Joined pharmacy because I wanted to undertake research in medicine	112	26	49	37	14	1.8
Joined pharmacy because I want to join the government sector	112	27	57	34	8	1.8
Family members own a pharmacy	158	13	48	9	10	1.3
Joined pharmacy as my school friends were doing it	173	15	42	6	2	1.2

Example 3: Periodontitis

The following figure represents a simple frequency distribution for the age of periodontitis patients, showing two peaks:

A box plot diagram is a more robust approach to graphically representing the frequency distribution, along with central tendency and dispersion. The box plot below shows age distribution by gender in patients with chronic periodontitis and aggressive periodontitis. The box encompasses the interquartile range and the black line represents the median. The median was lower for patients with aggressive periodontitis than those with chronic periodontitis. Higher variability of age among females with aggressive periodontitis and lower variability among males in the same group were observed. Several outliers were observed in the aggressive form, as indicated by the stars and circles (i.e., data points that did not fall within the whiskers):

For more information

Table 1: Patient Demographics: Vincent AH, et al. Survey of pharmacist-managed primary care clinics using healthcare failure mode and effect analysis. Pharmacy Practice. 2013; 11(4): 196-202.
Table 4: Ranking the Importance: Shen G, et al. Course experiences, satisfaction and career intent of final year pre-registration Australian pharmacy students. Pharmacy Practice. 2014;12(2).
Figures 1 and 2: Benoist HM, et al. Profile of chronic and aggressive periodontitis among Sengalese. J Periodontal Implant Sci. 2011;41(6):279-284.