Skip to main content

This lesson reviews descriptive statistics. At the end of this lesson, you will be able to:

  1. Define population and sample
  2. Differentiate between descriptive and inferential statistics
  3. Define basic measures of central tendency (mean, median, mode) and dispersion (standard deviation, interquartile range, range)
  4. Define frequency and probability
  5. Describe the characteristics of a normal distribution and a binomial distribution
  6. Describe what a 95% confidence interval represents

Terms that appear frequently throughout this lesson are defined below:

Term Definition
Population The entire collection of people, animals, cells, or other things from which we collect data

Parameter

A number that is calculated from an entire population
Sample A subset or group drawn from the population

Statistic

A number or quantity that is calculated from a sample of data

Descriptive statistics

Statistics that describe the sample without attempting to generalize the results to other groups or the population

Inferential statistics

Statistics that infer the likelihood that the results can be generalized to the population
Measure of central tendency A single value that attempts to describe the central position of a set of data

Mean

The average value

Median

The middle value

Mode

The most frequent value
Measure of dispersion/variation A value that describes how the data are dispersed around the measure of central tendency, or the extent to which individual values differ from the mean, median, or mode

Standard deviation

On average, how much individual values differ from the mean; the square root of the variance

Variance

How far a set of numbers is spread out from the mean; the sum of the squared differences between each value and the mean, divided by the number of values minus one

Range

The difference between the largest and smallest value in the data set

Interquartile range

A measure of the “middle fifty” in the data set; where the bulk of the values exist

Outlier

An observation point that is distant from other observations

Frequency

The number of times a value appears in the data set
Frequency distribution A table or graph that illustrates how frequently each value appears in the data set

Normal distribution

A symmetric, bell-shaped distribution for a continuous variable; 68% of observations fall within 1 standard deviation of the mean, 95% fall within 2 standard deviations of the mean, and 99.7% fall within 3 standard deviations of the mean

Binomial distribution

The probability distribution for a binomial variable (i.e. a variable that has only two possible values) with fixed probabilities that add up to one

Confidence interval

An estimate of the population parameter that will contain the population mean a specified proportion of the time, typically either 95% or 99% of the time

Probability

The likelihood that an event will occur

A sample is the subset or group (e.g., individuals, cells, animals, machines) drawn from the population for data collection. Descriptive statistics describe the sample from which data is collected and communicate the results without attempting to generalize beyond the individuals included in the sample. The goal of inferential statistics is to determine the likelihood that the observed results can be generalized to other samples.

Diagram representing a sample as a subset of the population, with inferential statistics being the study of that sample's generalizability

Descriptive statistics represent large amounts of data in an aggregate form. There are three basic descriptive characteristics for a single variable:

  1. Central tendency: An estimate of the “center” of a distribution of values for continuous variables.
    1. The mean, or average, is calculated by adding up all the values and dividing by the number of the values.
    2. The median is the value found in the exact middle of the set of values. If there are an even number of values and the middle values differ, the median is interpolated.
    3. The mode is the most frequently occurring value.
  2. Dispersion: Represents the spread of the data round the measure of central tendency.
    1. The standard deviation (SD) shows the relation of the values to the mean and is most commonly reported with the mean. It is the square root of the variance, which is calculated by summing the squares of the difference between each value and the mean and dividing the sum of squares by the number of values minus 1. Large standard deviations suggest that the data are generally spread out far from the mean. Small standard deviations suggest that the values are tightly clustered around the mean.
    2. The range is the difference between the highest value and the lowest value. Note that it is sensitive to outliers, as an outlier can greatly exaggerate the range.
    3. The interquartile range (IQR), also called middle fifty or midspread, is the difference between the first quartile and the third quartile of the data. The median is the corresponding measure of central tendency.
    Consider These Numbers Mean, SD Median, Range Mode IQR
    2, 4, 5, 6, 7, 1, 3, 5, 4 ,3, 2, 2 3.7, 1.8 3.5, 6 2 3
    10, 14, 34, 25, 92, 34, 54, 20, 34 35.2, 25.0 34, 82 34 27
    6, 6, 7, 5, 7, 5, 4, 6, 8, 6 6.0, 1.2 6, 4 6 2
  3. Frequency distribution: Summary of the frequency of individual values or ranges of values for a variable. Frequency distributions are usually represented in one of two ways: as a table or as a graph.

    A bell curve, with standard deviations on the X-axis and corresponding percentages represented on the Y-axis. Definition provided below.

    1. Normal distribution: The Gaussian or bell curve, in which 50% of the values occur above and below the mean, 68% of values fall within 1 standard deviation, 95% of values fall within 2 standard deviations, 99% of the values fall within 3 standard deviations, and the mean = median = mode.
    2. Binomial distribution: The probability distribution for a binomial variable that is tested in repeated trials (e.g., flipping a coin for heads/tails 100 times); fixed probabilities add up to one.
    3. Confidence Interval: Provides an estimated range of values in which the population parameter is likely to fall a specified proportion of the time, typically either 95% or 99% of the time. In other words, a 95% confidence interval means that the confidence interval should contain the population parameter 95 times if the study is repeated 100 times.
  BEWARE! Approximately normal data. If the sample size is large enough, continuous data that is not normally distributed is sometimes used with statistical tests that are developed for normally distributed data. Debate surrounds how large the sample size must be to relax the normality assumption but some use the following rule of thumb:

If: THEN:
n >= 100 It is always safe to relax the normality assumption
50 <= n < 100 It is almost always safe
30 <= n < 50 It is probably safe
n < 30 It is not safe

When descriptive data is computed, the following guidelines are generally followed:

OK TO COMPUTE NOMINAL ORDINAL INTERVAL RATIO
Frequency distribution Yes Yes Yes Yes
Mode Yes Yes (categorical) Yes (uncommon) Yes (uncommon)
Median, range No Yes (categorical) Yes Yes
Mean, standard deviation No

No (categorical)

Yes (continuous)

Yes Yes

Example 1: Patient Characteristics

In the following table, mean and standard deviation are reported for continuous variables (e.g., age, number of medications per patient) and frequency for nominal (e.g., gender, ethnicity, race) and ordinal variables (e.g., number of past pharmacist-managed clinic visits):

Patient Demographics

Average age in years (SD) 56 (12)
Gender

Male

36 (50.7%)

Female

35 (49.3%)
Ethnicity

Hispanic or Latino

1 (1.4%)

Not Hispanic or Latino

70 (98.6%)
Race

American Indian/Alaska Native

3 (4%)

Asian

1 (1.4%)

Black/African American

26 (37.1%)

Native Hawaiian/Pacific Islander

0

White/Caucasian

40 (57.1%)
No. of past pharmacist-managed clinic visits

2

14 (34.1%)

3 – 4

21 (29.6%)

5 – 7

11 (15.5%)

8 – 10

13 (18.3%)

> 10

12 (16.9%)
Prescription drug coverage

Institution specific

32 (47.8%)

Medicaid/Medicare

21 (31.3%)

Private

6 (9.0%)

Cash/no third party

7 (10.4%)
Total average no. of medications per patient (SD) 8.1 (5.3)
Average no. of medications managed by pharmacist per patient (SD) 4.2 (3.6)
Pharmacist-managed disease state

Diabetes

34 (47.9%)

Warfarin (Coumadin) management

19 (26.7%)

High blood pressure

30 (42.2%)

Quitting smoking

3 (4.2%)

High cholesterol

42 (59.2%)

Other

3 (4.2%)*
Duration of pharmacist-managed disease state in years (SD) 6.5 (7.4)

*Total is greater than 100% due to patients having multiple pharmacist-managed disease states>

Example 2: Survey Responses

The following table reports the median and range for ordinal data measured on a Likert scale survey. This table also reports the totals for each response, which provides the frequency distribution for each of the Reasons:

Ranking of the Importance of the Reasons/Motivation to Study Pharmacy
(ranked on 5 Likert scale ranging from 1=not important to 5=very important)

Reasons Number of Students Choosing the Importance of Each Reason Interpolated Median
1 2 3 4 5
Interested in health and medicine 1 12 5 97 123 4.5
Felt that health-related disciplines are good professions 3 11 15 135 74 4.2
Felt that pharmacy is a good profession 7 23 27 133 48 4.0
Felt that pharmacy has a good job prospect 14 19 40 111 54 3.9
Did well at chemistry/biology in higher school certificate 30 28 32 95 53 3.8
Am a people person and wanted a career with high levels of patient contact 17 26 46 96 53 3.8
Felt that pharmacy would have a high income 26 35 49 86 42 3.6
Had family members who encouraged a health-related field of study 51 26 41 85 35 3.5
Felt that pharmacy provides a diversity of options 27 35 77 65 34 3.2
Wanted to own a pharmacy 62 34 60 53 29 2.9
Had family members who encouraged pharmacy training 80 23 48 56 31 2.8
Joined pharmacy as a gateway to dentistry/medicine 102 27 52 35 22 2.1
Joined pharmacy because I want to work for a pharmaceutical company 111 21 42 50 14 1.9
Joined pharmacy because I wanted to undertake research in medicine 112 26 49 37 14 1.8
Joined pharmacy because I want to join the government sector 112 27 57 34 8 1.8
Family members own a pharmacy 158 13 48 9 10 1.3
Joined pharmacy as my school friends were doing it 173 15 42 6 2 1.2

Example 3: Periodontitis

The following figure represents a simple frequency distribution for the age of periodontitis patients, showing two peaks:

A frequency distribution for the age of periodontitis patients, showing two peaks.

A box plot diagram is a more robust approach to graphically representing the frequency distribution, along with central tendency and dispersion. The box plot below shows age distribution by gender in patients with chronic periodontitis and aggressive periodontitis. The box encompasses the interquartile range and the black line represents the median. The median was lower for patients with aggressive periodontitis than those with chronic periodontitis. Higher variability of age among females with aggressive periodontitis and lower variability among males in the same group were observed. Several outliers were observed in the aggressive form, as indicated by the stars and circles (i.e., data points that did not fall within the whiskers):

An example of a box plot diagram.

For more information