For help with statistics, Drs Andrea Giochini and Owen Tomlinson offer 1-to-1 stats help sessions.
You can find more information on the Medical Sciences Maths and Stats Support Engage (formerly Yammer) page
The peer mentors will offer at least one statistics session in the year, and monthly drop-ins where you can ask us any questions (statistics or otherwise). To find the date of the next drop-in, check Instagram or send us an email.
Here are some notes that may help you in statistics.
These notes are here to complement the lectures you receive and should not be taken as a sole resource for revision.
If the figures don't show correctly, click here to download the information contained on this page as a pdf file.
Here is a link to a flowchart for choosing statistical tests (Exeter login required)
Fundamentals of Statistics Cheat Sheet
Definitions:
Population – the full set of units we are interested in
Sample – Subset of units that we experiment on or observe from which inferences about the population can be drawn.
P–Value – The probability that you might see something as extreme, or more extreme, if nothing was going on (under the null hypothesis)
Descriptive statistic – describes data in a sample (fact)
Inferential statistic – draw inferences about the population from the sample (estimate)
P – Values Strength of Evidence
0.1 Weak
0.05 Moderate
0.01 Strong
0.001 Very strong
Types of Data:
Numerical – Expressed in numbers
Continuous – Theoretically can take any value
Count – Only takes integer values
Categorical – No inherent numerical value
Nominal –No inherent order
Ordinal –Inherent order
Descriptive Statistics
Categorical :
Frequency table
Percentages
Bar chart
Numerical:
Many possible values therefore tabulation is often not practical. Solutions include:
Grouping data
Data becomes categorical
Can’t see differences within groups
Histogram
Bin width = size of each group
Too many – data points in each bin too small so won’t be able to see shape.
Too few – lose the detail
Density = proportion in bin/bin width. Allows comparison of histograms with different bin widths (at the expense of interpretability)
Area = 1
Heights indicate relative frequency of observations in a bin
50 observations for a decent histogram
No gaps between bars
Normally distributed data has a kurtosis of 3.
Summary statistics
Provide a quantitative description of the data.
Dispersion
Standard deviation – how far a typical value is from the mean
Interquartile range
Lower quartile is the point where 25% of the data is below
Upper quartile is the point where 25% of the data is above
Range
Location
Mean
Mode
Median
Equal if data is normally distributed
Box and whisker plots
Easier to compare between groups
Outliers, more than 1.5, +/- the appropriate quartile
Inferential test - T-Test
Used to test whether the means in two groups of continuous data are different from one another. As an inferential test the T-Test shows whether it is likely differences observed in sample data reflect a difference in the population
SEM – Strongly related to the t–test
The standard deviation of data if we were to repeat our study over and over
= SD/√n
SEM increases with increasing variability
SEM decreases with increasing sample size
Null Hypothesis – two groups have the same mean in the population
Assumptions:
Standard
Normal distribution
Equal variance across groups (SD²)
Independent data points
For data with different SDs a t-test assuming unequal variances can be used.
The paired t-test handles non-independent data
e.g. same people before/after or from the same group of people in a hospital
ANOVA (ANalysis Of VAriance)
Looks overall at the data to see if there are any differences between groups rather than comparing each group. THEN individual differences should be investigated.
Means indicate which groups are higher or lower however not formally tested.
Post-hoc pairwise comparisons.
Made after overall assessments
T – tests often used however fail to take into account multiple uses of the same information.
Alternatives include Tukey tests.
Assumptions
Normal distribution
Equal variance across groups
Independent of the t-test
Wilcoxon Rank Sum tests
Used instead of a t-test when data is not normally distributed.
Null hypothesis – If we select a value from each group at random the value from the first group will be larger than the value from the second group 50% of the time.
If the distribution in each group is equal it can be used to see if there is a difference in medians.
Non - Parametric
Lacks assumptions (only that data points are independent)
Less powerful
Alternatives
Paired data – Wilcoxon signed-rank test
2+ groups – Kruskal Wallis test
Chi Squared
Comparison of categorical data
χ 2 = the sum of (observed-expected)² /expected
Degrees of freedom = (number of columns – 1) x (number of rows – 1)
Assumptions
Independent data
Data can be described by binomial/multinomial distribution
E >5 in each cell (Fishers exact test is an alternative)
Correlation
Two continuous variables
Pearson’s correlation coefficient (rho or p)
1 = strong positive correlation
0 = no correlation
-1 = strong negative correlation
Limitation
Certain non-linear relationships indicate no correlation
Context
In physics a value <0.8 might be considered weak
Medical research:
Rho value >0.5 Strong
0.3-0.5 Moderate
<0.3 Weak
Assumptions
Independent data points
Relationship is linear
One set of data is normally distributed for any given value of the other with a constant variance
Alternative
Spearman’s rank correlation coefficient
When the latter 2 assumptions above break down
Replaces values with their rank before calculating the correlation coefficient.
Still can’t cope with circular distribution
Linear Regression
Captures how x depends upon y rather than the strength of their association (correlation)
Normally given in the form y = bx +c
y = outcome (dependent variable)
b = Gradient (Average change in unit of y per unit change of x)
x = Exposure (independent variable)
c = Intercept (x=0)
Other values
P–value = a test against the null hypothesis that the slope = 0
R²= How much of the variability of y can be explained by the variability of x (Rho²)
R² close to 0 indicates huge variability of data points about the regression line
R² = 1 means data points are on the regression line
What else can regression do?
Classify one category of variables with 0s and the other with 1s
T-test
ANOVA
Logistic regression can do Chi²
Assumptions
Data points are independent
Outcome data is normally distributed for any given value of exposure (residuals are normally distributed)
A histogram of the residuals can be used to examine this.
Outcome data has constant variance (residuals are homoscedastic)
Linear relationship
How to choose a test (Summary)
Difference in means
2 groups – t-test/regression
3 + groups – ANOVA/regression
Difference in percentages
Chi² /logistic regression
Association between 2 continuous variables
Correlation coefficient/regression
Data Presentation
Figure/table legends
Axis (what it shows)
Key
Significance
Statistics used
Type of trend line
n number
Standard error bars