Q: What does it mean if my t-test fails the normality assumption?

The t-test assumes the population from which the sample is drawn is normally distributed (or that the sample size is large enough for the Central Limit Theorem to apply). If your data is clearly non-normal — highly skewed, heavy-tailed, or multimodal — and your sample size is small (n < 30), the t-test's p-values and confidence intervals may be inaccurate. Options when normality fails: (1) Transform your data — a log transformation often normalizes right-skewed data such as income or reaction times. (2) Use a non-parametric alternative — the Mann-Whitney U test replaces the two-sample t-test, the Wilcoxon signed-rank test replaces the paired t-test, and the sign test is the most distribution-free option. (3) Use a permutation test or bootstrap confidence interval, which make no distributional assumptions. (4) With larger samples (n > 30–50), the t-test is quite robust to non-normality due to the Central Limit Theorem. Always check normality with a Shapiro-Wilk test or Q-Q plot before applying a t-test to small samples.

Question 1

What is a t-test and when do I use one?

Accepted Answer

A t-test is a parametric statistical hypothesis test used to determine whether a statistically significant difference exists between means. William Sealy Gosset published the method in 1908 under the pseudonym 'Student,' hence 'Student's t-distribution.' You use a t-test when your data is continuous (interval or ratio scale), your population standard deviation is unknown (which is almost always the case in practice), and your sample size is small enough that the normal approximation is imprecise. The t-distribution has heavier tails than the normal distribution, which accounts for the extra uncertainty introduced by estimating the standard deviation from the sample. As sample size grows, the t-distribution converges to the standard normal. Use a t-test when comparing a sample mean to a known value, comparing means of two independent groups, or analyzing before-after measurements on the same subjects. For larger samples (n > 30) with known population variance, a z-test is technically more appropriate, though the difference is negligible in practice.

Question 2

What's the difference between one-sample, two-sample, and paired t-tests?

Accepted Answer

These three variants address different research questions. A one-sample t-test compares a single sample mean to a known or hypothesized population mean — for example, testing whether a factory's output of 52 units per hour differs from the industry benchmark of 50. A two-sample (independent samples) t-test compares means from two separate, unrelated groups — for example, comparing exam scores of students taught by two different methods. The samples must be independent: knowing one observation from Group A tells you nothing about Group B. A paired t-test (also called a repeated measures or matched pairs t-test) is used when each observation in one group is linked to a corresponding observation in the other — for example, blood pressure measured before and after a drug on the same patients. Pairing removes subject-to-subject variability, making the paired test more powerful than the two-sample test when pairing is appropriate. Choosing the wrong test inflates Type I or Type II error rates.

Question 3

How do I interpret the p-value from a t-test?

Accepted Answer

The p-value is the probability of observing a t-statistic at least as extreme as the one computed, assuming the null hypothesis is true. It does NOT tell you the probability that the null hypothesis is correct, nor the probability your result was due to chance. A common threshold is α = 0.05: if p < 0.05, you reject the null hypothesis and declare the result statistically significant. For a two-tailed test (testing whether the mean differs in either direction), you compare to the full α. For a one-tailed test (testing only whether the mean is greater or less), you halve α. Always report the exact p-value rather than just 'significant' or 'not significant.' A p-value of 0.049 and 0.051 are nearly identical in practical terms. Combine p-values with effect sizes (Cohen's d) and confidence intervals for a complete picture. A statistically significant result with a tiny effect size (d = 0.05) may have no practical importance whatsoever.

Question 4

What does it mean if my t-test fails the normality assumption?

Accepted Answer

The t-test assumes the population from which the sample is drawn is normally distributed (or that the sample size is large enough for the Central Limit Theorem to apply). If your data is clearly non-normal — highly skewed, heavy-tailed, or multimodal — and your sample size is small (n < 30), the t-test's p-values and confidence intervals may be inaccurate. Options when normality fails: (1) Transform your data — a log transformation often normalizes right-skewed data such as income or reaction times. (2) Use a non-parametric alternative — the Mann-Whitney U test replaces the two-sample t-test, the Wilcoxon signed-rank test replaces the paired t-test, and the sign test is the most distribution-free option. (3) Use a permutation test or bootstrap confidence interval, which make no distributional assumptions. (4) With larger samples (n > 30–50), the t-test is quite robust to non-normality due to the Central Limit Theorem. Always check normality with a Shapiro-Wilk test or Q-Q plot before applying a t-test to small samples.

Question 5

When should I use Welch's t-test instead of the standard t-test?

Accepted Answer

Welch's t-test is the preferred two-sample test in virtually all modern statistical practice. The standard (Student's) two-sample t-test assumes equal variances in both groups — an assumption called homogeneity of variance or homoscedasticity. Welch's correction relaxes this assumption by computing adjusted degrees of freedom using the Satterthwaite approximation: df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁−1) + (s₂²/n₂)²/(n₂−1)]. This typically gives non-integer degrees of freedom. When variances ARE equal, Welch's test is only very slightly less powerful than Student's. When variances are unequal, Student's test produces incorrect p-values (usually anti-conservative, meaning too many false positives). Statistical guidelines from journals including Nature and APA now recommend Welch's t-test as the default two-sample test, with an equal-variance test only as a sensitivity check. Use Levene's test to formally assess variance equality if needed.

t-Test Calculator

Test Results

Effect Size & Interval

Decision

How to Use This Calculator

Formula

Example

Frequently Asked Questions

Related Calculators