Hypothesis Testing

Introduction to Hypothesis Testing

Hypothesis testing is the cornerstone of statistical inference, providing a rigorous framework for making decisions based on data. In business and analytics, we constantly face questions like "Did the marketing campaign increase sales?" or "Is the new product more reliable than the old one?" Hypothesis testing allows us to answer these questions scientifically by quantifying the evidence in our data and determining whether observed effects are real or simply due to random chance.

What is Hypothesis Testing?

Hypothesis testing is a statistical method for testing a claim or hypothesis about a population parameter using sample data. The core idea is simple: we make an assumption (hypothesis) about the population, collect sample data, and then determine whether the data provides sufficient evidence to reject that assumption. This approach lets us make informed decisions while controlling the probability of making errors, which is crucial in business contexts where wrong decisions can be costly.

For example, a pharmaceutical company claims their new drug reduces blood pressure by 10 points on average. We cannot test every patient in the world, so we take a sample of 100 patients, measure the effect, and use hypothesis testing to determine whether the observed effect in our sample provides convincing evidence for the company's claim. If the sample shows an average reduction of only 2 points, is that enough to reject the claim? Hypothesis testing provides the mathematical framework to answer this question objectively.

Key Definition

A statistical procedure for evaluating claims about population parameters using sample data. It involves formulating competing hypotheses (null and alternative), collecting data, calculating a test statistic, and making a decision based on the probability of observing the data if the null hypothesis were true.

Why it matters: Hypothesis testing converts subjective business questions into objective statistical conclusions, allowing data-driven decision making with quantified confidence levels.

The Null and Alternative Hypotheses

Every hypothesis test involves two competing hypotheses. The null hypothesis (H0) represents the status quo, the default assumption, or the claim of "no effect" or "no difference." It is what we assume to be true until the data convinces us otherwise. The alternative hypothesis (H1 or Ha) represents what we are trying to find evidence for - the claim of an effect, difference, or relationship.

Think of hypothesis testing like a criminal trial. The null hypothesis is "innocent until proven guilty" - the default assumption. The alternative hypothesis is "guilty." We collect evidence (data) and determine whether there is enough evidence to reject the presumption of innocence. If the evidence is strong enough, we reject the null hypothesis in favor of the alternative. If not, we fail to reject the null hypothesis (though we never "accept" or "prove" it).

# Examples of null and alternative hypotheses

# Example 1: Testing a new website design
# H0: Conversion rate (new) = Conversion rate (old)
# H1: Conversion rate (new) != Conversion rate (old)

# Example 2: Quality control
# H0: Mean product weight = 500g (claimed weight)
# H1: Mean product weight != 500g (differs from claim)

# Example 3: A/B testing email subject lines
# H0: Open rate (Subject A) = Open rate (Subject B)
# H1: Open rate (Subject A) > Open rate (Subject B)  # One-sided test

# Example 4: Customer satisfaction
# H0: Mean satisfaction score <= 7.0 (meets target)
# H1: Mean satisfaction score > 7.0 (exceeds target)

P-Values and Significance Levels

The p-value is the probability of observing data as extreme as (or more extreme than) what we actually observed, assuming the null hypothesis is true. A small p-value suggests that our observed data would be very unlikely if H0 were true, providing evidence against the null hypothesis. Conversely, a large p-value suggests our data is consistent with the null hypothesis.

The significance level (alpha, α) is a threshold we set before conducting the test, typically 0.05 (5%). If the p-value is less than alpha, we reject the null hypothesis and conclude that the results are "statistically significant." If the p-value is greater than or equal to alpha, we fail to reject the null hypothesis. The significance level represents the maximum acceptable probability of making a Type I error (rejecting a true null hypothesis).

Important Note: A p-value tells you how incompatible your data is with the null hypothesis, not the probability that the null hypothesis is true. P = 0.03 means there is a 3% chance of seeing data this extreme if H0 is true, not that there is a 3% chance H0 is true.

# BINOMIAL TEST: TESTING COIN FAIRNESS
# Description: Uses exact binomial test to determine if observed proportions differ from expected.
# Binomial test is appropriate for binary outcomes (heads/tails, yes/no, success/failure).
# More accurate than normal approximation for small to medium sample sizes.

import scipy.stats as stats
import numpy as np

# Example: Testing if a casino coin is fair
# H0: p = 0.5 (coin is fair - 50% heads expected)
# H1: p != 0.5 (coin is biased - proportion differs from 50%)

# Observed data: After 100 flips, we got 65 heads
observed_heads = 65  # Actual number of heads observed
n_flips = 100  # Total number of coin flips
expected_prop = 0.5  # Expected proportion under H0 (fair coin)

print("Binomial Test: Coin Fairness Analysis")
print("=" * 50)
print(f"Coin flips: {n_flips}")
print(f"Observed heads: {observed_heads} ({observed_heads/n_flips*100:.1f}%)")
print(f"Expected heads: {expected_prop*n_flips:.0f} ({expected_prop*100:.0f}%)")
print(f"Deviation: {observed_heads - expected_prop*n_flips:+.0f} heads")

# Binomial test (exact test for proportions - no approximations)
# alternative='two-sided' tests for any difference (too many OR too few heads)
p_value = stats.binom_test(observed_heads, n_flips, expected_prop, alternative='two-sided')
print(f"\nP-value: {p_value:.4f}")
print(f"  → Probability of getting {observed_heads} or more extreme if coin is fair")

# Make decision at standard significance level
alpha = 0.05  # 5% significance level
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p = {p_value:.4f} < {alpha})")
    print("Conclusion: Strong evidence that coin is BIASED")
    print("Action: Investigate coin for unfair properties")
else:
    print(f"\nDecision: Fail to reject H0 (p = {p_value:.4f} >= {alpha})")
    print("Conclusion: Insufficient evidence of bias")
    print("Result could occur by chance with a fair coin")

Type I and Type II Errors

Hypothesis testing involves two types of potential errors. A Type I error (false positive) occurs when we reject a true null hypothesis - we conclude there is an effect when there actually is not. The probability of a Type I error is controlled by the significance level alpha. A Type II error (false negative) occurs when we fail to reject a false null hypothesis - we miss detecting a real effect. The probability of a Type II error is denoted by beta (β), and statistical power is defined as 1 - β.

	H0 is True	H0 is False
Reject H0	Type I Error (α) False Positive	Correct Decision True Positive
Fail to Reject H0	Correct Decision True Negative	Type II Error (β) False Negative

Business Context: In quality control, a Type I error means rejecting a good batch (wasted product and money). A Type II error means accepting a defective batch (angry customers and liability). The choice of alpha reflects which error is more costly to your business.

The Hypothesis Testing Process

Conducting a hypothesis test follows a systematic five-step process. First, formulate hypotheses: clearly state your null hypothesis H0 and alternative hypothesis H1. Second, choose significance level: decide on alpha (typically 0.05) before seeing the data. Third, select and calculate test statistic: choose the appropriate test based on your data type and research question, then compute the test statistic from your sample data. Fourth, find the p-value: determine the probability of observing your test statistic (or more extreme) if H0 is true. Fifth, make a decision: compare p-value to alpha and either reject or fail to reject H0, then interpret the result in the context of your business question.

import scipy.stats as stats
import numpy as np

# Complete example: Testing if average customer spending increased

# Step 1: Formulate hypotheses
# H0: μ = $50 (no change in spending)
# H1: μ > $50 (spending increased)

# Step 2: Set significance level
alpha = 0.05

# Step 3: Collect sample data
# Sample of 40 customers, new average spending
sample_data = np.array([55, 48, 62, 51, 59, 47, 54, 58, 61, 49,
                        53, 56, 50, 57, 52, 60, 46, 55, 54, 58,
                        51, 59, 56, 53, 48, 62, 50, 57, 55, 49,
                        61, 52, 58, 54, 47, 60, 56, 51, 59, 53])

hypothesized_mean = 50
sample_mean = sample_data.mean()
sample_std = sample_data.std(ddof=1)
n = len(sample_data)

print(f"Sample mean: ${sample_mean:.2f}")  # $54.32
print(f"Sample std: ${sample_std:.2f}")  # $4.53
print(f"Sample size: {n}")  # 40

# Step 4: Conduct one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, hypothesized_mean)

# For one-sided test (H1: μ > 50), divide p-value by 2
p_value_one_sided = p_value / 2

print(f"\nT-statistic: {t_statistic:.4f}")  # 6.0251
print(f"P-value (one-sided): {p_value_one_sided:.6f}")  # 0.000000

# Step 5: Make decision
print(f"\nSignificance level: {alpha}")
if p_value_one_sided < alpha:
    print("Decision: Reject H0")
    print("Conclusion: Strong evidence that average spending increased")
else:
    print("Decision: Fail to reject H0")
    print("Conclusion: Insufficient evidence of spending increase")

Practice Questions: Introduction to Hypothesis Testing

Test your understanding with these coding exercises.

Problem: A company claims their average delivery time is 3 days. You suspect it is higher. Write Python code to formulate the null and alternative hypotheses and test with sample data: [3.2, 3.5, 3.1, 3.8, 3.4, 3.9, 3.3, 3.6] days.

Show Solution

import scipy.stats as stats
import numpy as np

# Formulate hypotheses
print("H0: μ = 3 days (claimed average)")
print("H1: μ > 3 days (actually higher)")

# Sample data
delivery_times = np.array([3.2, 3.5, 3.1, 3.8, 3.4, 3.9, 3.3, 3.6])
claimed_mean = 3.0

# Calculate sample statistics
sample_mean = delivery_times.mean()
sample_std = delivery_times.std(ddof=1)
n = len(delivery_times)

print(f"\nSample mean: {sample_mean:.2f} days")  # 3.48 days
print(f"Sample std: {sample_std:.2f} days")  # 0.28 days
print(f"Sample size: {n}")  # 8

# Conduct one-sample t-test (one-sided)
t_stat, p_value_two_sided = stats.ttest_1samp(delivery_times, claimed_mean)
p_value = p_value_two_sided / 2  # One-sided test

print(f"\nT-statistic: {t_stat:.4f}")  # 4.8990
print(f"P-value: {p_value:.4f}")  # 0.0008

# Decision at alpha = 0.05
alpha = 0.05
if p_value < alpha:
    print(f"\nReject H0 (p = {p_value:.4f} < {alpha})")
    print("Conclusion: Evidence that delivery times exceed 3 days")
else:
    print(f"\nFail to reject H0 (p = {p_value:.4f} >= {alpha})")
    print("Conclusion: No evidence times exceed 3 days")

Problem: You test whether a coin is fair by flipping it 50 times and getting 32 heads. Write code to calculate the p-value and interpret it at significance levels of 0.05 and 0.01.

Show Solution

import scipy.stats as stats

# Hypotheses
print("H0: p = 0.5 (fair coin)")
print("H1: p != 0.5 (biased coin)")

# Data
n_flips = 50
observed_heads = 32
expected_prob = 0.5

# Calculate p-value using binomial test
p_value = stats.binom_test(observed_heads, n_flips, expected_prob, 
                            alternative='two-sided')

print(f"\nObserved: {observed_heads} heads out of {n_flips} flips")
print(f"Expected under H0: {n_flips * expected_prob} heads")
print(f"P-value: {p_value:.4f}")  # 0.0402

# Test at alpha = 0.05
alpha1 = 0.05
print(f"\n--- Test at α = {alpha1} ---")
if p_value < alpha1:
    print(f"Reject H0 (p = {p_value:.4f} < {alpha1})")
    print("Conclusion: Evidence coin is biased at 5% level")
else:
    print(f"Fail to reject H0 (p = {p_value:.4f} >= {alpha1})")
    print("Conclusion: No evidence of bias at 5% level")

# Test at alpha = 0.01
alpha2 = 0.01
print(f"\n--- Test at α = {alpha2} ---")
if p_value < alpha2:
    print(f"Reject H0 (p = {p_value:.4f} < {alpha2})")
    print("Conclusion: Evidence coin is biased at 1% level")
else:
    print(f"Fail to reject H0 (p = {p_value:.4f} >= {alpha2})")
    print("Conclusion: No evidence of bias at 1% level")

# Interpretation
print("\nInterpretation:")
print(f"There is a {p_value*100:.2f}% chance of seeing 32+ heads")
print("(or 18- heads) in 50 flips if coin is truly fair.")
print("This is unlikely enough to reject H0 at 5% level,")
print("but not at the more stringent 1% level.")

Problem: A quality control test checks if mean widget weight is 100g (H0: μ = 100). Generate sample data from different true means and calculate the proportion of Type I errors (when true mean = 100) and Type II errors (when true mean = 102) across 1000 simulations.

Show Solution

import numpy as np
import scipy.stats as stats

np.random.seed(42)
alpha = 0.05
n_simulations = 1000
sample_size = 30
null_mean = 100
std_dev = 5

# Scenario 1: H0 is TRUE (true mean = 100)
# Count how often we incorrectly reject H0 (Type I errors)
type1_errors = 0

for i in range(n_simulations):
    # Generate data where H0 is true
    sample = np.random.normal(null_mean, std_dev, sample_size)
    t_stat, p_value = stats.ttest_1samp(sample, null_mean)
    
    if p_value < alpha:
        type1_errors += 1

type1_rate = type1_errors / n_simulations
print(f"Type I Error Analysis (H0 is TRUE)")
print(f"True mean: {null_mean}g, Null hypothesis: μ = {null_mean}g")
print(f"Type I errors: {type1_errors} out of {n_simulations}")
print(f"Type I error rate: {type1_rate:.3f}")
print(f"Expected rate (α): {alpha}")

# Scenario 2: H0 is FALSE (true mean = 102)
# Count how often we fail to reject false H0 (Type II errors)
alternative_mean = 102
type2_errors = 0

for i in range(n_simulations):
    # Generate data where H0 is false
    sample = np.random.normal(alternative_mean, std_dev, sample_size)
    t_stat, p_value = stats.ttest_1samp(sample, null_mean)
    
    if p_value >= alpha:  # Failed to reject false H0
        type2_errors += 1

type2_rate = type2_errors / n_simulations
power = 1 - type2_rate

print(f"\nType II Error Analysis (H0 is FALSE)")
print(f"True mean: {alternative_mean}g, Null hypothesis: μ = {null_mean}g")
print(f"Type II errors: {type2_errors} out of {n_simulations}")
print(f"Type II error rate (β): {type2_rate:.3f}")
print(f"Statistical power (1-β): {power:.3f}")

print("\nInterpretation:")
print(f"- When H0 is true, we incorrectly reject it ~{type1_rate*100:.1f}% of time")
print(f"- When H0 is false, we correctly detect it ~{power*100:.1f}% of time")
print(f"- Larger samples would increase power and reduce β")

Types of Hypothesis Tests

Choosing the right hypothesis test is critical for valid statistical inference. Different tests are designed for different data types, research questions, and assumptions. Using the wrong test can lead to incorrect conclusions and poor business decisions. In this section, we will explore how to select the appropriate test based on whether you are comparing means, proportions, or relationships, and whether your data meets certain distributional assumptions.

Parametric vs. Non-Parametric Tests

Parametric tests assume your data follows a specific probability distribution, typically the normal distribution. They test hypotheses about population parameters like means and standard deviations. Examples include t-tests, ANOVA, and Pearson correlation. Parametric tests are generally more powerful (better at detecting real effects) when their assumptions are met, but they can give misleading results if assumptions are violated.

Non-parametric tests make fewer assumptions about the underlying distribution and work with ranks or categories rather than raw values. They are more robust to outliers and non-normal distributions but typically have less statistical power. Examples include Mann-Whitney U test, Kruskal-Wallis test, and Spearman correlation. Use non-parametric tests when you have small samples, ordinal data, or clear violations of normality.

Parametric Test	Non-Parametric Alternative	Use Case
Independent t-test	Mann-Whitney U test	Compare two independent groups
Paired t-test	Wilcoxon signed-rank test	Compare paired/related samples
One-way ANOVA	Kruskal-Wallis test	Compare 3+ independent groups
Pearson correlation	Spearman correlation	Measure association between variables

One-Tailed vs. Two-Tailed Tests

A two-tailed test (also called non-directional) tests for any difference from the null hypothesis value, whether higher or lower. The alternative hypothesis is H1: μ ≠ μ0. Use two-tailed tests when you want to detect any change, regardless of direction. For example, testing if a new process changes product quality (could improve or worsen).

A one-tailed test (directional) tests for a specific direction of difference. The alternative hypothesis is either H1: μ > μ0 (right-tailed) or H1: μ < μ0 (left-tailed). Use one-tailed tests only when you have a strong theoretical reason to predict the direction before seeing the data. For example, testing if a new drug lowers blood pressure (only interested in decrease, not increase).

Important: One-tailed tests have more power to detect effects in the predicted direction but will miss effects in the opposite direction. Never choose one-tailed after seeing the data - this inflates Type I error rates. Decide on one vs. two-tailed before data collection.

# ONE-TAILED VS TWO-TAILED TEST: WEBSITE OPTIMIZATION\n# Description: Compares one-tailed and two-tailed tests for website load time optimization.\n# Shows how directionality affects p-values and statistical conclusions.\n# One-tailed tests are more powerful but only detect effects in one direction.\n\nimport scipy.stats as stats\nimport numpy as np\n\n# Business scenario: We optimized website code to reduce load times\n# Question: Did optimization reduce load times?\n\n# Example data: Website load times (seconds) measured before and after optimization\nbefore = np.array([2.3, 2.5, 2.1, 2.8, 2.4, 2.6, 2.2, 2.7, 2.5, 2.3])  # Original site\nafter = np.array([2.0, 2.1, 1.9, 2.2, 2.0, 2.1, 1.8, 2.3, 2.0, 1.9])   # After optimization\n\nprint(\"One-Tailed vs Two-Tailed Test: Website Load Time Analysis\")\nprint(\"=\" * 65)\nprint(f\"Before optimization: mean = {before.mean():.3f}s, std = {before.std(ddof=1):.3f}s\")\nprint(f\"After optimization:  mean = {after.mean():.3f}s, std = {after.std(ddof=1):.3f}s\")\nprint(f\"Observed improvement: {before.mean() - after.mean():.3f}s faster ({((before.mean()-after.mean())/before.mean())*100:.1f}% faster)\")\n\n# === TWO-TAILED TEST ===\n# H0: μ_before = μ_after (optimization had no effect)\n# H1: μ_before ≠ μ_after (optimization changed load times - any direction)\nt_stat, p_value_two = stats.ttest_ind(before, after)\n\nprint(\"\\nTWO-TAILED TEST (H1: μ_before ≠ μ_after)\")\nprint(\"Tests for: ANY change in load times (increase OR decrease)\")\nprint(f\"P-value: {p_value_two:.4f}\")  # Tests both directions\nif p_value_two < 0.05:\n    print(\"Result: SIGNIFICANT - Load times changed\")\nelse:\n    print(\"Result: NOT SIGNIFICANT\")\n\n# === ONE-TAILED TEST ===\n# H0: μ_before <= μ_after (optimization didn't help)\n# H1: μ_before > μ_after (optimization reduced load times)\n# For one-tailed, divide two-tailed p-value by 2\np_value_one = p_value_two / 2  # Divide by 2 for one-tail\n\nprint(\"\\nONE-TAILED TEST (H1: μ_before > μ_after)\")\nprint(\"Tests for: SPECIFIC direction - did times DECREASE?\")\nprint(f\"P-value: {p_value_one:.4f}\")  # Stronger evidence for predicted direction\nif p_value_one < 0.05:\n    print(\"Result: SIGNIFICANT - Load times DECREASED\")\nelse:\n    print(\"Result: NOT SIGNIFICANT\")\n\n# Key takeaways\nprint(\"\\nInterpretation:\")\nprint(\"Two-tailed: Evidence that times changed (could be any direction)\")\nprint(\"One-tailed: Stronger evidence specifically for improvement\")\nprint(f\"\\nPower advantage: One-tailed p={p_value_one:.4f} vs Two-tailed p={p_value_two:.4f}\")

Test Selection Flowchart

Selecting the appropriate test depends on several factors: the type of data you have (continuous, categorical, ordinal), the number of groups you are comparing, whether samples are independent or paired, and whether assumptions like normality are met. Here is a systematic approach to test selection for common scenarios.

Comparing Groups (Continuous Data)

One sample: One-sample t-test or z-test
Two independent groups: Independent t-test
Two paired groups: Paired t-test
3+ independent groups: One-way ANOVA
3+ paired groups: Repeated measures ANOVA

Analyzing Relationships (Categorical)

Independence: Chi-square test of independence
Goodness of fit: Chi-square goodness of fit
One proportion: Binomial test or z-test
Two proportions: Two-proportion z-test
Multiple proportions: Chi-square test

Assumptions and Diagnostics

Most statistical tests require certain assumptions to be met for valid inference. For t-tests and ANOVA, key assumptions include: (1) Normality - data should be approximately normally distributed (check with histograms, Q-Q plots, or Shapiro-Wilk test); (2) Independence - observations should be independent of each other; (3) Homogeneity of variance - groups should have similar variances (Levene's test checks this).

import scipy.stats as stats
import numpy as np

# Sample data: Customer satisfaction scores from two stores
store_a = np.array([7.2, 8.1, 7.8, 8.5, 7.9, 8.0, 7.5, 8.2, 7.7, 8.3,
                    7.6, 8.4, 7.9, 8.1, 7.8, 8.0, 7.7, 8.2, 7.5, 8.1])
store_b = np.array([6.8, 7.2, 7.0, 7.5, 6.9, 7.1, 6.7, 7.3, 7.0, 7.4,
                    6.9, 7.2, 7.1, 7.3, 6.8, 7.0, 7.2, 7.1, 6.9, 7.4])

print("Checking Assumptions for Independent T-Test")
print("=" * 50)

# 1. Normality Test (Shapiro-Wilk)
stat_a, p_a = stats.shapiro(store_a)
stat_b, p_b = stats.shapiro(store_b)

print("\n1. Normality Check (Shapiro-Wilk test)")
print(f"Store A: W = {stat_a:.4f}, p = {p_a:.4f}")
print(f"Store B: W = {stat_b:.4f}, p = {p_b:.4f}")

if p_a > 0.05 and p_b > 0.05:
    print("✓ Both groups appear normally distributed")
else:
    print("✗ Normality violated - consider non-parametric test")

# 2. Homogeneity of Variance (Levene's test)
stat_lev, p_lev = stats.levene(store_a, store_b)
print(f"\n2. Equal Variance Check (Levene's test)")
print(f"Statistic = {stat_lev:.4f}, p = {p_lev:.4f}")

if p_lev > 0.05:
    print("✓ Equal variances assumed")
    equal_var = True
else:
    print("✗ Unequal variances - use Welch's t-test")
    equal_var = False

# 3. Conduct appropriate t-test
print("\n3. Conducting T-Test")
t_stat, p_value = stats.ttest_ind(store_a, store_b, equal_var=equal_var)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject H0: Stores have different satisfaction scores")
else:
    print("Fail to reject H0: No evidence of difference")

Practical Tip: With large samples (n > 30), t-tests and ANOVA are robust to moderate violations of normality due to the Central Limit Theorem. However, severe outliers or extreme skewness can still cause problems. Always visualize your data before testing.

Practice Questions: Types of Hypothesis Tests

Apply your knowledge of test selection.

Problem: A fitness app claims users lose at least 5 pounds in 30 days. You collect data from 25 users: [4.8, 5.2, 4.5, 6.1, 5.0, 4.9, 5.3, 5.5, 4.7, 5.8, 5.1, 4.6, 5.4, 5.2, 4.9, 5.7, 5.0, 5.3, 4.8, 5.6, 5.1, 4.9, 5.4, 5.2, 5.5]. Test if weight loss meets the claim using appropriate one-tailed test.

Show Solution

import scipy.stats as stats
import numpy as np

# Data
weight_loss = np.array([4.8, 5.2, 4.5, 6.1, 5.0, 4.9, 5.3, 5.5, 4.7, 5.8,
                        5.1, 4.6, 5.4, 5.2, 4.9, 5.7, 5.0, 5.3, 4.8, 5.6,
                        5.1, 4.9, 5.4, 5.2, 5.5])
claimed_loss = 5.0

# Hypotheses (testing if actual >= claimed)
print("H0: μ >= 5.0 pounds (claim is met)")
print("H1: μ < 5.0 pounds (claim not met)")
print("One-tailed test (left-tailed)")

# Calculate statistics
sample_mean = weight_loss.mean()
sample_std = weight_loss.std(ddof=1)
n = len(weight_loss)

print(f"\nSample mean: {sample_mean:.2f} pounds")  # 5.14
print(f"Sample std: {sample_std:.2f} pounds")  # 0.38
print(f"Sample size: {n}")  # 25

# One-sample t-test
t_stat, p_value_two = stats.ttest_1samp(weight_loss, claimed_loss)

# For left-tailed test: if t > 0, p-value is > 0.5
# We want P(T < t_stat), which is p_two/2 if t < 0
if t_stat < 0:
    p_value = p_value_two / 2
else:
    p_value = 1 - (p_value_two / 2)

print(f"\nT-statistic: {t_stat:.4f}")  # 1.8447
print(f"P-value (left-tailed): {p_value:.4f}")  # 0.9613

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nReject H0 (p < {alpha})")
    print("Conclusion: Claim is NOT supported by data")
else:
    print(f"\nFail to reject H0 (p >= {alpha})")
    print("Conclusion: Claim IS supported - users lose >= 5 lbs")

Problem: Before conducting an independent t-test comparing two marketing strategies, check normality and equal variance assumptions. Strategy A results: [23, 25, 22, 28, 24, 26, 23, 27, 25, 24]. Strategy B results: [20, 22, 21, 19, 23, 21, 20, 22, 21, 20]. Write code to test assumptions.

Show Solution

import scipy.stats as stats
import numpy as np

# Data
strategy_a = np.array([23, 25, 22, 28, 24, 26, 23, 27, 25, 24])
strategy_b = np.array([20, 22, 21, 19, 23, 21, 20, 22, 21, 20])

print("Assumption Testing for Independent T-Test")
print("=" * 50)

# Assumption 1: Normality (Shapiro-Wilk test)
print("\n1. Testing Normality (Shapiro-Wilk test)")
print("H0: Data is normally distributed")

stat_a, p_a = stats.shapiro(strategy_a)
stat_b, p_b = stats.shapiro(strategy_b)

print(f"\nStrategy A: W = {stat_a:.4f}, p-value = {p_a:.4f}")
if p_a > 0.05:
    print("  ✓ Normally distributed (fail to reject H0)")
else:
    print("  ✗ Not normally distributed (reject H0)")

print(f"\nStrategy B: W = {stat_b:.4f}, p-value = {p_b:.4f}")
if p_b > 0.05:
    print("  ✓ Normally distributed (fail to reject H0)")
else:
    print("  ✗ Not normally distributed (reject H0)")

# Assumption 2: Equal Variances (Levene's test)
print("\n2. Testing Equal Variances (Levene's test)")
print("H0: Variances are equal")

stat_lev, p_lev = stats.levene(strategy_a, strategy_b)
print(f"\nLevene statistic = {stat_lev:.4f}, p-value = {p_lev:.4f}")

if p_lev > 0.05:
    print("  ✓ Equal variances (fail to reject H0)")
    equal_var = True
else:
    print("  ✗ Unequal variances (reject H0)")
    print("  → Use Welch's t-test (equal_var=False)")
    equal_var = False

# Summary
print("\n" + "=" * 50)
print("SUMMARY:")
normality_ok = (p_a > 0.05) and (p_b > 0.05)
variance_ok = (p_lev > 0.05)

if normality_ok and variance_ok:
    print("✓ All assumptions met - standard t-test appropriate")
elif normality_ok and not variance_ok:
    print("✓ Normality OK, variances differ - use Welch's t-test")
elif not normality_ok:
    print("✗ Normality violated - consider Mann-Whitney U test")

# Conduct appropriate test
print("\n3. Conducting T-Test")
t_stat, p_value = stats.ttest_ind(strategy_a, strategy_b, 
                                   equal_var=equal_var)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: Strategies differ significantly")
else:
    print("Conclusion: No significant difference")

Problem: You have three scenarios. For each, identify the correct test and implement it: (A) Compare conversion rates between two website designs (Design A: 45/200 converted, Design B: 38/200). (B) Test if dice is fair based on 60 rolls: [1:8, 2:12, 3:9, 4:11, 5:10, 6:10]. (C) Compare customer satisfaction (1-10 scale) across 3 regions: Region1=[7,8,9,8,7], Region2=[8,9,9,8,10], Region3=[6,7,6,7,8].

Show Solution

import scipy.stats as stats
import numpy as np

print("Scenario A: Comparing Two Proportions")
print("=" * 50)
# Two-proportion z-test
successes = np.array([45, 38])
trials = np.array([200, 200])

# Calculate proportions
p1 = successes[0] / trials[0]
p2 = successes[1] / trials[1]

# Pooled proportion
p_pool = successes.sum() / trials.sum()

# Standard error
se = np.sqrt(p_pool * (1 - p_pool) * (1/trials[0] + 1/trials[1]))

# Z-statistic
z = (p1 - p2) / se

# P-value (two-tailed)
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

print(f"Design A conversion: {p1:.3f} ({successes[0]}/{trials[0]})")
print(f"Design B conversion: {p2:.3f} ({successes[1]}/{trials[1]})")
print(f"Z-statistic: {z:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: Conversion rates differ significantly")
else:
    print("Conclusion: No significant difference")

print("\n\nScenario B: Testing Dice Fairness")
print("=" * 50)
# Chi-square goodness of fit test
observed = np.array([8, 12, 9, 11, 10, 10])
expected = np.array([10, 10, 10, 10, 10, 10])  # Fair dice

chi2, p_value = stats.chisquare(observed, expected)

print("H0: Dice is fair (all faces equally likely)")
print(f"Observed: {observed}")
print(f"Expected: {expected}")
print(f"Chi-square: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: Dice is NOT fair")
else:
    print("Conclusion: No evidence of unfairness")

print("\n\nScenario C: Comparing Three Regions")
print("=" * 50)
# One-way ANOVA
region1 = np.array([7, 8, 9, 8, 7])
region2 = np.array([8, 9, 9, 8, 10])
region3 = np.array([6, 7, 6, 7, 8])

f_stat, p_value = stats.f_oneway(region1, region2, region3)

print("Test: One-way ANOVA")
print(f"Region 1 mean: {region1.mean():.2f}")
print(f"Region 2 mean: {region2.mean():.2f}")
print(f"Region 3 mean: {region3.mean():.2f}")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: At least one region differs")
    print("Next step: Post-hoc tests to identify which pairs differ")
else:
    print("Conclusion: No significant differences among regions")

T-Tests

T-tests are among the most commonly used statistical tests in data analytics, designed for comparing means when sample sizes are small or population standard deviations are unknown. Named after William Sealy Gosset (who published under the pseudonym "Student"), the t-test accounts for additional uncertainty from estimating the population standard deviation from sample data. T-tests are workhorses of business analytics, used in A/B testing, quality control, market research, and countless other applications.

One-Sample T-Test

The one-sample t-test determines whether a sample mean differs significantly from a known or hypothesized population mean. Use this test when you have a single sample and want to compare it to a specific value. Common scenarios include testing if average product weight matches the labeled weight, checking if customer satisfaction meets a target score, or verifying if response times meet service level agreements.

The test statistic is t = (x̄ - μ0) / (s / √n), where x̄ is the sample mean, μ0 is the hypothesized population mean, s is the sample standard deviation, and n is the sample size. The t-statistic follows a t-distribution with n-1 degrees of freedom. Larger absolute t-values provide stronger evidence against the null hypothesis.

# ONE-SAMPLE T-TEST: QUALITY CONTROL APPLICATION
# Description: Tests whether a sample mean differs from a known/hypothesized value.
# Use case: Manufacturing quality control to verify products meet specifications.
# This is a two-tailed test because deviations in either direction matter.

import scipy.stats as stats
import numpy as np

# Business scenario: Manufacturing bolts that should be exactly 50mm long
# We sample 16 bolts and measure them to verify production quality
# H0: μ = 50mm (production meets specification - no adjustment needed)
# H1: μ ≠ 50mm (production differs from spec - machine needs calibration)

bolt_lengths = np.array([50.2, 49.8, 50.1, 50.3, 49.9, 50.0, 50.2, 49.7,
                         50.4, 50.1, 49.9, 50.3, 50.0, 49.8, 50.2, 50.1])  # Sample measurements

target_length = 50.0  # Specification: What the length should be
alpha = 0.05  # Significance level: 5% risk of false alarm

# Calculate descriptive statistics from our sample
sample_mean = bolt_lengths.mean()  # Average of our measurements
sample_std = bolt_lengths.std(ddof=1)  # Standard deviation (ddof=1 for sample)
n = len(bolt_lengths)  # Sample size

print("One-Sample T-Test: Bolt Length Quality Control")
print("=" * 55)
print(f"Target length: {target_length}mm (specification)")
print(f"Sample size: {n} bolts measured")
print(f"Sample mean: {sample_mean:.4f}mm (our observed average)")
print(f"Sample std: {sample_std:.4f}mm (variability in measurements)")

# Conduct one-sample t-test
# This compares our sample mean to the target value
# Returns: t-statistic (how many std errors away) and p-value
t_statistic, p_value = stats.ttest_1samp(bolt_lengths, target_length)

print(f"\nT-statistic: {t_statistic:.4f}")
print(f"  → Positive means sample mean > target, negative means < target")
print(f"  → Larger absolute value = stronger evidence of difference")
print(f"P-value: {p_value:.4f}")
print(f"  → Probability of observing this difference if H0 is true")
print(f"Degrees of freedom: {n-1}")

# Make decision based on p-value
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("Conclusion: Bolt lengths differ significantly from 50mm specification")
    print(f"Average deviation: {sample_mean - target_length:.4f}mm")
    print("ACTION NEEDED: Adjust manufacturing equipment")
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("Conclusion: Bolt lengths meet specification (difference likely due to chance)")
    print("ACTION: No adjustment needed, production is on target")

# Calculate 95% confidence interval for the true mean
# This gives us a range where we're 95% confident the true mean lies
ci = stats.t.interval(0.95, n-1, sample_mean, sample_std/np.sqrt(n))
print(f"\n95% Confidence Interval: ({ci[0]:.4f}, {ci[1]:.4f})mm")
print(f"  → We're 95% confident the true average bolt length is in this range")
print(f"  → Target {target_length}mm is {'inside' if ci[0] <= target_length <= ci[1] else 'outside'} the interval")

Independent Samples T-Test

The independent samples t-test (also called two-sample t-test) compares means between two independent groups. Use this when you have two separate samples and want to know if they come from populations with different means. Examples include comparing sales between two regions, testing if a new drug performs better than placebo, or evaluating if one marketing strategy outperforms another.

There are two variants: Student's t-test assumes equal variances in both groups, while Welch's t-test does not require this assumption and is more robust when variances differ. Modern practice generally prefers Welch's t-test as the default. Python's ttest_ind() can perform either version depending on the equal_var parameter.

Key Formula

Independent T-Test Statistic

For equal variances: t = (x̄₁ - x̄₂) / √(s²pooled × (1/n₁ + 1/n₂))
For unequal variances (Welch): t = (x̄₁ - x̄₂) / √(s₁²/n₁ + s₂²/n₂)

When to use: Comparing two independent groups (different subjects) on a continuous outcome variable. Groups must be independent - observations in one group cannot influence the other.

import scipy.stats as stats
import numpy as np

# Example: A/B test comparing two website designs
# H0: μ_A = μ_B (no difference in time spent)
# H1: μ_A ≠ μ_B (designs differ in engagement)

# Time spent on page (minutes)
design_a = np.array([3.2, 4.1, 3.8, 4.5, 3.9, 4.0, 3.7, 4.2, 3.5, 4.3,
                     3.6, 4.1, 3.9, 4.0, 3.8, 4.4, 3.7, 4.2, 3.9, 4.1])
design_b = np.array([4.2, 4.8, 4.5, 5.1, 4.6, 4.7, 4.3, 4.9, 4.4, 5.0,
                     4.5, 4.8, 4.6, 4.9, 4.7, 5.2, 4.4, 4.8, 4.6, 4.7])

print("Independent Samples T-Test: Website A/B Testing")
print("=" * 55)

# Calculate descriptive statistics
print(f"Design A: n={len(design_a)}, mean={design_a.mean():.2f}, std={design_a.std(ddof=1):.2f}")
print(f"Design B: n={len(design_b)}, mean={design_b.mean():.2f}, std={design_b.std(ddof=1):.2f}")

# Test for equal variances (Levene's test)
_, p_levene = stats.levene(design_a, design_b)
print(f"\nLevene's test p-value: {p_levene:.4f}")

if p_levene < 0.05:
    print("Variances differ - using Welch's t-test")
    equal_var = False
else:
    print("Equal variances - using Student's t-test")
    equal_var = True

# Conduct independent samples t-test
t_stat, p_value = stats.ttest_ind(design_a, design_b, equal_var=equal_var)

print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.6f}")

# Calculate effect size (Cohen's d)
pooled_std = np.sqrt(((len(design_a)-1)*design_a.std(ddof=1)**2 + 
                      (len(design_b)-1)*design_b.std(ddof=1)**2) / 
                     (len(design_a) + len(design_b) - 2))
cohens_d = (design_b.mean() - design_a.mean()) / pooled_std

print(f"Cohen's d: {cohens_d:.4f}")
print(f"Effect size: {'Small' if abs(cohens_d) < 0.5 else 'Medium' if abs(cohens_d) < 0.8 else 'Large'}")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print(f"Conclusion: Design B significantly increases engagement")
    print(f"Average difference: {design_b.mean() - design_a.mean():.2f} minutes")
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("Conclusion: No significant difference between designs")

Paired Samples T-Test

The paired samples t-test (also called dependent t-test or repeated measures t-test) compares means from the same group measured twice - before and after an intervention, or under two different conditions. Because measurements come from the same subjects, we account for individual differences by analyzing the differences within each pair rather than comparing group means directly.

This test is more powerful than an independent samples t-test when data is naturally paired, because it controls for individual variability. Use paired t-tests for before-after studies, pre-post treatment designs, matched pairs, or any situation where the same subjects provide both measurements. The test essentially performs a one-sample t-test on the differences, testing if the mean difference is zero.

import scipy.stats as stats
import numpy as np

# Example: Training program effectiveness
# H0: μ_diff = 0 (no improvement)
# H1: μ_diff > 0 (scores improved after training)

employee_ids = range(1, 16)
before_training = np.array([72, 68, 75, 70, 73, 69, 71, 74, 70, 72,
                            68, 73, 71, 69, 74])
after_training = np.array([78, 75, 82, 76, 79, 74, 77, 81, 75, 78,
                           73, 80, 76, 74, 79])

print("Paired Samples T-Test: Training Program Effectiveness")
print("=" * 55)

# Calculate differences
differences = after_training - before_training
mean_diff = differences.mean()
std_diff = differences.std(ddof=1)

print(f"Sample size: {len(differences)} employees")
print(f"Mean improvement: {mean_diff:.2f} points")
print(f"Std of differences: {std_diff:.2f}")
print(f"\nBefore mean: {before_training.mean():.2f}")
print(f"After mean: {after_training.mean():.2f}")

# Conduct paired t-test
t_stat, p_value_two = stats.ttest_rel(after_training, before_training)

# One-tailed test (expecting improvement)
p_value = p_value_two / 2

print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value (one-tailed): {p_value:.6f}")

# Individual results table
print("\nIndividual Results:")
print("Employee | Before | After | Difference")
print("-" * 42)
for i, emp_id in enumerate(employee_ids):
    print(f"   {emp_id:2d}    |   {before_training[i]:2d}   |  {after_training[i]:2d}   |    {differences[i]:+2d}")
print("-" * 42)
improved = np.sum(differences > 0)
print(f"{improved}/{len(differences)} employees improved")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("Conclusion: Training significantly improved scores")
    
    # Calculate 95% CI for mean difference
    ci = stats.t.interval(0.95, len(differences)-1, mean_diff, 
                          std_diff/np.sqrt(len(differences)))
    print(f"95% CI for improvement: ({ci[0]:.2f}, {ci[1]:.2f}) points")
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("Conclusion: No significant improvement from training")

Choosing Between Independent and Paired: Use independent t-test when comparing two separate groups (different people). Use paired t-test when the same subjects are measured twice or subjects are matched in meaningful pairs. Paired tests are more powerful because they control for individual differences.

Effect Size and Practical Significance

Statistical significance (p-value < 0.05) tells you whether an effect exists, but not how large or important it is. A tiny, trivial difference can be statistically significant with a large enough sample. Effect size quantifies the magnitude of the difference, helping you assess practical significance alongside statistical significance.

Cohen's d is the most common effect size measure for t-tests, calculated as the difference between means divided by the pooled standard deviation. Interpret Cohen's d as: small effect (d = 0.2), medium effect (d = 0.5), large effect (d = 0.8). A large effect size means the groups differ substantially, regardless of statistical significance. Always report effect sizes alongside p-values for complete information.

import scipy.stats as stats
import numpy as np

# Demonstration: Statistical vs Practical Significance

# Scenario 1: Large sample, small effect (statistically significant but trivial)
np.random.seed(42)
group1_large = np.random.normal(100, 15, 500)  # Mean=100, n=500
group2_large = np.random.normal(102, 15, 500)  # Mean=102, n=500 (2 point diff)

t1, p1 = stats.ttest_ind(group1_large, group2_large)
d1 = (group2_large.mean() - group1_large.mean()) / \
     np.sqrt(((len(group1_large)-1)*group1_large.std(ddof=1)**2 + 
              (len(group2_large)-1)*group2_large.std(ddof=1)**2) / 
             (len(group1_large) + len(group2_large) - 2))

print("Scenario 1: Large Sample, Small Effect")
print("=" * 50)
print(f"Group 1: n=500, mean={group1_large.mean():.2f}")
print(f"Group 2: n=500, mean={group2_large.mean():.2f}")
print(f"Difference: {group2_large.mean() - group1_large.mean():.2f}")
print(f"P-value: {p1:.6f}")
print(f"Cohen's d: {d1:.4f} (small effect)")
print("Result: Statistically significant but practically trivial\n")

# Scenario 2: Small sample, large effect (not significant but important)
group1_small = np.random.normal(100, 15, 10)   # Mean=100, n=10
group2_small = np.random.normal(115, 15, 10)   # Mean=115, n=10 (15 point diff)

t2, p2 = stats.ttest_ind(group1_small, group2_small)
d2 = (group2_small.mean() - group1_small.mean()) / \
     np.sqrt(((len(group1_small)-1)*group1_small.std(ddof=1)**2 + 
              (len(group2_small)-1)*group2_small.std(ddof=1)**2) / 
             (len(group1_small) + len(group2_small) - 2))

print("Scenario 2: Small Sample, Large Effect")
print("=" * 50)
print(f"Group 1: n=10, mean={group1_small.mean():.2f}")
print(f"Group 2: n=10, mean={group2_small.mean():.2f}")
print(f"Difference: {group2_small.mean() - group1_small.mean():.2f}")
print(f"P-value: {p2:.6f}")
print(f"Cohen's d: {d2:.4f} (large effect)")
print("Result: Not statistically significant but large practical difference\n")

print("KEY LESSON:")
print("Always report both p-values AND effect sizes")
print("Statistical significance ≠ Practical importance")

Practice Questions: T-Tests

Apply t-tests to real-world scenarios.

Problem: A coffee shop claims their average latte contains 150mg of caffeine. Test this claim with sample measurements: [148, 152, 149, 153, 150, 147, 151, 149, 152, 150] mg.

Show Solution

import scipy.stats as stats
import numpy as np

# Data
caffeine = np.array([148, 152, 149, 153, 150, 147, 151, 149, 152, 150])
claimed_amount = 150

# Hypotheses
print("H0: μ = 150mg (claim is accurate)")
print("H1: μ ≠ 150mg (claim is inaccurate)")

# Sample statistics
sample_mean = caffeine.mean()
sample_std = caffeine.std(ddof=1)
n = len(caffeine)

print(f"\nSample size: {n}")
print(f"Sample mean: {sample_mean:.2f}mg")
print(f"Sample std: {sample_std:.2f}mg")
print(f"Claimed: {claimed_amount}mg")

# One-sample t-test
t_stat, p_value = stats.ttest_1samp(caffeine, claimed_amount)

print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nReject H0 (p < {alpha})")
    print("Claim is inaccurate")
else:
    print(f"\nFail to reject H0 (p >= {alpha})")
    print("Claim appears accurate")

# 95% CI
ci = stats.t.interval(0.95, n-1, sample_mean, sample_std/np.sqrt(n))
print(f"\n95% CI: ({ci[0]:.2f}, {ci[1]:.2f})mg")
print(f"Claimed value {claimed_amount}mg is {'inside' if ci[0] <= claimed_amount <= ci[1] else 'outside'} CI")

Problem: Compare customer satisfaction scores between two store locations and calculate Cohen's d. Store A: [7.2, 8.1, 7.5, 8.3, 7.8, 8.0, 7.6, 8.2]. Store B: [6.8, 7.3, 7.0, 7.5, 6.9, 7.2, 6.7, 7.4].

Show Solution

import scipy.stats as stats
import numpy as np

# Data
store_a = np.array([7.2, 8.1, 7.5, 8.3, 7.8, 8.0, 7.6, 8.2])
store_b = np.array([6.8, 7.3, 7.0, 7.5, 6.9, 7.2, 6.7, 7.4])

print("Independent T-Test: Store Comparison")
print("=" * 50)

# Descriptive statistics
print(f"Store A: n={len(store_a)}, mean={store_a.mean():.2f}, std={store_a.std(ddof=1):.2f}")
print(f"Store B: n={len(store_b)}, mean={store_b.mean():.2f}, std={store_b.std(ddof=1):.2f}")
print(f"Difference: {store_a.mean() - store_b.mean():.2f} points")

# Check equal variances
_, p_levene = stats.levene(store_a, store_b)
equal_var = p_levene >= 0.05
print(f"\nLevene's test: p={p_levene:.4f} - {'Equal' if equal_var else 'Unequal'} variances")

# Independent t-test
t_stat, p_value = stats.ttest_ind(store_a, store_b, equal_var=equal_var)

print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Calculate Cohen's d
pooled_std = np.sqrt(((len(store_a)-1)*store_a.std(ddof=1)**2 + 
                      (len(store_b)-1)*store_b.std(ddof=1)**2) / 
                     (len(store_a) + len(store_b) - 2))
cohens_d = (store_a.mean() - store_b.mean()) / pooled_std

print(f"\nEffect Size (Cohen's d): {cohens_d:.4f}")
if abs(cohens_d) < 0.2:
    effect = "Negligible"
elif abs(cohens_d) < 0.5:
    effect = "Small"
elif abs(cohens_d) < 0.8:
    effect = "Medium"
else:
    effect = "Large"
print(f"Interpretation: {effect} effect")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("Stores differ significantly in satisfaction")
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("No significant difference between stores")

# Practical interpretation
print("\nPractical Interpretation:")
print(f"Store A scores {store_a.mean() - store_b.mean():.2f} points higher on average")
print(f"This represents a {effect.lower()} effect size")
print("Consider investigating what Store A does differently")

Problem: A weight loss program tracked 12 participants before and after 8 weeks. Conduct paired t-test and analyze individual changes. Before: [185, 192, 178, 203, 188, 195, 181, 199, 187, 194, 183, 190]. After: [180, 186, 175, 195, 182, 188, 177, 192, 182, 187, 179, 184].

Show Solution

import scipy.stats as stats
import numpy as np

# Data
before = np.array([185, 192, 178, 203, 188, 195, 181, 199, 187, 194, 183, 190])
after = np.array([180, 186, 175, 195, 182, 188, 177, 192, 182, 187, 179, 184])

print("Paired T-Test: Weight Loss Program")
print("=" * 60)

# Calculate differences
differences = before - after  # Positive = weight lost
mean_diff = differences.mean()
std_diff = differences.std(ddof=1)

print(f"Participants: {len(differences)}")
print(f"Mean weight loss: {mean_diff:.2f} lbs")
print(f"Std of differences: {std_diff:.2f} lbs")
print(f"Range: {differences.min():.1f} to {differences.max():.1f} lbs")

# Paired t-test (one-tailed: expecting weight loss)
t_stat, p_value_two = stats.ttest_rel(before, after)
p_value = p_value_two / 2

print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value (one-tailed): {p_value:.6f}")

# Individual results
print("\nIndividual Results:")
print("Participant | Before | After | Change")
print("-" * 45)
for i in range(len(before)):
    change = differences[i]
    print(f"     {i+1:2d}     |  {before[i]:3d}   |  {after[i]:3d}  |  {change:+5.1f}")
print("-" * 45)

successful = np.sum(differences > 0)
print(f"{successful}/{len(differences)} participants lost weight")

# Calculate 95% CI for mean difference
ci = stats.t.interval(0.95, len(differences)-1, mean_diff, 
                      std_diff/np.sqrt(len(differences)))

print(f"\n95% CI for weight loss: ({ci[0]:.2f}, {ci[1]:.2f}) lbs")

# Effect size (Cohen's d for paired samples)
cohens_d = mean_diff / std_diff
print(f"Cohen's d: {cohens_d:.4f}")
print(f"Effect size: {'Small' if abs(cohens_d) < 0.5 else 'Medium' if abs(cohens_d) < 0.8 else 'Large'}")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("Program produces significant weight loss")
    print(f"Average loss: {mean_diff:.2f} ± {std_diff:.2f} lbs")
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("No significant evidence of weight loss")

# Additional analysis
print("\nProgram Effectiveness:")
percent_change = (mean_diff / before.mean()) * 100
print(f"Average percent change: {percent_change:.1f}%")
print(f"Success rate: {(successful/len(differences))*100:.0f}% of participants")
print(f"Minimum loss: {differences.min():.1f} lbs")
print(f"Maximum loss: {differences.max():.1f} lbs")

Chi-Square and ANOVA

While t-tests compare means between one or two groups, real-world analytics often involves categorical data or comparisons across three or more groups. Chi-square tests handle categorical variables, testing relationships between categories or goodness-of-fit to expected distributions. ANOVA (Analysis of Variance) extends t-tests to compare means across multiple groups simultaneously, controlling for Type I error inflation that occurs when performing multiple pairwise comparisons.

Chi-Square Test of Independence

The chi-square test of independence determines whether two categorical variables are related. Use this test when you have cross-tabulated data and want to know if the distribution of one variable depends on the other. Common applications include testing if customer purchase behavior relates to demographics, whether survey responses differ by region, or if product defect rates vary across production lines.

The test compares observed frequencies in each cell of a contingency table to expected frequencies under the assumption of independence. The chi-square statistic χ² = Σ((O - E)² / E) sums squared standardized differences across all cells. Larger χ² values indicate greater departure from independence. The test requires expected frequencies of at least 5 in each cell for validity.

import scipy.stats as stats
import numpy as np
import pandas as pd

# Example: Testing if customer satisfaction relates to product category
# H0: Satisfaction and category are independent
# H1: Satisfaction and category are related

# Create contingency table
data = {
    'Electronics': [45, 30, 15],    # High, Medium, Low satisfaction
    'Clothing': [35, 40, 25],
    'HomeGoods': [40, 35, 25],
    'Sports': [50, 25, 15]
}

df = pd.DataFrame(data, index=['High', 'Medium', 'Low'])

print("Chi-Square Test of Independence")
print("=" * 60)
print("\nContingency Table (Observed Frequencies):")
print(df)
print(f"\nTotal observations: {df.sum().sum()}")

# Perform chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(df)

print(f"\nChi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.6f}")

# Show expected frequencies
expected_df = pd.DataFrame(expected, 
                           index=df.index, 
                           columns=df.columns)
print("\nExpected Frequencies (under independence):")
print(expected_df.round(2))

# Check minimum expected frequency assumption
min_expected = expected.min()
print(f"\nMinimum expected frequency: {min_expected:.2f}")
if min_expected < 5:
    print("WARNING: Some expected frequencies < 5, results may be unreliable")
else:
    print("Assumption met: All expected frequencies >= 5")

# Calculate effect size (Cramér's V)
n = df.sum().sum()
min_dim = min(df.shape[0] - 1, df.shape[1] - 1)
cramers_v = np.sqrt(chi2 / (n * min_dim))
print(f"\nCramér's V: {cramers_v:.4f}")
print(f"Effect size: {'Small' if cramers_v < 0.3 else 'Medium' if cramers_v < 0.5 else 'Large'}")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("Satisfaction IS related to product category")
    print("\nPost-hoc analysis: Examine which cells differ most")
    
    # Calculate standardized residuals
    residuals = (df.values - expected) / np.sqrt(expected)
    residuals_df = pd.DataFrame(residuals, 
                                index=df.index, 
                                columns=df.columns)
    print("\nStandardized Residuals (|residual| > 2 suggests significant cell):")
    print(residuals_df.round(2))
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("No evidence of relationship between satisfaction and category")

Chi-Square Goodness-of-Fit Test

The chi-square goodness-of-fit test compares an observed frequency distribution to an expected distribution. Use this test when you have categorical data from a single variable and want to test if it matches a theoretical distribution or expected proportions. Examples include testing if dice are fair, if survey responses match population demographics, or if sales are evenly distributed across days of the week.

import scipy.stats as stats
import numpy as np

# Example: Testing if website visits are evenly distributed across weekdays
# H0: Visits are uniformly distributed (equal across days)
# H1: Visits are not uniformly distributed

days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
observed = np.array([520, 480, 550, 510, 540])  # Actual visits

print("Chi-Square Goodness-of-Fit Test")
print("=" * 60)
print("\nObserved Frequencies:")
for day, count in zip(days, observed):
    print(f"{day:12s}: {count:4d} visits")

total = observed.sum()
print(f"\nTotal visits: {total}")

# Expected frequencies (uniform distribution)
expected = np.full(len(observed), total / len(observed))

print("\nExpected Frequencies (uniform distribution):")
for day, count in zip(days, expected):
    print(f"{day:12s}: {count:6.1f} visits")

# Perform goodness-of-fit test
chi2, p_value = stats.chisquare(observed, expected)

print(f"\nChi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {len(observed) - 1}")
print(f"P-value: {p_value:.6f}")

# Calculate deviations
print("\nDeviations from Expected:")
for day, obs, exp in zip(days, observed, expected):
    deviation = obs - exp
    percent = (deviation / exp) * 100
    print(f"{day:12s}: {deviation:+6.1f} ({percent:+5.1f}%)")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("Visits are NOT uniformly distributed")
    print("Consider day-specific marketing strategies")
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("No evidence against uniform distribution")
    print("Traffic appears consistent across weekdays")

Chi-Square vs T-Test

Use chi-square tests for categorical data (counts, frequencies, categories). Use t-tests for continuous numerical data (measurements, scores, amounts). Chi-square answers "Are these categorical variables related?" while t-tests answer "Do these groups have different means?"

One-Way ANOVA

One-way ANOVA (Analysis of Variance) tests whether means differ across three or more independent groups. While you could perform multiple t-tests, this inflates Type I error (false positives). ANOVA tests all groups simultaneously while controlling error rate. Use ANOVA to compare multiple treatments, test differences across geographic regions, or evaluate performance across several time periods.

ANOVA partitions total variance into between-group variance (differences among group means) and within-group variance (variability within each group). The F-statistic is the ratio of between-group to within-group variance. Large F-values suggest means differ more than expected by chance. ANOVA assumes normality, equal variances (homoscedasticity), and independent observations.

Key Concept

ANOVA Logic

ANOVA tests if group means differ by comparing two sources of variability:

Between-group variance: How much do group means differ from the overall mean?
Within-group variance: How much do individual observations vary within each group?

F-statistic = Between-group variance / Within-group variance

If groups truly differ, between-group variance will be large relative to within-group variance, producing a large F-statistic and small p-value.

import scipy.stats as stats
import numpy as np

# Example: Comparing sales performance across 4 sales regions
# H0: μ₁ = μ₂ = μ₃ = μ₄ (all regions have equal mean sales)
# H1: At least one region differs

region_1 = np.array([45, 48, 52, 47, 50, 49, 51, 46, 53, 48])
region_2 = np.array([52, 55, 58, 54, 56, 53, 57, 55, 59, 54])
region_3 = np.array([48, 51, 47, 50, 49, 52, 48, 51, 50, 49])
region_4 = np.array([55, 58, 60, 57, 59, 56, 61, 58, 57, 59])

print("One-Way ANOVA: Sales Performance Across Regions")
print("=" * 60)

# Descriptive statistics
regions = [region_1, region_2, region_3, region_4]
region_names = ['Region 1', 'Region 2', 'Region 3', 'Region 4']

print("\nDescriptive Statistics:")
print(f"{'Region':<15} {'n':<5} {'Mean':<10} {'Std':<10}")
print("-" * 45)
for name, data in zip(region_names, regions):
    print(f"{name:<15} {len(data):<5} {data.mean():<10.2f} {data.std(ddof=1):<10.2f}")

# Overall statistics
all_data = np.concatenate(regions)
print(f"\n{'Overall':<15} {len(all_data):<5} {all_data.mean():<10.2f} {all_data.std(ddof=1):<10.2f}")

# Perform one-way ANOVA
f_stat, p_value = stats.f_oneway(region_1, region_2, region_3, region_4)

print(f"\nF-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.8f}")

# Calculate effect size (eta-squared)
# eta² = SS_between / SS_total
grand_mean = all_data.mean()
ss_between = sum(len(group) * (group.mean() - grand_mean)**2 for group in regions)
ss_total = sum((x - grand_mean)**2 for x in all_data)
eta_squared = ss_between / ss_total

print(f"\nEffect size (eta²): {eta_squared:.4f}")
print(f"Interpretation: {eta_squared*100:.1f}% of variance explained by region")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("At least one region differs significantly in sales")
    print("\nRECOMMENDATION: Perform post-hoc tests to identify which regions differ")
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("No significant differences among regions")

# Check ANOVA assumptions
print("\n" + "=" * 60)
print("ASSUMPTION CHECKS:")
print("=" * 60)

# 1. Normality (Shapiro-Wilk test for each group)
print("\n1. Normality Test (Shapiro-Wilk) - p > 0.05 suggests normality:")
for name, data in zip(region_names, regions):
    _, p = stats.shapiro(data)
    status = "✓ Normal" if p > 0.05 else "✗ Non-normal"
    print(f"   {name}: p={p:.4f} {status}")

# 2. Homogeneity of variances (Levene's test)
print("\n2. Homogeneity of Variances (Levene's test):")
_, p_levene = stats.levene(*regions)
print(f"   P-value: {p_levene:.4f}")
if p_levene > 0.05:
    print("   ✓ Equal variances assumption met")
else:
    print("   ✗ Variances differ - consider Welch's ANOVA")

Post-Hoc Tests

When ANOVA indicates significant differences, it doesn't tell you which specific groups differ. Post-hoc tests perform pairwise comparisons while controlling for multiple testing. Tukey's HSD (Honestly Significant Difference) is the most common post-hoc test, comparing all possible pairs of groups while maintaining the overall significance level.

import scipy.stats as stats
import numpy as np
from itertools import combinations

# Continuing from ANOVA example above
regions = [region_1, region_2, region_3, region_4]
region_names = ['Region 1', 'Region 2', 'Region 3', 'Region 4']

# If ANOVA was significant, perform pairwise comparisons
print("\nPost-Hoc Analysis: Pairwise Comparisons")
print("=" * 60)

# Bonferroni correction: divide alpha by number of comparisons
n_comparisons = len(list(combinations(range(4), 2)))
alpha_bonferroni = 0.05 / n_comparisons

print(f"\nNumber of pairwise comparisons: {n_comparisons}")
print(f"Bonferroni-corrected alpha: {alpha_bonferroni:.4f}")

print("\nPairwise T-Tests:")
print(f"{'Comparison':<25} {'Mean Diff':<12} {'P-value':<12} {'Significant?'}")
print("-" * 70)

for i, j in combinations(range(len(regions)), 2):
    group1, group2 = regions[i], regions[j]
    name1, name2 = region_names[i], region_names[j]
    
    # Independent t-test
    t_stat, p_val = stats.ttest_ind(group1, group2)
    mean_diff = group1.mean() - group2.mean()
    
    significant = "Yes" if p_val < alpha_bonferroni else "No"
    comparison = f"{name1} vs {name2}"
    
    print(f"{comparison:<25} {mean_diff:<12.2f} {p_val:<12.6f} {significant}")

# Identify which region performed best
best_region = region_names[np.argmax([r.mean() for r in regions])]
print(f"\nBest performing region: {best_region}")
print(f"Mean sales: {regions[region_names.index(best_region)].mean():.2f}")

# Business recommendations
print("\nBusiness Recommendations:")
print("1. Study best practices from top-performing regions")
print("2. Provide additional training to underperforming regions")
print("3. Consider resource reallocation based on performance")
print("4. Investigate external factors (market conditions, competition)")

Multiple Comparison Problem: Performing multiple t-tests inflates Type I error rate. With 4 groups, you'd need 6 pairwise tests. At α=0.05, the probability of at least one false positive reaches 26%, not 5%. Use ANOVA followed by Bonferroni correction or Tukey HSD to control error rate.

Practice Questions: Chi-Square and ANOVA

Test categorical relationships and compare multiple groups.

Problem: Test if customer complaints are evenly distributed across product categories. Observed: Electronics=45, Clothing=38, Home=42, Sports=35.

Show Solution

import scipy.stats as stats
import numpy as np

# Data
categories = ['Electronics', 'Clothing', 'Home', 'Sports']
observed = np.array([45, 38, 42, 35])

print("Chi-Square Goodness-of-Fit: Complaint Distribution")
print("=" * 55)

# Expected frequencies (equal distribution)
total = observed.sum()
expected = np.full(len(observed), total / len(observed))

print(f"\nTotal complaints: {total}")
print("\nObserved vs Expected:")
print(f"{'Category':<15} {'Observed':<12} {'Expected':<12} {'Difference'}")
print("-" * 55)
for cat, obs, exp in zip(categories, observed, expected):
    diff = obs - exp
    print(f"{cat:<15} {obs:<12} {exp:<12.2f} {diff:+.2f}")

# Perform test
chi2, p_value = stats.chisquare(observed, expected)

print(f"\nChi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {len(observed) - 1}")
print(f"P-value: {p_value:.4f}")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nReject H0 (p < {alpha})")
    print("Complaints NOT evenly distributed")
    print(f"\nCategory with most complaints: {categories[observed.argmax()]}")
else:
    print(f"\nFail to reject H0 (p >= {alpha})")
    print("Complaints appear evenly distributed")

Problem: Compare productivity scores across 3 work shifts. Morning: [85, 88, 82, 90, 87]. Afternoon: [78, 81, 79, 83, 80]. Night: [72, 75, 70, 76, 73]. Check assumptions and interpret.

Show Solution

import scipy.stats as stats
import numpy as np

# Data
morning = np.array([85, 88, 82, 90, 87])
afternoon = np.array([78, 81, 79, 83, 80])
night = np.array([72, 75, 70, 76, 73])

print("One-Way ANOVA: Productivity Across Shifts")
print("=" * 55)

# Descriptive statistics
shifts = [morning, afternoon, night]
shift_names = ['Morning', 'Afternoon', 'Night']

print("\nDescriptive Statistics:")
for name, data in zip(shift_names, shifts):
    print(f"{name}: mean={data.mean():.2f}, std={data.std(ddof=1):.2f}, n={len(data)}")

# Check assumptions
print("\nAssumption Checks:")
print("-" * 55)

# 1. Normality
print("\n1. Normality (Shapiro-Wilk):")
all_normal = True
for name, data in zip(shift_names, shifts):
    _, p = stats.shapiro(data)
    status = "Normal" if p > 0.05 else "Not normal"
    print(f"   {name}: p={p:.4f} ({status})")
    if p <= 0.05:
        all_normal = False

# 2. Homogeneity of variances
print("\n2. Equal Variances (Levene's test):")
_, p_levene = stats.levene(*shifts)
print(f"   P-value: {p_levene:.4f}")
equal_var = p_levene > 0.05
print(f"   {'Equal variances' if equal_var else 'Unequal variances'}")

# Perform ANOVA
print("\nANOVA Results:")
print("-" * 55)
f_stat, p_value = stats.f_oneway(*shifts)

print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.6f}")

# Effect size
grand_mean = np.concatenate(shifts).mean()
ss_between = sum(len(s) * (s.mean() - grand_mean)**2 for s in shifts)
ss_total = sum((x - grand_mean)**2 for x in np.concatenate(shifts))
eta_sq = ss_between / ss_total

print(f"Eta-squared: {eta_sq:.4f} ({eta_sq*100:.1f}% variance explained)")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("Shifts differ significantly in productivity")
    
    # Pairwise comparisons with Bonferroni
    print("\nPost-hoc Pairwise Comparisons:")
    alpha_corr = 0.05 / 3  # 3 comparisons
    
    comparisons = [
        ('Morning', 'Afternoon', morning, afternoon),
        ('Morning', 'Night', morning, night),
        ('Afternoon', 'Night', afternoon, night)
    ]
    
    for name1, name2, data1, data2 in comparisons:
        _, p = stats.ttest_ind(data1, data2)
        sig = "***" if p < alpha_corr else "ns"
        diff = data1.mean() - data2.mean()
        print(f"   {name1} vs {name2}: diff={diff:+.2f}, p={p:.4f} {sig}")
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("No significant differences among shifts")

Problem: Test if marketing channel effectiveness varies by age group. Create contingency table and calculate Cramér's V. Age groups (18-25, 26-40, 41+) vs Channels (Social, Email, Search) with conversions: [[25,15,10], [30,35,20], [15,25,35]].

Show Solution

import scipy.stats as stats
import numpy as np
import pandas as pd

# Data: rows=age groups, columns=channels
data = np.array([[25, 15, 10],   # 18-25
                 [30, 35, 20],   # 26-40
                 [15, 25, 35]])  # 41+

df = pd.DataFrame(data, 
                  index=['18-25', '26-40', '41+'],
                  columns=['Social', 'Email', 'Search'])

print("Chi-Square Test of Independence")
print("Marketing Channel Effectiveness by Age Group")
print("=" * 60)

print("\nContingency Table (Conversions):")
print(df)
print(f"\nTotal conversions: {df.sum().sum()}")

# Row and column totals
print("\nMarginal Totals:")
print("\nBy Age Group:")
for idx in df.index:
    print(f"  {idx}: {df.loc[idx].sum()}")

print("\nBy Channel:")
for col in df.columns:
    print(f"  {col}: {df[col].sum()}")

# Perform chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(df)

print(f"\nChi-Square Test Results:")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.6f}")

# Expected frequencies
expected_df = pd.DataFrame(expected,
                           index=df.index,
                           columns=df.columns)
print("\nExpected Frequencies:")
print(expected_df.round(2))

# Check assumptions
min_expected = expected.min()
print(f"\nMinimum expected frequency: {min_expected:.2f}")
print(f"Assumption {'met' if min_expected >= 5 else 'VIOLATED'}: all cells >= 5")

# Calculate Cramér's V
n = df.sum().sum()
min_dim = min(df.shape[0] - 1, df.shape[1] - 1)
cramers_v = np.sqrt(chi2 / (n * min_dim))

print(f"\nEffect Size (Cramér's V): {cramers_v:.4f}")
if cramers_v < 0.1:
    effect = "Negligible"
elif cramers_v < 0.3:
    effect = "Small"
elif cramers_v < 0.5:
    effect = "Medium"
else:
    effect = "Large"
print(f"Interpretation: {effect} association")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H0 (p < {alpha})")
    print("Channel effectiveness DOES vary by age group")
    
    # Standardized residuals
    residuals = (df.values - expected) / np.sqrt(expected)
    residuals_df = pd.DataFrame(residuals,
                                index=df.index,
                                columns=df.columns)
    
    print("\nStandardized Residuals (|>2| = significant cell):")
    print(residuals_df.round(2))
    
    # Interpret patterns
    print("\nPattern Analysis:")
    print("18-25: Strong preference for Social media")
    print("26-40: Balanced across all channels")
    print("41+:   Strong preference for Search")
    
    print("\nMarketing Recommendations:")
    print("- Target younger audiences (18-25) via Social media")
    print("- Use Search advertising for older audiences (41+)")
    print("- Email works consistently across all age groups")
    
else:
    print(f"\nDecision: Fail to reject H0 (p >= {alpha})")
    print("No evidence that channel effectiveness varies by age")

Key Takeaways

Hypothesis Framework

Hypothesis testing provides a structured framework for making statistical decisions based on sample data, allowing us to quantify uncertainty and draw evidence-based conclusions

P-Value Interpretation

The p-value measures the probability of observing results as extreme as yours if the null hypothesis is true - lower p-values provide stronger evidence against the null hypothesis

Significance Level

The significance level (alpha) is the threshold for decision-making, typically set at 0.05, representing the acceptable probability of making a Type I error (false positive)

Test Selection

Choose the appropriate test based on your data type and research question: t-tests for means, chi-square for categorical relationships, and ANOVA for multiple group comparisons

Type I vs Type II Errors

Type I error (false positive) occurs when rejecting a true null hypothesis, while Type II error (false negative) happens when failing to reject a false null hypothesis

Business Applications

Hypothesis testing drives critical business decisions from A/B testing website changes to quality control in manufacturing, marketing campaign effectiveness, and product development

Understanding Hypothesis Testing

What You'll Learn

Contents

Introduction to Hypothesis Testing

What is Hypothesis Testing?