Assignment 6-A

Statistical Analysis Report

Apply your statistical analysis skills to analyze real-world healthcare data, perform hypothesis testing, and generate actionable insights that could improve patient outcomes.

6-8 hours
Advanced
100 Points
Submit Assignment
What You'll Practice
  • Descriptive statistics analysis
  • Hypothesis testing (t-test, ANOVA)
  • Chi-square tests
  • Confidence intervals
  • Statistical reporting

Assignment Overview

In this comprehensive assignment, you will work as a Healthcare Data Analyst at HealthFirst Medical Center. The hospital administration needs you to analyze patient data, lab results, and treatment outcomes to identify patterns that could improve healthcare quality and patient satisfaction.

Objectives
  • Calculate descriptive statistics for patient metrics
  • Perform hypothesis tests to compare groups
  • Construct confidence intervals
  • Conduct chi-square tests for categorical data
  • Analyze treatment effectiveness
  • Create a professional statistical report
Skills Tested
  • scipy.stats for hypothesis testing
  • NumPy for numerical calculations
  • Pandas for data manipulation
  • Statistical interpretation skills
  • Visualization of statistical results
  • Scientific writing and reporting
Deliverables
  • statistical_analysis.ipynb (notebook)
  • statistical_report.pdf
  • visualizations/ folder (6 charts)
  • README.md

The Scenario

πŸ“§ Email from Dr. Sarah Chen, Chief Medical Officer

"Welcome to HealthFirst Medical Center! We've been collecting patient data for the past year, and our board is asking tough questions about treatment effectiveness and resource allocation. We need your statistical expertise to provide evidence-based answers.

Specifically, I need you to investigate:

  1. Patient Demographics - What are the key characteristics of our patient population?
  2. Treatment Comparisons - Is Treatment B really more effective than Treatment A?
  3. Lab Result Patterns - Do certain patient groups show significantly different lab values?
  4. Recovery Time Analysis - What factors affect how quickly patients recover?

The board meets next month, and I need a comprehensive statistical report with clear visualizations and actionable recommendations. Please use proper statistical methods and clearly state your hypotheses, test results, and conclusions.

Looking forward to your analysis!"

- Dr. Sarah Chen, Chief Medical Officer

The Datasets

Download the three datasets below. These contain anonymized patient data from HealthFirst Medical Center.

healthfirst_patients.csv

Patient admission records including demographics, department, charges, vitals, and outcomes.

350 patients Admission data Clean dataset
Download CSV
Columns:
  • patient_id, admission_date, age, gender, department
  • admission_type, length_of_stay, total_charges, insurance_type
  • bp_systolic, heart_rate, satisfaction_score, readmitted_30day, discharge_disposition
healthfirst_lab_results.csv

Laboratory test results with values, normal ranges, and status indicators for various tests.

500 records Lab data Linked to patients
Download CSV
Columns:
  • result_id, patient_id, test_date, test_name
  • result_value, unit, normal_range_low, normal_range_high
  • status (Normal/Low/High), ordering_physician
healthfirst_treatments.csv

Treatment records including medications, procedures, dosages, and patient outcomes.

400 records Treatment data Outcome measures
Download CSV
Columns:
  • treatment_id, patient_id, treatment_date, treatment_type
  • treatment_name, dosage, frequency, duration
  • cost, outcome (Improved/Stable/Resolved), side_effects_reported

Requirements

Task 1 Descriptive Statistics Analysis (15 points)

In your statistical_analysis.ipynb notebook:
  1. Load and merge all three datasets:
    • Use patient_id to join the datasets
    • Handle any missing values appropriately
    • Document your data cleaning decisions
  2. Calculate summary statistics for numerical variables:
    • Mean, median, mode, standard deviation for age, length_of_stay, total_charges
    • Skewness and kurtosis for bp_systolic and heart_rate
    • Quartiles and interquartile range (IQR) for treatment costs
    • Identify outliers in total_charges using the IQR method
  3. Analyze categorical variable distributions:
    • Frequency tables for gender, department, admission_type, insurance_type
    • Cross-tabulations between department and discharge_disposition
    • Proportions and percentages for treatment outcomes
  4. Create visualization:
    • Histogram of age distribution with normal curve overlay
    • Box plots comparing total_charges across departments
    • Export as descriptive_stats.png

Task 2 One-Way ANOVA Analysis (20 points)

Test: Does length of stay differ by hospital department?
  1. State your hypotheses:
    • Hβ‚€: μ₁ = ΞΌβ‚‚ = μ₃ = ... (mean length of stay is equal across departments)
    • H₁: At least one department mean is different
  2. Check ANOVA assumptions:
    • Normality (Shapiro-Wilk test for each department group)
    • Homogeneity of variances (Levene's test)
    • Independence (discuss based on data collection)
  3. Perform the one-way ANOVA:
    • Use scipy.stats.f_oneway()
    • Report F-statistic and p-value
    • Calculate effect size (eta-squared: Ξ·Β²)
  4. Post-hoc analysis (if significant):
    • Tukey's HSD test for pairwise comparisons
    • Report which departments differ significantly
  5. Create visualization:
    • Box plots with individual data points by department
    • Include significance bars for post-hoc results
    • Export as anova_length_of_stay.png

Task 3 Independent Samples T-Test (15 points)

Test: Is there a significant difference in satisfaction scores between male and female patients?
  1. State your hypotheses:
    • Hβ‚€: ΞΌ_male = ΞΌ_female (no difference in mean satisfaction scores)
    • H₁: ΞΌ_male β‰  ΞΌ_female (two-tailed test)
  2. Check t-test assumptions:
    • Normality of each group
    • Equal variances (Levene's test)
    • Use Welch's t-test if variances are unequal
  3. Perform the t-test:
    • Use scipy.stats.ttest_ind()
    • Report t-statistic, degrees of freedom, and p-value
    • Calculate effect size (Cohen's d)
  4. Construct 95% confidence interval:
    • CI for the difference in means
    • Interpret the interval in context
  5. Create visualization:
    • Overlapping distributions for both groups
    • Vertical lines showing means
    • Export as ttest_satisfaction.png

Task 4 Chi-Square Test of Independence (15 points)

Test: Is there an association between insurance type and 30-day readmission?
  1. State your hypotheses:
    • Hβ‚€: Insurance type and readmission are independent
    • H₁: There is an association between insurance type and readmission
  2. Create contingency table:
    • Rows: Insurance type (Private, Medicare, Medicaid, Self-Pay)
    • Columns: Readmitted within 30 days (Yes, No)
    • Include row and column totals
  3. Calculate expected frequencies:
    • E = (row total Γ— column total) / grand total
    • Verify all expected counts β‰₯ 5
  4. Perform chi-square test:
    • Use scipy.stats.chi2_contingency()
    • Report χ², degrees of freedom, and p-value
    • Calculate CramΓ©r's V for effect size
  5. Create visualization:
    • Stacked or grouped bar chart
    • Heatmap of residuals (observed - expected)
    • Export as chisquare_insurance.png

Task 5 Correlation and Regression Analysis (15 points)

Analyze relationships between patient metrics and outcomes:
  1. Calculate correlation matrix:
    • Pearson correlation for numerical variables
    • Include: age, length_of_stay, total_charges, bp_systolic, heart_rate, satisfaction_score
    • Test significance of each correlation (p-values)
  2. Perform simple linear regression:
    • Predictor: length_of_stay
    • Outcome: total_charges
    • Report slope, intercept, RΒ², and p-value
  3. Check regression assumptions:
    • Linearity (residual plot)
    • Normality of residuals
    • Homoscedasticity
  4. Create visualizations:
    • Correlation heatmap with annotations
    • Scatter plot with regression line and CI band
    • Export as correlation_analysis.png

Task 6 Comprehensive Treatment Analysis (20 points)

Compare treatment outcomes across patient groups:
  1. Two-Way ANOVA:
    • Factors: Treatment type Γ— Admission type (Emergency/Elective/Urgent)
    • Outcome: Treatment cost
    • Test for main effects and interaction
    • Report all F-statistics and p-values
  2. Calculate confidence intervals:
    • 95% CI for mean total_charges by department
    • 95% CI for proportion with side_effects_reported
    • Display intervals graphically
  3. Non-parametric alternative:
    • Kruskal-Wallis test for satisfaction_score by treatment outcome
    • Compare with ANOVA results
    • Discuss when to use non-parametric tests
  4. Create visualizations:
    • Interaction plot for two-way ANOVA
    • Forest plot showing CIs for all treatment types
    • Export as treatment_analysis.png

Grading Rubric

Component Points Criteria
Descriptive Statistics 15 Complete summary statistics, proper interpretation, quality visualization
One-Way ANOVA 20 Correct assumptions testing, proper test execution, effect size calculated, post-hoc if needed
T-Test Analysis 15 Correct test selection, CI construction, effect size (Cohen's d), clear interpretation
Chi-Square Test 15 Proper contingency table, expected frequencies verified, CramΓ©r's V calculated
Correlation Analysis 15 Correlation matrix complete, regression assumptions checked, quality heatmap
Treatment Analysis 20 Two-way ANOVA correct, CIs properly constructed, non-parametric comparison included
Total 100
Deductions
  • -5 points: Missing hypothesis statements or incorrect test selection
  • -5 points: Assumptions not checked before parametric tests
  • -5 points: Missing effect sizes or confidence intervals
  • -5 points: Visualizations missing or low quality
  • -10 points: Incorrect statistical interpretations
Bonus Points (up to 10)
  • +3 points: Power analysis included for sample size justification
  • +3 points: Bootstrap confidence intervals as alternative method
  • +2 points: Multiple comparison correction (Bonferroni, FDR) applied
  • +2 points: Exceptionally clear statistical report suitable for non-technical audience

Submission

Create a public GitHub repository with the exact name shown below, add all required files, and submit through the submission portal.

github.com/<your-username>/healthfirst-analysis
Required Repository Structure:
healthfirst-analysis/
β”œβ”€β”€ statistical_analysis.ipynb
β”œβ”€β”€ statistical_report.pdf
β”œβ”€β”€ visualizations/
β”‚   β”œβ”€β”€ descriptive_stats.png
β”‚   β”œβ”€β”€ anova_length_of_stay.png
β”‚   β”œβ”€β”€ ttest_satisfaction.png
β”‚   β”œβ”€β”€ chisquare_insurance.png
β”‚   β”œβ”€β”€ correlation_analysis.png
β”‚   └── treatment_analysis.png
β”œβ”€β”€ data/
β”‚   └── (downloaded CSV files)
└── README.md
Required Files Checklist:
statistical_analysis.ipynb statistical_report.pdf descriptive_stats.png anova_length_of_stay.png ttest_satisfaction.png chisquare_insurance.png correlation_analysis.png treatment_analysis.png README.md

All files are required. Submission will fail if any file is missing.

Pro Tips

Statistical Best Practices
  • Always check assumptions before using parametric tests
  • Report effect sizes along with p-values
  • Use exact p-values (p = 0.023) not just thresholds (p < 0.05)
  • Consider practical significance, not just statistical significance
Code Organization
  • Create helper functions for repeated calculations
  • Use meaningful variable names (e.g., ttest_result not t)
  • Add markdown cells explaining your reasoning
  • Include references to statistical formulas used
Report Writing
  • Structure: Introduction β†’ Methods β†’ Results β†’ Discussion
  • Use tables for test statistics and p-values
  • Include confidence intervals in your conclusions
  • Write for a non-technical audience (the hospital board)
Common Mistakes
  • Using parametric tests on non-normal data without checking
  • Forgetting to correct for multiple comparisons
  • Interpreting correlation as causation
  • Not reporting the direction of differences (which group is higher?)

Pre-Submission Checklist

Statistical Requirements
Deliverables