Assignment 6: Statistical Analysis Report | Data Science Course

Assignment Overview

In this comprehensive assignment, you will work as a Healthcare Data Analyst at HealthFirst Medical Center. The hospital administration needs you to analyze patient data, lab results, and treatment outcomes to identify patterns that could improve healthcare quality and patient satisfaction.

Objectives

Calculate descriptive statistics for patient metrics
Perform hypothesis tests to compare groups
Construct confidence intervals
Conduct chi-square tests for categorical data
Analyze treatment effectiveness
Create a professional statistical report

Skills Tested

scipy.stats for hypothesis testing
NumPy for numerical calculations
Pandas for data manipulation
Statistical interpretation skills
Visualization of statistical results
Scientific writing and reporting

Deliverables

statistical_analysis.ipynb (notebook)
statistical_report.pdf
visualizations/ folder (6 charts)
README.md

The Scenario

📧 Email from Dr. Sarah Chen, Chief Medical Officer

"Welcome to HealthFirst Medical Center! We've been collecting patient data for the past year, and our board is asking tough questions about treatment effectiveness and resource allocation. We need your statistical expertise to provide evidence-based answers.

Specifically, I need you to investigate:

Patient Demographics - What are the key characteristics of our patient population?
Treatment Comparisons - Is Treatment B really more effective than Treatment A?
Lab Result Patterns - Do certain patient groups show significantly different lab values?
Recovery Time Analysis - What factors affect how quickly patients recover?

The board meets next month, and I need a comprehensive statistical report with clear visualizations and actionable recommendations. Please use proper statistical methods and clearly state your hypotheses, test results, and conclusions.

Looking forward to your analysis!"

- Dr. Sarah Chen, Chief Medical Officer

The Datasets

Download the three datasets below. These contain anonymized patient data from HealthFirst Medical Center.

healthfirst_patients.csv

Patient admission records including demographics, department, charges, vitals, and outcomes.

350 patients Admission data Clean dataset

Download CSV

Columns:

patient_id, admission_date, age, gender, department
admission_type, length_of_stay, total_charges, insurance_type
bp_systolic, heart_rate, satisfaction_score, readmitted_30day, discharge_disposition

healthfirst_lab_results.csv

Laboratory test results with values, normal ranges, and status indicators for various tests.

500 records Lab data Linked to patients

Download CSV

Columns:

result_id, patient_id, test_date, test_name
result_value, unit, normal_range_low, normal_range_high
status (Normal/Low/High), ordering_physician

healthfirst_treatments.csv

Treatment records including medications, procedures, dosages, and patient outcomes.

400 records Treatment data Outcome measures

Download CSV

Columns:

treatment_id, patient_id, treatment_date, treatment_type
treatment_name, dosage, frequency, duration
cost, outcome (Improved/Stable/Resolved), side_effects_reported

Requirements

Task 1 Descriptive Statistics Analysis (15 points)

In your statistical_analysis.ipynb notebook:

Load and merge all three datasets:
- Use patient_id to join the datasets
- Handle any missing values appropriately
- Document your data cleaning decisions
Calculate summary statistics for numerical variables:
- Mean, median, mode, standard deviation for age, length_of_stay, total_charges
- Skewness and kurtosis for bp_systolic and heart_rate
- Quartiles and interquartile range (IQR) for treatment costs
- Identify outliers in total_charges using the IQR method
Analyze categorical variable distributions:
- Frequency tables for gender, department, admission_type, insurance_type
- Cross-tabulations between department and discharge_disposition
- Proportions and percentages for treatment outcomes
Create visualization:
- Histogram of age distribution with normal curve overlay
- Box plots comparing total_charges across departments
- Export as descriptive_stats.png

Task 2 One-Way ANOVA Analysis (20 points)

Test: Does length of stay differ by hospital department?

State your hypotheses:
- H₀: μ₁ = μ₂ = μ₃ = ... (mean length of stay is equal across departments)
- H₁: At least one department mean is different
Check ANOVA assumptions:
- Normality (Shapiro-Wilk test for each department group)
- Homogeneity of variances (Levene's test)
- Independence (discuss based on data collection)
Perform the one-way ANOVA:
- Use scipy.stats.f_oneway()
- Report F-statistic and p-value
- Calculate effect size (eta-squared: η²)
Post-hoc analysis (if significant):
- Tukey's HSD test for pairwise comparisons
- Report which departments differ significantly
Create visualization:
- Box plots with individual data points by department
- Include significance bars for post-hoc results
- Export as anova_length_of_stay.png

Task 3 Independent Samples T-Test (15 points)

Test: Is there a significant difference in satisfaction scores between male and female patients?

State your hypotheses:
- H₀: μ_male = μ_female (no difference in mean satisfaction scores)
- H₁: μ_male ≠ μ_female (two-tailed test)
Check t-test assumptions:
- Normality of each group
- Equal variances (Levene's test)
- Use Welch's t-test if variances are unequal
Perform the t-test:
- Use scipy.stats.ttest_ind()
- Report t-statistic, degrees of freedom, and p-value
- Calculate effect size (Cohen's d)
Construct 95% confidence interval:
- CI for the difference in means
- Interpret the interval in context
Create visualization:
- Overlapping distributions for both groups
- Vertical lines showing means
- Export as ttest_satisfaction.png

Task 4 Chi-Square Test of Independence (15 points)

Test: Is there an association between insurance type and 30-day readmission?

State your hypotheses:
- H₀: Insurance type and readmission are independent
- H₁: There is an association between insurance type and readmission
Create contingency table:
- Rows: Insurance type (Private, Medicare, Medicaid, Self-Pay)
- Columns: Readmitted within 30 days (Yes, No)
- Include row and column totals
Calculate expected frequencies:
- E = (row total × column total) / grand total
- Verify all expected counts ≥ 5
Perform chi-square test:
- Use scipy.stats.chi2_contingency()
- Report χ², degrees of freedom, and p-value
- Calculate Cramér's V for effect size
Create visualization:
- Stacked or grouped bar chart
- Heatmap of residuals (observed - expected)
- Export as chisquare_insurance.png

Task 5 Correlation and Regression Analysis (15 points)

Analyze relationships between patient metrics and outcomes:

Calculate correlation matrix:
- Pearson correlation for numerical variables
- Include: age, length_of_stay, total_charges, bp_systolic, heart_rate, satisfaction_score
- Test significance of each correlation (p-values)
Perform simple linear regression:
- Predictor: length_of_stay
- Outcome: total_charges
- Report slope, intercept, R², and p-value
Check regression assumptions:
- Linearity (residual plot)
- Normality of residuals
- Homoscedasticity
Create visualizations:
- Correlation heatmap with annotations
- Scatter plot with regression line and CI band
- Export as correlation_analysis.png

Task 6 Comprehensive Treatment Analysis (20 points)

Compare treatment outcomes across patient groups:

Two-Way ANOVA:
- Factors: Treatment type × Admission type (Emergency/Elective/Urgent)
- Outcome: Treatment cost
- Test for main effects and interaction
- Report all F-statistics and p-values
Calculate confidence intervals:
- 95% CI for mean total_charges by department
- 95% CI for proportion with side_effects_reported
- Display intervals graphically
Non-parametric alternative:
- Kruskal-Wallis test for satisfaction_score by treatment outcome
- Compare with ANOVA results
- Discuss when to use non-parametric tests
Create visualizations:
- Interaction plot for two-way ANOVA
- Forest plot showing CIs for all treatment types
- Export as treatment_analysis.png

Grading Rubric

Component	Points	Criteria
Descriptive Statistics	15	Complete summary statistics, proper interpretation, quality visualization
One-Way ANOVA	20	Correct assumptions testing, proper test execution, effect size calculated, post-hoc if needed
T-Test Analysis	15	Correct test selection, CI construction, effect size (Cohen's d), clear interpretation
Chi-Square Test	15	Proper contingency table, expected frequencies verified, Cramér's V calculated
Correlation Analysis	15	Correlation matrix complete, regression assumptions checked, quality heatmap
Treatment Analysis	20	Two-way ANOVA correct, CIs properly constructed, non-parametric comparison included
Total	100

Deductions

-5 points: Missing hypothesis statements or incorrect test selection
-5 points: Assumptions not checked before parametric tests
-5 points: Missing effect sizes or confidence intervals
-5 points: Visualizations missing or low quality
-10 points: Incorrect statistical interpretations

Bonus Points (up to 10)

+3 points: Power analysis included for sample size justification
+3 points: Bootstrap confidence intervals as alternative method
+2 points: Multiple comparison correction (Bonferroni, FDR) applied
+2 points: Exceptionally clear statistical report suitable for non-technical audience

Submission

Create a public GitHub repository with the exact name shown below, add all required files, and submit through the submission portal.

github.com/<your-username>/healthfirst-analysis

Submit Assignment

Required Repository Structure:

healthfirst-analysis/
├── statistical_analysis.ipynb
├── statistical_report.pdf
├── visualizations/
│   ├── descriptive_stats.png
│   ├── anova_length_of_stay.png
│   ├── ttest_satisfaction.png
│   ├── chisquare_insurance.png
│   ├── correlation_analysis.png
│   └── treatment_analysis.png
├── data/
│   └── (downloaded CSV files)
└── README.md

Required Files Checklist:

statistical_analysis.ipynb statistical_report.pdf descriptive_stats.png anova_length_of_stay.png ttest_satisfaction.png chisquare_insurance.png correlation_analysis.png treatment_analysis.png README.md

All files are required. Submission will fail if any file is missing.

Pro Tips

Statistical Best Practices

Always check assumptions before using parametric tests
Report effect sizes along with p-values
Use exact p-values (p = 0.023) not just thresholds (p < 0.05)
Consider practical significance, not just statistical significance

Code Organization

Create helper functions for repeated calculations
Use meaningful variable names (e.g., ttest_result not t)
Add markdown cells explaining your reasoning
Include references to statistical formulas used

Report Writing

Structure: Introduction → Methods → Results → Discussion
Use tables for test statistics and p-values
Include confidence intervals in your conclusions
Write for a non-technical audience (the hospital board)

Common Mistakes

Using parametric tests on non-normal data without checking
Forgetting to correct for multiple comparisons
Interpreting correlation as causation
Not reporting the direction of differences (which group is higher?)

Statistical Analysis Report

What You'll Practice

Assignment Overview

Objectives

Skills Tested

Deliverables

The Scenario

📧 Email from Dr. Sarah Chen, Chief Medical Officer

The Datasets

healthfirst_patients.csv

healthfirst_lab_results.csv

healthfirst_treatments.csv

Requirements

Task 1 Descriptive Statistics Analysis (15 points)

In your statistical_analysis.ipynb notebook:

Task 2 One-Way ANOVA Analysis (20 points)

Test: Does length of stay differ by hospital department?

Task 3 Independent Samples T-Test (15 points)

Test: Is there a significant difference in satisfaction scores between male and female patients?

Task 4 Chi-Square Test of Independence (15 points)

Test: Is there an association between insurance type and 30-day readmission?

Task 5 Correlation and Regression Analysis (15 points)

Analyze relationships between patient metrics and outcomes:

Task 6 Comprehensive Treatment Analysis (20 points)

Compare treatment outcomes across patient groups:

Grading Rubric

Deductions

Bonus Points (up to 10)

Submission

Required Repository Structure:

Required Files Checklist:

Pro Tips

Statistical Best Practices

Code Organization

Report Writing

Common Mistakes

Pre-Submission Checklist

Statistical Requirements

Deliverables