Descriptive Statistics

What is Descriptive Statistics?

Descriptive statistics is the branch of statistics that deals with summarizing and describing data. Unlike inferential statistics which makes predictions about populations, descriptive statistics focuses on describing what the data actually shows without making conclusions beyond it.

Key Concept

Descriptive vs Inferential Statistics

Descriptive Statistics summarizes and organizes data so it can be easily understood. It describes the sample you have.

Inferential Statistics uses sample data to make predictions or inferences about a larger population. It goes beyond the data at hand.

This module focuses on descriptive statistics - summarizing what we observe in our data.

The Three Pillars of Descriptive Statistics

Central Tendency

Measures that describe the center or typical value of a dataset: mean, median, and mode.

Variability

Measures that describe the spread or dispersion: range, variance, standard deviation, IQR.

Distribution Shape

Measures that describe the shape: skewness (asymmetry) and kurtosis (peakedness).

Getting Started with Data

Let's set up a sample dataset to explore these concepts:

import numpy as np
import pandas as pd
from scipy import stats

# Sample dataset: exam scores
scores = [72, 85, 88, 91, 76, 82, 78, 95, 89, 73, 
          80, 84, 77, 93, 86, 79, 81, 90, 75, 87]

# Create a pandas Series for easier manipulation
exam_scores = pd.Series(scores, name="Exam Scores")

print("Sample Data:")
print(exam_scores.values)
print(f"\nNumber of observations: {len(exam_scores)}")

Tip: Throughout this lesson, we'll use NumPy for calculations, Pandas for data handling, and SciPy for advanced statistical functions.

Measures of Central Tendency

Central tendency measures describe the center of a dataset - the typical or representative value. The three main measures are mean, median, and mode, each useful in different situations.

Interactive: Mean vs Median with Outliers

Drag to Explore

See how adding an outlier affects mean vs median. The median is resistant to extreme values!

Outlier Value 70

Normal Extreme

[50, 55, 60, 65, 70, 70]

Mean

61.67

Base value

Median

62.50

Base value

Difference 0.83

Insight: With no outlier, mean and median are similar. Drag the slider to see the mean shift dramatically while the median stays stable!

Mean (Average)

The mean is the sum of all values divided by the number of values. It's the most common measure but is sensitive to outliers.

Mean = Σx / n

Sum of all values divided by count

# Calculating the mean
scores = [72, 85, 88, 91, 76, 82, 78, 95, 89, 73, 
          80, 84, 77, 93, 86, 79, 81, 90, 75, 87]

# Using NumPy
mean_np = np.mean(scores)
print(f"Mean (NumPy): {mean_np}")

# Using Pandas
exam_scores = pd.Series(scores)
mean_pd = exam_scores.mean()
print(f"Mean (Pandas): {mean_pd}")

# Manual calculation
mean_manual = sum(scores) / len(scores)
print(f"Mean (Manual): {mean_manual}")

# Output: Mean = 83.05

Median (Middle Value)

The median is the middle value when data is sorted. If there's an even number of values, it's the average of the two middle values. The median is resistant to outliers.

# Calculating the median
median_np = np.median(scores)
print(f"Median (NumPy): {median_np}")

median_pd = exam_scores.median()
print(f"Median (Pandas): {median_pd}")

# Manual calculation
sorted_scores = sorted(scores)
n = len(sorted_scores)
if n % 2 == 0:
    median_manual = (sorted_scores[n//2 - 1] + sorted_scores[n//2]) / 2
else:
    median_manual = sorted_scores[n//2]
print(f"Median (Manual): {median_manual}")

# Output: Median = 82.5

Mode (Most Frequent Value)

The mode is the value that appears most frequently. A dataset can have no mode, one mode (unimodal), or multiple modes (bimodal/multimodal).

# Calculating the mode
from scipy import stats

# Dataset with a clear mode
grades = [85, 90, 85, 78, 92, 85, 88, 90, 85]

mode_scipy = stats.mode(grades, keepdims=True)
print(f"Mode: {mode_scipy.mode[0]}")
print(f"Count: {mode_scipy.count[0]}")

# Using Pandas
grades_series = pd.Series(grades)
mode_pd = grades_series.mode()
print(f"Mode (Pandas): {mode_pd.values}")

# Output: Mode = 85 (appears 4 times)

When to Use Each Measure

Mean

✅ Symmetric distributions
✅ Interval/ratio data
❌ Avoid with outliers
❌ Avoid with skewed data

Median

✅ Skewed distributions
✅ Data with outliers
✅ Ordinal data
✅ Income, house prices

Mode

✅ Categorical data
✅ Finding most common
✅ Bimodal distributions
❌ May not exist

Practice: Central Tendency

Task: Given employee salaries: salaries = [45000, 48000, 52000, 55000, 51000, 49000, 53000, 47000, 850000] (the last one is the CEO), calculate mean, median, and mode. Explain which measure best represents the "typical" employee salary and why.

Show Solution

import numpy as np
from scipy import stats

salaries = [45000, 48000, 52000, 55000, 51000, 49000, 53000, 47000, 850000]

mean_sal = np.mean(salaries)
median_sal = np.median(salaries)
mode_sal = stats.mode(salaries, keepdims=True).mode[0]

print(f"Mean: ${mean_sal:,.0f}")    # $138,889 - inflated by CEO
print(f"Median: ${median_sal:,.0f}") # $51,000 - typical employee
print(f"Mode: ${mode_sal:,.0f}")     # No repeated values

# ANSWER: Median ($51,000) best represents typical salary
# The CEO's $850,000 is an outlier that skews the mean

Task: A clothing store recorded t-shirt sizes sold: sizes = ['M', 'L', 'S', 'M', 'XL', 'M', 'L', 'M', 'S', 'M', 'L', 'M', 'XXL', 'L', 'M']. Find which size is most popular (mode) and calculate what percentage of sales it represents.

Show Solution

import pandas as pd

sizes = ['M', 'L', 'S', 'M', 'XL', 'M', 'L', 'M', 'S', 'M', 'L', 'M', 'XXL', 'L', 'M']
sizes_series = pd.Series(sizes)

# Find mode
mode = sizes_series.mode()[0]
mode_count = (sizes_series == mode).sum()
percentage = (mode_count / len(sizes_series)) * 100

print(f"Most popular size: {mode}")
print(f"Count: {mode_count} out of {len(sizes_series)}")
print(f"Percentage: {percentage:.1f}%")

# Output: M is the mode, sold 7 times (46.7%)

Measures of Variability

While central tendency tells us about the typical value, variability measures tell us how spread out the data is. Two datasets can have the same mean but very different spreads, making variability essential for understanding data.

Range

The simplest measure of spread - the difference between the maximum and minimum values. It's easy to calculate but highly sensitive to outliers.

scores = [72, 85, 88, 91, 76, 82, 78, 95, 89, 73, 
          80, 84, 77, 93, 86, 79, 81, 90, 75, 87]

# Calculate range
data_range = np.max(scores) - np.min(scores)
print(f"Range: {data_range}")

# Or using built-in functions
data_range = max(scores) - min(scores)
print(f"Range: {data_range}")  # Output: 23

Variance

Variance measures the average squared deviation from the mean. It quantifies how far data points are from the center on average.

Variance = Σ(x - μ)² / n (Population)

Variance = Σ(x - x̄)² / (n-1) (Sample)

n-1 for sample (Bessel's correction) to reduce bias

# Population variance (divide by n)
var_pop = np.var(scores)
print(f"Population Variance: {var_pop:.2f}")

# Sample variance (divide by n-1) - use this for samples!
var_sample = np.var(scores, ddof=1)  # ddof=1 for sample
print(f"Sample Variance: {var_sample:.2f}")

# Pandas uses sample variance by default
var_pd = pd.Series(scores).var()
print(f"Pandas Variance: {var_pd:.2f}")

Standard Deviation

Standard deviation is the square root of variance. It's in the same units as the data, making it more interpretable than variance.

# Population standard deviation
std_pop = np.std(scores)
print(f"Population Std Dev: {std_pop:.2f}")

# Sample standard deviation (most common)
std_sample = np.std(scores, ddof=1)
print(f"Sample Std Dev: {std_sample:.2f}")

# Pandas (sample std by default)
std_pd = pd.Series(scores).std()
print(f"Pandas Std Dev: {std_pd:.2f}")

# Interpretation
mean = np.mean(scores)
print(f"\nMean ± 1 Std: {mean:.1f} ± {std_sample:.1f}")
print(f"Range: [{mean - std_sample:.1f}, {mean + std_sample:.1f}]")

68-95-99.7 Rule: For normal distributions, ~68% of data falls within 1 std, ~95% within 2 std, and ~99.7% within 3 std of the mean.

Interquartile Range (IQR)

IQR is the range of the middle 50% of data (Q3 - Q1). It's robust to outliers and commonly used in box plots.

# Calculate quartiles
q1 = np.percentile(scores, 25)
q3 = np.percentile(scores, 75)
iqr = q3 - q1

print(f"Q1 (25th percentile): {q1}")
print(f"Q3 (75th percentile): {q3}")
print(f"IQR: {iqr}")

# Using SciPy
from scipy.stats import iqr
iqr_scipy = iqr(scores)
print(f"IQR (SciPy): {iqr_scipy}")

# Outlier detection using IQR
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print(f"\nOutlier bounds: [{lower_bound:.1f}, {upper_bound:.1f}]")

Coefficient of Variation (CV)

CV expresses standard deviation as a percentage of the mean, allowing comparison of variability across different scales.

# Coefficient of Variation
mean = np.mean(scores)
std = np.std(scores, ddof=1)
cv = (std / mean) * 100

print(f"Mean: {mean:.2f}")
print(f"Std Dev: {std:.2f}")
print(f"CV: {cv:.2f}%")

# Useful for comparing variability
# e.g., heights (CV ~5%) vs income (CV ~50-100%)

Practice: Variability

Task: Machine A produces parts with weights: [100.2, 99.8, 100.1, 99.9, 100.0] grams. Machine B produces parts with weights: [50.5, 49.2, 51.3, 48.8, 50.2] grams. Using Coefficient of Variation (CV), determine which machine is more consistent (lower relative variability).

Show Solution

import numpy as np

machine_a = [100.2, 99.8, 100.1, 99.9, 100.0]
machine_b = [50.5, 49.2, 51.3, 48.8, 50.2]

# Calculate CV = (std / mean) * 100
cv_a = (np.std(machine_a, ddof=1) / np.mean(machine_a)) * 100
cv_b = (np.std(machine_b, ddof=1) / np.mean(machine_b)) * 100

print(f"Machine A - Mean: {np.mean(machine_a):.2f}g, Std: {np.std(machine_a, ddof=1):.3f}g")
print(f"Machine A - CV: {cv_a:.2f}%")
print(f"\nMachine B - Mean: {np.mean(machine_b):.2f}g, Std: {np.std(machine_b, ddof=1):.3f}g")
print(f"Machine B - CV: {cv_b:.2f}%")

print(f"\nMachine A is more consistent (CV: {cv_a:.2f}% vs {cv_b:.2f}%)")
# Machine A has ~0.16% CV, Machine B has ~1.97% CV

Task: House prices in a neighborhood (in thousands): prices = [250, 275, 290, 310, 285, 295, 280, 265, 890, 305]. Use the IQR method to identify any outliers and explain if the $890K house should be flagged.

Show Solution

import numpy as np

prices = [250, 275, 290, 310, 285, 295, 280, 265, 890, 305]

q1 = np.percentile(prices, 25)
q3 = np.percentile(prices, 75)
iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

print(f"Q1: ${q1:.0f}K, Q3: ${q3:.0f}K, IQR: ${iqr:.0f}K")
print(f"Lower bound: ${lower_bound:.0f}K")
print(f"Upper bound: ${upper_bound:.0f}K")

outliers = [p for p in prices if p < lower_bound or p > upper_bound]
print(f"\nOutliers: {outliers}")
# The $890K house is an outlier (above upper bound ~$350K)

Distribution Shape

Beyond center and spread, understanding the shape of data distribution is crucial. Skewness and kurtosis describe asymmetry and peakedness, helping identify the nature of your data and appropriate analytical methods.

Skewness

Skewness measures the asymmetry of a distribution. It tells you which direction the tail extends.

Negative Skew

←

Left tail longer
Mean < Median < Mode

Example: Easy test scores

Symmetric

○

Balanced tails
Mean ≈ Median ≈ Mode

Example: Heights, IQ

Positive Skew

→

Right tail longer
Mode < Median < Mean

Example: Income, house prices

from scipy.stats import skew

# Our exam scores
scores = [72, 85, 88, 91, 76, 82, 78, 95, 89, 73, 
          80, 84, 77, 93, 86, 79, 81, 90, 75, 87]

# Calculate skewness
skewness = skew(scores)
print(f"Skewness: {skewness:.3f}")

# Interpretation
if skewness > 0.5:
    print("Moderately to highly positively skewed")
elif skewness < -0.5:
    print("Moderately to highly negatively skewed")
else:
    print("Approximately symmetric")

# Example of positively skewed data (income-like)
income = [30000, 35000, 40000, 45000, 50000, 55000, 
          60000, 80000, 100000, 150000, 500000]
print(f"\nIncome skewness: {skew(income):.3f}")  # Positive

Kurtosis

Kurtosis measures the "tailedness" of a distribution - how much data is in the tails versus the center compared to a normal distribution.

Platykurtic

Kurtosis < 0

Flat, light tails
Fewer outliers

Example: Uniform distribution

Mesokurtic

Kurtosis ≈ 0

Normal shape
Moderate tails

Example: Normal distribution

Leptokurtic

Kurtosis > 0

Peaked, heavy tails
More outliers

Example: Financial returns

from scipy.stats import kurtosis

# Calculate excess kurtosis (Fisher's definition)
# Normal distribution has excess kurtosis = 0
kurt = kurtosis(scores)
print(f"Excess Kurtosis: {kurt:.3f}")

# Interpretation
if kurt > 1:
    print("Leptokurtic - heavy tails, more outliers likely")
elif kurt < -1:
    print("Platykurtic - light tails, fewer outliers")
else:
    print("Approximately mesokurtic (normal-like)")

# Compare different distributions
normal_data = np.random.normal(0, 1, 1000)
uniform_data = np.random.uniform(-2, 2, 1000)
laplace_data = np.random.laplace(0, 1, 1000)

print(f"\nNormal kurtosis: {kurtosis(normal_data):.3f}")
print(f"Uniform kurtosis: {kurtosis(uniform_data):.3f}")
print(f"Laplace kurtosis: {kurtosis(laplace_data):.3f}")

Percentiles and Quartiles

Percentiles divide data into 100 equal parts. Quartiles are special percentiles (25th, 50th, 75th) that divide data into four equal parts.

# Calculate various percentiles
scores_series = pd.Series(scores)

# Quartiles
print("Quartiles:")
print(f"Q1 (25th): {scores_series.quantile(0.25)}")
print(f"Q2 (50th): {scores_series.quantile(0.50)}")  # Median
print(f"Q3 (75th): {scores_series.quantile(0.75)}")

# Any percentile
print(f"\n10th percentile: {np.percentile(scores, 10)}")
print(f"90th percentile: {np.percentile(scores, 90)}")

# Five-number summary
print(f"\nFive-number summary:")
print(f"Min: {scores_series.min()}")
print(f"Q1: {scores_series.quantile(0.25)}")
print(f"Median: {scores_series.median()}")
print(f"Q3: {scores_series.quantile(0.75)}")
print(f"Max: {scores_series.max()}")

Practice: Distribution Shape

Task: Given household incomes (in thousands): income = [35, 42, 48, 52, 55, 58, 62, 68, 75, 95, 120, 180, 250], calculate skewness and kurtosis. Based on the results, would you recommend a log transformation before using this data in a linear model?

Show Solution

import numpy as np
import pandas as pd
from scipy.stats import skew, kurtosis

income = [35, 42, 48, 52, 55, 58, 62, 68, 75, 95, 120, 180, 250]

skewness = skew(income)
kurt = kurtosis(income)
mean_val = np.mean(income)
median_val = np.median(income)

print(f"Mean: ${mean_val:.0f}K, Median: ${median_val:.0f}K")
print(f"Skewness: {skewness:.2f}")
print(f"Kurtosis: {kurt:.2f}")

# Apply log transform and compare
log_income = np.log(income)
print(f"\nAfter log transform:")
print(f"Skewness: {skew(log_income):.2f}")

# Skewness > 1 indicates strong right skew
# Log transformation IS recommended for linear models

Task: Exam scores: scores = [45, 52, 58, 62, 65, 68, 70, 72, 75, 78, 80, 82, 85, 88, 92, 95, 98]. Calculate the five-number summary, determine if the distribution is symmetric or skewed, and identify if any scores would be considered outliers using the 1.5×IQR rule.

Show Solution

import numpy as np
import pandas as pd

scores = [45, 52, 58, 62, 65, 68, 70, 72, 75, 78, 80, 82, 85, 88, 92, 95, 98]

# Five-number summary
five_num = {
    'Min': np.min(scores),
    'Q1': np.percentile(scores, 25),
    'Median': np.median(scores),
    'Q3': np.percentile(scores, 75),
    'Max': np.max(scores)
}
print("Five-Number Summary:")
for k, v in five_num.items():
    print(f"  {k}: {v}")

# Check symmetry
iqr = five_num['Q3'] - five_num['Q1']
lower_whisker = five_num['Median'] - five_num['Q1']
upper_whisker = five_num['Q3'] - five_num['Median']

print(f"\nIQR: {iqr}")
print(f"Q1 to Median: {lower_whisker}")
print(f"Median to Q3: {upper_whisker}")

# Outlier bounds
lower_bound = five_num['Q1'] - 1.5 * iqr
upper_bound = five_num['Q3'] + 1.5 * iqr
outliers = [s for s in scores if s < lower_bound or s > upper_bound]

print(f"\nOutlier bounds: [{lower_bound:.1f}, {upper_bound:.1f}]")
print(f"Outliers: {outliers if outliers else 'None'}")
# Distribution is roughly symmetric, no outliers

Pandas Statistical Methods

Pandas provides a comprehensive suite of statistical methods built directly into DataFrames and Series. The describe() method gives you a complete statistical summary in one line, while individual methods offer precise control.

The describe() Method

The describe() method provides a complete statistical summary including count, mean, std, min, quartiles, and max.

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'Age': [25, 32, 45, 28, 36, 42, 29, 51, 33, 38],
    'Salary': [45000, 52000, 78000, 48000, 62000, 71000, 51000, 85000, 55000, 65000],
    'Experience': [2, 5, 18, 3, 10, 15, 4, 22, 8, 12]
})

# Basic describe
print(df.describe())

# Output includes:
# - count: non-null values
# - mean: average
# - std: standard deviation
# - min: minimum
# - 25%: first quartile (Q1)
# - 50%: median (Q2)
# - 75%: third quartile (Q3)
# - max: maximum

Customizing describe()

You can customize describe() with percentiles and include categorical data.

# Custom percentiles
print(df.describe(percentiles=[.1, .25, .5, .75, .9]))

# Include categorical columns
df['Department'] = ['Sales', 'IT', 'HR', 'IT', 'Sales', 
                    'IT', 'HR', 'Sales', 'IT', 'HR']
print(df.describe(include='all'))

# Only categorical
print(df.describe(include=['object']))

# Only numeric
print(df.describe(include=[np.number]))

Individual Statistical Methods

Pandas provides individual methods for every statistical measure, applicable to both Series and DataFrames.

# Central tendency
print(f"Mean:\n{df[['Age', 'Salary']].mean()}")
print(f"\nMedian:\n{df[['Age', 'Salary']].median()}")
print(f"\nMode:\n{df['Department'].mode()}")

# Variability
print(f"\nVariance:\n{df[['Age', 'Salary']].var()}")
print(f"\nStd Dev:\n{df[['Age', 'Salary']].std()}")

# Range-based
print(f"\nMin:\n{df[['Age', 'Salary']].min()}")
print(f"\nMax:\n{df[['Age', 'Salary']].max()}")

# Quantiles
print(f"\nQ1:\n{df[['Age', 'Salary']].quantile(0.25)}")
print(f"\nQ3:\n{df[['Age', 'Salary']].quantile(0.75)}")

Grouped Statistics with groupby()

Combine groupby() with statistical methods for powerful segmented analysis.

# Statistics by group
print("Mean by Department:")
print(df.groupby('Department')[['Age', 'Salary', 'Experience']].mean())

# Multiple aggregations
print("\nMultiple stats by Department:")
print(df.groupby('Department')['Salary'].agg(['mean', 'median', 'std', 'min', 'max']))

# Named aggregations
summary = df.groupby('Department').agg(
    avg_salary=('Salary', 'mean'),
    max_salary=('Salary', 'max'),
    avg_age=('Age', 'mean'),
    headcount=('Age', 'count')
)
print("\nNamed aggregations:")
print(summary)

Correlation and Covariance

Understand relationships between variables with correlation and covariance matrices.

# Correlation matrix
print("Correlation Matrix:")
print(df[['Age', 'Salary', 'Experience']].corr())

# Covariance matrix
print("\nCovariance Matrix:")
print(df[['Age', 'Salary', 'Experience']].cov())

# Correlation between two columns
print(f"\nAge-Salary correlation: {df['Age'].corr(df['Salary']):.3f}")
print(f"Experience-Salary correlation: {df['Experience'].corr(df['Salary']):.3f}")

Additional Useful Methods

# Cumulative statistics
print(f"Cumulative Sum:\n{df['Salary'].cumsum()}")
print(f"\nCumulative Max:\n{df['Salary'].cummax()}")

# Rolling statistics (window-based)
print(f"\n3-period Rolling Mean:\n{df['Salary'].rolling(3).mean()}")

# Ranking
print(f"\nSalary Ranks:\n{df['Salary'].rank()}")

# Value counts (for categorical)
print(f"\nDepartment Counts:\n{df['Department'].value_counts()}")

# Unique values
print(f"\nUnique Departments: {df['Department'].nunique()}")

Practice: Pandas Statistics

Task: Given sales data with regions, calculate mean, median, std, min, and max sales for each region. Identify which region has the highest average sales and which has the most consistent performance (lowest CV).

Show Solution

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Region': ['East', 'West', 'East', 'North', 'West', 'North', 'East', 'West', 'North', 'East'],
    'Sales': [12500, 15800, 11200, 9800, 16200, 10500, 13100, 14900, 11000, 12800]
})

# Calculate multiple statistics per region
stats = df.groupby('Region')['Sales'].agg(['mean', 'median', 'std', 'min', 'max'])
print("Regional Statistics:")
print(stats)

# Calculate CV for consistency comparison
cv_by_region = df.groupby('Region')['Sales'].agg(lambda x: (x.std() / x.mean()) * 100)
print(f"\nCV by Region:")
print(cv_by_region)

print(f"\nHighest avg sales: {stats['mean'].idxmax()} (${stats['mean'].max():,.0f})")
print(f"Most consistent: {cv_by_region.idxmin()} (CV: {cv_by_region.min():.1f}%)")

Task: Given employee data with Experience (years), Salary, Performance Score (1-10), and Training Hours, create a correlation matrix and identify: (1) the strongest positive correlation, (2) any negative correlations, and (3) which factor is most correlated with Salary.

Show Solution

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'Experience': [2, 5, 8, 3, 12, 7, 4, 9, 6, 10],
    'Salary': [45000, 58000, 72000, 48000, 95000, 65000, 52000, 78000, 62000, 85000],
    'Performance': [7, 8, 7, 6, 9, 8, 7, 8, 9, 8],
    'Training_Hours': [40, 25, 15, 35, 10, 20, 30, 12, 22, 8]
})

# Full correlation matrix
corr_matrix = df.corr()
print("Correlation Matrix:")
print(corr_matrix.round(3))

# Find strongest positive correlation (excluding diagonal)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
corr_pairs = corr_matrix.where(mask).stack()

print(f"\nStrongest positive correlation: {corr_pairs.idxmax()}")
print(f"  Value: {corr_pairs.max():.3f}")

# Negative correlations
neg_corrs = corr_pairs[corr_pairs < 0]
print(f"\nNegative correlations:")
for idx, val in neg_corrs.items():
    print(f"  {idx}: {val:.3f}")

# Factor most correlated with Salary
salary_corr = corr_matrix['Salary'].drop('Salary').abs().sort_values(ascending=False)
print(f"\nFactors correlated with Salary:")
print(salary_corr)

Key Takeaways

Central Tendency Trio

Mean for symmetric data, Median for skewed data or outliers, Mode for categorical data. Always report the appropriate measure for your data type.

Variability Matters

Standard deviation measures typical distance from mean. IQR is robust to outliers. Use CV to compare variability across different scales.

Shape Indicators

Skewness reveals asymmetry (positive = right tail, negative = left tail). Kurtosis indicates tail weight and outlier propensity.

Pandas describe()

Use df.describe() for instant statistical summaries. Customize with percentiles and include parameters for comprehensive analysis.

Grouped Analysis

Combine groupby() with agg() for powerful segmented statistics. Use named aggregations for readable, multi-metric summaries.

Sample vs Population

Use ddof=1 for sample statistics (n-1 denominator). Pandas uses sample statistics by default; NumPy uses population unless specified.

What You'll Learn

Contents

What is Descriptive Statistics?

Descriptive vs Inferential Statistics

The Three Pillars of Descriptive Statistics

Central Tendency

Variability

Distribution Shape

Getting Started with Data

Measures of Central Tendency

Interactive: Mean vs Median with Outliers

Mean (Average)

Median (Middle Value)

Mode (Most Frequent Value)

When to Use Each Measure

Mean

Median

Mode

Practice: Central Tendency

Medium Analyze employee salary distribution with outlier CEO compensation

Easy Find the most popular t-shirt size from sales data

Measures of Variability

Range

Variance

Standard Deviation

Interquartile Range (IQR)

Coefficient of Variation (CV)

Practice: Variability

Hard Compare consistency of two manufacturing machines using CV

Medium Identify house price outliers using IQR method

Distribution Shape

Skewness

Negative Skew

Symmetric

Positive Skew

Kurtosis

Platykurtic

Mesokurtic

Leptokurtic

Percentiles and Quartiles

Practice: Distribution Shape

Medium Analyze income distribution skewness and suggest transformation

Hard Generate five-number summary and create box plot interpretation for exam scores

Pandas Statistical Methods

The describe() Method

Customizing describe()

Individual Statistical Methods

Grouped Statistics with groupby()

Correlation and Covariance

Additional Useful Methods

Practice: Pandas Statistics

Medium Calculate regional sales statistics using groupby aggregation

Hard Build a correlation analysis report for employee metrics

Key Takeaways

Central Tendency Trio

Variability Matters

Shape Indicators

Pandas describe()

Grouped Analysis

Sample vs Population

Knowledge Check