Data Modeling Fundamentals

Data Modeling Concepts

Data modeling is the process of structuring data to support analytics and reporting. A well-designed data model enables fast queries, reduces redundancy, and provides a clear business context for analytical exploration. Unlike transactional databases optimized for OLTP (Online Transaction Processing), analytical models use OLAP (Online Analytical Processing) design principles.

Key Concept

What is Data Modeling?

Data modeling is the design and organization of data to optimize analytics and reporting. It involves determining table structure, relationships, and measure definitions that support business intelligence and decision-making.

Why it matters: A well-designed model supports fast analysis, reduces data redundancy, prevents calculation errors, and provides users with intuitive access to business metrics.

OLTP vs OLAP Design

Transactional databases prioritize write performance and data integrity, while analytical databases prioritize read performance and query speed. The following comparison shows how these designs differ:

OLTP (Transactional)

Normalized structure, many small inserts/updates, row-level operations, real-time consistency.

OLAP (Analytical)

Denormalized structure, aggregated data, complex queries, optimized for reads and reporting.

Core Components

Every analytical model consists of three main components: fact tables that store measurable business events, dimension tables that provide context, and relationships that connect them. Understanding these components is essential for building effective models.

# Example: Sales Model Structure
# Fact Table: Sales (transactional level)
# - SalesID, DateKey, CustomerKey, ProductKey
# - QuantitySold, SalesAmount, DiscountAmount

# Dimension Tables: Time, Customer, Product
# - DateKey: Date, Month, Quarter, Year, IsWeekend
# - CustomerKey: CustomerName, City, Country, Segment
# - ProductKey: ProductName, Category, SubCategory, Brand

Key Modeling Principles

Effective data models follow several key principles. They use consistent grain at the fact table level, maintain referential integrity through relationships, separate concerns between facts and dimensions, and optimize for common analytical queries.

# Example: Creating a Simple Power BI Model
# Step 1: Load data from sources
sales_data = pd.read_csv('sales.csv')
customers = pd.read_csv('customers.csv')
products = pd.read_csv('products.csv')

# Step 2: Structure as fact and dimensions
fact_sales = sales_data[['transaction_id', 'customer_id', 'product_id', 
                          'amount', 'quantity']]
dim_customer = customers[['customer_id', 'name', 'city', 'country']]
dim_product = products[['product_id', 'name', 'category', 'price']]

print(f"Fact table shape: {fact_sales.shape}")
print(f"Dimension tables: {dim_customer.shape}, {dim_product.shape}")

Pro Tip: Design your model to match user questions. If users ask "What was sales by region?", you need a region dimension. This user-centric approach ensures the model supports real business needs.

Practice: Core Concepts

Task: Explain the key differences between OLTP and OLAP database designs and provide examples of each.

Show Solution

# Answer: OLTP vs OLAP

# OLTP: Optimized for transactional operations (INSERT, UPDATE, DELETE)
# - Normalized structure to reduce redundancy
# - Fast write operations
# - Real-time consistency
# - Few rows per transaction
# Example: Banking system for deposits/withdrawals

# OLAP: Optimized for analytical queries (SELECT, aggregations)
# - Denormalized structure for query speed
# - Fast read operations
# - Historical data retention
# - Aggregated data for analysis
# Example: Sales reporting and business intelligence

# Comparison Table:
# Aspect         | OLTP              | OLAP
# Optimization   | Writes            | Reads
# Structure      | Normalized        | Denormalized
# Data Volume    | Current only      | Historical
# Query Type     | Simple, short      | Complex, long
# Users          | Many concurrent   | Few, power users

Task: Compare query performance and complexity between normalized and dimensional designs.

Show Solution

# Normalized Structure (Poor for Analytics)
# Many tables with complex joins required
SELECT SUM(amount)
FROM sales s
JOIN transactions t ON s.t_id = t.id
JOIN customers c ON t.c_id = c.id
JOIN products p ON s.p_id = p.id
WHERE c.country = 'USA' AND YEAR(t.date) = 2024;

# Star Schema / Dimensional (Better for Analytics)
# Few joins, simple structure, fast aggregations
SELECT SUM(sales_amount)
FROM fact_sales f
JOIN dim_customer dc ON f.customer_key = dc.customer_key
JOIN dim_product dp ON f.product_key = dp.product_key
WHERE dc.country = 'USA' AND YEAR(f.date) = 2024;

# Why dimensional is better:
# - Fewer joins = faster queries (5 joins vs 3 joins)
# - Fewer tables = simpler to understand
# - Denormalization = pre-calculated aggregations possible
# - Clear separation of concerns (facts vs dimensions)
# - Users can navigate intuitively
# - Better performance on large datasets

Task: Design a complete fact table for e-commerce with all necessary dimensions and measures. Explain the grain and justify your choices.

Show Solution

# E-Commerce Star Schema Design
# Grain: One row per order line item (not per order)

# FACT TABLE: fact_order_lines
fact_structure = {
    'Primary Key': ['order_line_key'],
    'Foreign Keys': [
        'order_key', 'customer_key', 'product_key', 
        'warehouse_key', 'date_key', 'promotion_key'
    ],
    'Measures': [
        'unit_price', 'quantity_ordered', 'discount_amount', 
        'tax_amount', 'total_amount', 'shipping_cost'
    ],
    'Degenerate Dimensions': [
        'order_number', 'line_number', 'sku'
    ]
}

# DIMENSION TABLES
dimensions = {
    'dim_date': ['date_key', 'date', 'month', 'quarter', 'year', 'day_of_week', 'is_holiday'],
    'dim_customer': ['customer_key', 'customer_id', 'name', 'segment', 'lifetime_value', 'country'],
    'dim_product': ['product_key', 'product_id', 'name', 'category', 'brand', 'supplier_key', 'price'],
    'dim_warehouse': ['warehouse_key', 'warehouse_id', 'location', 'region', 'capacity', 'manager'],
    'dim_promotion': ['promotion_key', 'promotion_id', 'type', 'discount_pct', 'start_date', 'end_date'],
    'dim_supplier': ['supplier_key', 'supplier_id', 'name', 'country', 'rating']
}

# Analysis Capabilities:
# ✓ Net sales by customer segment, product category, region
# ✓ Promotion effectiveness (impact on sales)
# ✓ Warehouse performance and efficiency
# ✓ Time-based trends and seasonality
# ✓ Supplier performance metrics
# ✓ Customer lifetime value analysis

Star Schema Design

The star schema is the most common approach to organizing analytical data. It consists of a central fact table surrounded by dimension tables, creating a structure that resembles a star when visualized. This design provides excellent query performance and intuitive business analysis.

Key Concept

Star Schema Structure

A star schema consists of one or more fact tables in the center, surrounded by dimension tables. Each dimension contains descriptive attributes, while the fact table contains foreign keys to dimensions and numerical measures.

Visual structure: Fact table at center with lines connecting to dimension tables like points on a star.

Fact Tables

Fact tables store measurable, quantifiable business events. Each row represents an occurrence at a specific grain or level of detail, such as one transaction, one daily total, or one sales order line item. Facts contain foreign keys to dimensions and numeric measures.

# Fact Table Characteristics
fact_table = {
    'name': 'fact_sales',
    'grain': 'one row per transaction',
    'foreign_keys': ['date_key', 'customer_key', 'product_key', 'store_key'],
    'measures': ['sales_amount', 'units_sold', 'discount', 'profit'],
    'properties': [
        'Many rows (millions/billions in large systems)',
        'Degenerate dimensions (order_number, invoice_number)',
        'Foreign keys only (no descriptive text)',
        'Numeric columns only in measures'
    ]
}

# Example SQL for a fact table
# CREATE TABLE fact_sales (
#   sales_key INT PRIMARY KEY,
#   date_key INT,
#   customer_key INT,
#   product_key INT,
#   sales_amount DECIMAL,
#   units_sold INT,
#   FOREIGN KEY (date_key) REFERENCES dim_date,
#   FOREIGN KEY (customer_key) REFERENCES dim_customer,
#   FOREIGN KEY (product_key) REFERENCES dim_product
# )

Dimension Tables

Dimension tables provide context for facts. They contain descriptive attributes that enable filtering, grouping, and analysis. Dimensions typically have fewer rows than fact tables and are slower to change compared to fact data.

# Dimension Table Characteristics
dimension_tables = {
    'dim_customer': {
        'primary_key': 'customer_key',
        'attributes': ['name', 'email', 'city', 'country', 'segment', 'lifetime_value'],
        'characteristics': ['Few rows (thousands)', 'Descriptive text', 'Changes slowly']
    },
    'dim_product': {
        'primary_key': 'product_key',
        'attributes': ['name', 'category', 'subcategory', 'brand', 'price', 'supplier'],
        'characteristics': ['Few rows (thousands)', 'Slowly changing', 'Multiple attributes']
    },
    'dim_store': {
        'primary_key': 'store_key',
        'attributes': ['store_name', 'city', 'region', 'opening_date', 'manager'],
        'characteristics': ['Few rows (hundreds)', 'Hierarchical', 'Static attributes']
    }
}

# Example dimension table in code
import pandas as pd

customers = pd.DataFrame({
    'customer_key': [1, 2, 3, 4, 5],
    'customer_name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'segment': ['Premium', 'Standard', 'Premium', 'Basic', 'Standard'],
    'country': ['USA', 'USA', 'Canada', 'UK', 'USA']
})

print(customers)
# Output shows dimension with attributes for analysis

Grain Definition

The grain is the level of detail of a fact table - what one row represents. Common grain levels include daily totals, individual transactions, or monthly aggregates. Defining grain clearly prevents calculation errors and data inconsistencies.

# Example: Different Grains of Sales Data
# Grain 1: Per Transaction
# One row per sales transaction
fact_sales_detail = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'date_key': [20240101, 20240101, 20240102, 20240102, 20240103],
    'customer_key': [101, 102, 101, 103, 102],
    'amount': [50, 75, 100, 125, 60]
})

# Grain 2: Per Day Per Customer
# One row per day per customer (aggregated)
fact_sales_daily = pd.DataFrame({
    'date_key': [20240101, 20240101, 20240102, 20240102, 20240103],
    'customer_key': [101, 102, 101, 103, 102],
    'total_amount': [50, 75, 100, 125, 60]
})

# Grain 3: Per Month Per Customer
# One row per month per customer
fact_sales_monthly = pd.DataFrame({
    'month': ['2024-01', '2024-01', '2024-01', '2024-02', '2024-02'],
    'customer_key': [101, 102, 103, 101, 102],
    'monthly_total': [150, 135, 125, 200, 180]
})

# Key principle:
# - Choose grain based on analytical needs
# - Maintain consistent grain within one fact table
# - Don't mix grains (e.g., daily and monthly)

Important: Never mix different grains in a single fact table. This leads to incorrect aggregations and confusing analysis results.

Practice: Star Schemas

Task: Identify and describe the two main table types in star schemas and explain their roles.

Show Solution

# Answer: Fact and Dimension Tables

# FACT TABLES:
# - Store measurable business events
# - Many rows (millions or billions in large systems)
# - Foreign keys to dimensions
# - Numeric measures (amounts, counts, rates)
# - Examples: sales, inventory, transactions

# DIMENSION TABLES:
# - Provide context and descriptive attributes
# - Few rows (thousands or less)
# - Support filtering and grouping
# - Text and categorical values
# - Examples: customer, product, date, location

# Visual Star Structure:
#           dim_customer
#                |
#   dim_product--+--fact_sales--dim_date
#                |
#           dim_store

# Key Relationship:
# One dimension row → Many fact rows (one-to-many)
# Example: One customer appears in many sales records

Task: Explain the problems with mixing grains and demonstrate why separate tables are better.

Show Solution

# PROBLEM: Mixing Transaction and Daily Data
# If you mix both grains in one fact table:

# Mixed Fact Table (BAD):
# |transaction_id|amount|is_daily_aggregate|
# |    NULL      | 100  |       0          |  <- transaction-level
# |    NULL      | 250  |       1          |  <- daily aggregate
# |    NULL      | 50   |       0          |  <- transaction-level

# Query Result is WRONG:
SELECT SUM(amount) FROM fact_sales
WHERE date = '2024-01-01'
# Result: 400 (includes duplicates and aggregates!)

# SOLUTION: Separate Tables by Grain

# Grain 1: Transaction Level
fact_sales_detail = {
    'sales_key', 'date_key', 'customer_key', 
    'amount', 'units_sold'  # One row per transaction
}

# Grain 2: Daily Level
fact_sales_daily = {
    'daily_key', 'date_key', 'customer_key',
    'total_amount', 'total_units'  # One row per day per customer
}

# Benefits:
# ✓ No double-counting or confusion
# ✓ Clear semantics (users know which table to use)
# ✓ Correct aggregations guaranteed
# ✓ Better query performance (daily data is smaller)

Task: Create a complete star schema for a college with fact and dimension tables. Define the grain, keys, and all dimensions needed for analysis.

Show Solution

# College Course Enrollment Star Schema

# FACT TABLE: fact_enrollments
# Grain: One row per student enrollment per course per semester

fact_enrollments = {
    'Primary Key': 'enrollment_key',
    'Foreign Keys': [
        'student_key',
        'course_key',
        'semester_key',
        'instructor_key',
        'department_key'
    ],
    'Measures': [
        'credit_hours',
        'tuition_amount',
        'final_grade_points',
        'attendance_percentage'
    ]
}

# DIMENSION TABLES

dim_student = {
    'Primary Key': 'student_key',
    'Attributes': [
        'student_id', 'student_name', 'major',
        'class_level', 'gpa', 'enrollment_status'
    ]
}

dim_course = {
    'Primary Key': 'course_key',
    'Attributes': [
        'course_id', 'course_name', 'credit_hours',
        'department', 'difficulty_level', 'capacity'
    ]
}

dim_semester = {
    'Primary Key': 'semester_key',
    'Attributes': [
        'semester_id', 'year', 'term',
        'start_date', 'end_date', 'is_current'
    ]
}

dim_instructor = {
    'Primary Key': 'instructor_key',
    'Attributes': [
        'instructor_id', 'name', 'department',
        'tenure_status', 'specialty'
    ]
}

dim_department = {
    'Primary Key': 'department_key',
    'Attributes': [
        'department_id', 'department_name',
        'college', 'budget_code'
    ]
}

# ANALYSIS ENABLED:
# ✓ Student performance by course and semester
# ✓ Course popularity and enrollment trends
# ✓ Instructor workload and effectiveness
# ✓ Department enrollment patterns
# ✓ Prerequisite validation and progression analysis

Relationships and Cardinality

Table relationships define how facts and dimensions connect. Understanding cardinality - the number of unique values in a relationship - is critical for avoiding incorrect aggregations and ensuring model accuracy. Power BI relationships can be one-to-one, one-to-many, many-to-one, or many-to-many.

Key Concept

Cardinality

Cardinality describes the relationship between two tables. It specifies how many rows in one table can match how many rows in another. In Power BI, cardinality options are 1-to-many (*:1), many-to-one (1:*), one-to-one (1:1), and many-to-many (*:*).

Why it matters: Incorrect cardinality causes formula errors, double-counting in aggregations, and incorrect filtered values in related tables.

One-to-Many Relationships

The most common relationship type in star schemas. One fact table row relates to many dimension rows, and one dimension row relates to one fact row. For example, one customer can have many sales orders, but each order belongs to exactly one customer.

# One-to-Many Example: Customer to Sales
# One customer can have multiple orders
# Each order belongs to one customer

# Dimension Table
dim_customer = pd.DataFrame({
    'customer_key': [1, 2, 3],
    'customer_name': ['Alice', 'Bob', 'Charlie'],
    'country': ['USA', 'USA', 'Canada']
})

# Fact Table (many sides)
fact_sales = pd.DataFrame({
    'sales_key': [101, 102, 103, 104, 105],
    'customer_key': [1, 1, 2, 3, 1],  # Alice appears 3 times
    'amount': [100, 50, 75, 200, 125]
})

# Relationship: customer_key in fact_sales → customer_key in dim_customer
# This is the STANDARD relationship in star schemas
# Flow of filtering: Dimension → Fact (one direction)

One-to-One Relationships

Less common in analytical models. Used when each row in one table matches exactly one row in another. For example, if each employee has exactly one unique ID card.

# One-to-One Example: Employee to Employee Details
# Each employee has exactly one set of personal details
# Each personal detail record belongs to exactly one employee

employee_main = pd.DataFrame({
    'employee_key': [1, 2, 3],
    'name': ['John', 'Jane', 'Jack'],
    'department_key': [10, 20, 10]
})

employee_details = pd.DataFrame({
    'employee_key': [1, 2, 3],
    'ssn': ['123-45-6789', '987-65-4321', '555-55-5555'],
    'hire_date': ['2020-01-15', '2021-03-22', '2022-06-10']
})

# Usually, it's better to combine these into one table
# Use 1:1 only when fields change at different rates

Many-to-Many Relationships

Complex relationships where multiple rows in one table can relate to multiple rows in another. Power BI supports many-to-many relationships through bridge tables or direct relationships with special handling.

# Many-to-Many Example: Students to Courses
# Many students can enroll in many courses
# Many courses can have many students

# Problem without bridge table:
# Can't directly connect student and course

# Solution 1: Bridge Table (Recommended)
bridge_enrollment = pd.DataFrame({
    'enrollment_key': [1, 2, 3, 4, 5],
    'student_key': [1, 1, 2, 2, 3],
    'course_key': [101, 102, 101, 103, 102],
    'grade': ['A', 'B', 'A', 'C', 'B']
})

# Relationships:
# dim_student (1) → bridge_enrollment (many)
# dim_course (1) → bridge_enrollment (many)
# fact_grades uses bridge_enrollment as fact table

# Solution 2: Direct Many-to-Many (Power BI Feature)
# Direct relationship between dim_student and dim_course
# Use with caution - can cause performance issues

Cross-Filter Direction

In Power BI, relationships have a filter direction. Single direction (default) filters from one table to another. Bidirectional filters flow both ways, which can impact performance.

# Cross-Filter Direction Examples

# Single Direction (Default): dim_customer → fact_sales
# Selecting a customer filters the related sales
# Selecting a sale does NOT filter the customer list
# Better for performance, prevents circular dependencies

# Bidirectional: Filters flow both ways
# Selecting a customer filters sales
# Selecting a sale filters the customer list
# More intuitive for some analyses
# But can cause performance issues and calculation errors

# In Power BI SQL:
# Single: dim_customer filters fact_sales
# SELECT SUM(amount) FROM fact_sales
# WHERE customer_key IN (selected customer)

# Bidirectional: Both tables filter each other
# More complex filter context, slower queries

Warning: Bidirectional relationships can create circular dependencies and performance problems. Use single-direction relationships by default and only switch to bidirectional when necessary.

Practice: Relationships

Task: Define cardinality and describe the four main types of relationships.

Show Solution

# Answer: Cardinality defines the multiplicity of a relationship

# FOUR MAIN TYPES:

# 1. One-to-Many (*:1) - MOST COMMON IN STAR SCHEMAS
# - One customer can have many orders
# - One order belongs to one customer
# Example: dim_customer → fact_sales

# 2. Many-to-One (1:*) - Same as 1:many, different perspective
# - Same concept, opposite direction view

# 3. One-to-One (1:1) - RARE IN ANALYTICS
# - One employee has one passport
# - One passport belongs to one employee
# Use only when fields change at different rates

# 4. Many-to-Many (*:*) - REQUIRES BRIDGE TABLE
# - Many students can take many courses
# - Many courses can have many students
# Solution: Create bridge/junction table

# IMPORTANCE:
# ✓ Incorrect cardinality causes formula errors
# ✓ Double-counting in aggregations
# ✓ Incorrect filtered values in related tables
# ✓ Model performance issues

Task: Explain the data quality issues with direct many-to-many relationships and demonstrate the bridge table solution.

Show Solution

# PROBLEM: Direct Many-to-Many Creates Duplicate Rows

# Student Table:
# |student_key|name   |
# |    1      |Alice  |
# |    2      |Bob    |

# Course Table:
# |course_key|name        |
# |   101    |Math        |
# |   102    |English     |

# Direct join without bridge: WRONG
# You get artificial cross-product (4 rows from 2+2)
# Alice appears once per course
# Bob appears once per course
# This creates duplicate records and ambiguity

# SOLUTION: Create Bridge Table (Enrollment)

enrollment_bridge = {
    'enrollment_key': [1, 2, 3, 4],
    'student_key': [1, 1, 2, 2],
    'course_key': [101, 102, 101, 102],
    'grade': ['A', 'B', 'B', 'A']
}

# Relationships become clear:
# dim_student (1) ---> enrollment_bridge (many)
# dim_course (1) ---> enrollment_bridge (many)
# Each enrollment row is unique and unambiguous

# Benefits:
# ✓ One row per actual relationship
# ✓ No artificial duplication
# ✓ Can store relationship attributes (grade)
# ✓ Proper aggregations and calculations

Task: Create a complete relationship model allowing simultaneous filtering by multiple dimensions. Specify relationship types, cardinality, and filter direction.

Show Solution

# Multi-Dimensional Star Schema with Proper Relationships

# SCHEMA STRUCTURE:
#           dim_customer
#                |
#   dim_product--+--fact_sales--dim_date
#                |
#           dim_store

# FACT TABLE: fact_sales
fact_sales = {
    'Primary Key': 'sales_key',
    'Foreign Keys': [
        'customer_key',    # Links to dim_customer
        'product_key',     # Links to dim_product
        'date_key',        # Links to dim_date
        'store_key'        # Links to dim_store
    ],
    'Measures': [
        'amount',
        'quantity',
        'discount'
    ]
}

# RELATIONSHIPS (All One-to-Many from dimensions to fact)
relationships = [
    {
        'Name': 'Customer to Sales',
        'From': 'dim_customer[customer_key]',
        'To': 'fact_sales[customer_key]',
        'Cardinality': 'One-to-Many (*:1)',
        'Filter Direction': 'Single (dimension → fact)',
        'Is Active': True
    },
    {
        'Name': 'Product to Sales',
        'From': 'dim_product[product_key]',
        'To': 'fact_sales[product_key]',
        'Cardinality': 'One-to-Many (*:1)',
        'Filter Direction': 'Single',
        'Is Active': True
    },
    {
        'Name': 'Date to Sales',
        'From': 'dim_date[date_key]',
        'To': 'fact_sales[date_key]',
        'Cardinality': 'One-to-Many (*:1)',
        'Filter Direction': 'Single',
        'Is Active': True
    },
    {
        'Name': 'Store to Sales',
        'From': 'dim_store[store_key]',
        'To': 'fact_sales[store_key]',
        'Cardinality': 'One-to-Many (*:1)',
        'Filter Direction': 'Single',
        'Is Active': True
    }
]

# ANALYSIS CAPABILITY:
# Users can simultaneously filter by:
# ✓ Customer (select Alice)
# ✓ Product (select Widget)
# ✓ Date (select 2024)
# ✓ Store (select NYC)
# Result: Sales of Widget to Alice in NYC during 2024

# BEST PRACTICES:
# ✓ All relationships are one-to-many (optimal)
# ✓ Single-direction filters (better performance)
# ✓ One active relationship per dimension
# ✓ All foreign keys point FROM fact TO dimensions

Advanced Modeling Techniques

Advanced modeling techniques optimize performance, handle complex business scenarios, and improve analysis flexibility. Role-playing dimensions, bridge tables, and separate date tables are essential patterns for real-world analytical models.

Role-Playing Dimensions

A role-playing dimension is a single dimension table that is referenced multiple times in a fact table with different meanings. The classic example is a date dimension that can represent order date, ship date, and due date. This is memory-efficient and ensures consistency across different date contexts.

# Role-Playing Dimension Example: Date Dimension Used Multiple Ways
# One dim_date table, multiple roles in fact_orders

# dim_date table (single copy)
dim_date = pd.DataFrame({
    'date_key': [20240101, 20240102, 20240103],
    'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
    'month': ['January', 'January', 'January'],
    'quarter': ['Q1', 'Q1', 'Q1'],
    'year': [2024, 2024, 2024]
})

# fact_orders uses date_key multiple times
fact_orders = pd.DataFrame({
    'order_key': [1, 2, 3],
    'order_date_key': [20240101, 20240101, 20240102],
    'ship_date_key': [20240103, 20240104, 20240105],
    'due_date_key': [20240110, 20240110, 20240112],
    'customer_key': [101, 102, 103],
    'amount': [500, 750, 600]
})

# Relationships in Power BI:
# dim_date[date_key] → fact_orders[order_date_key] (Active)
# dim_date[date_key] → fact_orders[ship_date_key] (Inactive)
# dim_date[date_key] → fact_orders[due_date_key] (Inactive)

# Using USERELATIONSHIP to analyze by ship date:
# Analyze by Ship Date
# CALCULATE(SUM(fact_orders[amount]),
#   USERELATIONSHIP(fact_orders[ship_date_key], dim_date[date_key]))

Slowly Changing Dimensions

Dimensions change over time (customer address, product price, employee department). Slowly changing dimensions (SCD) are techniques for capturing these changes while maintaining historical accuracy in analysis.

# Type 2 SCD: Maintain Historical Records
# Create new dimension rows when attributes change
# Add version and date-range attributes

# dim_customer with SCD Type 2
dim_customer_scd = pd.DataFrame({
    'customer_key': [1, 2, 3, 1],  # Same customer, different record
    'customer_id': ['C001', 'C002', 'C003', 'C001'],
    'name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Alice Johnson'],
    'version': [1, 1, 1, 2],
    'effective_date': ['2023-01-01', '2023-01-01', '2023-01-01', '2024-03-15'],
    'end_date': ['2024-03-14', '9999-12-31', '9999-12-31', '9999-12-31'],
    'is_current': [False, True, True, True]
})

# Now, when analyzing historical data:
# Use version 1 for 2023 data (when Alice Smith was married)
# Use version 2 for 2024 data (when Alice Johnson after marriage)

# Query with SCD Type 2:
# SELECT SUM(amount), customer_name
# FROM fact_sales
# JOIN dim_customer USING (customer_key)
# WHERE effective_date <= transaction_date
#   AND end_date > transaction_date
# GROUP BY customer_name

Junk Dimensions

When you have many low-cardinality boolean or categorical attributes (is_rush_order, payment_method, delivery_type), combining them into a single junk dimension reduces fact table width and improves performance.

# Junk Dimension Example: Order Flags
# Instead of many small fact columns:
# fact_sales[is_rush], [is_paid], [is_discounted], etc.

# Create junk dimension:
dim_order_flags = pd.DataFrame({
    'flag_key': [1, 2, 3, 4, 5, 6, 7, 8],
    'is_rush_order': [0, 0, 0, 0, 1, 1, 1, 1],
    'payment_method': ['Credit', 'Credit', 'Debit', 'Debit', 'Credit', 'Credit', 'Debit', 'Debit'],
    'delivery_type': ['Standard', 'Express', 'Standard', 'Express', 'Standard', 'Express', 'Standard', 'Express'],
    'is_promo_code': [0, 0, 1, 1, 0, 0, 1, 1]
})

# Now fact table is simpler:
fact_sales = pd.DataFrame({
    'sales_key': [1, 2, 3, 4],
    'customer_key': [101, 102, 101, 103],
    'flag_key': [1, 3, 7, 2],  # References junk dimension
    'amount': [100, 250, 150, 300]
})

# Benefits:
# - Fewer columns in fact table
# - Faster inserts
# - Easier to analyze by combinations (rush + express + promo)
# - Reduces data redundancy

Conformed Dimensions

Conformed dimensions are reused across multiple fact tables, ensuring consistent analysis across different processes. For example, using the same customer and product dimensions across sales, returns, and exchanges.

# Conformed Dimensions Example
# Same dim_customer and dim_product used in multiple facts

# Fact Tables - Different Processes
fact_sales = pd.DataFrame({
    'sales_key': [1, 2],
    'customer_key': [101, 102],
    'product_key': [501, 502],
    'amount': [100, 250]
})

fact_returns = pd.DataFrame({
    'return_key': [1, 2],
    'customer_key': [101, 102],  # Same customer dimension
    'product_key': [501, 503],   # Same product dimension
    'amount': [-50, -75]
})

# Shared Dimension Tables
dim_customer = pd.DataFrame({
    'customer_key': [101, 102],
    'name': ['Alice', 'Bob'],
    'segment': ['Premium', 'Standard']
})

dim_product = pd.DataFrame({
    'product_key': [501, 502, 503],
    'name': ['Product A', 'Product B', 'Product C'],
    'category': ['Electronics', 'Electronics', 'Books']
})

# Benefits of conforming:
# - Consistent customer definitions across processes
# - Unified product hierarchy
# - Can analyze net sales (sales - returns)
# - Single version of truth for dimensions

Best Practice: Always conform your dimensions. A customer should have the same key and attributes whether appearing in sales, returns, or support fact tables.

Practice: Advanced Patterns

Task: Define role-playing dimensions, provide examples, and explain the implementation approach.

Show Solution

# Answer: Role-Playing Dimension Definition
# A single dimension table referenced multiple times with different meanings

# CLASSIC EXAMPLE: Date Dimension
# One dim_date table can represent:
# - Order Date (when customer placed order)
# - Ship Date (when order left warehouse)
# - Delivery Date (when customer received order)
# - Due Date (when customer must pay)

# Memory Efficient Approach:
# Instead of 4 separate date tables, use 1 table with multiple relationships

# fact_orders structure:
# |order_key|order_date_key|ship_date_key|delivery_date_key|customer_key|amount|
# |   1     |   20240101   |  20240103   |   20240108      |    101     | 500  |

# Power BI Relationships:
# 1. dim_date[date_key] → fact_orders[order_date_key] (ACTIVE)
# 2. dim_date[date_key] → fact_orders[ship_date_key] (INACTIVE)
# 3. dim_date[date_key] → fact_orders[delivery_date_key] (INACTIVE)

# Using USERELATIONSHIP for Analysis:
# -- Analyze by Ship Date (not default)
# Analyze by Ship Date
# CALCULATE(SUM(fact_orders[amount]),
#   USERELATIONSHIP(fact_orders[ship_date_key], dim_date[date_key]))

# OTHER EXAMPLES:
# - Employee dimension: Manager, Supervisor, Interviewer
# - Location dimension: Billing Address, Shipping Address, Headquarters
# - Date for: Period Date, Due Date, Completed Date

# BENEFITS:
# ✓ One copy of date data (memory efficient)
# ✓ Consistent definitions across roles
# ✓ Same attributes for all dates (month, quarter, year)
# ✓ Easier maintenance

Task: Define SCD Type 2, provide an example with version history, and explain the use cases.

Show Solution

# Answer: Type 2 SCD - Creates New Rows for Changes

# EXAMPLE: Product Price Changes Over Time

# dim_product with SCD Type 2:
# |product_key|product_id|name  |price|version|effective_date|end_date   |is_current|
# |    1      |  P001    |Widget| 10  |   1   | 2024-01-01   |2024-06-30 |  False   |
# |    2      |  P001    |Widget| 12  |   2   | 2024-07-01   |9999-12-31 |  True    |

# WHEN TO USE:
# ✓ Need to track history of dimension attributes
# ✓ Accuracy of historical analysis is important
# ✓ Want to know "what was the product price when sold?"
# ✓ Managing slowly changing attributes (not frequently)

# WHEN NOT TO USE:
# ✗ Dimension changes very frequently (daily, hourly)
# ✗ Don't need historical tracking
# ✗ Storage space is constrained
# ✗ Changes are cosmetic (formatting, capitalization)

# IMPLEMENTATION:
# 1. Add version column to dimension
# 2. Add effective_date and end_date columns
# 3. Add is_current flag for easier filtering
# 4. When attribute changes, insert new row with new version
# 5. Close old row with end_date

# Query with SCD Type 2:
# SELECT SUM(amount), product_name
# FROM fact_sales f
# JOIN dim_product d ON f.product_key = d.product_key
#   AND f.order_date BETWEEN d.effective_date AND d.end_date
# WHERE d.category = 'Electronics'
# GROUP BY product_name

# BENEFITS:
# ✓ Maintains complete history
# ✓ Accurate historical reporting
# ✓ Can answer "what changed and when?"
# ✓ Supports audit requirements

Task: Create a complete retail model with multiple fact tables, conformed dimensions, and advanced patterns. Include SCD, role-playing dimensions, and junk dimensions.

Show Solution

# COMPREHENSIVE RETAIL DATA MODEL

# ═════════════════════════════════════════════════════════════
# CONFORMED DIMENSIONS (Shared across multiple facts)
# ═════════════════════════════════════════════════════════════

conformed_dimensions = {
    'dim_customer': [
        'customer_key', 'customer_id', 'name', 'segment',
        'city', 'country', 'lifetime_value'
    ],
    'dim_product': [
        'product_key', 'product_id', 'name', 'category',
        'brand', 'supplier_key', 'price', 'version',
        'effective_date', 'end_date'  # SCD Type 2
    ],
    'dim_store': [
        'store_key', 'store_id', 'store_name', 'region',
        'manager', 'opening_date'
    ],
    'dim_date': [
        'date_key', 'date', 'month', 'quarter', 'year',
        'is_holiday', 'day_of_week'
    ],
    'dim_supplier': [
        'supplier_key', 'supplier_id', 'supplier_name',
        'country', 'rating'
    ]
}

# ═════════════════════════════════════════════════════════════
# FACT TABLES (Different processes, different grains)
# ═════════════════════════════════════════════════════════════

fact_tables = {
    'fact_sales': {
        'grain': 'One row per transaction line item',
        'keys': [
            'customer_key', 'product_key', 'store_key',
            'order_date_key', 'flag_key'  # junk dimension
        ],
        'measures': [
            'quantity', 'unit_price', 'discount', 'total_amount'
        ]
    },
    'fact_returns': {
        'grain': 'One row per returned item',
        'keys': [
            'customer_key', 'product_key', 'store_key',
            'return_date_key'
        ],
        'measures': [
            'quantity_returned', 'refund_amount', 'reason_code'
        ]
    },
    'fact_inventory': {
        'grain': 'One row per product per store per day',
        'keys': ['product_key', 'store_key', 'date_key'],
        'measures': [
            'quantity_on_hand', 'quantity_reserved', 'reorder_point'
        ]
    },
    'fact_supplier_orders': {
        'grain': 'One row per supplier order line',
        'keys': ['supplier_key', 'product_key', 'date_key'],
        'measures': [
            'quantity_ordered', 'cost', 'lead_time_days'
        ]
    }
}

# ═════════════════════════════════════════════════════════════
# SPECIALIZED DIMENSIONS
# ═════════════════════════════════════════════════════════════

specialized_dimensions = {
    'dim_return_reason': [
        'return_reason_key', 'reason', 'category', 'is_warranty'
    ],
    'dim_order_flags': {  # Junk Dimension
        'flag_key': 'composite key',
        'attributes': ['is_rush_order', 'is_promo', 'payment_method']
    }
}

# ═════════════════════════════════════════════════════════════
# ROLE-PLAYING DIMENSIONS
# ═════════════════════════════════════════════════════════════

# dim_date used in multiple roles:
role_playing_dates = {
    'fact_sales': {
        'order_date_key': 'ACTIVE relationship',
        'ship_date_key': 'INACTIVE relationship',
        'delivery_date_key': 'INACTIVE relationship'
    }
}

# ═════════════════════════════════════════════════════════════
# ANALYSIS CAPABILITIES
# ═════════════════════════════════════════════════════════════

analysis_capabilities = [
    'Net Sales (Sales - Returns) by product, store, date',
    'Inventory Turnover (COGS / Avg Inventory)',
    'Supplier Performance (on-time delivery, quality)',
    'Customer Segmentation (lifetime value, frequency)',
    'Category Performance with trends',
    'Regional comparison and benchmarking',
    'Return rate analysis by reason',
    'Promotion effectiveness analysis'
]

# ═════════════════════════════════════════════════════════════
# PERFORMANCE OPTIMIZATIONS
# ═════════════════════════════════════════════════════════════

optimizations = [
    'Create aggregate tables for common queries',
    'Use role-playing dimensions to avoid duplication',
    'Implement SCD Type 2 for product tracking',
    'Use junk dimension for order flags',
    'Index fact tables on foreign keys',
    'Maintain single-direction relationships',
    'Consider materialized views for complex queries'
]

Key Takeaways

Star Schema

Organize data with a central fact table and surrounding dimensions for optimal analytics performance

Relationships Matter

Define correct cardinality and directionality to ensure accurate aggregations and calculations

Fact Tables

Store measurable events with foreign keys to dimensions, containing granular business metrics

Dimensions

Describe the context of facts using attributes that enable filtering, grouping, and analysis

Performance Optimization

Use summarization tables, indexing, and efficient relationships to speed up query execution

Advanced Patterns

Apply role-playing dimensions and bridge tables for complex analytical scenarios

What You'll Learn

Contents

Data Modeling Concepts

What is Data Modeling?

OLTP vs OLAP Design

OLTP (Transactional)

OLAP (Analytical)

Core Components

Key Modeling Principles

Practice: Core Concepts

Star Schema Design

Star Schema Structure

Fact Tables

Dimension Tables

Grain Definition

Practice: Star Schemas

Relationships and Cardinality

Cardinality

One-to-Many Relationships

One-to-One Relationships

Many-to-Many Relationships

Cross-Filter Direction

Practice: Relationships

Advanced Modeling Techniques

Role-Playing Dimensions

Slowly Changing Dimensions

Junk Dimensions

Conformed Dimensions

Practice: Advanced Patterns

Key Takeaways

Star Schema

Relationships Matter

Fact Tables

Dimensions

Performance Optimization

Advanced Patterns

Knowledge Check

Interactive: Cardinality Explorer

ID	Name
1	Alice
2	Bob
3	Charlie

ID	Customer	Amount
101	1	$100
102	1	$50
103	2	$75
104	3	$200

Data Modeling Fundamentals

What You'll Learn

Contents

Data Modeling Concepts

What is Data Modeling?

OLTP vs OLAP Design

OLTP (Transactional)

OLAP (Analytical)

Core Components

Key Modeling Principles

Practice: Core Concepts

Easy What is the main difference between OLTP and OLAP databases?

Medium Why is dimensional modeling better for analytics than fully normalized tables?

Hard Design a fact table for an e-commerce system with dimensions

Star Schema Design

Star Schema Structure

Fact Tables

Dimension Tables

Grain Definition

Practice: Star Schemas

Easy What are the two main types of tables in a star schema?

Medium Why use separate fact tables for different grains instead of one combined table?

Hard Design a star schema for a college course enrollment system

Relationships and Cardinality

Cardinality

One-to-Many Relationships

One-to-One Relationships

Many-to-Many Relationships

Cross-Filter Direction

Practice: Relationships

Easy What is cardinality in a data model?

Medium Why does many-to-many relationship create problems without a bridge table?

Hard Design relationships for analyzing sales by customer AND product

Advanced Modeling Techniques

Role-Playing Dimensions

Slowly Changing Dimensions

Junk Dimensions

Conformed Dimensions

Practice: Advanced Patterns

Easy What is a role-playing dimension and give an example

Medium Explain Type 2 Slowly Changing Dimensions and when to use them

Hard Design comprehensive retail data model with conformed dimensions

Key Takeaways

Star Schema

Relationships Matter

Fact Tables

Dimensions

Performance Optimization

Advanced Patterns

Knowledge Check

Interactive: Cardinality Explorer