Introduction to Data Cleaning
Data cleaning is the process of detecting and correcting errors, inconsistencies, and inaccuracies in datasets. Studies show that analysts spend 60-80% of their time cleaning data before analysis. Mastering these techniques will dramatically improve your efficiency and ensure accurate insights.
Data Cleaning
The process of identifying and correcting (or removing) corrupt, inaccurate, incomplete, irrelevant, or improperly formatted data from a dataset to improve data quality and reliability for analysis.
Common Data Quality Issues
Real-world data rarely arrives in perfect condition. Understanding common issues helps you spot and fix them quickly:
| Issue Type | Example | Impact |
|---|---|---|
| Missing Values | Empty cells in customer phone numbers | Incomplete analysis, cannot contact customers |
| Duplicate Records | Same customer entered twice | Inflated counts, skewed averages |
| Inconsistent Formatting | "New York", "NY", "new york" | Difficult to group and analyze |
| Extra Spaces | " John Smith " (leading/trailing spaces) | Matching and sorting problems |
| Wrong Data Type | Numbers stored as text: "1000" | Calculations fail, sorting incorrect |
| Spelling Errors | "Californa" instead of "California" | Categories split incorrectly |
Data Cleaning Workflow
Follow this systematic approach to clean your data efficiently:
Inspect the Data
Review the dataset to identify obvious issues. Check for missing values, strange characters, inconsistent formats.
Remove Duplicates
Identify and eliminate duplicate records that could skew your analysis results.
Fix Structural Issues
Standardize formats, correct typos, remove extra spaces, and ensure consistent naming conventions.
Validate Data
Verify data types, check ranges, and ensure values make sense in their context.
Excel's Data Cleaning Toolkit
Excel provides a comprehensive set of tools for data cleaning. Here's your essential toolkit:
Find & Replace
Search and fix patterns across entire datasetsText Functions
TRIM, PROPER, UPPER, LOWER, CLEANRemove Duplicates
Eliminate redundant records automaticallyFilters
Isolate and examine specific data subsetsText to Columns
Split data into separate fieldsData Validation
Prevent errors before they occurPractice Questions: Introduction to Data Cleaning
Test your understanding with these hands-on exercises.
Scenario: You receive a customer dataset with the following records. Identify all data quality issues:
Name Email Phone City
John Smith john@email.com 555-1234 New York
Jane Doe JANE@EMAIL.COM 5551234 new york
John Smith john@email.com 555-1234 New York
Bob Johnson bob@email (555) 123-4567 NY
Sarah Lee sarah@email.com 555.1234 New york
Solution:
Issues identified:
- Extra spaces: " Jane Doe" has leading space, "Sarah Lee" has double space
- Inconsistent formatting: Emails in different cases (lowercase vs UPPERCASE)
- Duplicate record: John Smith appears twice with identical data
- Incomplete data: Bob's email is missing domain
- Format variations: Phone numbers in different formats
- Inconsistent city names: "New York", "new york", "NY" all represent same city
In the next sections, we'll learn how to fix each of these issues systematically using Excel's tools.
Task: You have a sales dataset with 10,000 rows containing: duplicate orders, missing customer names, inconsistent product codes, and extra spaces in all text fields. Rank these issues by cleaning priority and explain your reasoning.
Show Solution
Recommended priority order:
- Remove duplicate orders (HIGH): Duplicates directly affect revenue calculations and inventory counts. Fix first to avoid grossly incorrect totals.
- Standardize product codes (HIGH): Inconsistent codes prevent proper grouping and analysis. Essential for accurate reporting.
- Remove extra spaces (MEDIUM): Spaces cause matching issues but don't affect numerical calculations. Clean before doing text-based analysis.
- Fill missing customer names (LOW): Missing names are problematic but don't prevent order analysis. Can use customer ID for most analyses.
General rule: Prioritize issues that affect calculations and aggregations first, then handle formatting issues, finally address missing data that has workarounds.
Task: Create a data cleaning checklist for a new employee database that includes: names, email addresses, phone numbers, hire dates, and departments. List specific checks to perform for each field.
Show Solution
| Field | Cleaning Checks |
|---|---|
| Names |
✓ Remove extra spaces (TRIM) ✓ Standardize capitalization (PROPER) ✓ Check for special characters ✓ Verify no numbers in names |
|
✓ Standardize to lowercase ✓ Verify @ symbol present ✓ Check for domain (.com, .org, etc.) ✓ Remove duplicates |
|
| Phone |
✓ Standardize format (xxx-xxx-xxxx) ✓ Remove non-numeric characters ✓ Verify 10 digits ✓ Check for missing values |
| Hire Date |
✓ Verify date format consistency ✓ Check dates not in future ✓ Ensure dates after company founding ✓ Convert text dates to date format |
| Department |
✓ Standardize department names ✓ Fix spelling variations ✓ Create dropdown validation list ✓ Check against approved dept list |
Finding and Fixing Errors
Excel provides powerful tools to identify and correct data quality issues. From Find & Replace to conditional formatting, you can quickly spot problems and fix them systematically across large datasets.
Find & Replace Basics
The Find & Replace tool (Ctrl+H) is your first line of defense against data inconsistencies. It can fix thousands of errors in seconds.
Keyboard Shortcuts
| Ctrl + F | Open Find dialog |
| Ctrl + H | Open Find & Replace |
| Alt + A | Replace All |
| Alt + R | Replace (single) |
Common Options
- Match case: Distinguish between uppercase and lowercase
- Match entire cell: Find exact matches only
- Within: Search Sheet or Workbook
- Search: By Rows or By Columns
- Look in: Formulas, Values, or Comments
Using Wildcards
Wildcards enable pattern-based searching, making Find & Replace exponentially more powerful:
| Wildcard | Meaning | Example | Matches |
|---|---|---|---|
* |
Any number of characters | Find: data* |
"data", "database", "data science" |
? |
Single character | Find: sm?th |
"smith", "smyth" |
~ |
Literal wildcard character | Find: ~* |
Actual asterisk symbol "*" |
Practical Wildcard Examples:
Problem: Phone numbers in mixed formats
(555) 123-4567
555-123-4567
555.123.4567
Find & Replace Steps:
1. Find: (*)*(*)* Replace with: (leave empty)
- Removes all parentheses, hyphens, periods
2. Result: 5551234567
3. Find: (???)???-???? (use formula after cleaning)
- =TEXT(A2,"(000) 000-0000")
- Creates: (555) 123-4567
Problem: States in different formats
California, CA, Calif, Calif.
Solution - Replace each variation:
Find: California Replace: CA
Find: Calif Replace: CA
Find: Calif. Replace: CA
Tip: Check "Match entire cell" to avoid
replacing partial matches
Problem: Product codes with unwanted prefixes
OLD-12345, OLD-67890, OLD-11111
Solution:
Find: OLD- Replace: (leave empty)
Result: 12345, 67890, 11111
Or use wildcard for all prefixes:
Find: *- Replace: (leave empty)
Conditional Formatting for Error Detection
Conditional formatting visually highlights potential problems, making them easy to spot in large datasets:
Steps:
1. Select your data range
2. Home tab → Conditional Formatting
3. Highlight Cells Rules → Duplicate Values
4. Choose formatting (e.g., Red Fill)
Result: All duplicate entries highlighted in red
Advanced: Highlight duplicates EXCEPT first occurrence
Formula: =COUNTIF($A$2:$A2,$A2)>1
Apply to: $A$2:$A$100
Scenario: Age column should only contain values 18-65
Steps:
1. Select age column (B2:B100)
2. Conditional Formatting → New Rule
3. Use formula: =OR(B2<18, B2>65)
4. Set format: Red background
Result: Invalid ages highlighted automatically
Common validation formulas:
- Empty cells: =ISBLANK(A2)
- Non-numeric: =NOT(ISNUMBER(A2))
- Too long: =LEN(A2)>50
- Contains numbers in text: =SUMPRODUCT(--ISNUMBER(--MID(A2,ROW($1:$50),1)))>0
Practice Questions: Finding and Fixing Errors
Test your understanding with these hands-on exercises.
Dataset: You have email addresses with inconsistent domains that need standardization.
john@gmail.com
jane@GMAIL.COM
bob@gmail
sarah@gmail.co
mike@yahoo.com
Task: Standardize all Gmail addresses to lowercase "gmail.com" format.
Show Solution
Steps:
1. Find: GMAIL.COM Replace: gmail.com
(Fixes uppercase)
2. Find: @gmail Replace: @gmail.com
Options: Match entire cell = OFF
(Completes incomplete domains)
3. Find: @gmail.co Replace: @gmail.com
(Fixes common typo)
Result: All Gmail addresses standardized to @gmail.com
Challenge: Use wildcards to clean these product SKUs. Remove all prefix variations but keep the numeric code.
PROD-12345
SKU-67890
ITEM-11111
CODE-99999
Desired result: Just the numbers (12345, 67890, 11111, 99999)
Show Solution
Method 1: Single Wildcard Replace
Find: *-
Replace: (leave empty)
This removes everything before and including the hyphen.
Method 2: Formula Approach (More Flexible)
=RIGHT(A2,5)
or
=VALUE(RIGHT(A2,5)) -- Converts to number
Method 3: Text to Columns
1. Select column
2. Data → Text to Columns
3. Choose Delimited → Next
4. Select Other, enter: -
5. Finish
6. Delete first column (prefixes)
Each method works - choose based on whether you need to keep original data or if the pattern changes.
Scenario: Create a conditional formatting rule that highlights email addresses that might be invalid. An email should have:
- Exactly one @ symbol
- At least one character before @
- At least one dot after @
- At least two characters after the last dot
Solution:
Complex Formula Approach:
=OR(
LEN(A2)-LEN(SUBSTITUTE(A2,"@",""))<>1,
FIND("@",A2)<=1,
ISNUMBER(SEARCH("*@*.*",A2))=FALSE,
LEN(A2)-FIND(".",A2,FIND("@",A2))<2
)
Breakdown:
1. LEN(A2)-LEN(SUBSTITUTE(A2,"@",""))<>1
- Checks if there's exactly one @ symbol
2. FIND("@",A2)<=1
- Checks @ is not at start
3. ISNUMBER(SEARCH("*@*.*",A2))=FALSE
- Checks for dot after @ symbol
4. LEN(A2)-FIND(".",A2,FIND("@",A2))<2
- Checks at least 2 chars after last dot
Apply to range: A2:A100
Format: Red fill with dark red text
Simpler Alternative (Less Precise):
=NOT(AND(
LEN(A2)>5,
ISNUMBER(SEARCH("*@*.*",A2))
))
Tests basic structure only but easier to understand.
Text Cleaning Functions
Excel offers specialized functions for cleaning text data. These functions remove unwanted spaces, fix capitalization, eliminate non-printable characters, and standardize text formats for consistent analysis.
Essential Text Functions
Master these five functions to handle 90% of text cleaning tasks:
| Function | Purpose | Syntax | Example Input → Output |
|---|---|---|---|
| TRIM | Removes extra spaces | =TRIM(text) |
" John Smith " → "John Smith" |
| CLEAN | Removes non-printable characters | =CLEAN(text) |
"Data\n\rScience" → "DataScience" |
| PROPER | Capitalizes first letter of each word | =PROPER(text) |
"john SMITH" → "John Smith" |
| UPPER | Converts to uppercase | =UPPER(text) |
"New York" → "NEW YORK" |
| LOWER | Converts to lowercase | =LOWER(text) |
"New York" → "new york" |
TRIM Function Deep Dive
TRIM is your most-used text cleaning function. It removes leading spaces, trailing spaces, and reduces multiple spaces between words to single spaces.
Cell A2: " John Smith "
Formula: =TRIM(A2)
Result: "John Smith"
Before TRIM:
LEN(A2) = 17 characters -- Includes extra spaces
After TRIM:
LEN(TRIM(A2)) = 10 characters -- Clean text
Use Case: Cleaning imported data with spacing issues
Problem: Match names for VLOOKUP, but some have spaces
Data in A2: " John Smith "
Lookup table has: "John Smith"
Solution:
=VLOOKUP(TRIM(A2), $D$2:$E$100, 2, FALSE)
Result: Successful match after trimming spaces
Tip: Always TRIM before comparisons and lookups
CLEAN Function
CLEAN removes the first 32 non-printable characters in the ASCII set. Essential when importing data from databases or websites.
Problem: Data imported with line breaks (CHAR(10))
Cell A2 contains:
"Product: Laptop
Price: $999
Status: Available"
Formula: =CLEAN(A2)
Result: "Product: LaptopPrice: $999Status: Available"
Combined with TRIM:
=TRIM(CLEAN(A2))
Best Practice: Use both together for imported data:
=TRIM(CLEAN(A2))
Case Conversion Functions
Standardize text capitalization for consistency and accurate grouping.
Mixed case names need standardization:
A2: "JOHN SMITH" → =PROPER(A2) → "John Smith"
A3: "jane DOE" → =PROPER(A3) → "Jane Doe"
A4: "bob JOHNSON jr." → =PROPER(A4) → "Bob Johnson Jr."
Common Use Cases:
- Customer names in CRM systems
- Product names from multiple sources
- Address standardization
Caution: PROPER capitalizes after apostrophes:
"o'brien" → "O'Brien" ✓ Correct
"mcdonald" → "Mcdonald" ✗ Should be "McDonald"
Use UPPER for product codes (consistency):
A2: "sku-12345a" → =UPPER(A2) → "SKU-12345A"
Use LOWER for email addresses (standard format):
A2: "John@EMAIL.com" → =LOWER(A2) → "john@email.com"
Database matching example:
=VLOOKUP(UPPER(A2), ProductTable, 2, FALSE)
-- Ensures match regardless of input case
SUBSTITUTE Function
Replace specific text within a string. More precise than Find & Replace for formulas.
Syntax: =SUBSTITUTE(text, old_text, new_text, [instance])
Remove all hyphens from phone numbers:
A2: "555-123-4567"
=SUBSTITUTE(A2,"-","")
Result: "5551234567"
Replace multiple characters (nest functions):
A2: "(555) 123-4567"
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"(",""),")",""),"-","")
Result: "555 1234567"
Remove specific instance (3rd hyphen only):
A2: "2024-01-15"
=SUBSTITUTE(A2,"-","",2) -- Replaces 2nd hyphen
Result: "2024-0115"
Combining Functions for Power Cleaning
Stack multiple functions for comprehensive cleaning in a single formula:
Clean imported text completely:
=TRIM(CLEAN(PROPER(A2)))
What it does:
1. PROPER(A2) - Standardizes capitalization
2. CLEAN(...) - Removes non-printable characters
3. TRIM(...) - Removes extra spaces
Example:
A2: " jOHN SMITH\n "
Result: "John Smith"
Real-world application:
Import 1000 customer names → Apply once → Copy values → Delete helper column
Standardize email addresses:
=LOWER(TRIM(SUBSTITUTE(SUBSTITUTE(A2," ",""),"@","@")))
Steps:
1. Remove all spaces: SUBSTITUTE(A2," ","")
2. Trim (just in case): TRIM(...)
3. Convert to lowercase: LOWER(...)
Before: " John.Smith @EMAIL.COM "
After: "john.smith@email.com"
Add validation:
=IF(ISNUMBER(SEARCH("@",A2)),
LOWER(TRIM(A2)),
"Invalid Email"
)
Practice Questions: Text Cleaning Functions
Task: Clean this customer name list to proper case with no extra spaces.
A2: " JOHN SMITH "
A3: "jane DOE"
A4: "BOB johnson "
A5: " sarah LEE"
Show Solution
Solution:
Formula in B2:
=TRIM(PROPER(A2))
Copy down to B5
Results:
B2: "John Smith"
B3: "Jane Doe"
B4: "Bob Johnson"
B5: "Sarah Lee"
Next steps:
1. Copy B2:B5
2. Paste Values over A2:A5
3. Delete column B
Challenge: Convert all phone numbers to format: (XXX) XXX-XXXX
A2: "555-123-4567"
A3: "5551234567"
A4: "(555) 123-4567"
A5: "555.123.4567"
Show Solution
Solution:
Step 1: Strip all formatting (Column B)
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,
"-",""),"(",""),")",""),".","")
Result: All become "5551234567"
Step 2: Reformat (Column C)
="("&LEFT(B2,3)&") "&MID(B2,4,3)&"-"&RIGHT(B2,4)
Or use TEXT function if numeric:
=TEXT(VALUE(B2),"(000) 000-0000")
Final Results (C2:C5):
"(555) 123-4567"
"(555) 123-4567"
"(555) 123-4567"
"(555) 123-4567"
Alternative: Single mega-formula
=TEXT(VALUE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(
SUBSTITUTE(A2,"-",""),"(",""),")",""),".","")),
"(000) 000-0000")
Scenario: Data imported from a web form contains multiple issues. Create a single formula to clean it completely.
Issues in data:
- Leading/trailing spaces
- Non-printable characters (line breaks)
- Inconsistent capitalization
- Multiple spaces between words
- Non-breaking spaces (CHAR(160))
Sample A2: " JoHN SMITH\n "
(with CHAR(160) instead of normal space)
Show Solution
Solution:
Comprehensive cleaning formula:
=TRIM(CLEAN(PROPER(SUBSTITUTE(A2,CHAR(160)," "))))
Breaking it down from inside out:
1. SUBSTITUTE(A2,CHAR(160)," ")
- Replaces non-breaking spaces with normal spaces
2. PROPER(...)
- Standardizes to Title Case
3. CLEAN(...)
- Removes line breaks and non-printable chars
4. TRIM(...)
- Removes extra spaces
Test cases:
Input: " jOHN SMITH\n "
Output: "John Smith"
Input: "SARAH\nLEE "
Output: "Sarah Lee"
Input: " bob johnson "
Output: "Bob Johnson"
Production tip:
Create named range for formula:
Name: CleanText
Formula: =LAMBDA(text,TRIM(CLEAN(PROPER(
SUBSTITUTE(text,CHAR(160)," ")))))
Usage: =CleanText(A2)
Removing Duplicate Records
Duplicate data can skew analysis results and waste storage space. Learn how to identify, highlight, and remove duplicate records while preserving data integrity and unique information.
Identifying Duplicates
Before removing duplicates, you need to understand what makes a record a duplicate in your context.
Duplicate Record
A row in a dataset where one or more columns contain the same values as another row. Can be exact duplicates (all columns match) or partial duplicates (specific columns match).
| Duplicate Type | Definition | Example |
|---|---|---|
| Exact Duplicate | All column values identical | Two rows with same customer ID, name, email, phone |
| Partial Duplicate | Key columns match, others differ | Same email but different names (possible data entry error) |
| Case-Sensitive Duplicate | Same text, different case | "john@email.com" vs "John@email.com" |
Highlighting Duplicates
Use conditional formatting to visualize duplicates before removal:
Steps:
1. Select your data range (e.g., A2:A100)
2. Home tab → Conditional Formatting
3. Highlight Cells Rules → Duplicate Values
4. Choose formatting (Red Fill recommended)
5. Click OK
Result: All duplicate values highlighted in red
Best for: Quick visual check of single column
Limitation: Highlights all occurrences including first
Custom formula to highlight only repeat occurrences:
Select range: A2:A100
Conditional Formatting → New Rule → Use formula
Formula: =COUNTIF($A$2:$A2,$A2)>1
Format: Red background
How it works:
- For A2: Counts "A2" in range A2:A2 (count = 1) → Not highlighted
- For A3: If duplicate of A2, counts "A3" in A2:A3 (count = 2) → Highlighted
- For A4: Counts "A4" in A2:A4 → Highlights if duplicate
Result: First occurrence remains unmarked, duplicates highlighted
Remove Duplicates Tool
Excel's Remove Duplicates feature is powerful but irreversible. Always backup first!
Dataset: Customer list with duplicate entries
Name Email Phone
John Smith john@email.com 555-1234
Jane Doe jane@email.com 555-5678
John Smith john@email.com 555-1234 ← Duplicate
Bob Johnson bob@email.com 555-9012
Steps:
1. Select entire data range including headers
2. Data tab → Data Tools → Remove Duplicates
3. Check "My data has headers"
4. Select columns to check: All columns
5. Click OK
Result:
- Keeps first John Smith row
- Deletes second John Smith row
- Message: "1 duplicate values found and removed; 3 unique values remain"
Warning: This action cannot be undone!
Scenario: Remove duplicates based on email only
Customer_ID Name Email Phone
1001 John Smith john@email.com 555-1234
1002 J. Smith john@email.com 555-5678 ← Same email
1003 Jane Doe jane@email.com 555-9012
Problem: Same person entered twice with different names
Steps:
1. Select all data (A1:D3)
2. Data → Remove Duplicates
3. Uncheck all columns EXCEPT "Email"
4. Click OK
Result: Keeps row 1001, removes row 1002
Rationale: First occurrence preserved, duplicates removed
Use Case: Finding customers with multiple accounts
Advanced Duplicate Detection
Use formulas for more control over duplicate identification:
Add helper column to count occurrences:
A2: john@email.com
B2: =COUNTIF($A$2:$A$100,$A2)
Results:
- B2 = 1: Unique value
- B2 = 2: Appears twice (1 duplicate)
- B2 = 3: Appears three times (2 duplicates)
Filter B column > 1 to see all duplicates
Advanced: Flag first occurrence differently
C2: =IF(COUNTIF($A$2:$A2,$A2)>1,"Duplicate","First/Unique")
Results:
Row 2: "First/Unique" (first john@email.com)
Row 5: "Duplicate" (second john@email.com)
Find duplicates based on multiple columns:
A2: Name B2: Email C2: Duplicate?
John Smith john@email.com
Formula in C2:
=COUNTIFS($A$2:$A$100,$A2,$B$2:$B$100,$B2)>1
Returns TRUE if both name AND email match another row
Alternative: Create unique key
D2: =A2&"-"&B2
Result: "John Smith-john@email.com"
Then use COUNTIF on column D:
E2: =COUNTIF($D$2:$D$100,D2)>1
Duplicate Extraction Technique
Sometimes you want to keep duplicates for review rather than delete them:
Method: Use Advanced Filter
Step 1: Add helper column (D2)
=COUNTIF($A$2:$A$100,$A2)>1
Step 2: Apply filter
1. Select data range (A1:D100)
2. Data → Filter
3. Click filter arrow in column D
4. Check only TRUE values
5. Shows only duplicate rows
Step 3: Copy to new sheet
1. Select visible rows (Ctrl+Click to skip hidden)
2. Copy (Ctrl+C)
3. Create new sheet "Duplicates"
4. Paste (Ctrl+V)
Result: Original data intact, duplicates isolated for review
Best Practices for Duplicate Management
Do This
- Always create a backup before removing duplicates
- Use conditional formatting to preview before deletion
- Document your criteria for identifying duplicates
- Review a sample of flagged duplicates manually
- Keep a log of how many duplicates were removed
Avoid This
- Removing duplicates without understanding the cause
- Deleting duplicates based on wrong columns
- Assuming all duplicates are errors (some may be valid)
- Forgetting to check for case-sensitive duplicates
- Removing duplicates from unsorted data without review
Practice Questions: Removing Duplicates
Dataset: Identify what type of duplicates exist in this customer list.
ID Name Email City
101 John Smith john@email.com NYC
102 Jane Doe jane@email.com LA
103 John Smith john@email.com NYC ← ?
104 john smith john@email.com NYC ← ?
105 Jane Doe jane@gmail.com LA ← ?
Show Solution
Solution:
Duplicate Analysis:
- Row 103: Exact duplicate of row 101 (all columns match). Should be removed.
- Row 104: Case-sensitive duplicate of row 101. Same person with lowercase name. Should standardize with PROPER() then remove.
- Row 105: Partial duplicate of row 102. Same name, same city, but different email. Might be same person with two emails - requires investigation, not automatic removal.
Recommended Actions:
- Standardize names with PROPER() function
- Standardize emails with LOWER() function
- Remove exact duplicates (row 103)
- Flag row 105 for manual review
Task: Create a formula that identifies duplicate email addresses but ignores the first occurrence. Mark subsequent duplicates as "Duplicate", first occurrence as "Original", and unique values as "Unique".
Show Solution
Solution:
Formula in B2 (assuming emails in A2:A100):
=IF(COUNTIF($A$2:$A$100,$A2)=1,
"Unique",
IF(COUNTIF($A$2:$A2,$A2)=1,
"Original",
"Duplicate"
)
)
Logic Breakdown:
1. COUNTIF($A$2:$A$100,$A2)=1
- If email appears only once in entire range → "Unique"
2. COUNTIF($A$2:$A2,$A2)=1
- If email appears once in range from start to current row
- This is the first occurrence → "Original"
3. Otherwise → "Duplicate"
Example Results:
A2: john@email.com B2: Original
A3: jane@email.com B3: Unique
A4: john@email.com B4: Duplicate
A5: bob@email.com B5: Unique
A6: john@email.com B6: Duplicate
Now filter column B for "Duplicate" to review/remove
Scenario: You have a sales dataset where duplicates should be identified based on BOTH Customer ID AND Order Date (same customer, same day = duplicate order). Create a system to flag and count these complex duplicates.
A: Order_ID B: Customer_ID C: Order_Date D: Amount
1001 C123 2024-01-15 $500
1002 C456 2024-01-15 $300
1003 C123 2024-01-15 $500 ← Duplicate?
1004 C123 2024-01-16 $200 ← Not duplicate
Show Solution
Solution:
Method 1: Helper Column with Concatenation
Column E (Unique Key):
E2: =B2&"-"&TEXT(C2,"yyyy-mm-dd")
Result: "C123-2024-01-15"
Column F (Duplicate Flag):
F2: =COUNTIF($E$2:$E$100,E2)>1
Column G (Occurrence Number):
G2: =COUNTIF($E$2:$E2,E2)
Results:
Row 2: E2="C123-2024-01-15" F2=TRUE G2=1 (Original)
Row 3: E3="C456-2024-01-15" F3=FALSE G3=1 (Unique)
Row 4: E4="C123-2024-01-15" F4=TRUE G4=2 (Duplicate!)
Row 5: E5="C123-2024-01-16" F5=FALSE G5=1 (Unique)
Method 2: Single Formula (No Helper Column)
Column E (Duplicate Status):
=IF(COUNTIFS($B$2:$B$100,$B2,$C$2:$C$100,$C2)=1,
"Unique",
IF(COUNTIFS($B$2:$B2,$B2,$C$2:$C2,$C2)=1,
"First",
"Duplicate-"&COUNTIFS($B$2:$B2,$B2,$C$2:$C2,$C2)
)
)
Results:
Row 2: "First"
Row 3: "Unique"
Row 4: "Duplicate-2"
Row 5: "Unique"
To Remove Duplicates:
1. Sort by Customer_ID, then Order_Date
2. Use Remove Duplicates tool
3. Select only Customer_ID and Order_Date columns
4. First occurrence kept, others deleted
Alternative: Keep highest amount when duplicates exist
1. Sort by Customer_ID, Order_Date, Amount (descending)
2. Remove duplicates based on Customer_ID + Order_Date
3. Highest amount entry preserved
Handling Missing Values
Missing data is one of the most common data quality issues. How you handle missing values can significantly impact your analysis. Learn to identify gaps, decide when to fill or remove them, and implement best practices for maintaining data integrity.
Identifying Missing Data
Excel represents missing data in different ways, each requiring different handling:
| Type | Appearance | Excel Formula | Common Cause |
|---|---|---|---|
| Empty Cells | Blank cell (nothing visible) | =ISBLANK(A2) |
Data entry skipped, import errors |
| Zero Values | 0 | =A2=0 |
Placeholder, actual zero, or missing? |
| Text Placeholders | "N/A", "NULL", "Missing" | =A2="N/A" |
Database exports, manual entry |
| Formula Errors | #N/A, #DIV/0!, #VALUE! | =ISERROR(A2) |
Lookup failures, calculation issues |
| Spaces Only | Cell appears empty but has spaces | =AND(A2<>"",TRIM(A2)="") |
Copy-paste errors, web scraping |
Decision Framework: Fill or Remove?
Before taking action, understand what the missing data means:
When to Fill Missing Values
- Missing value is predictable (carry forward last known value)
- Can calculate from other columns (e.g., Total = Quantity × Price)
- Domain knowledge provides default (e.g., missing country = "USA" for US company)
- Small percentage missing (<5%) with safe assumptions
- Critical field needed for all records (Customer ID, Date)
When to Remove Records
- Missing critical information (e.g., missing sales amount in sales data)
- Large percentage missing (>30%) makes filling unreliable
- Random missing pattern (MCAR - Missing Completely at Random)
- Cannot fill without making dangerous assumptions
- Duplicate or test records with missing required fields
Techniques for Filling Missing Values
Method 1: Fill with Constant Value
Replace missing values with a specific value (0, "Unknown", average, etc.):
Task: Fill blank phone numbers with "Not Provided"
Steps:
1. Select phone number column
2. Press F5 → Special → Blanks → OK
3. Type: Not Provided
4. Press Ctrl+Enter (fills all selected blanks at once)
Result: All blank cells in selection now contain "Not Provided"
Formula in B2:
=IF(ISBLANK(A2),"Not Provided",A2)
Logic: If cell is blank, return "Not Provided", otherwise keep original value
Copy down for entire column
Method 2: Fill Down / Fill Up
Use surrounding values to fill gaps:
Keyboard Shortcut: Ctrl+D
Steps:
1. Select cell with value + empty cells below
2. Press Ctrl+D
Example:
Before: After Ctrl+D:
Sales Sales
[blank] Sales
[blank] Sales
[blank] Sales
Formula in A2:
=IF(ISBLANK(A2),B1,A2)
Then copy down
Use case: Filling hierarchical data where category repeats
Steps:
1. Select first date + blank cells
2. Home → Fill → Series
3. Choose: Date Unit: Day, Step value: 1
4. OK
Example:
Before: After:
1/1/2024 1/1/2024
[blank] 1/2/2024
[blank] 1/3/2024
[blank] 1/4/2024
Method 3: Calculate from Other Columns
Derive missing values using relationships in your data:
Scenario: Total column has blanks, but Quantity and Unit Price exist
Formula in D2:
=IF(ISBLANK(D2),B2*C2,D2)
Logic:
- If Total is blank → Calculate: Quantity × Unit Price
- If Total exists → Keep existing value
Example:
Quantity Price Total (Before) Total (After)
10 5.00 50.00 50.00
5 10.00 [blank] 50.00 (calculated)
20 3.00 60.00 60.00
Formula in C2:
=IF(ISBLANK(C2),YEAR(TODAY())-YEAR(B2),C2)
Before:
Birth Date Age
1990-05-15 [blank] → Calculates: 35 (as of 2026)
1985-03-20 40 → Keeps: 40
Scenario: Product IDs have prefixes (EL=Electronics, CL=Clothing)
Formula in D2:
=IF(ISBLANK(D2),
IF(LEFT(A2,2)="EL","Electronics",
IF(LEFT(A2,2)="CL","Clothing","Other")),
D2)
Example:
Product ID Category (Before) Category (After)
EL-001 [blank] Electronics
CL-200 [blank] Clothing
EL-055 Electronics Electronics
FU-100 [blank] Other
Method 4: Use VLOOKUP/XLOOKUP for Reference Data
Fill missing values by looking up information from reference tables:
Scenario: Customer Name missing, but Customer ID exists
Formula in B2:
=IF(ISBLANK(B2),
XLOOKUP(A2,CustomerTable[ID],CustomerTable[Name],"Not Found"),
B2)
Reference Table (CustomerTable):
ID Name
C001 John Smith
C002 Jane Doe
C003 Bob Johnson
Result:
ID Name (Before) Name (After)
C001 [blank] John Smith
C002 Jane Doe Jane Doe
C003 [blank] Bob Johnson
C999 [blank] Not Found
Formula in C2:
=IFERROR(
VLOOKUP(A2,PriceTable,2,FALSE),
"Price Not Found"
)
VLOOKUP Breakdown:
- A2: Product ID to look up
- PriceTable: Reference range with Product ID and Price
- 2: Return value from column 2 (Price column)
- FALSE: Exact match required
IFERROR wrapper: Returns "Price Not Found" if product doesn't exist
Removing Records with Missing Values
Method 1: Filter and Delete
Steps:
1. Select your data range (Ctrl+A)
2. Data tab → Filter
3. Click filter dropdown on column with missing data
4. Uncheck (Blanks)
5. Select all visible rows (rows with data)
6. Right-click row numbers → Delete Rows
7. Remove filter to see remaining data
Result: All rows with blank values in that column are removed
Method 2: Formula Flagging
Formula in E2 (helper column):
=IF(OR(ISBLANK(A2),ISBLANK(B2),ISBLANK(C2)),
"INCOMPLETE",
"COMPLETE")
Logic:
- Checks if ANY of columns A, B, or C is blank
- If blank found → "INCOMPLETE"
- If all filled → "COMPLETE"
Example:
A B C E (Status)
John Sales 50000 COMPLETE
Jane [blank] 60000 INCOMPLETE
Bob IT [blank] INCOMPLETE
Sarah HR 55000 COMPLETE
Next Step: Filter column E for "INCOMPLETE" to review before deleting
Method 3: Remove if Any Critical Field is Missing
Formula in F2 (decision column):
=IF(COUNTBLANK(A2:E2)>0,"DELETE","KEEP")
COUNTBLANK Explanation:
- Counts number of blank cells in range A2:E2
- If count > 0 (any blanks exist) → "DELETE"
- If count = 0 (all filled) → "KEEP"
Critical fields example: Customer ID, Order Date, Product, Quantity, Price
If any are blank, entire record is flagged for deletion
Example:
ID Date Product Qty Price Decision
101 2024-01-15 Widget 10 5.00 KEEP
102 2024-01-16 [blank] 5 10.00 DELETE
103 [blank] Gadget 20 3.00 DELETE
104 2024-01-17 Tool 15 8.00 KEEP
Workflow:
1. Add formula in helper column
2. Sort by Decision column
3. Delete all "DELETE" rows at once
4. Remove helper column
Practice Questions: Handling Missing Values
Dataset: Employee records with various missing values.
ID Name Department Salary Email
101 John Smith Sales 50000 john@company.com
102 Jane Doe 60000 jane@company.com
103 Bob Johnson Marketing bob@company.com
104 IT 55000
105 Sarah Lee Sales 52000 sarah@company.com
Task: For each missing value, determine whether to fill or remove. Explain your reasoning.
Show Solution
Analysis:
- Row 102 - Missing Department: FILL - Can look up from HR database using Employee ID, or ask employee
- Row 103 - Missing Salary: FILL with caution - Could use department average, but sensitive data requires verification
- Row 104 - Missing Name: REMOVE or INVESTIGATE - Critical identifier missing, likely test record or data corruption
- Row 104 - Missing Email: FILL - Can construct from naming convention (firstname@company.com) once name is found
Recommended Actions:
- Row 104: Investigate first (check ID 104 in source system). If truly missing name, delete record
- Row 102: Fill department from HR system or mark as "Unassigned" temporarily
- Row 103: Flag for manager review before filling salary
- Row 104: If name found, construct email using pattern
Scenario: Sales data where Total is missing for some records. Fill using Quantity × Price. If either Quantity or Price is missing, mark as "INCOMPLETE".
A: Order_ID B: Quantity C: Price D: Total
1001 10 5.00 50.00
1002 5 10.00
1003 7.50
1004 20 3.00 60.00
Show Solution
Formula in D2:
=IF(OR(ISBLANK(B2),ISBLANK(C2)),
"INCOMPLETE",
IF(ISBLANK(D2),B2*C2,D2)
)
Logic Breakdown:
1. First check if Quantity OR Price is blank
→ If either is missing, cannot calculate → "INCOMPLETE"
2. If both Quantity and Price exist:
→ Check if Total is blank
→ If blank, calculate: Quantity × Price
→ If not blank, keep existing Total
Results:
D2: 50.00 (already filled, keep it)
D3: 50.00 (calculated: 5 × 10.00)
D4: "INCOMPLETE" (missing Quantity)
D5: 60.00 (already filled, keep it)
Alternative - Strict version that only fills if missing:
=IF(ISBLANK(D2),
IF(OR(ISBLANK(B2),ISBLANK(C2)),"INCOMPLETE",B2*C2),
D2)
Challenge: You have 10,000 customer orders. 15% are missing State, 8% are missing Zip Code, 3% are missing both. Create a multi-step strategy to handle this.
Show Solution
Multi-Step Strategy:
Step 1: Assess Impact
// Count missing patterns
Missing State only: 12% (1,200 records)
Missing Zip only: 5% (500 records)
Missing both: 3% (300 records)
Complete: 80% (8,000 records)
Step 2: Fill Zip from State (if State exists)
// For records with State but no Zip, use most common Zip for that State
=IF(AND(NOT(ISBLANK(B2)),ISBLANK(C2)),
XLOOKUP(B2,StateLookup[State],StateLookup[MostCommonZip],"00000"),
C2)
This fills ~5% (500 records)
Step 3: Fill State from Zip (if Zip exists)
// For records with Zip but no State
=IF(AND(ISBLANK(B2),NOT(ISBLANK(C2))),
XLOOKUP(C2,ZipLookup[Zip],ZipLookup[State],"Unknown"),
B2)
This fills most of the 12% (State-only missing)
Step 4: Handle Both Missing
// Option A: Use shipping address from previous orders (if repeat customer)
=IF(AND(ISBLANK(B2),ISBLANK(C2)),
XLOOKUP(A2,HistoricalOrders[CustomerID],HistoricalOrders[State],"CHECK"),
B2)
// Option B: If no history and both missing, flag for manual review
=IF(AND(ISBLANK(B2),ISBLANK(C2)),"MANUAL REVIEW","OK")
Step 5: Document Changes
// Add audit column showing what was done
=IF(AND(OriginalState="",FinalState<>""),"State Filled from Zip",
IF(AND(OriginalZip="",FinalZip<>""),"Zip Filled from State",
IF(AND(OriginalState="",OriginalZip=""),"Manual Review Needed",
"No Changes")))
Final Result:
- ~93% records complete after automated filling
- ~3-5% flagged for manual review (missing both)
- ~2% may remain unresolvable (delete or mark as "Unknown")
- Full audit trail of all changes
Key Principle: Never silently fill data. Always mark filled values for review and maintain original data in separate columns until validated.
Text to Columns and Flash Fill
Transform unstructured text into organized data with Excel's Text to Columns and Flash Fill features. Split full names into first and last names, separate addresses into components, or extract specific patterns from messy text data with ease.
Text to Columns
Text to Columns splits data from one column into multiple columns based on delimiters (commas, spaces, tabs) or fixed widths.
Method 1: Delimited Text
Use when data is separated by specific characters:
Select Data and Open Tool
Select the column with text to split → Data tab → Text to Columns
Choose Delimited
Select "Delimited" → Next
Select Delimiters
Check delimiter(s): Tab, Semicolon, Comma, Space, or Other
Preview shows how data will split
Format Columns and Finish
Set data format for each column (General, Text, Date) → Finish
Example: Split Full Names
Original Data (Column A):
John Smith
Jane Mary Doe
Bob Johnson
Steps:
1. Select A2:A4
2. Data → Text to Columns
3. Delimited → Space
4. Finish
Result:
Column A Column B Column C
John Smith
Jane Mary Doe
Bob Johnson
Method 2: Fixed Width
Use when data has consistent character positions:
Example: Split Fixed-Width Data
Original Data (positions fixed):
ID0001JohnSmith Marketing
ID0002JaneJohnson Sales
ID0003BobLee IT
Positions:
0-5: ID
6-10: First Name
11-20: Last Name
21-30: Department
Steps:
1. Data → Text to Columns
2. Fixed Width
3. Click to create break lines at positions 6, 11, 21
4. Finish
Result: 4 clean columns
Flash Fill (Excel 2013+)
Flash Fill automatically detects patterns in your examples and fills remaining cells. It's like teaching Excel by showing examples rather than writing formulas.
How Flash Fill Works
Provide Examples
In an adjacent column, manually type 1-2 examples of the desired output
Trigger Flash Fill
Press Ctrl+E or Data → Flash Fill
Excel detects the pattern and fills remaining cells
Review and Accept
Check suggestions. If correct, accept. If not, add more examples and retry
Common Flash Fill Use Cases
Extract First Names
A: Full Name B: First Name
John Smith John (type this)
Jane Doe [Ctrl+E]
Bob Johnson [Auto-filled: Bob]
Create Emails
A: Name B: Email
John Smith john.smith@company.com
Jane Doe [Ctrl+E]
Bob Johnson [Auto: bob.johnson@company.com]
Format Dates
A: Date B: Formatted
01/15/2024 Jan 15, 2024
02/20/2024 [Ctrl+E]
03/10/2024 [Auto: Mar 10, 2024]
Extract Area Codes
A: Phone B: Area Code
(555) 123-4567 555
(212) 987-6543 [Ctrl+E]
(310) 555-1234 [Auto: 310]
Advanced Flash Fill Techniques
Multi-Column Patterns
Flash Fill can combine data from multiple columns:
A: First B: Last C: Dept D: Badge ID
John Smith Sales JS-SAL-001 (type example)
Jane Doe Marketing [Ctrl+E detects: JD-MAR-002]
Bob Lee IT [Auto: BL-IT-003]
Pattern: FirstInitial + LastInitial + "-" + DEPT + "-" + Number
Text to Columns vs Flash Fill
| Feature | Text to Columns | Flash Fill |
|---|---|---|
| Best For | Consistent delimiters or fixed widths | Complex patterns, transformations |
| Speed | Very fast, one-time setup | Requires examples, but flexible |
| Flexibility | Limited to splitting | Can split, combine, format, extract |
| Consistency | 100% reliable with clear delimiters | 90% accurate, may need verification |
| Formula Required | No | No (but not dynamic) |
| Updates Automatically | No (one-time conversion) | No (one-time conversion) |
Practice Questions: Text to Columns & Flash Fill
Task: Split this comma-separated data into separate columns.
A1: Name,Age,City,State
A2: John Smith,28,New York,NY
A3: Jane Doe,35,Los Angeles,CA
A4: Bob Johnson,42,Chicago,IL
Show Solution
Steps:
- Select A1:A4
- Data tab → Text to Columns
- Choose "Delimited" → Next
- Check "Comma" → Next
- Leave data format as "General" → Finish
Result:
A B C D
Name Age City State
John Smith 28 New York NY
Jane Doe 35 Los Angeles CA
Bob Johnson 42 Chicago IL
If your data has headers in first row, consider skipping A1 and starting from A2, then manually type headers.
Challenge: Use Flash Fill to generate company email addresses in format: firstname.lastname@company.com (all lowercase).
A: Full Name B: Email Address
John Smith
Jane DOE
Bob JOHNSON
Sarah Lee
Show Solution
Steps:
- In B2, manually type: john.smith@company.com
- In B3, manually type: jane.doe@company.com
- Select B4
- Press Ctrl+E (Flash Fill)
- Excel fills remaining cells with pattern
Result:
A: Full Name B: Email Address
John Smith john.smith@company.com
Jane DOE jane.doe@company.com
Bob JOHNSON bob.johnson@company.com
Sarah Lee sarah.lee@company.com
Why It Works:
- Flash Fill detects: split name at space
- Convert to lowercase
- Join with period
- Add domain @company.com
If Flash Fill doesn't work, add a third example or use formula: =LOWER(SUBSTITUTE(A2," ","."))&"@company.com"
Scenario: Parse full addresses into Street, City, State, Zip components. Data has inconsistent formatting.
Full Address:
123 Main St, New York, NY 10001
456 Oak Avenue, Los Angeles, CA, 90001
789 Pine Road, Chicago IL 60601
Challenge: Handle comma variations and extract all 4 components reliably.
Show Solution
Multi-Step Approach:
Step 1: Standardize First (Add Missing Commas)
// Find state abbreviations without comma before
Find: ([A-Z]{2}) (\d{5})
Replace: , $1 $2
Use Find & Replace with wildcards
Result: All addresses have consistent comma placement
Step 2: Text to Columns (Comma Delimited)
1. Select address column
2. Text to Columns → Delimited → Comma
3. Click Next → Finish
Result:
A: 123 Main St B: New York C: NY 10001
A: 456 Oak Avenue B: Los Angeles C: CA D: 90001
Step 3: Clean Up State and Zip
// Extract State (column D)
=LEFT(TRIM(C2),2)
// Extract Zip (column E)
=RIGHT(TRIM(C2),5)
Or use Flash Fill:
- Manually type "NY" in D2, "CA" in D3
- Ctrl+E to fill remaining states
- Repeat for Zip codes
Final Clean Result:
Street City State Zip
123 Main St New York NY 10001
456 Oak Avenue Los Angeles CA 90001
789 Pine Road Chicago IL 60601
Alternative: Power Query (Best for Large Datasets)
1. Data → Get Data → From Table/Range
2. Split Column by Delimiter (comma)
3. Split last column by space to separate State and Zip
4. Trim all columns
5. Close & Load
Best Practice: For complex, repetitive parsing tasks, Power Query is more reliable than Text to Columns because it's repeatable and handles variations better.
Power Query Basics for Data Cleaning
Power Query is Excel's most powerful data transformation tool. Unlike formulas that need to be reapplied, Power Query creates repeatable, automated workflows that can refresh with new data. Perfect for recurring data cleaning tasks and complex transformations.
Why Use Power Query?
Repeatable
Create once, apply to any new data with a single refresh. Saves hours on recurring tasks.
Non-Destructive
Original data stays untouched. All transformations happen in Power Query Editor.
Auditable
Every step is recorded. See exactly what transformations were applied and when.
Getting Started with Power Query
Load Data into Power Query
Method A: From Excel Table
1. Select your data range
2. Insert → Table (or Ctrl+T)
3. Data tab → Get Data → From Table/Range
Method B: From File
Data → Get Data → From File → From Excel/CSV/Text
Power Query Editor Opens
You'll see three main areas:
- Preview Area: Shows your data with transformations
- Applied Steps: Right panel - records every transformation
- Ribbon: Transformation tools organized by category
Apply Transformations
Each action adds a step to Applied Steps pane. You can:
- Click any step to see data at that point
- Delete steps (X icon)
- Reorder steps (drag and drop)
- Edit step settings (gear icon)
Load Back to Excel
Home tab → Close & Load → Choose destination:
- Table: Loads as Excel table (most common)
- PivotTable: Creates PivotTable from cleaned data
- Connection Only: Saves query without loading (for reuse)
Essential Power Query Transformations
1. Remove Columns
// Right-click column header
→ Remove Columns
or
→ Remove Other Columns (keeps selected, removes rest)
Use when: Simplifying dataset, removing unnecessary fields
2. Remove Duplicates
// Select columns that define uniqueness
Home → Remove Rows → Remove Duplicates
Use when: Cleaning customer lists, order records
Advantage over Excel: Can specify which columns to check
3. Trim and Clean Text
// Right-click text column
→ Transform → Trim (removes leading/trailing spaces)
→ Transform → Clean (removes non-printable characters)
→ Format → UPPERCASE / lowercase / Capitalize Each Word
Chain these: Trim → Clean → Proper Case in sequence
4. Replace Values
// Right-click column → Replace Values
From: NY
To: New York
Better than Find & Replace:
- Only affects selected column
- Case-sensitive option available
- Recorded as repeatable step
5. Split Columns
// Right-click column header → Split Column
Options:
- By Delimiter (comma, space, custom)
- By Number of Characters
- By Position (left, right, range)
Example: Split "LastName, FirstName"
Split by Delimiter → Comma → Each occurrence
6. Fill Down / Fill Up
// Right-click column → Fill → Down
Fills empty cells with value from above
Common use: Hierarchical data where category repeats
Category Product
Electronics [blank] → Electronics Laptop
[blank] [blank] → Electronics Phone
Furniture [blank] → Furniture Desk
7. Change Data Type
// Click column header icon → Choose type
Text, Whole Number, Decimal Number, Date, Currency
Critical for: Ensuring calculations work, proper sorting
Automatic detection: Power Query guesses type (check Applied Steps)
8. Filter Rows
// Click filter dropdown in column header
Options:
- Text Filters: Contains, Begins With, Ends With
- Number Filters: Greater Than, Between, Top N
- Date Filters: Year, Month, Date Range
Removes unwanted records before loading to Excel
Power Query Cleaning Workflow Example
Real-World Scenario: Monthly Sales Report Automation
Problem: Every month, you receive a messy CSV file with 10,000+ sales records that needs the same cleaning.
Power Query Solution:
Step 1: Load CSV
Data → From Text/CSV → Select file
Step 2: Promote First Row to Headers
Home → Use First Row as Headers
Step 3: Remove Unnecessary Columns
Select ID, Date, Product, Quantity, Price columns
→ Remove Other Columns
Step 4: Clean Product Names
Transform → Trim
Transform → Clean
Format → Capitalize Each Word
Step 5: Fix Date Format
Change Type → Date
Step 6: Remove Duplicates (based on ID)
Select ID column → Remove Duplicates
Step 7: Add Calculated Column (Total)
Add Column → Custom Column
Name: Total
Formula: [Quantity] * [Price]
Step 8: Filter Out Test Records
Filter Product column → Does Not Contain "TEST"
Step 9: Sort by Date
Home → Sort → Sort Descending
Step 10: Load to Excel
Close & Load → Table → New Worksheet
Next Month:
1. Save new CSV file with same name in same location
2. Right-click query in Queries & Connections pane
3. Click Refresh
All 10 steps automatically reapply to new data!
M Language (Behind the Scenes)
Power Query writes code called M Language for each step. You don't need to write it, but viewing helps understand transformations:
// View M Code: Right-click step → View Native Query
// Or: Home → Advanced Editor
Example M Code:
let
Source = Excel.CurrentWorkbook(){[Name="SalesData"]}[Content],
RemovedDuplicates = Table.Distinct(Source, {"ID"}),
TrimmedText = Table.TransformColumns(RemovedDuplicates,{{"Product", Text.Trim}}),
ChangedType = Table.TransformColumnTypes(TrimmedText,{{"Date", type date}})
in
ChangedType
Practice Questions: Power Query Basics
Task: Create a Power Query to clean customer data: remove duplicates, trim spaces, capitalize names, remove test records.
Sample Data (CustomerList table):
ID Name Email City
101 john smith john@email.com new york
102 JANE DOE jane@email.com LOS ANGELES
101 john smith john@email.com new york
999 TEST CUSTOMER test@email.com TEST CITY
Show Solution
Power Query Steps:
- Load Data: Data → From Table/Range → Select CustomerList
- Remove Duplicates: Home → Remove Rows → Remove Duplicates
- Trim All Text Columns:
- Select Name, Email, City columns (hold Ctrl)
- Transform → Format → Trim
- Capitalize Names:
- Select Name column
- Transform → Format → Capitalize Each Word
- Capitalize City:
- Select City column
- Transform → Format → Capitalize Each Word
- Remove Test Records:
- Click filter on Name column
- Text Filters → Does Not Contain → "TEST"
- Load: Home → Close & Load
Final Result:
ID Name Email City
101 John Smith john@email.com New York
102 Jane Doe jane@email.com Los Angeles
Applied Steps (should show):
- Source
- Removed Duplicates
- Trimmed Text
- Capitalized Each Word (Name)
- Capitalized Each Word (City)
- Filtered Rows
Challenge: Clean product data by splitting SKU codes and extracting components.
Data:
SKU Quantity
ELEC-LAP-001 10
ELEC-PHONE-002 25
FURN-DESK-003 5
FURN-CHAIR-004 15
Task:
- Split SKU into: Category, Type, ID
- Capitalize Category and Type properly
- Add calculated column: Value = Quantity × 100
Show Solution
Power Query Steps:
- Split SKU by Delimiter:
- Right-click SKU column
- Split Column → By Delimiter → Hyphen (-)
- Choose: At each occurrence of delimiter
- Rename Columns:
- Double-click headers to rename:
- SKU.1 → Category
- SKU.2 → Type
- SKU.3 → ID
- Capitalize Category and Type:
- Select Category and Type columns (Ctrl+Click)
- Transform → Format → Capitalize Each Word
- Add Calculated Column:
- Add Column → Custom Column
- Name: Value
- Formula:
[Quantity] * 100
- Reorder Columns (optional):
- Drag columns to desired order
- Close & Load
Final Result:
Category Type ID Quantity Value
Elec Lap 001 10 1000
Elec Phone 002 25 2500
Furn Desk 003 5 500
Furn Chair 004 15 1500
Scenario: Build a Power Query that processes monthly sales files automatically. Requirements:
- Import CSV from C:\Reports\MonthlySales.csv
- Remove columns except: Date, Product, Quantity, UnitPrice
- Fix date format issues
- Calculate Total = Quantity × UnitPrice
- Filter out records where Quantity = 0
- Group by Product and show total sales
- Sort by total sales descending
Show Solution
Complete Power Query Pipeline:
Step 1: Import CSV
Data → Get Data → From File → From Text/CSV
Navigate to C:\Reports\MonthlySales.csv
Load → Transform Data (opens Power Query Editor)
Step 2: Remove Unwanted Columns
Select: Date, Product, Quantity, UnitPrice
Right-click → Remove Other Columns
Step 3: Fix Date Column
Click Date column header icon → Choose "Date" type
If errors appear, check original format and use:
Transform → Date → Parse (choose format)
Step 4: Add Total Column
Add Column → Custom Column
Name: Total
Formula: [Quantity] * [UnitPrice]
Step 5: Filter Out Zero Quantity
Click filter on Quantity column
Number Filters → Greater Than → 0
Step 6: Group by Product
Home → Group By
Group by: Product
New column name: TotalSales
Operation: Sum
Column: Total
Step 7: Sort Descending
Click TotalSales column header dropdown
Sort Descending
Step 8: Rename Query (optional)
Right-click query name in Applied Steps
Rename: MonthlySalesReport
Step 9: Close & Load
Home → Close & Load → Table → New worksheet
Applied Steps Should Show:
- Source
- Promoted Headers
- Removed Other Columns
- Changed Type (Date)
- Added Custom (Total)
- Filtered Rows (Quantity > 0)
- Grouped Rows
- Sorted Rows
Next Month Refresh:
1. Replace MonthlySales.csv with new month's data (same filename, same location)
2. Right-click query in Queries & Connections pane
3. Click Refresh
All steps automatically reapply!
Advanced: Parameterize File Path
To make file path flexible:
1. Home → Manage Parameters → New Parameter
Name: FilePath
Type: Text
Current Value: C:\Reports\MonthlySales.csv
2. Edit Source step in Advanced Editor:
Replace hardcoded path with: FilePath parameter
Now you can change file location without editing query!
Result: A fully automated pipeline that transforms 30 minutes of manual work into 5 seconds of refresh time!
Key Takeaways
Clean Data Foundation
Data cleaning is essential before analysis - invest time upfront to ensure accurate insights
Find & Replace Mastery
Use wildcards and advanced options to find and fix patterns across entire datasets efficiently
Text Function Toolkit
TRIM, PROPER, UPPER, LOWER, and CLEAN functions standardize text and remove unwanted characters
Duplicate Detection
Remove Duplicates and conditional formatting help identify and eliminate redundant records
Data Validation
Set up validation rules to prevent errors from entering your dataset in the first place
Quality Assurance
Regular data quality checks ensure consistency and reliability throughout your analysis workflow
Knowledge Check
Quick Quiz
Test your understanding of Excel data cleaning techniques