Assignment Overview
In this assignment, you will build a complete NLP Analysis Pipeline for customer reviews. This comprehensive project requires you to apply ALL concepts from Module 4: text preprocessing, tokenization, text representation (TF-IDF), sentiment analysis, named entity recognition (NER), and text classification using scikit-learn.
nltk, spacy, textblob,
sklearn, pandas, and json. No other external libraries are allowed.
Text Preprocessing (4.1)
Lowercasing, punctuation removal, stop words, tokenization, and TF-IDF vectorization
Sentiment Analysis (4.1)
Polarity detection, sentiment classification, and opinion mining
NER and Classification (4.1)
Entity extraction with SpaCy, text classification with Naive Bayes
The Scenario
ShopSmart E-Commerce Platform
You have been hired as an NLP Engineer at ShopSmart, an e-commerce platform that wants to automatically analyze customer reviews. The product manager has given you this task:
"We have thousands of customer reviews in JSON format. We need a Python system that can clean the text, detect sentiment, extract mentioned products and brands, classify reviews into categories, and generate actionable insights. Can you build this for us?"
Your Task
Create a Jupyter Notebook called nlp_pipeline.ipynb that implements a complete
NLP analysis pipeline. Your code must read reviews from a JSON file, preprocess text using
NLTK, analyze sentiment with TextBlob, extract entities with SpaCy, classify reviews using
TF-IDF and Naive Bayes, and generate an analysis report.
The Dataset
You will work with a customer reviews dataset. Create this file exactly as shown below:
File: reviews.json (Customer Reviews)
{
"reviews": [
{
"id": "R001",
"product": "Wireless Earbuds Pro",
"category": "Electronics",
"text": "Absolutely love these earbuds! The sound quality is amazing and battery lasts forever. Apple AirPods have competition now. Bought them in New York last Monday.",
"rating": 5
},
{
"id": "R002",
"product": "Laptop Stand X200",
"category": "Accessories",
"text": "Terrible quality. The stand broke after just two weeks. Complete waste of $50. Do NOT buy from this seller. Amazon should remove this product.",
"rating": 1
},
{
"id": "R003",
"product": "Mechanical Keyboard RGB",
"category": "Electronics",
"text": "Decent keyboard for the price. Keys feel good but the RGB software from Logitech is buggy. Works fine for gaming and typing in my San Francisco office.",
"rating": 3
},
{
"id": "R004",
"product": "USB-C Hub 7-in-1",
"category": "Accessories",
"text": "Great product! Works perfectly with my MacBook Pro. All ports function as expected. Fast shipping from California. Highly recommend!",
"rating": 5
},
{
"id": "R005",
"product": "Noise Cancelling Headphones",
"category": "Electronics",
"text": "The noise cancellation is okay but not as good as Sony or Bose. Comfortable for long flights though. Dr. Smith recommended these for focus work.",
"rating": 3
},
{
"id": "R006",
"product": "Portable Charger 20000mAh",
"category": "Accessories",
"text": "Saved my life during my trip to Tokyo! Charged my iPhone and iPad multiple times. Samsung users will love this too. Worth every penny!",
"rating": 5
},
{
"id": "R007",
"product": "Webcam HD 1080p",
"category": "Electronics",
"text": "Horrible webcam. Picture quality is grainy and the microphone picks up everything. Returned it immediately. Microsoft Teams calls were embarrassing.",
"rating": 1
},
{
"id": "R008",
"product": "Monitor Arm Mount",
"category": "Accessories",
"text": "Solid build quality. Installation took about 30 minutes. Works great with my Dell monitor. The instructions from IKEA-style manual were clear.",
"rating": 4
},
{
"id": "R009",
"product": "Wireless Mouse Ergonomic",
"category": "Electronics",
"text": "My wrist pain is gone after switching to this mouse! Dr. Johnson at the clinic recommended ergonomic devices. Best $40 I ever spent in Chicago.",
"rating": 5
},
{
"id": "R010",
"product": "Cable Management Kit",
"category": "Accessories",
"text": "Does the job but nothing special. Cables stay organized on my desk. Would be nice if it came in more colors. Ordered in January 2024.",
"rating": 3
},
{
"id": "R011",
"product": "Smart Speaker Mini",
"category": "Electronics",
"text": "Google Assistant works great! Love asking it questions while cooking. Sound quality is not amazing but good for a small room in London.",
"rating": 4
},
{
"id": "R012",
"product": "Laptop Sleeve 15 inch",
"category": "Accessories",
"text": "Perfect fit for my HP laptop. Material feels premium. Zipper is smooth. Got it on sale for $20 from Amazon Prime Day. Very satisfied!",
"rating": 5
},
{
"id": "R013",
"product": "Bluetooth Adapter 5.0",
"category": "Electronics",
"text": "Stopped working after a month. Customer service from the company was useless. Waste of money. Had to buy Intel adapter instead.",
"rating": 1
},
{
"id": "R014",
"product": "Desk Organizer Wood",
"category": "Accessories",
"text": "Beautiful design! Looks great on my desk. Holds pens, phone, and small items. Handmade quality from local artisans in Portland.",
"rating": 5
},
{
"id": "R015",
"product": "Gaming Mousepad XL",
"category": "Electronics",
"text": "Huge mousepad, covers my entire desk. Smooth surface for gaming. The Razer logo looks cool. Edges started fraying after 6 months though.",
"rating": 3
}
],
"metadata": {
"total_reviews": 15,
"collection_date": "2024-12-15",
"source": "ShopSmart Platform"
}
}
Data Fields Explained
id- Unique review identifier (string)product- Product name being reviewed (string)category- Product category: Electronics or Accessories (string)text- The actual review text with real-world messiness (string)rating- Customer rating from 1 (worst) to 5 (best) (integer)
Requirements
Your nlp_pipeline.ipynb must implement ALL of the following functions.
Each function is mandatory and will be tested individually.
Load Reviews from JSON
Create a function load_reviews(filename) that:
- Uses Python's
jsonmodule with context manager (withstatement) - Returns the list of review dictionaries from the JSON file
- Handles file not found errors gracefully
def load_reviews(filename):
"""Load review data from JSON file."""
# Must use: with open(), json.load()
# Return: list of review dictionaries
pass
Text Preprocessing Pipeline
Create a function preprocess_text(text) that:
- Converts text to lowercase
- Removes punctuation using
string.punctuation - Tokenizes using
nltk.word_tokenize() - Removes stop words using
nltk.corpus.stopwords - Returns a list of cleaned tokens
def preprocess_text(text):
"""Complete text preprocessing pipeline."""
# Must use: lower(), string.punctuation, word_tokenize(), stopwords
# Return: list of cleaned tokens
pass
Sentiment Analysis
Create a function analyze_sentiment(text) that:
- Uses
TextBlobto analyze the original (not preprocessed) text - Returns a tuple:
(polarity, subjectivity, sentiment_label) - Sentiment label: "positive" if polarity > 0.1, "negative" if < -0.1, else "neutral"
def analyze_sentiment(text):
"""Analyze sentiment using TextBlob."""
# Must return: tuple (polarity, subjectivity, label)
# Labels: "positive", "negative", "neutral"
pass
Named Entity Recognition
Create a function extract_entities(text) that:
- Uses SpaCy's
en_core_web_smmodel - Extracts all named entities from the text
- Returns a list of tuples:
[(entity_text, entity_label), ...]
def extract_entities(text):
"""Extract named entities using SpaCy."""
# Must use: spacy.load('en_core_web_sm')
# Return: list of (entity_text, entity_label) tuples
pass
TF-IDF Vectorization
Create a function create_tfidf_vectors(texts) that:
- Uses
TfidfVectorizerfrom scikit-learn - Fits and transforms the list of texts
- Returns the vectorizer and the TF-IDF matrix
def create_tfidf_vectors(texts):
"""Create TF-IDF vectors from texts."""
# Must use: TfidfVectorizer from sklearn
# Return: (vectorizer, tfidf_matrix)
pass
Text Classification (Sentiment Prediction)
Create a function train_classifier(tfidf_matrix, labels) that:
- Uses
MultinomialNBfrom scikit-learn - Trains a Naive Bayes classifier on TF-IDF features
- Returns the trained classifier
def train_classifier(tfidf_matrix, labels):
"""Train a Naive Bayes text classifier."""
# Must use: MultinomialNB from sklearn.naive_bayes
# Return: trained classifier
pass
Category Statistics
Create a function get_category_stats(reviews) that:
- Groups reviews by category (Electronics vs Accessories)
- Calculates average rating and average sentiment polarity per category
- Returns a dictionary with category statistics
def get_category_stats(reviews):
"""Calculate statistics per category."""
# Return: {"Electronics": {"count": x, "avg_rating": y, "avg_sentiment": z}, ...}
pass
Find Most Common Words
Create a function get_top_words(reviews, n=10) that:
- Preprocesses all review texts
- Counts word frequencies across all reviews
- Returns the top N most common words as a list of tuples:
[(word, count), ...]
def get_top_words(reviews, n=10):
"""Find most common words across all reviews."""
# Must use: collections.Counter or manual counting
# Return: list of (word, count) tuples, sorted by count descending
pass
Extract All Mentioned Brands/Organizations
Create a function get_all_brands(reviews) that:
- Uses your
extract_entities()function on each review - Collects all entities labeled as ORG (organization)
- Returns a sorted list of unique brand/organization names
def get_all_brands(reviews):
"""Extract all mentioned brands/organizations."""
# Must use: extract_entities() and filter for ORG label
# Return: sorted list of unique organization names
pass
Generate Analysis Report
Create a function generate_report(reviews, filename) that:
- Writes a comprehensive text report to a file
- Includes: total reviews, category breakdown, sentiment distribution
- Lists: top 10 words, all extracted brands, and sample insights
- Uses context manager for file writing
def generate_report(reviews, filename):
"""Generate a comprehensive NLP analysis report."""
# Write to file using: with open(filename, 'w')
# Include all statistics and insights
pass
Main Pipeline Function
Create a main() function that:
- Loads reviews from JSON
- Processes all reviews through the NLP pipeline
- Prints summary statistics to console
- Trains and tests the classifier (use rating as label: 4-5 = positive, 1-2 = negative, 3 = neutral)
- Generates the analysis report
def main():
# Load data
reviews = load_reviews("reviews.json")
print(f"Loaded {len(reviews)} reviews")
# Process each review
for review in reviews:
# Preprocess
tokens = preprocess_text(review['text'])
# Sentiment
polarity, subjectivity, label = analyze_sentiment(review['text'])
review['sentiment'] = label
review['polarity'] = polarity
# Entities
entities = extract_entities(review['text'])
review['entities'] = entities
# Get statistics
stats = get_category_stats(reviews)
top_words = get_top_words(reviews)
brands = get_all_brands(reviews)
# Print summary
print(f"Categories: {list(stats.keys())}")
print(f"Top words: {top_words[:5]}")
print(f"Brands mentioned: {brands}")
# Generate report
generate_report(reviews, "analysis_report.txt")
print("Report generated: analysis_report.txt")
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
shopsmart-nlp-pipeline
Required Files
shopsmart-nlp-pipeline/
├── nlp_pipeline.ipynb # Your Jupyter Notebook with ALL 11 functions
├── reviews.json # Input dataset (as provided)
├── analysis_report.txt # Generated report from running your notebook
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- Brief description of your NLP pipeline approach
- Any challenges faced and how you solved them
- Instructions to run your notebook (including required library installations)
Do Include
- All 11 functions implemented and working
- Docstrings for every function
- Comments explaining NLP concepts
- Generated analysis_report.txt file
- Proper use of NLTK, SpaCy, TextBlob, sklearn
- README.md with all required sections
Do Not Include
- Libraries not in the allowed list
- SpaCy model files (users will download)
- Any .pyc or __pycache__ files
- Code that does not run without errors
- Hardcoded outputs (we test with different data)
nltk.download('punkt'), nltk.download('stopwords')) and
SpaCy model (python -m spacy download en_core_web_sm).
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Text Preprocessing | 40 | Correct implementation of lowercasing, punctuation removal, tokenization, and stop word removal |
| Sentiment Analysis | 40 | Proper use of TextBlob, correct polarity calculation, and accurate sentiment labeling |
| Named Entity Recognition | 40 | Correct use of SpaCy NER, proper entity extraction, and brand identification |
| TF-IDF and Classification | 50 | Proper TF-IDF vectorization and Naive Bayes classifier implementation |
| Statistics and Analysis | 40 | Correct category statistics, word frequency analysis, and report generation |
| Code Quality | 40 | Docstrings, comments, naming conventions, and clean organization |
| Total | 250 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Text Preprocessing (4.1)
Building a complete preprocessing pipeline with lowercasing, punctuation removal, tokenization, and stop word filtering
Sentiment Analysis (4.1)
Using TextBlob for polarity and subjectivity analysis, converting scores to human-readable sentiment labels
Named Entity Recognition (4.1)
Extracting people, organizations, locations, and dates from text using SpaCy's pre-trained NER models
Text Classification (4.1)
Building TF-IDF vectors and training Naive Bayes classifiers for automatic text categorization
Pro Tips
NLP Best Practices
- Always preprocess text before analysis
- Use original text for sentiment (not preprocessed)
- Load SpaCy model once, reuse for all texts
- Handle empty strings and edge cases
Library Setup
- Install:
pip install nltk spacy textblob scikit-learn - Download NLTK data in your notebook
- Download SpaCy model:
en_core_web_sm - Test imports before writing functions
Time Management
- Start with loading and preprocessing
- Test each function before moving on
- Build sentiment analysis second
- Save NER and classification for last
Common Mistakes
- Forgetting to download NLTK resources
- Using preprocessed text for sentiment
- Not handling missing/empty reviews
- Loading SpaCy model inside loops