Assignment 4: NLP Pipeline Project | AI Course

Assignment Overview

In this assignment, you will build a complete NLP Analysis Pipeline for customer reviews. This comprehensive project requires you to apply ALL concepts from Module 4: text preprocessing, tokenization, text representation (TF-IDF), sentiment analysis, named entity recognition (NER), and text classification using scikit-learn.

Allowed Libraries Only: You may use nltk, spacy, textblob, sklearn, pandas, and json. No other external libraries are allowed.

Skills Applied: This assignment tests your understanding of NLP Basics (Topic 4.1) and Transformers (Topic 4.2) from Module 4.

Text Preprocessing (4.1)

Lowercasing, punctuation removal, stop words, tokenization, and TF-IDF vectorization

Sentiment Analysis (4.1)

Polarity detection, sentiment classification, and opinion mining

NER and Classification (4.1)

Entity extraction with SpaCy, text classification with Naive Bayes

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

ShopSmart E-Commerce Platform

You have been hired as an NLP Engineer at ShopSmart, an e-commerce platform that wants to automatically analyze customer reviews. The product manager has given you this task:

"We have thousands of customer reviews in JSON format. We need a Python system that can clean the text, detect sentiment, extract mentioned products and brands, classify reviews into categories, and generate actionable insights. Can you build this for us?"

Your Task

Create a Jupyter Notebook called nlp_pipeline.ipynb that implements a complete NLP analysis pipeline. Your code must read reviews from a JSON file, preprocess text using NLTK, analyze sentiment with TextBlob, extract entities with SpaCy, classify reviews using TF-IDF and Naive Bayes, and generate an analysis report.

The Dataset

You will work with a customer reviews dataset. Create this file exactly as shown below:

File: `reviews.json` (Customer Reviews)

{
  "reviews": [
    {
      "id": "R001",
      "product": "Wireless Earbuds Pro",
      "category": "Electronics",
      "text": "Absolutely love these earbuds! The sound quality is amazing and battery lasts forever. Apple AirPods have competition now. Bought them in New York last Monday.",
      "rating": 5
    },
    {
      "id": "R002",
      "product": "Laptop Stand X200",
      "category": "Accessories",
      "text": "Terrible quality. The stand broke after just two weeks. Complete waste of $50. Do NOT buy from this seller. Amazon should remove this product.",
      "rating": 1
    },
    {
      "id": "R003",
      "product": "Mechanical Keyboard RGB",
      "category": "Electronics",
      "text": "Decent keyboard for the price. Keys feel good but the RGB software from Logitech is buggy. Works fine for gaming and typing in my San Francisco office.",
      "rating": 3
    },
    {
      "id": "R004",
      "product": "USB-C Hub 7-in-1",
      "category": "Accessories",
      "text": "Great product! Works perfectly with my MacBook Pro. All ports function as expected. Fast shipping from California. Highly recommend!",
      "rating": 5
    },
    {
      "id": "R005",
      "product": "Noise Cancelling Headphones",
      "category": "Electronics",
      "text": "The noise cancellation is okay but not as good as Sony or Bose. Comfortable for long flights though. Dr. Smith recommended these for focus work.",
      "rating": 3
    },
    {
      "id": "R006",
      "product": "Portable Charger 20000mAh",
      "category": "Accessories",
      "text": "Saved my life during my trip to Tokyo! Charged my iPhone and iPad multiple times. Samsung users will love this too. Worth every penny!",
      "rating": 5
    },
    {
      "id": "R007",
      "product": "Webcam HD 1080p",
      "category": "Electronics",
      "text": "Horrible webcam. Picture quality is grainy and the microphone picks up everything. Returned it immediately. Microsoft Teams calls were embarrassing.",
      "rating": 1
    },
    {
      "id": "R008",
      "product": "Monitor Arm Mount",
      "category": "Accessories",
      "text": "Solid build quality. Installation took about 30 minutes. Works great with my Dell monitor. The instructions from IKEA-style manual were clear.",
      "rating": 4
    },
    {
      "id": "R009",
      "product": "Wireless Mouse Ergonomic",
      "category": "Electronics",
      "text": "My wrist pain is gone after switching to this mouse! Dr. Johnson at the clinic recommended ergonomic devices. Best $40 I ever spent in Chicago.",
      "rating": 5
    },
    {
      "id": "R010",
      "product": "Cable Management Kit",
      "category": "Accessories",
      "text": "Does the job but nothing special. Cables stay organized on my desk. Would be nice if it came in more colors. Ordered in January 2024.",
      "rating": 3
    },
    {
      "id": "R011",
      "product": "Smart Speaker Mini",
      "category": "Electronics",
      "text": "Google Assistant works great! Love asking it questions while cooking. Sound quality is not amazing but good for a small room in London.",
      "rating": 4
    },
    {
      "id": "R012",
      "product": "Laptop Sleeve 15 inch",
      "category": "Accessories",
      "text": "Perfect fit for my HP laptop. Material feels premium. Zipper is smooth. Got it on sale for $20 from Amazon Prime Day. Very satisfied!",
      "rating": 5
    },
    {
      "id": "R013",
      "product": "Bluetooth Adapter 5.0",
      "category": "Electronics",
      "text": "Stopped working after a month. Customer service from the company was useless. Waste of money. Had to buy Intel adapter instead.",
      "rating": 1
    },
    {
      "id": "R014",
      "product": "Desk Organizer Wood",
      "category": "Accessories",
      "text": "Beautiful design! Looks great on my desk. Holds pens, phone, and small items. Handmade quality from local artisans in Portland.",
      "rating": 5
    },
    {
      "id": "R015",
      "product": "Gaming Mousepad XL",
      "category": "Electronics",
      "text": "Huge mousepad, covers my entire desk. Smooth surface for gaming. The Razer logo looks cool. Edges started fraying after 6 months though.",
      "rating": 3
    }
  ],
  "metadata": {
    "total_reviews": 15,
    "collection_date": "2024-12-15",
    "source": "ShopSmart Platform"
  }
}

Data Fields Explained

id - Unique review identifier (string)
product - Product name being reviewed (string)
category - Product category: Electronics or Accessories (string)
text - The actual review text with real-world messiness (string)
rating - Customer rating from 1 (worst) to 5 (best) (integer)

Requirements

Your nlp_pipeline.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

Load Reviews from JSON

Create a function load_reviews(filename) that:

Uses Python's json module with context manager (with statement)
Returns the list of review dictionaries from the JSON file
Handles file not found errors gracefully

def load_reviews(filename):
    """Load review data from JSON file."""
    # Must use: with open(), json.load()
    # Return: list of review dictionaries
    pass

Text Preprocessing Pipeline

Create a function preprocess_text(text) that:

Converts text to lowercase
Removes punctuation using string.punctuation
Tokenizes using nltk.word_tokenize()
Removes stop words using nltk.corpus.stopwords
Returns a list of cleaned tokens

def preprocess_text(text):
    """Complete text preprocessing pipeline."""
    # Must use: lower(), string.punctuation, word_tokenize(), stopwords
    # Return: list of cleaned tokens
    pass

Sentiment Analysis

Create a function analyze_sentiment(text) that:

Uses TextBlob to analyze the original (not preprocessed) text
Returns a tuple: (polarity, subjectivity, sentiment_label)
Sentiment label: "positive" if polarity > 0.1, "negative" if < -0.1, else "neutral"

def analyze_sentiment(text):
    """Analyze sentiment using TextBlob."""
    # Must return: tuple (polarity, subjectivity, label)
    # Labels: "positive", "negative", "neutral"
    pass

Named Entity Recognition

Create a function extract_entities(text) that:

Uses SpaCy's en_core_web_sm model
Extracts all named entities from the text
Returns a list of tuples: [(entity_text, entity_label), ...]

def extract_entities(text):
    """Extract named entities using SpaCy."""
    # Must use: spacy.load('en_core_web_sm')
    # Return: list of (entity_text, entity_label) tuples
    pass

TF-IDF Vectorization

Create a function create_tfidf_vectors(texts) that:

Uses TfidfVectorizer from scikit-learn
Fits and transforms the list of texts
Returns the vectorizer and the TF-IDF matrix

def create_tfidf_vectors(texts):
    """Create TF-IDF vectors from texts."""
    # Must use: TfidfVectorizer from sklearn
    # Return: (vectorizer, tfidf_matrix)
    pass

Text Classification (Sentiment Prediction)

Create a function train_classifier(tfidf_matrix, labels) that:

Uses MultinomialNB from scikit-learn
Trains a Naive Bayes classifier on TF-IDF features
Returns the trained classifier

def train_classifier(tfidf_matrix, labels):
    """Train a Naive Bayes text classifier."""
    # Must use: MultinomialNB from sklearn.naive_bayes
    # Return: trained classifier
    pass

Category Statistics

Create a function get_category_stats(reviews) that:

Groups reviews by category (Electronics vs Accessories)
Calculates average rating and average sentiment polarity per category
Returns a dictionary with category statistics

def get_category_stats(reviews):
    """Calculate statistics per category."""
    # Return: {"Electronics": {"count": x, "avg_rating": y, "avg_sentiment": z}, ...}
    pass

Find Most Common Words

Create a function get_top_words(reviews, n=10) that:

Preprocesses all review texts
Counts word frequencies across all reviews
Returns the top N most common words as a list of tuples: [(word, count), ...]

def get_top_words(reviews, n=10):
    """Find most common words across all reviews."""
    # Must use: collections.Counter or manual counting
    # Return: list of (word, count) tuples, sorted by count descending
    pass

Extract All Mentioned Brands/Organizations

Create a function get_all_brands(reviews) that:

Uses your extract_entities() function on each review
Collects all entities labeled as ORG (organization)
Returns a sorted list of unique brand/organization names

def get_all_brands(reviews):
    """Extract all mentioned brands/organizations."""
    # Must use: extract_entities() and filter for ORG label
    # Return: sorted list of unique organization names
    pass

Generate Analysis Report

Create a function generate_report(reviews, filename) that:

Writes a comprehensive text report to a file
Includes: total reviews, category breakdown, sentiment distribution
Lists: top 10 words, all extracted brands, and sample insights
Uses context manager for file writing

def generate_report(reviews, filename):
    """Generate a comprehensive NLP analysis report."""
    # Write to file using: with open(filename, 'w')
    # Include all statistics and insights
    pass

Main Pipeline Function

Create a main() function that:

Loads reviews from JSON
Processes all reviews through the NLP pipeline
Prints summary statistics to console
Trains and tests the classifier (use rating as label: 4-5 = positive, 1-2 = negative, 3 = neutral)
Generates the analysis report

def main():
    # Load data
    reviews = load_reviews("reviews.json")
    print(f"Loaded {len(reviews)} reviews")
    
    # Process each review
    for review in reviews:
        # Preprocess
        tokens = preprocess_text(review['text'])
        
        # Sentiment
        polarity, subjectivity, label = analyze_sentiment(review['text'])
        review['sentiment'] = label
        review['polarity'] = polarity
        
        # Entities
        entities = extract_entities(review['text'])
        review['entities'] = entities
    
    # Get statistics
    stats = get_category_stats(reviews)
    top_words = get_top_words(reviews)
    brands = get_all_brands(reviews)
    
    # Print summary
    print(f"Categories: {list(stats.keys())}")
    print(f"Top words: {top_words[:5]}")
    print(f"Brands mentioned: {brands}")
    
    # Generate report
    generate_report(reviews, "analysis_report.txt")
    print("Report generated: analysis_report.txt")

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

shopsmart-nlp-pipeline

github.com/<your-username>/shopsmart-nlp-pipeline

Required Files

shopsmart-nlp-pipeline/
├── nlp_pipeline.ipynb        # Your Jupyter Notebook with ALL 11 functions
├── reviews.json              # Input dataset (as provided)
├── analysis_report.txt       # Generated report from running your notebook
└── README.md                 # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Brief description of your NLP pipeline approach
Any challenges faced and how you solved them
Instructions to run your notebook (including required library installations)

Do Include

All 11 functions implemented and working
Docstrings for every function
Comments explaining NLP concepts
Generated analysis_report.txt file
Proper use of NLTK, SpaCy, TextBlob, sklearn
README.md with all required sections

Do Not Include

Libraries not in the allowed list
SpaCy model files (users will download)
Any .pyc or __pycache__ files
Code that does not run without errors
Hardcoded outputs (we test with different data)

Important: Before submitting, run all cells in your notebook to ensure it executes without errors. Make sure to download NLTK data (nltk.download('punkt'), nltk.download('stopwords')) and SpaCy model (python -m spacy download en_core_web_sm).

Submit Your Assignment

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Text Preprocessing	40	Correct implementation of lowercasing, punctuation removal, tokenization, and stop word removal
Sentiment Analysis	40	Proper use of TextBlob, correct polarity calculation, and accurate sentiment labeling
Named Entity Recognition	40	Correct use of SpaCy NER, proper entity extraction, and brand identification
TF-IDF and Classification	50	Proper TF-IDF vectorization and Naive Bayes classifier implementation
Statistics and Analysis	40	Correct category statistics, word frequency analysis, and report generation
Code Quality	40	Docstrings, comments, naming conventions, and clean organization
Total	250

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

Text Preprocessing (4.1)

Building a complete preprocessing pipeline with lowercasing, punctuation removal, tokenization, and stop word filtering

Sentiment Analysis (4.1)

Using TextBlob for polarity and subjectivity analysis, converting scores to human-readable sentiment labels

Named Entity Recognition (4.1)

Extracting people, organizations, locations, and dates from text using SpaCy's pre-trained NER models

Text Classification (4.1)

Building TF-IDF vectors and training Naive Bayes classifiers for automatic text categorization

Pro Tips

NLP Best Practices

Always preprocess text before analysis
Use original text for sentiment (not preprocessed)
Load SpaCy model once, reuse for all texts
Handle empty strings and edge cases

Library Setup

Install: pip install nltk spacy textblob scikit-learn
Download NLTK data in your notebook
Download SpaCy model: en_core_web_sm
Test imports before writing functions

Time Management

Start with loading and preprocessing
Test each function before moving on
Build sentiment analysis second
Save NER and classification for last

Common Mistakes

Forgetting to download NLTK resources
Using preprocessed text for sentiment
Not handling missing/empty reviews
Loading SpaCy model inside loops

NLP Pipeline Project

What You'll Practice

Contents

Assignment Overview

Text Preprocessing (4.1)

Sentiment Analysis (4.1)

NER and Classification (4.1)

The Scenario

ShopSmart E-Commerce Platform

Your Task

The Dataset

File: reviews.json (Customer Reviews)

Data Fields Explained

Requirements

Load Reviews from JSON

Text Preprocessing Pipeline

Sentiment Analysis

Named Entity Recognition

TF-IDF Vectorization

Text Classification (Sentiment Prediction)

Category Statistics

Find Most Common Words

Extract All Mentioned Brands/Organizations

Generate Analysis Report

Main Pipeline Function

Submission

Required Repository Name

Required Files

README.md Must Include:

Do Include

Do Not Include

Grading Rubric

Ready to Submit?

What You Will Practice

Text Preprocessing (4.1)

Sentiment Analysis (4.1)

Named Entity Recognition (4.1)

Text Classification (4.1)

Pro Tips

NLP Best Practices

Library Setup

Time Management

Common Mistakes

Pre-Submission Checklist

Code Requirements

Repository Requirements

File: `reviews.json` (Customer Reviews)