Assignment 4-A

NLP Pipeline Project

Build a complete Natural Language Processing pipeline that analyzes customer reviews. Apply text preprocessing, sentiment analysis, named entity recognition, and text classification to extract insights from real-world text data.

8-10 hours
Challenging
250 Points
Submit Assignment
What You'll Practice
  • Build complete text preprocessing pipelines
  • Implement sentiment analysis with TextBlob
  • Extract entities using SpaCy NER
  • Train text classification models
  • Generate insights from text data
Contents
01

Assignment Overview

In this assignment, you will build a complete NLP Analysis Pipeline for customer reviews. This comprehensive project requires you to apply ALL concepts from Module 4: text preprocessing, tokenization, text representation (TF-IDF), sentiment analysis, named entity recognition (NER), and text classification using scikit-learn.

Allowed Libraries Only: You may use nltk, spacy, textblob, sklearn, pandas, and json. No other external libraries are allowed.
Skills Applied: This assignment tests your understanding of NLP Basics (Topic 4.1) and Transformers (Topic 4.2) from Module 4.
Text Preprocessing (4.1)

Lowercasing, punctuation removal, stop words, tokenization, and TF-IDF vectorization

Sentiment Analysis (4.1)

Polarity detection, sentiment classification, and opinion mining

NER and Classification (4.1)

Entity extraction with SpaCy, text classification with Naive Bayes

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

ShopSmart E-Commerce Platform

You have been hired as an NLP Engineer at ShopSmart, an e-commerce platform that wants to automatically analyze customer reviews. The product manager has given you this task:

"We have thousands of customer reviews in JSON format. We need a Python system that can clean the text, detect sentiment, extract mentioned products and brands, classify reviews into categories, and generate actionable insights. Can you build this for us?"

Your Task

Create a Jupyter Notebook called nlp_pipeline.ipynb that implements a complete NLP analysis pipeline. Your code must read reviews from a JSON file, preprocess text using NLTK, analyze sentiment with TextBlob, extract entities with SpaCy, classify reviews using TF-IDF and Naive Bayes, and generate an analysis report.

03

The Dataset

You will work with a customer reviews dataset. Create this file exactly as shown below:

File: reviews.json (Customer Reviews)

{
  "reviews": [
    {
      "id": "R001",
      "product": "Wireless Earbuds Pro",
      "category": "Electronics",
      "text": "Absolutely love these earbuds! The sound quality is amazing and battery lasts forever. Apple AirPods have competition now. Bought them in New York last Monday.",
      "rating": 5
    },
    {
      "id": "R002",
      "product": "Laptop Stand X200",
      "category": "Accessories",
      "text": "Terrible quality. The stand broke after just two weeks. Complete waste of $50. Do NOT buy from this seller. Amazon should remove this product.",
      "rating": 1
    },
    {
      "id": "R003",
      "product": "Mechanical Keyboard RGB",
      "category": "Electronics",
      "text": "Decent keyboard for the price. Keys feel good but the RGB software from Logitech is buggy. Works fine for gaming and typing in my San Francisco office.",
      "rating": 3
    },
    {
      "id": "R004",
      "product": "USB-C Hub 7-in-1",
      "category": "Accessories",
      "text": "Great product! Works perfectly with my MacBook Pro. All ports function as expected. Fast shipping from California. Highly recommend!",
      "rating": 5
    },
    {
      "id": "R005",
      "product": "Noise Cancelling Headphones",
      "category": "Electronics",
      "text": "The noise cancellation is okay but not as good as Sony or Bose. Comfortable for long flights though. Dr. Smith recommended these for focus work.",
      "rating": 3
    },
    {
      "id": "R006",
      "product": "Portable Charger 20000mAh",
      "category": "Accessories",
      "text": "Saved my life during my trip to Tokyo! Charged my iPhone and iPad multiple times. Samsung users will love this too. Worth every penny!",
      "rating": 5
    },
    {
      "id": "R007",
      "product": "Webcam HD 1080p",
      "category": "Electronics",
      "text": "Horrible webcam. Picture quality is grainy and the microphone picks up everything. Returned it immediately. Microsoft Teams calls were embarrassing.",
      "rating": 1
    },
    {
      "id": "R008",
      "product": "Monitor Arm Mount",
      "category": "Accessories",
      "text": "Solid build quality. Installation took about 30 minutes. Works great with my Dell monitor. The instructions from IKEA-style manual were clear.",
      "rating": 4
    },
    {
      "id": "R009",
      "product": "Wireless Mouse Ergonomic",
      "category": "Electronics",
      "text": "My wrist pain is gone after switching to this mouse! Dr. Johnson at the clinic recommended ergonomic devices. Best $40 I ever spent in Chicago.",
      "rating": 5
    },
    {
      "id": "R010",
      "product": "Cable Management Kit",
      "category": "Accessories",
      "text": "Does the job but nothing special. Cables stay organized on my desk. Would be nice if it came in more colors. Ordered in January 2024.",
      "rating": 3
    },
    {
      "id": "R011",
      "product": "Smart Speaker Mini",
      "category": "Electronics",
      "text": "Google Assistant works great! Love asking it questions while cooking. Sound quality is not amazing but good for a small room in London.",
      "rating": 4
    },
    {
      "id": "R012",
      "product": "Laptop Sleeve 15 inch",
      "category": "Accessories",
      "text": "Perfect fit for my HP laptop. Material feels premium. Zipper is smooth. Got it on sale for $20 from Amazon Prime Day. Very satisfied!",
      "rating": 5
    },
    {
      "id": "R013",
      "product": "Bluetooth Adapter 5.0",
      "category": "Electronics",
      "text": "Stopped working after a month. Customer service from the company was useless. Waste of money. Had to buy Intel adapter instead.",
      "rating": 1
    },
    {
      "id": "R014",
      "product": "Desk Organizer Wood",
      "category": "Accessories",
      "text": "Beautiful design! Looks great on my desk. Holds pens, phone, and small items. Handmade quality from local artisans in Portland.",
      "rating": 5
    },
    {
      "id": "R015",
      "product": "Gaming Mousepad XL",
      "category": "Electronics",
      "text": "Huge mousepad, covers my entire desk. Smooth surface for gaming. The Razer logo looks cool. Edges started fraying after 6 months though.",
      "rating": 3
    }
  ],
  "metadata": {
    "total_reviews": 15,
    "collection_date": "2024-12-15",
    "source": "ShopSmart Platform"
  }
}
Data Fields Explained
  • id - Unique review identifier (string)
  • product - Product name being reviewed (string)
  • category - Product category: Electronics or Accessories (string)
  • text - The actual review text with real-world messiness (string)
  • rating - Customer rating from 1 (worst) to 5 (best) (integer)
04

Requirements

Your nlp_pipeline.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

1
Load Reviews from JSON

Create a function load_reviews(filename) that:

  • Uses Python's json module with context manager (with statement)
  • Returns the list of review dictionaries from the JSON file
  • Handles file not found errors gracefully
def load_reviews(filename):
    """Load review data from JSON file."""
    # Must use: with open(), json.load()
    # Return: list of review dictionaries
    pass
2
Text Preprocessing Pipeline

Create a function preprocess_text(text) that:

  • Converts text to lowercase
  • Removes punctuation using string.punctuation
  • Tokenizes using nltk.word_tokenize()
  • Removes stop words using nltk.corpus.stopwords
  • Returns a list of cleaned tokens
def preprocess_text(text):
    """Complete text preprocessing pipeline."""
    # Must use: lower(), string.punctuation, word_tokenize(), stopwords
    # Return: list of cleaned tokens
    pass
3
Sentiment Analysis

Create a function analyze_sentiment(text) that:

  • Uses TextBlob to analyze the original (not preprocessed) text
  • Returns a tuple: (polarity, subjectivity, sentiment_label)
  • Sentiment label: "positive" if polarity > 0.1, "negative" if < -0.1, else "neutral"
def analyze_sentiment(text):
    """Analyze sentiment using TextBlob."""
    # Must return: tuple (polarity, subjectivity, label)
    # Labels: "positive", "negative", "neutral"
    pass
4
Named Entity Recognition

Create a function extract_entities(text) that:

  • Uses SpaCy's en_core_web_sm model
  • Extracts all named entities from the text
  • Returns a list of tuples: [(entity_text, entity_label), ...]
def extract_entities(text):
    """Extract named entities using SpaCy."""
    # Must use: spacy.load('en_core_web_sm')
    # Return: list of (entity_text, entity_label) tuples
    pass
5
TF-IDF Vectorization

Create a function create_tfidf_vectors(texts) that:

  • Uses TfidfVectorizer from scikit-learn
  • Fits and transforms the list of texts
  • Returns the vectorizer and the TF-IDF matrix
def create_tfidf_vectors(texts):
    """Create TF-IDF vectors from texts."""
    # Must use: TfidfVectorizer from sklearn
    # Return: (vectorizer, tfidf_matrix)
    pass
6
Text Classification (Sentiment Prediction)

Create a function train_classifier(tfidf_matrix, labels) that:

  • Uses MultinomialNB from scikit-learn
  • Trains a Naive Bayes classifier on TF-IDF features
  • Returns the trained classifier
def train_classifier(tfidf_matrix, labels):
    """Train a Naive Bayes text classifier."""
    # Must use: MultinomialNB from sklearn.naive_bayes
    # Return: trained classifier
    pass
7
Category Statistics

Create a function get_category_stats(reviews) that:

  • Groups reviews by category (Electronics vs Accessories)
  • Calculates average rating and average sentiment polarity per category
  • Returns a dictionary with category statistics
def get_category_stats(reviews):
    """Calculate statistics per category."""
    # Return: {"Electronics": {"count": x, "avg_rating": y, "avg_sentiment": z}, ...}
    pass
8
Find Most Common Words

Create a function get_top_words(reviews, n=10) that:

  • Preprocesses all review texts
  • Counts word frequencies across all reviews
  • Returns the top N most common words as a list of tuples: [(word, count), ...]
def get_top_words(reviews, n=10):
    """Find most common words across all reviews."""
    # Must use: collections.Counter or manual counting
    # Return: list of (word, count) tuples, sorted by count descending
    pass
9
Extract All Mentioned Brands/Organizations

Create a function get_all_brands(reviews) that:

  • Uses your extract_entities() function on each review
  • Collects all entities labeled as ORG (organization)
  • Returns a sorted list of unique brand/organization names
def get_all_brands(reviews):
    """Extract all mentioned brands/organizations."""
    # Must use: extract_entities() and filter for ORG label
    # Return: sorted list of unique organization names
    pass
10
Generate Analysis Report

Create a function generate_report(reviews, filename) that:

  • Writes a comprehensive text report to a file
  • Includes: total reviews, category breakdown, sentiment distribution
  • Lists: top 10 words, all extracted brands, and sample insights
  • Uses context manager for file writing
def generate_report(reviews, filename):
    """Generate a comprehensive NLP analysis report."""
    # Write to file using: with open(filename, 'w')
    # Include all statistics and insights
    pass
11
Main Pipeline Function

Create a main() function that:

  • Loads reviews from JSON
  • Processes all reviews through the NLP pipeline
  • Prints summary statistics to console
  • Trains and tests the classifier (use rating as label: 4-5 = positive, 1-2 = negative, 3 = neutral)
  • Generates the analysis report
def main():
    # Load data
    reviews = load_reviews("reviews.json")
    print(f"Loaded {len(reviews)} reviews")
    
    # Process each review
    for review in reviews:
        # Preprocess
        tokens = preprocess_text(review['text'])
        
        # Sentiment
        polarity, subjectivity, label = analyze_sentiment(review['text'])
        review['sentiment'] = label
        review['polarity'] = polarity
        
        # Entities
        entities = extract_entities(review['text'])
        review['entities'] = entities
    
    # Get statistics
    stats = get_category_stats(reviews)
    top_words = get_top_words(reviews)
    brands = get_all_brands(reviews)
    
    # Print summary
    print(f"Categories: {list(stats.keys())}")
    print(f"Top words: {top_words[:5]}")
    print(f"Brands mentioned: {brands}")
    
    # Generate report
    generate_report(reviews, "analysis_report.txt")
    print("Report generated: analysis_report.txt")

if __name__ == "__main__":
    main()
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
shopsmart-nlp-pipeline
github.com/<your-username>/shopsmart-nlp-pipeline
Required Files
shopsmart-nlp-pipeline/
├── nlp_pipeline.ipynb        # Your Jupyter Notebook with ALL 11 functions
├── reviews.json              # Input dataset (as provided)
├── analysis_report.txt       # Generated report from running your notebook
└── README.md                 # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Brief description of your NLP pipeline approach
  • Any challenges faced and how you solved them
  • Instructions to run your notebook (including required library installations)
Do Include
  • All 11 functions implemented and working
  • Docstrings for every function
  • Comments explaining NLP concepts
  • Generated analysis_report.txt file
  • Proper use of NLTK, SpaCy, TextBlob, sklearn
  • README.md with all required sections
Do Not Include
  • Libraries not in the allowed list
  • SpaCy model files (users will download)
  • Any .pyc or __pycache__ files
  • Code that does not run without errors
  • Hardcoded outputs (we test with different data)
Important: Before submitting, run all cells in your notebook to ensure it executes without errors. Make sure to download NLTK data (nltk.download('punkt'), nltk.download('stopwords')) and SpaCy model (python -m spacy download en_core_web_sm).
Submit Your Assignment

Enter your GitHub username - we will verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Text Preprocessing 40 Correct implementation of lowercasing, punctuation removal, tokenization, and stop word removal
Sentiment Analysis 40 Proper use of TextBlob, correct polarity calculation, and accurate sentiment labeling
Named Entity Recognition 40 Correct use of SpaCy NER, proper entity extraction, and brand identification
TF-IDF and Classification 50 Proper TF-IDF vectorization and Naive Bayes classifier implementation
Statistics and Analysis 40 Correct category statistics, word frequency analysis, and report generation
Code Quality 40 Docstrings, comments, naming conventions, and clean organization
Total 250

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

Text Preprocessing (4.1)

Building a complete preprocessing pipeline with lowercasing, punctuation removal, tokenization, and stop word filtering

Sentiment Analysis (4.1)

Using TextBlob for polarity and subjectivity analysis, converting scores to human-readable sentiment labels

Named Entity Recognition (4.1)

Extracting people, organizations, locations, and dates from text using SpaCy's pre-trained NER models

Text Classification (4.1)

Building TF-IDF vectors and training Naive Bayes classifiers for automatic text categorization

08

Pro Tips

NLP Best Practices
  • Always preprocess text before analysis
  • Use original text for sentiment (not preprocessed)
  • Load SpaCy model once, reuse for all texts
  • Handle empty strings and edge cases
Library Setup
  • Install: pip install nltk spacy textblob scikit-learn
  • Download NLTK data in your notebook
  • Download SpaCy model: en_core_web_sm
  • Test imports before writing functions
Time Management
  • Start with loading and preprocessing
  • Test each function before moving on
  • Build sentiment analysis second
  • Save NER and classification for last
Common Mistakes
  • Forgetting to download NLTK resources
  • Using preprocessed text for sentiment
  • Not handling missing/empty reviews
  • Loading SpaCy model inside loops
09

Pre-Submission Checklist

Code Requirements
Repository Requirements