Module 4.1

Natural Language Processing

Unlock the power of human language for AI! Learn how machines understand, interpret, and generate text through preprocessing, tokenization, and representation techniques.

50 min read
Beginner
Hands-on Examples
What You'll Learn
  • Understanding NLP and its real-world applications
  • Text preprocessing techniques (cleaning, normalization)
  • Tokenization methods (word, sentence, subword)
  • Text representation (Bag of Words, TF-IDF)
  • Common NLP tasks and Python libraries
Contents
01

Introduction to NLP

Natural Language Processing (NLP) is a fascinating field at the intersection of artificial intelligence, computer science, and linguistics. It enables machines to read, understand, and derive meaning from human language, powering everything from virtual assistants to translation services.

Welcome, Beginner! Here's What You'll Learn

Don't worry if you're new to programming or AI! This lesson is designed specifically for beginners. By the end, you'll understand how computers can "read" and "understand" human language - and you'll write your first NLP code!

No Prerequisites Needed

Basic Python knowledge is helpful but not required

Hands-On Examples

Every concept includes runnable code you can try

Real-World Applications

See how NLP powers apps you use daily

Practice Questions

Test your understanding after each section

What is Natural Language Processing?

Imagine talking to your phone and it actually understands you. Or typing a question into Google and getting exactly the answer you need. Or having an app automatically translate a foreign menu into your language. That's NLP in action!

Every day, humans generate massive amounts of text data - emails, social media posts, reviews, articles, and conversations. NLP is the technology that helps computers make sense of this unstructured data. Unlike structured data in databases with clear rows and columns, text is messy, ambiguous, and filled with nuances that humans understand intuitively but machines struggle to grasp.

Think of it This Way: When you read "I saw her duck," you instantly know it could mean two things - either you saw her pet duck, or you saw her ducking down. Your brain uses context to figure it out. NLP teaches computers to do the same thing - understand the meaning behind words, not just the words themselves.
Key Concept

Natural Language Processing (NLP)

Natural Language Processing is a branch of artificial intelligence that helps computers understand, interpret, and manipulate human language. It combines computational linguistics (rule-based modeling of human language) with statistical, machine learning, and deep learning models.

Why it matters: NLP bridges the gap between human communication and computer understanding. Without NLP, computers would only understand structured commands, not natural human speech or writing.

Why Does NLP Matter?

Consider this: over 80% of business data is unstructured, primarily in the form of text. Emails, customer reviews, support tickets, social media posts, legal documents, and medical records all contain valuable insights locked in natural language. NLP is the key that unlocks this treasure trove of information.

Virtual Assistants

Siri, Alexa, and Google Assistant use NLP to understand your voice commands and respond naturally, making technology accessible to everyone.

Machine Translation

Google Translate and DeepL break language barriers, translating text between 100+ languages in real-time using advanced NLP models.

Search Engines

Google understands your queries, not just keywords. NLP helps interpret intent behind searches like "restaurants near me" or "how to fix a leaky faucet."

The NLP Pipeline

Processing natural language involves a series of steps, often called a pipeline. Each step transforms the raw text into something more useful for analysis. Understanding this pipeline is crucial because the quality of each step affects all downstream tasks. Think of it as an assembly line where each station adds value.

# A typical NLP pipeline in Python
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Sample text
text = "Natural Language Processing is amazing! It helps computers understand human language."

# Step 1: Lowercase
text_lower = text.lower()
print(f"Lowercased: {text_lower}")

# Step 2: Tokenization
tokens = word_tokenize(text_lower)
print(f"Tokens: {tokens}")

# Step 3: Remove punctuation and stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [t for t in tokens if t.isalnum() and t not in stop_words]
print(f"Filtered: {filtered_tokens}")

Output:

Lowercased: natural language processing is amazing! it helps computers understand human language.
Tokens: ['natural', 'language', 'processing', 'is', 'amazing', '!', 'it', 'helps', 'computers', 'understand', 'human', 'language', '.']
Filtered: ['natural', 'language', 'processing', 'amazing', 'helps', 'computers', 'understand', 'human', 'language']

Notice how the pipeline progressively cleans the text. We start with a natural sentence, convert it to lowercase for consistency, break it into individual tokens, and then remove noise like punctuation and common words that do not add meaning. The result is a clean list of meaningful words ready for analysis.

Installing NLP Libraries

Python offers several excellent libraries for NLP. NLTK (Natural Language Toolkit) is perfect for learning and experimentation, while SpaCy is optimized for production use. Let us set up your environment with both libraries so you can follow along with the examples in this lesson.

# Install NLTK (Natural Language Toolkit)
pip install nltk

# Install SpaCy (Industrial-strength NLP)
pip install spacy

# Download SpaCy's English model
python -m spacy download en_core_web_sm

# Download NLTK data (run in Python)
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Pro Tip: NLTK is great for learning NLP concepts because it is well-documented and has many educational resources. SpaCy is faster and better for production applications. Many data scientists use both - NLTK for prototyping and SpaCy for deployment.

Challenges in NLP

Human language is incredibly complex. Words can have multiple meanings, sentences can be structured in countless ways, and context changes everything. These challenges make NLP one of the most difficult areas in AI, but also one of the most rewarding when solved.

Challenge Example Why It Is Hard
Ambiguity "I saw her duck" Did she duck down, or did I see her pet duck?
Sarcasm "Oh great, another meeting" Words are positive but meaning is negative
Context "It is cold" vs "The case went cold" Same word, completely different meanings
Slang "That movie was lit" Informal language evolves constantly
Negation "I do not dislike it" Double negatives require logical reasoning

Practice Questions: Introduction to NLP

Test your understanding with these coding exercises.

Task: Import NLTK and print its version to verify the installation is working correctly.

Show Solution
import nltk
print(f"NLTK Version: {nltk.__version__}")
# Output: NLTK Version: 3.8.1

Task: Load SpaCy's small English model and process the sentence "NLP is transforming how we interact with technology."

Show Solution
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Process text
doc = nlp("NLP is transforming how we interact with technology.")

# Print each token
for token in doc:
    print(f"{token.text:15} | POS: {token.pos_:8} | Lemma: {token.lemma_}")
    
# Output:
# NLP             | POS: PROPN   | Lemma: NLP
# is              | POS: AUX     | Lemma: be
# transforming    | POS: VERB    | Lemma: transform
# ...

Task: Given the text below, tokenize it and count the frequency of each word (ignore case and punctuation).

text = "NLP helps machines understand language. Language understanding is key to AI. AI and NLP work together."
Show Solution
from collections import Counter
from nltk.tokenize import word_tokenize

text = "NLP helps machines understand language. Language understanding is key to AI. AI and NLP work together."

# Tokenize and lowercase
tokens = word_tokenize(text.lower())

# Filter only alphabetic tokens
words = [t for t in tokens if t.isalpha()]

# Count frequencies
freq = Counter(words)

print("Word Frequencies:")
for word, count in freq.most_common():
    print(f"  {word}: {count}")
    
# Output:
# Word Frequencies:
#   nlp: 2
#   ai: 2
#   language: 2
#   understanding: 2
#   ...
02

Text Preprocessing

Before feeding text data to any NLP model, we must clean and normalize it. Text preprocessing transforms raw, messy text into a clean, consistent format that algorithms can effectively process. This step is crucial because real-world text contains noise like HTML tags, special characters, and inconsistent formatting.

Why Is Preprocessing Important?

Raw text data is messy. A single word might appear as "Hello", "HELLO", "hello!", or "hello..." - and to a computer, these are all different strings. Preprocessing ensures consistency so that the model can focus on meaning rather than superficial differences. Think of it as preparing ingredients before cooking - you would not throw unwashed, unpeeled vegetables into a pot.

Key Concept

Text Preprocessing

Text preprocessing is the process of cleaning and transforming raw text into a format suitable for analysis. It typically includes lowercasing, removing punctuation, handling special characters, and normalizing whitespace.

Why it matters: Models trained on preprocessed text perform better because they learn patterns from clean, consistent data rather than getting confused by formatting variations.

Step 1: Lowercasing

The simplest but often most impactful preprocessing step is converting all text to lowercase. This ensures that "Python", "PYTHON", and "python" are treated as the same word. Without lowercasing, your vocabulary would be unnecessarily large, and the model might think these are different concepts.

# Lowercasing - the simplest preprocessing step
text = "Natural Language Processing is AMAZING!"

# Convert to lowercase
text_lower = text.lower()
print(f"Original: {text}")
print(f"Lowercased: {text_lower}")

# Why it matters - vocabulary comparison
words_original = set(text.split())
words_lower = set(text_lower.split())
print(f"\nOriginal vocabulary size: {len(words_original)}")
print(f"Lowercased vocabulary size: {len(words_lower)}")

Output:

Original: Natural Language Processing is AMAZING!
Lowercased: natural language processing is amazing!

Original vocabulary size: 5
Lowercased vocabulary size: 5
Caution: Lowercasing is not always appropriate. For named entity recognition (NER) or when case carries meaning (like "US" vs "us"), you may want to preserve the original case or handle it differently.

Step 2: Removing Punctuation

Punctuation marks like periods, commas, and exclamation points are important for human readability but often add noise for NLP models. Removing them simplifies the text and reduces vocabulary size. However, some tasks (like sentiment analysis) might benefit from keeping certain punctuation like exclamation marks.

import string

text = "Hello, World! How are you doing today? I'm great!!!"

# Method 1: Using string.punctuation
no_punct = text.translate(str.maketrans('', '', string.punctuation))
print(f"Original: {text}")
print(f"No punctuation: {no_punct}")

# Method 2: Using regex (more control)
import re
no_punct_regex = re.sub(r'[^\w\s]', '', text)
print(f"Using regex: {no_punct_regex}")

# See what punctuation we removed
print(f"\nPunctuation characters: {string.punctuation}")

Output:

Original: Hello, World! How are you doing today? I'm great!!!
No punctuation: Hello World How are you doing today Im great
Using regex: Hello World How are you doing today Im great

Punctuation characters: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Key Observation: Both methods effectively remove punctuation, but notice the side effect: "I'm" becomes "Im". This happens because the apostrophe is classified as punctuation. For production systems, consider handling contractions first (expanding "I'm" to "I am") before removing punctuation. The optimal approach depends on your specific NLP task requirements.

Step 3: Handling Special Characters and Numbers

Real-world text often contains special characters, HTML tags, URLs, email addresses, and numbers. Depending on your task, you may want to remove these entirely, replace them with placeholders, or handle them specially. Regular expressions are your best friend for this kind of pattern matching.

import re

text = """Check out https://example.com for more info!
Contact us at support@email.com or call 123-456-7890.
Price: $99.99 (50% off!)"""

# Remove URLs
no_urls = re.sub(r'https?://\S+|www\.\S+', '[URL]', text)
print("URLs replaced:")
print(no_urls)

# Remove emails
no_emails = re.sub(r'\S+@\S+', '[EMAIL]', no_urls)
print("\nEmails replaced:")
print(no_emails)

# Remove numbers (but keep words)
no_numbers = re.sub(r'\d+', '[NUM]', no_emails)
print("\nNumbers replaced:")
print(no_numbers)

Output:

URLs replaced:
Check out [URL] for more info!
Contact us at support@email.com or call 123-456-7890.
Price: $99.99 (50% off!)

Emails replaced:
Check out [URL] for more info!
Contact us at [EMAIL] or call 123-456-7890.
Price: $99.99 (50% off!)

Numbers replaced:
Check out [URL] for more info!
Contact us at [EMAIL] or call [NUM]-[NUM]-[NUM].
Price: $[NUM].[NUM] ([NUM]% off!)
Pro Tip: Instead of removing special elements entirely, consider replacing them with placeholder tokens like [URL], [EMAIL], or [NUM]. This preserves the information that something was there while normalizing the format.

Step 4: Removing Stop Words

Stop words are common words like "the", "is", "at", and "which" that appear frequently but carry little semantic meaning. Removing them reduces noise and helps models focus on the important content words. Most NLP libraries come with predefined stop word lists that you can customize.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Get English stop words
stop_words = set(stopwords.words('english'))
print(f"Number of stop words: {len(stop_words)}")
print(f"Sample stop words: {list(stop_words)[:10]}")

# Remove stop words from text
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text.lower())
filtered = [word for word in tokens if word not in stop_words]

print(f"\nOriginal: {text}")
print(f"Tokens: {tokens}")
print(f"After removing stop words: {filtered}")

Output:

Number of stop words: 179
Sample stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Original: The quick brown fox jumps over the lazy dog
Tokens: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
After removing stop words: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Complete Preprocessing Pipeline

Let us combine all these steps into a reusable preprocessing function. This is a common pattern in NLP projects - you create a pipeline that can be applied consistently to all your text data.

import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def preprocess_text(text):
    """Complete text preprocessing pipeline."""
    # 1. Lowercase
    text = text.lower()
    
    # 2. Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # 3. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # 4. Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # 5. Remove extra whitespace
    text = ' '.join(text.split())
    
    # 6. Tokenize
    tokens = word_tokenize(text)
    
    # 7. Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    
    return tokens

# Test the pipeline
sample = "Check out https://nlp.com! NLP is AMAZING - it helps computers understand us."
result = preprocess_text(sample)
print(f"Original: {sample}")
print(f"Processed: {result}")

Output:

Original: Check out https://nlp.com! NLP is AMAZING - it helps computers understand us.
Processed: ['check', 'nlp', 'amazing', 'helps', 'computers', 'understand', 'us']

Practice Questions: Text Preprocessing

Practice your preprocessing skills.

Task: Write code to remove all digits from "The year 2024 has 365 days".

Show Solution
import re

text = "The year 2024 has 365 days"

# Method 1: Using regex
no_digits = re.sub(r'\d+', '', text)
print(no_digits)  # The year  has  days

# Method 2: Using string methods
no_digits = ''.join(c for c in text if not c.isdigit())
print(no_digits)  # The year  has  days

Task: Extract all email addresses from the following text.

text = "Contact john@example.com or support@company.org for help."
Show Solution
import re

text = "Contact john@example.com or support@company.org for help."

# Email regex pattern
emails = re.findall(r'\S+@\S+', text)
print(f"Found emails: {emails}")
# Output: Found emails: ['john@example.com', 'support@company.org']

Task: Clean this noisy review by removing HTML, converting to lowercase, removing punctuation, and removing stop words.

review = "<p>This product is AMAZING!!! I bought it for $29.99... Best purchase EVER!</p>"
Show Solution
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

review = "

This product is AMAZING!!! I bought it for $29.99... Best purchase EVER!

" # Step 1: Remove HTML tags clean = re.sub(r'<.*?>', '', review) # Step 2: Lowercase clean = clean.lower() # Step 3: Remove punctuation clean = clean.translate(str.maketrans('', '', string.punctuation)) # Step 4: Tokenize tokens = word_tokenize(clean) # Step 5: Remove stop words stop_words = set(stopwords.words('english')) final = [t for t in tokens if t not in stop_words and t.isalpha()] print(f"Original: {review}") print(f"Cleaned: {final}") # Output: Cleaned: ['product', 'amazing', 'bought', 'best', 'purchase', 'ever']
03

Tokenization

Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, sentences, or even subwords. This fundamental step converts continuous text into discrete elements that machines can process, serving as the foundation for all downstream NLP tasks.

What is Tokenization?

Imagine reading a book without any spaces between words - it would be nearly impossible to understand. Tokenization is how we give computers that same ability to identify word boundaries. While it sounds simple, tokenization is surprisingly complex because languages have different rules and edge cases.

Key Concept

Tokenization

Tokenization is the process of splitting text into individual units called tokens. These tokens can be words, sentences, characters, or subword units depending on the tokenization strategy.

Why it matters: Tokenization is the first step in converting human-readable text into a format that machine learning models can process. The quality of tokenization directly affects model performance.

Word Tokenization

Word tokenization splits text into individual words. While you might think splitting on spaces is enough, real text has contractions ("don't" = "do" + "n't"), hyphenated words ("state-of-the-art"), and punctuation attached to words. Good tokenizers handle these edge cases intelligently.

from nltk.tokenize import word_tokenize, TreebankWordTokenizer

text = "I can't believe it's already 2024! State-of-the-art NLP is amazing."

# Simple split (naive approach)
simple_tokens = text.split()
print(f"Simple split: {simple_tokens}")

# NLTK word_tokenize (handles punctuation and contractions)
nltk_tokens = word_tokenize(text)
print(f"NLTK tokens: {nltk_tokens}")

# Notice how contractions are handled
print(f"\nContraction 'can't' becomes: {word_tokenize(\"can't\")}")
print(f"Contraction \"it's\" becomes: {word_tokenize(\"it's\")}")

Output:

Simple split: ["I", "can't", "believe", "it's", "already", "2024!", "State-of-the-art", "NLP", "is", "amazing."]
NLTK tokens: ['I', 'ca', "n't", 'believe', 'it', "'s", 'already', '2024', '!', 'State-of-the-art', 'NLP', 'is', 'amazing', '.']

Contraction 'can't' becomes: ['ca', "n't"]
Contraction "it's" becomes: ['it', "'s"]

Notice how NLTK intelligently separates punctuation and handles contractions. The word "can't" becomes ["ca", "n't"] because linguistically, "can't" is "can" + "not". This detailed tokenization helps models understand the underlying meaning better.

Sentence Tokenization

Sentence tokenization splits text into individual sentences. This is trickier than it sounds because periods appear in abbreviations (Dr., U.S.A.), decimal numbers (3.14), and URLs. Good sentence tokenizers use context to determine actual sentence boundaries.

from nltk.tokenize import sent_tokenize

text = """Dr. Smith earned $3.5 million in 2023. That's impressive! 
The U.S.A. leads in AI research. Visit https://ai.stanford.edu for more info."""

# Sentence tokenization
sentences = sent_tokenize(text)

print("Sentences found:")
for i, sent in enumerate(sentences, 1):
    print(f"  {i}. {sent.strip()}")

print(f"\nTotal sentences: {len(sentences)}")

Output:

Sentences found:
  1. Dr. Smith earned $3.5 million in 2023.
  2. That's impressive!
  3. The U.S.A. leads in AI research.
  4. Visit https://ai.stanford.edu for more info.

Total sentences: 4
Pro Tip: NLTK's sentence tokenizer is trained on English text and knows common abbreviations. For other languages or domain-specific text (like legal or medical documents), you may need to train a custom tokenizer or use a different library.

Tokenization with SpaCy

SpaCy provides industrial-strength tokenization with additional linguistic information. When you process text with SpaCy, each token comes with its part-of-speech tag, lemma (base form), and other annotations. This rich information is incredibly useful for downstream tasks.

import spacy

# Load English model
nlp = spacy.load('en_core_web_sm')

text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)

# Print tokens with details
print("Token Analysis:")
print("-" * 60)
for token in doc:
    print(f"{token.text:12} | POS: {token.pos_:6} | Lemma: {token.lemma_:10} | Stop: {token.is_stop}")

# Get just the tokens as a list
tokens = [token.text for token in doc]
print(f"\nTokens: {tokens}")

Output:

Token Analysis:
------------------------------------------------------------
Apple        | POS: PROPN  | Lemma: Apple      | Stop: False
is           | POS: AUX    | Lemma: be         | Stop: True
looking      | POS: VERB   | Lemma: look       | Stop: False
at           | POS: ADP    | Lemma: at         | Stop: True
buying       | POS: VERB   | Lemma: buy        | Stop: False
U.K.         | POS: PROPN  | Lemma: U.K.       | Stop: False
startup      | POS: NOUN   | Lemma: startup    | Stop: False
for          | POS: ADP    | Lemma: for        | Stop: True
$            | POS: SYM    | Lemma: $          | Stop: False
1            | POS: NUM    | Lemma: 1          | Stop: False
billion      | POS: NUM    | Lemma: billion    | Stop: False

Tokens: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion']
Understanding the Output
POS Tags

Part-of-Speech identifies word types: PROPN (proper noun), VERB, ADP (preposition)

Lemmas

Base forms: "looking" → "look", "is" → "be" for normalization

Stop Words

Flags common words (is, at, for) for easy filtering

Pro Insight: SpaCy correctly handles "U.K." as a single token despite periods - this intelligent tokenization is why SpaCy excels in production!

Subword Tokenization

Modern NLP models like BERT and GPT use subword tokenization, which breaks words into smaller meaningful units. This handles unknown words elegantly - even if a word was not in the training data, its subwords probably were. For example, "unhappiness" might become ["un", "happiness"].

# Using Hugging Face tokenizers
from transformers import AutoTokenizer

# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "Tokenization is fundamental for NLP preprocessing."

# Tokenize
tokens = tokenizer.tokenize(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")

# Convert to IDs
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")

# Handling unknown words
rare_word = "supercalifragilistic"
rare_tokens = tokenizer.tokenize(rare_word)
print(f"\nRare word '{rare_word}' becomes: {rare_tokens}")

Output:

Text: Tokenization is fundamental for NLP preprocessing.
Tokens: ['token', '##ization', 'is', 'fundamental', 'for', 'nl', '##p', 'prep', '##ro', '##ces', '##sing', '.']
Token IDs: [101, 19204, 3989, 2003, 8050, 2005, 17953, 2361, 17531, 9541, 9623, 2075, 1012, 102]

Rare word 'supercalifragilistic' becomes: ['super', '##cal', '##if', '##rag', '##ili', '##stic']

The "##" prefix indicates that a token is a continuation of the previous token (not a new word). This subword approach allows models to handle any word, even ones never seen during training, by breaking them into familiar pieces.

Tokenization Methods Comparison

Method Best For Pros Cons
Word Traditional NLP, BoW, TF-IDF Simple, intuitive Large vocabulary, OOV words
Sentence Summarization, translation Preserves structure Ambiguous boundaries
Character Spelling correction, some languages Tiny vocabulary Loses word meaning
Subword Modern transformers (BERT, GPT) Handles any word, compact Requires pretrained tokenizer

Practice Questions: Tokenization

Practice your tokenization skills.

Task: Tokenize and count the number of words in "Machine learning is a subset of artificial intelligence".

Show Solution
from nltk.tokenize import word_tokenize

text = "Machine learning is a subset of artificial intelligence"
tokens = word_tokenize(text)

print(f"Tokens: {tokens}")
print(f"Word count: {len(tokens)}")
# Output: Word count: 8

Task: Split the following paragraph into sentences and print each one with its word count.

paragraph = "NLP is fascinating. It powers virtual assistants. Machine translation is another application."
Show Solution
from nltk.tokenize import sent_tokenize, word_tokenize

paragraph = "NLP is fascinating. It powers virtual assistants. Machine translation is another application."

sentences = sent_tokenize(paragraph)
for i, sent in enumerate(sentences, 1):
    word_count = len(word_tokenize(sent))
    print(f"Sentence {i} ({word_count} words): {sent}")
    
# Output:
# Sentence 1 (4 words): NLP is fascinating.
# Sentence 2 (4 words): It powers virtual assistants.
# Sentence 3 (5 words): Machine translation is another application.

Task: Compare NLTK word tokenization with BERT subword tokenization for the sentence "Transformers revolutionized NLP".

Show Solution
from nltk.tokenize import word_tokenize
from transformers import AutoTokenizer

text = "Transformers revolutionized NLP"

# Word tokenization
word_tokens = word_tokenize(text)
print(f"Word tokens ({len(word_tokens)}): {word_tokens}")

# Subword tokenization (BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
subword_tokens = tokenizer.tokenize(text)
print(f"Subword tokens ({len(subword_tokens)}): {subword_tokens}")

# Output:
# Word tokens (3): ['Transformers', 'revolutionized', 'NLP']
# Subword tokens (5): ['transformers', 'revolution', '##ized', 'nl', '##p']
04

Text Representation

Machine learning models cannot directly process text - they need numbers. Text representation is the art of converting words and documents into numerical vectors that capture meaning. The quality of your text representation directly determines how well your model can understand and process language.

Why Do We Need Text Representation?

Computers understand numbers, not words. When you feed text to a machine learning algorithm, it must be converted into a numerical format. The challenge is doing this conversion in a way that preserves the meaning and relationships between words. Different representation methods capture different aspects of text.

Key Concept

Text Vectorization

Text vectorization is the process of converting text into numerical vectors. Each document becomes a point in high-dimensional space, where similar documents are closer together and dissimilar documents are farther apart.

The goal: Create numerical representations where semantically similar texts have similar vector representations, enabling mathematical operations on language.

Bag of Words (BoW)

The Bag of Words model is the simplest text representation. It counts how many times each word appears in a document, ignoring grammar and word order. Think of it as dumping all words into a "bag" and just counting them. Despite its simplicity, BoW works surprisingly well for many classification tasks.

from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "Machine learning is amazing",
    "Deep learning is a subset of machine learning",
    "NLP uses machine learning techniques"
]

# Create CountVectorizer (Bag of Words)
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", list(vocab))

# Convert to array and display
import pandas as pd
df = pd.DataFrame(bow_matrix.toarray(), columns=vocab)
print("\nBag of Words Matrix:")
print(df)

Output:

Vocabulary: ['amazing', 'deep', 'is', 'learning', 'machine', 'nlp', 'of', 'subset', 'techniques', 'uses']

Bag of Words Matrix:
   amazing  deep  is  learning  machine  nlp  of  subset  techniques  uses
0        1     0   1         1        1    0   0       0           0     0
1        0     1   1         2        1    0   1       1           0     0
2        0     0   0         1        1    1   0       0           1     1

Each row represents a document, and each column represents a unique word. The values show word counts. Notice how "learning" appears twice in Document 2 (index 1), so it has a value of 2 in that row.

Limitation: BoW loses word order information. "The cat chased the dog" and "The dog chased the cat" would have identical BoW representations, even though they mean very different things!

TF-IDF: Term Frequency - Inverse Document Frequency

TF-IDF improves on BoW by weighing words based on their importance. Words that appear frequently in one document but rarely in others get higher weights. Common words like "the" and "is" get lower weights because they appear everywhere and carry little meaning.

Term Frequency (TF)

Measures how often a word appears in a document relative to the total number of words. Words that appear more frequently in a specific document are considered more important to that document's meaning. A word appearing 5 times in a 100-word document has TF = 0.05.

TF = count(word) / total_words
Example: If "machine" appears 3 times in a 50-word document, TF = 3/50 = 0.06

Inverse Document Frequency (IDF)

Measures how rare or unique a word is across all documents in the corpus. Rare words that appear in only a few documents get higher IDF scores, while common words like "the" or "is" that appear everywhere get lower scores, reducing their influence.

IDF = log(N / df)
Example: If "neural" appears in 2 of 1000 documents, IDF = log(1000/2) = 2.7
from sklearn.feature_extraction.text import TfidfVectorizer

# Same documents as before
documents = [
    "Machine learning is amazing",
    "Deep learning is a subset of machine learning",
    "NLP uses machine learning techniques"
]

# Create TF-IDF vectorizer
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)

# Get feature names
vocab = tfidf.get_feature_names_out()

# Display as DataFrame
import pandas as pd
df = pd.DataFrame(tfidf_matrix.toarray().round(3), columns=vocab)
print("TF-IDF Matrix:")
print(df)

Output:

TF-IDF Matrix:
   amazing  deep     is  learning  machine  nlp     of  subset  techniques  uses
0    0.631   0.0  0.480     0.373    0.373  0.0  0.000   0.000       0.000   0.0
1    0.000   0.4  0.305     0.474    0.237  0.0  0.400   0.400       0.000   0.0
2    0.000   0.0  0.000     0.330    0.330  0.5  0.000   0.000       0.500   0.5

Compare this to BoW: "amazing" has a high score (0.631) in Document 1 because it only appears there. Meanwhile, "machine" has lower scores across documents because it appears in all three - it is less distinctive. TF-IDF naturally identifies the most important words for each document.

Practical Example: Document Similarity

Once we have vectors, we can measure how similar documents are using cosine similarity. This is the foundation of search engines, recommendation systems, and document clustering.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Query and documents
query = "machine learning applications"
documents = [
    "Machine learning is used in many applications",
    "Deep learning is a type of machine learning",
    "Cooking recipes for beginners",
    "NLP is an application of machine learning"
]

# Vectorize query and documents together
tfidf = TfidfVectorizer()
all_texts = [query] + documents
tfidf_matrix = tfidf.fit_transform(all_texts)

# Calculate similarity between query and each document
query_vector = tfidf_matrix[0]
doc_vectors = tfidf_matrix[1:]

similarities = cosine_similarity(query_vector, doc_vectors)[0]

# Rank documents by similarity
print("Document Similarity to Query:")
print("-" * 50)
for i, (doc, score) in enumerate(zip(documents, similarities)):
    print(f"Score: {score:.3f} | Doc {i+1}: {doc[:40]}...")

Output:

Document Similarity to Query:
--------------------------------------------------
Score: 0.638 | Doc 1: Machine learning is used in many applic...
Score: 0.256 | Doc 2: Deep learning is a type of machine lear...
Score: 0.000 | Doc 3: Cooking recipes for beginners...
Score: 0.391 | Doc 4: NLP is an application of machine learni...

Document 1 is most similar because it shares "machine", "learning", and "applications" with the query. Document 3 (cooking recipes) has zero similarity - it shares no vocabulary with our query. This is exactly how search engines find relevant results!

Text Representation Methods Comparison

Method How It Works Pros Cons
Bag of Words Counts word occurrences Simple, fast, interpretable No semantics, sparse vectors
TF-IDF Weighs by importance Better than BoW for search Still no semantic understanding
Word2Vec Learns word embeddings Captures semantic similarity Requires large training data
BERT Embeddings Contextual embeddings State-of-the-art quality Computationally expensive

Practice Questions: Text Representation

Build your text vectorization skills.

Task: Use CountVectorizer to create BoW vectors for "I love NLP" and "NLP is fun".

Show Solution
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["I love NLP", "NLP is fun"]

vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(sentences)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW vectors:\n", bow.toarray())
# Output: [[0 1 1 0], [1 0 1 1]]

Task: Given three documents, find the most distinctive word in each using TF-IDF scores.

docs = ["Python is great for data science", 
        "Java is popular for enterprise", 
        "JavaScript runs in browsers"]
Show Solution
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

docs = ["Python is great for data science", 
        "Java is popular for enterprise", 
        "JavaScript runs in browsers"]

tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(docs)
vocab = tfidf.get_feature_names_out()

for i, doc in enumerate(docs):
    scores = matrix[i].toarray()[0]
    top_idx = np.argmax(scores)
    print(f"Doc {i+1}: Most distinctive word = '{vocab[top_idx]}' (score: {scores[top_idx]:.3f})")

Task: Create a function that takes a query and returns the most similar document from a corpus.

Show Solution
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def search_documents(query, corpus):
    """Find most similar document to query."""
    tfidf = TfidfVectorizer()
    
    # Fit on corpus, transform both corpus and query
    corpus_vectors = tfidf.fit_transform(corpus)
    query_vector = tfidf.transform([query])
    
    # Calculate similarities
    similarities = cosine_similarity(query_vector, corpus_vectors)[0]
    
    # Find best match
    best_idx = similarities.argmax()
    return corpus[best_idx], similarities[best_idx]

# Test it
corpus = ["Learn Python programming", 
          "Data science with Python", 
          "Web development basics"]

query = "Python data analysis"
result, score = search_documents(query, corpus)
print(f"Best match: '{result}' (similarity: {score:.3f})")
05

Common NLP Tasks

NLP encompasses a wide range of tasks, from classifying sentiment in reviews to extracting named entities from documents. Understanding these tasks helps you identify which techniques to apply for your specific use case, whether it is building a chatbot, analyzing customer feedback, or automating document processing.

The NLP Task Landscape

NLP tasks can be grouped into categories based on what they accomplish. Some tasks analyze text to extract information (like sentiment or entities), while others generate new text (like translation or summarization). Let us explore the most common tasks you will encounter as a beginner.

Sentiment Analysis

Determines the emotional tone of text - positive, negative, or neutral. Widely used for analyzing product reviews, social media posts, and customer feedback to understand public opinion and brand perception.

Use Cases: Brand monitoring, customer feedback analysis, market research

Named Entity Recognition

Identifies and classifies named entities like people, organizations, locations, dates, and monetary values in text. Essential for extracting structured information from unstructured documents.

Use Cases: Information extraction, knowledge graphs, document indexing

Text Classification

Categorizes documents into predefined classes based on their content. Powers spam detection, topic labeling, intent recognition in chatbots, and content moderation systems.

Use Cases: Spam filtering, news categorization, support ticket routing

Sentiment Analysis

Sentiment analysis determines the emotional tone behind text. Is this review positive or negative? Is this tweet expressing happiness or frustration? Companies use sentiment analysis to monitor brand perception, analyze customer feedback, and track public opinion on social media.

from textblob import TextBlob

# Sample reviews to analyze
reviews = [
    "This product is absolutely amazing! Best purchase ever.",
    "Terrible quality. Complete waste of money.",
    "It's okay, nothing special but it works.",
    "I love this! Exceeded all my expectations!"
]

print("Sentiment Analysis Results:")
print("-" * 60)
for review in reviews:
    blob = TextBlob(review)
    polarity = blob.sentiment.polarity  # -1 (negative) to 1 (positive)
    
    if polarity > 0.1:
        sentiment = "Positive"
    elif polarity < -0.1:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"
    
    print(f"{sentiment:8} (score: {polarity:+.2f}) | {review[:40]}...")

Output:

Sentiment Analysis Results:
------------------------------------------------------------
Positive (score: +0.62) | This product is absolutely amazing! Bes...
Negative (score: -0.65) | Terrible quality. Complete waste of mon...
Neutral  (score: +0.00) | It's okay, nothing special but it works...
Positive (score: +0.50) | I love this! Exceeded all my expectatio...
Real-World Impact: Companies process millions of reviews with sentiment analysis. Netflix analyzes viewer feedback to improve recommendations. Amazon uses it to surface helpful reviews. Brands monitor social media sentiment to catch PR crises early.

Named Entity Recognition (NER)

Named Entity Recognition identifies and classifies named entities in text. It answers questions like: Who is mentioned? What companies? Which locations? This is crucial for information extraction, search engines, and building knowledge graphs.

import spacy

# Load English model
nlp = spacy.load('en_core_web_sm')

text = """Apple Inc. announced that Tim Cook will visit the new headquarters 
in Cupertino, California next Monday. The company plans to invest $5 billion 
in AI research by 2025."""

doc = nlp(text)

print("Named Entities Found:")
print("-" * 50)
for ent in doc.ents:
    print(f"{ent.text:20} | Type: {ent.label_:12} | Description: {spacy.explain(ent.label_)}")

# Visualize entities (in Jupyter)
# from spacy import displacy
# displacy.render(doc, style='ent')

Output:

Named Entities Found:
--------------------------------------------------
Apple Inc.           | Type: ORG          | Description: Companies, agencies, institutions
Tim Cook             | Type: PERSON       | Description: People, including fictional
Cupertino            | Type: GPE          | Description: Countries, cities, states
California           | Type: GPE          | Description: Countries, cities, states
next Monday          | Type: DATE         | Description: Absolute or relative dates or periods
$5 billion           | Type: MONEY        | Description: Monetary values, including unit
2025                 | Type: DATE         | Description: Absolute or relative dates or periods
Extraction Summary

SpaCy's NER model automatically extracted 7 entities from our text:

Apple Inc. (ORG) Tim Cook (PERSON) Cupertino (GPE) California (GPE) next Monday (DATE) $5 billion (MONEY) 2025 (DATE)
Power of NER: This transforms unstructured text into actionable data for databases, search indexing, and analytics systems!

Common Entity Types

Entity Type Description Examples
PERSON People, including fictional Tim Cook, Albert Einstein, Harry Potter
ORG Companies, agencies, institutions Apple, NASA, United Nations
GPE Countries, cities, states India, New York, California
DATE Dates and time periods January 2024, next week, 1990s
MONEY Monetary values $100, 50 euros, 1 million dollars

Text Classification

Text classification assigns predefined categories to documents. Think of your email inbox - spam detection is a classic text classification problem. Other applications include topic labeling, language detection, and intent classification for chatbots.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Training data: (text, category)
train_texts = [
    "Get rich quick! Win $1000 now!", 
    "Meeting scheduled for tomorrow at 3pm",
    "Claim your free prize today!!!",
    "Please review the attached document",
    "Congratulations! You've won a lottery",
    "Project deadline extended to Friday"
]
train_labels = ["spam", "not_spam", "spam", "not_spam", "spam", "not_spam"]

# Create a simple classifier pipeline
classifier = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

# Train the model
classifier.fit(train_texts, train_labels)

# Test on new emails
test_emails = [
    "Free money waiting for you!",
    "Can we schedule a meeting?",
    "You've been selected for a cash prize!"
]

print("Email Classification Results:")
print("-" * 50)
for email in test_emails:
    prediction = classifier.predict([email])[0]
    confidence = classifier.predict_proba([email]).max()
    print(f"[{prediction.upper():8}] ({confidence:.0%}) {email}")

Output:

Email Classification Results:
--------------------------------------------------
[SPAM    ] (89%) Free money waiting for you!
[NOT_SPAM] (76%) Can we schedule a meeting?
[SPAM    ] (92%) You've been selected for a cash prize!
How It Works
The Classification Pipeline
  1. Pattern Learning: Identifies spam keywords like "free", "win", "prize", "money"
  2. TF-IDF Vectorization: Converts text to numerical feature vectors
  3. Naive Bayes Prediction: Calculates probability of each category

Result: 6 training examples → 89-92% confidence on new emails!

Production Tip: Train on thousands of labeled emails for even better accuracy in real-world applications.

Other Important NLP Tasks

Machine Translation

Automatically translates text from one language to another while preserving meaning and context. Modern neural machine translation uses deep learning to produce human-quality translations, powering services like Google Translate and enabling global communication.

Example: "Hello, how are you?" → "Hola, ¿cómo estás?" (Spanish)

Text Summarization

Condenses long documents into shorter versions while retaining key information. Can be extractive (selecting important sentences) or abstractive (generating new sentences). Essential for processing news articles, research papers, and legal documents efficiently.

Example: Summarize a 10-page report into a 3-sentence executive summary

Question Answering

Extracts precise answers from text given a natural language question. Powers virtual assistants, FAQ systems, and search engines. Can work with structured knowledge bases or unstructured text documents using reading comprehension techniques.

Example: Q: "When was Python created?" → A: "1991 by Guido van Rossum"

Chatbots and Dialogue

Builds conversational agents that understand context, maintain dialogue history, and respond naturally to user queries. Combines multiple NLP tasks including intent recognition, entity extraction, and response generation to create intelligent assistants.

Example: Customer support bots, Siri, Alexa, ChatGPT

Practice Questions: NLP Tasks

Apply your NLP knowledge to real tasks.

Task: Use TextBlob to determine if "The battery life is disappointing but the camera is excellent" is positive or negative.

Show Solution
from textblob import TextBlob

review = "The battery life is disappointing but the camera is excellent"
blob = TextBlob(review)

print(f"Polarity: {blob.sentiment.polarity:.2f}")
print(f"Subjectivity: {blob.sentiment.subjectivity:.2f}")
# Output: Mixed sentiment - slightly positive (camera excellence outweighs battery disappointment)

Task: Extract all person names from: "Elon Musk met with Sundar Pichai to discuss AI safety. Bill Gates joined via video call."

Show Solution
import spacy

nlp = spacy.load('en_core_web_sm')
text = "Elon Musk met with Sundar Pichai to discuss AI safety. Bill Gates joined via video call."

doc = nlp(text)
people = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

print(f"People mentioned: {people}")
# Output: ['Elon Musk', 'Sundar Pichai', 'Bill Gates']

Task: Create a classifier that categorizes text into "sports", "technology", or "food" topics.

Show Solution
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Training data
texts = [
    "The team scored a winning goal", "Basketball playoffs begin",
    "New smartphone released", "AI breakthrough announced",
    "Best pizza recipe", "Restaurant reviews"
]
labels = ["sports", "sports", "tech", "tech", "food", "food"]

# Train classifier
clf = Pipeline([('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())])
clf.fit(texts, labels)

# Test
test = ["The football match was exciting", "New laptop features"]
predictions = clf.predict(test)
print(f"Predictions: {list(zip(test, predictions))}")

Key Takeaways

NLP Bridges Human-Machine Communication

Natural Language Processing enables computers to understand, interpret, and generate human language, powering applications from chatbots to translation services

Preprocessing is Essential

Text cleaning (lowercasing, removing punctuation, handling special characters) and normalization are crucial steps before any NLP analysis

Tokenization Breaks Text into Units

Word, sentence, and subword tokenization convert continuous text into discrete tokens that machines can process and analyze

Text Must Become Numbers

Bag of Words counts word occurrences, while TF-IDF weighs term importance. Both convert text to numerical vectors for machine learning

Stop Words and Stemming Reduce Noise

Removing common words (the, is, at) and reducing words to their root form helps focus on meaningful content and reduces vocabulary size

NLP Powers Many Applications

Sentiment analysis, named entity recognition, text classification, and machine translation are just a few of the many practical NLP applications

Knowledge Check

Test your understanding of Natural Language Processing fundamentals:

Question 1 of 6

What is the primary goal of Natural Language Processing (NLP)?

Question 2 of 6

Which preprocessing step converts "Running" and "RUNNING" to the same form?

Question 3 of 6

What does tokenization do to the sentence "I love NLP"?

Question 4 of 6

What is the main difference between Bag of Words and TF-IDF?

Question 5 of 6

Which words are typically removed as "stop words" in NLP preprocessing?

Question 6 of 6

What NLP task determines if a movie review is positive or negative?

Answer all questions to check your score