Project Overview
This capstone project brings together everything you have learned in the Java DSA course. You will work with real social network data containing 50 users, 100 connections, and 50 posts with engagement metrics. Your goal is to implement graph algorithms that power features like friend suggestions, connection paths, and community detection - the same algorithms used by LinkedIn, Facebook, and Twitter.
Graph Structure
Adjacency list with weighted edges
Friend Suggestions
BFS/DFS based recommendations
Connection Paths
Dijkstra's shortest path algorithm
Communities
Union-Find for group detection
Learning Objectives
Technical Skills
- Implement Graph using adjacency list representation
- Build BFS traversal for friend-of-friend recommendations
- Apply Dijkstra's algorithm for weighted shortest paths
- Use Union-Find data structure for community detection
- Calculate network metrics like degree centrality and influence
Problem-Solving Skills
- Analyze real-world social network patterns
- Understand graph algorithm time complexity
- Handle large-scale graph operations efficiently
- Design recommendation systems based on connections
- Test with actual social network interaction data
Problem Scenario
ConnectHub
You have been hired as a Backend Engineer at ConnectHub, a growing social networking platform. The company is struggling with user engagement - their current system shows random friend suggestions and users can't discover how they're connected to others. The product team needs you to build an intelligent recommendation engine using graph algorithms.
"Our users want to see 'People You May Know' based on mutual friends, not random suggestions. They also want to know 'how am I connected to this person?' like LinkedIn's connection path feature. Can you build a graph-based system that handles these features efficiently?"
Key Challenges to Solve
- Find friends-of-friends efficiently (2-hop connections)
- Rank suggestions by mutual friend count
- Filter already-connected users
- Handle edge cases gracefully
- Find shortest path between two users
- Consider connection strength as edge weight
- Display the chain of connections
- Handle disconnected users
- Identify groups of closely connected users
- Use Union-Find for efficient grouping
- Detect isolated communities
- Analyze community sizes and patterns
- Calculate degree centrality (connections)
- Identify most influential users
- Analyze network topology metrics
- Find bridge users between communities
The Dataset
You will work with a comprehensive social network dataset containing real-world interaction patterns. Download the CSV files containing users, connections, and posts data for graph analysis:
Dataset Download
Download the social network dataset files and save them to your project folder. The CSV files contain all necessary data for graph algorithms and network analysis.
Original Data Source
This project uses a curated Social Network Graph Dataset inspired by popular Kaggle datasets and real-world platforms like LinkedIn, Twitter, and Facebook. The dataset contains realistic user profiles, friendship connections with varying strengths, and engagement metrics. Perfect for learning graph algorithms and network analysis techniques used by major tech companies.
Dataset Schema
| Column | Type | Description |
|---|---|---|
user_id | String | Unique user identifier (U001-U050) |
username | String | Username handle for the profile |
full_name | String | User's full name |
email | String | User's email address |
join_date | Date | Account creation date (YYYY-MM-DD) |
location | String | User's city/region |
followers_count | Integer | Number of followers |
following_count | Integer | Number of users followed |
posts_count | Integer | Total posts published |
is_verified | Boolean | Account verification status |
profile_status | String | Active, Inactive, or Dormant |
| Column | Type | Description |
|---|---|---|
connection_id | String | Unique connection identifier (C001-C100) |
user_id | String | Source user (follower/connector) |
friend_id | String | Target user (followed/connected) |
connection_type | String | Friend, Follower, or Following |
connected_date | Date | Date connection was established |
interaction_count | Integer | Total interactions between users |
last_interaction | Date | Date of most recent interaction |
strength_score | Decimal | Connection strength (0.0-1.0), edge weight |
| Column | Type | Description |
|---|---|---|
post_id | String | Unique post identifier (P001-P050) |
user_id | String | User who created the post |
content | String | Post text/content |
post_date | Date | Date post was published (YYYY-MM-DD) |
likes_count | Integer | Number of likes received |
comments_count | Integer | Number of comments |
shares_count | Integer | Number of shares |
post_type | String | Text, Image, Video, or Article |
visibility | String | Public, Friends Only, or Private |
engagement_score | Decimal | Calculated engagement metric (0.0-100.0) |
Project Requirements
Your project must include all of the following components using Java, graph data structures, and algorithms. Structure your implementation with clear organization and comprehensive documentation.
Graph Data Structure & CSV Parsing
Data Loading:
- Parse users.csv and populate user profiles (User objects)
- Parse connections.csv and build adjacency list graph
- Parse posts.csv for engagement metrics (optional for recommendation scoring)
- Validate data integrity and handle malformed CSV rows
Graph Construction:
- Implement adjacency list using HashMap<String, List<Edge>>
- Store connection strength as edge weights (0.0-1.0)
- Support bidirectional friendship connections
- Efficient O(1) neighbor lookup and O(E) edge iteration
Data Classes:
- User.java - Store user profile with followers, following, posts count
- Connection.java - Edge representation with strength score
- SocialGraph.java - Main graph class with core operations
Friend Recommendations (BFS Algorithm)
Algorithm Implementation:
- Implement BFS to explore 2-hop connections from source user
- Level 1: Direct friends (visited, exclude from recommendations)
- Level 2: Friends-of-friends (candidates for recommendation)
- Skip self-loops and already-connected users
Ranking & Filtering:
- For each Level 2 user, count number of mutual friends
- Rank candidates by mutual friend count (descending)
- Return top N recommendations with scores
- Handle users with no friends (return empty list)
Performance Requirements:
- Time Complexity: O(V + E) for BFS traversal
- Space Complexity: O(V) for visited set and result storage
- Optimize with HashMap for constant-time lookups
Shortest Connection Path (Dijkstra's Algorithm)
Algorithm Implementation:
- Implement Dijkstra's algorithm with binary heap (PriorityQueue)
- Use connection strength as edge weight: cost = 1 - strength_score
- Stronger connections (higher strength) = shorter distances (preferred)
- Find optimal path maximizing total connection strength
Path Reconstruction:
- Track parent pointers during algorithm execution
- Reconstruct full path from source to destination
- Return path as list of User objects or IDs
- Include total distance/cost in result
Edge Cases:
- Handle disconnected users (no path exists)
- Handle direct connections (source == destination)
- Handle single-hop paths (direct friends)
Community Detection (Union-Find Algorithm)
Union-Find Implementation:
- Implement Disjoint Set Union (Union-Find) data structure
- Use int[] parent array to track set membership
- Use int[] rank array for union-by-rank optimization
- Implement path compression in find() method
Community Grouping:
- Union users connected by edges with strength_score ≥ 0.7 (strong connections)
- Identify distinct connected components (communities)
- Return mapping of users to community IDs
- Report community sizes and member lists
Optimizations:
- Path compression in find() - O(α(n)) amortized time
- Union by rank - maintain balanced tree structure
- Overall complexity: Near O(1) per operation
Bonus Features (+75 points)
Influence Analysis (+50 pts)
Calculate network influence metrics:
- Degree centrality for each user (connection count)
- Weighted degree (sum of connection strengths)
- Identify top 10 most influential users
- Find bridge users connecting multiple communities
- Betweenness centrality (times in shortest paths)
Network Visualization (+25 pts)
Generate visual representation:
- Export graph in DOT format (Graphviz)
- Color-code nodes by community membership
- Size nodes by degree centrality
- Label edges with connection strength
- Highlight bridge users and communities
Algorithm Complexity Reference
Understanding the time and space complexity of each algorithm is critical for performance and scalability. These are the expected complexities for your implementations.
- Time: O(V + E)
- Space: O(V)
- Why: Visit each vertex and edge once
With 50 users and 100 connections: ~150 operations
- Time: O((V+E) log V)
- Space: O(V)
- Why: Extract-min from heap V times
With 50 users and 100 connections: ~600 operations
- Time: O(α(n)) per op
- Space: O(V)
- Why: Path compression ≈ constant
With 50 users: ~50 near-constant ops
- Time: O(V + E)
- Space: O(V)
- Why: Stack depth + edges
With 50 users and 100 connections: ~150 operations
Data Structures
Choose the right data structure for each component. Efficiency matters when working with graphs!
HashMap<String, List<Edge>>
- Map user ID → list of neighbors
- O(1) neighbor access
- O(E) space for edges
PriorityQueue<Vertex>
- Extract min distance in O(log V)
- Used for Dijkstra's algorithm
- Custom Comparator by distance
int[] parent, int[] rank
- Path compression in find()
- Union by rank optimization
- Nearly O(1) per operation
HashMap, HashSet
- O(1) average lookups
- Visited tracking (BFS/DFS)
- User/connection storage
Submission Requirements
GitHub Repository
Create a public GitHub repository named social-network-graph with complete source code,
data files, test cases, and comprehensive documentation. The repository will be verified for
required files and code structure.
Required Repository Structure
README.md Must Include
Project Details
- Your full name and submission date
- Project overview and features
- Core algorithms implemented
- Bonus features (if any)
Technical Documentation
- Time/space complexity analysis
- Data structure explanations
- Compilation & execution instructions
- Sample output for each feature
Design & Implementation
- Design decisions and trade-offs
- Graph representation explanation
- Algorithm choice justification
- Edge case handling
Testing & Results
- Test cases with expected results
- Performance on dataset
- Screenshots of output
- Known limitations
Grading Rubric
Your project will be evaluated on correctness of implementation, code quality, and comprehensive documentation. Bonus features offer additional points for advanced work.
| Component | Points | Criteria |
|---|---|---|
| Graph Construction | 100 | Correct CSV parsing, adjacency list implementation, bidirectional edges, weight handling |
| Friend Recommendations (BFS) | 100 | Correct BFS traversal, 2-hop detection, ranking by mutual friends, proper exclusions |
| Shortest Path (Dijkstra's) | 100 | Correct algorithm implementation, priority queue usage, path reconstruction, edge weighting |
| Community Detection (Union-Find) | 100 | Correct Union-Find implementation, path compression, union by rank, threshold-based grouping |
| Code Quality | 50 | Clean code structure, meaningful variable names, proper comments, efficient implementation |
| Documentation & Testing | 50 | Comprehensive README, complexity analysis, test cases, sample output, design explanation |
| Bonus: Influence Analysis | +50 | Degree centrality calculation, top influencers, bridge user detection |
| Bonus: Network Visualization | +25 | DOT format export, community coloring, node sizing by influence |
| Total | 500 | Base score + up to +75 bonus points |