Designed and deployed a scalable clickstream analytics pipeline to process 1.5 million+ user events using Apache Spark and Hadoop-style architecture, achieving a 70% improvement in batch processing efficiency.
- Apache Spark (PySpark)
- Python (Pandas, Faker, Matplotlib)
- Hadoop HDFS (simulated locally)
- Jupyter Notebook (for visualization)
Modern digital platforms generate millions of user clicks every day. This project aims to:
- Simulate large-scale clickstream data
- Efficiently process and aggregate massive event logs
- Derive insights like top visited pages, user engagement trends, etc.