End-to-end data engineering and analytics project built on Google Cloud Platform and Databricks, showcasing a complete pipeline from data ingestion to analysis, machine learning, and visualization.
Design and implement a scalable cloud-based data pipeline to process, analyze, and visualize data using modern data engineering and analytics tools.
- Google Cloud Platform (GCS, BigQuery, Cloud Shell)
- Databricks
- Apache Spark (Spark SQL, DataFrames)
- Spark MLlib
- Looker Studio
- SQL, Python (Jupyter Notebooks)
-
Cloud Setup
Created a Google Cloud Storage (GCS) bucket and configured project resources. -
Data Ingestion
Downloaded the dataset, uploaded it to GCS, and verified data integrity using Cloud Shell. -
Data Manipulation & Querying
Imported data into BigQuery and executed analytical queries using:- BigQuery Web Console
- Jupyter notebooks
-
Distributed Data Analysis
Loaded data into Spark DataFrames on Databricks and replicated analytical queries using:- Spark SQL
- DataFrame operations
-
Data Enrichment
Applied a machine learning model using Spark MLlib to enhance the analysis. -
Data Visualization
Built an interactive dashboard in Looker Studio to present insights (with optional visualization in Databricks).
The project demonstrates how cloud storage, distributed computing, machine learning, and visualization tools can be integrated into a unified data pipeline for real-world analytics use cases.
This repository highlights practical experience in building and managing cloud-based data pipelines, combining data engineering and data analysis skills in a scalable environment.