Skip to content

End-to-end data pipeline on Google Cloud and Databricks: data ingestion, storage, querying with BigQuery, Spark analysis, ML with Spark MLlib, and visualization with Looker Studio. Demonstrates full workflow from raw dataset to enriched insights.

Notifications You must be signed in to change notification settings

giuleo129/cloudComputing

Repository files navigation

Cloud Data Engineering & Analytics Pipeline

End-to-end data engineering and analytics project built on Google Cloud Platform and Databricks, showcasing a complete pipeline from data ingestion to analysis, machine learning, and visualization.

Objective

Design and implement a scalable cloud-based data pipeline to process, analyze, and visualize data using modern data engineering and analytics tools.

Tools & Technologies

  • Google Cloud Platform (GCS, BigQuery, Cloud Shell)
  • Databricks
  • Apache Spark (Spark SQL, DataFrames)
  • Spark MLlib
  • Looker Studio
  • SQL, Python (Jupyter Notebooks)

Workflow

  • Cloud Setup
    Created a Google Cloud Storage (GCS) bucket and configured project resources.

  • Data Ingestion
    Downloaded the dataset, uploaded it to GCS, and verified data integrity using Cloud Shell.

  • Data Manipulation & Querying
    Imported data into BigQuery and executed analytical queries using:

    • BigQuery Web Console
    • Jupyter notebooks
  • Distributed Data Analysis
    Loaded data into Spark DataFrames on Databricks and replicated analytical queries using:

    • Spark SQL
    • DataFrame operations
  • Data Enrichment
    Applied a machine learning model using Spark MLlib to enhance the analysis.

  • Data Visualization
    Built an interactive dashboard in Looker Studio to present insights (with optional visualization in Databricks).

Outcome

The project demonstrates how cloud storage, distributed computing, machine learning, and visualization tools can be integrated into a unified data pipeline for real-world analytics use cases.

Key Takeaway

This repository highlights practical experience in building and managing cloud-based data pipelines, combining data engineering and data analysis skills in a scalable environment.

download download

About

End-to-end data pipeline on Google Cloud and Databricks: data ingestion, storage, querying with BigQuery, Spark analysis, ML with Spark MLlib, and visualization with Looker Studio. Demonstrates full workflow from raw dataset to enriched insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published