Skip to content

ChearinL/SpotiFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎧 SpotiFlow: End-to-End K-pop Artist Data Pipeline from Spotify API

SpotiFlow is an end-to-end data pipeline project that extracts K-pop artist data from the Spotify API, processes it using Airflow, stores the data in both PostgreSQL and ClickHouse, and visualizes insights using Metabase. This project demonstrates a full-stack data engineering workflow including ETL orchestration, data warehousing, containerization, and dashboarding.


🚀 Project Architecture

              +----------------+                     
              | Spotify API    |                     
              +----------------+                     
                      |                               
                  (Extract)                          
                      |                               
              +----------------+                     
              | Airflow (DAG)  |                     
              +----------------+                     
                |     |     |                       
            Extract  Transform  Load               
                |     |     |                       
       +--------+     |     +------------+          
       |              ↓                  ↓          
+----------------+        +-----------------------+
| PostgreSQL     |        | ClickHouse            |
| (raw data)     |        | (for analytics)       |
+----------------+        +-----------------------+
                                     |                
                               +-----------+          
                               | Metabase  |          
                               +-----------+          

📦 Tech Stack

Domain Stack
Language Python 3.11
Data Source Spotify Web API
Workflow Orchestration Apache Airflow
Storage PostgreSQL (OLTP), ClickHouse (OLAP)
Visualization Metabase
Infrastructure Docker, Docker Compose
Others requests, dotenv, psycopg2, clickhouse-driver

🔧 Setup Instructions

1. Clone the repository

git clone https://github.com/your-username/spotiflow.git
cd spotiflow

2. Set up environment variables

Create a .env file in the project root with the following contents:

SPOTIFY_CLIENT_ID=your_spotify_client_id
SPOTIFY_CLIENT_SECRET=your_spotify_client_secret

3. Start the containers

docker-compose up --build -d

This will spin up:

  • Airflow (Webserver + Scheduler)
  • PostgreSQL
  • ClickHouse
  • Metabase

Note: For Apple Silicon (M1/M2), platform: linux/amd64 is used to ensure compatibility.

4. Initialize Airflow

Inside the container:

docker exec -it airflow-webserver bash
airflow db init
airflow users create --username admin --password admin --firstname Chaerin --lastname Lee --role Admin --email test@example.com

Then visit: http://localhost:8080


💠 Airflow DAGs

  • spotify_etl_dag.py: Main DAG that runs three tasks:
    • extract_task: Calls Spotify API and fetches K-pop artist data
    • load_task: Saves to PostgreSQL and migrates to ClickHouse
    • clickhouse_task: (Optional) Transformations or aggregations

You can trigger the DAG manually or set it to run on a schedule (@daily).


📊 Metabase Dashboard

Once Metabase is up:

  1. Visit http://localhost:3000

  2. Add a new database → Select ClickHouse

  3. Set up connection:

    • Host: clickhouse
    • Port: 8123
    • Username: default
    • Password: (leave blank)
    • Database: default
  4. Browse Data → spotify_artists table

  5. Create charts:

    • Popularity vs Followers (scatter)
    • Genre-wise average popularity
    • Trends over fetched_at (time-series)

📈 Sample Queries

-- Top 10 K-pop artists by followers
SELECT name, followers, popularity
FROM spotify_artists
ORDER BY followers DESC
LIMIT 10;

-- Average popularity per genre
SELECT genres, avg(popularity) AS avg_popularity
FROM spotify_artists
GROUP BY genres
ORDER BY avg_popularity DESC;

🧠 Lessons Learned

  • Mastered Spotify API authentication and data extraction
  • Gained hands-on experience with Airflow DAGs and scheduling
  • Handled ClickHouse integration and batch insert optimizations
  • Troubleshot Docker container compatibility issues on ARM architecture
  • Learned to visualize data effectively using Metabase

✨ Future Improvements

  • Support more Spotify entities (Albums, Tracks, Playlists)
  • Enable full CI/CD workflow for Airflow DAG deployments
  • Add real-time data streaming (e.g., via Kafka)
  • Improve dashboards with filters, drill-downs, and dynamic widgets

📄 License

MIT License


🙋‍️ Author

Chaerin Lee

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages