SpotiFlow is an end-to-end data pipeline project that extracts K-pop artist data from the Spotify API, processes it using Airflow, stores the data in both PostgreSQL and ClickHouse, and visualizes insights using Metabase. This project demonstrates a full-stack data engineering workflow including ETL orchestration, data warehousing, containerization, and dashboarding.
+----------------+
| Spotify API |
+----------------+
|
(Extract)
|
+----------------+
| Airflow (DAG) |
+----------------+
| | |
Extract Transform Load
| | |
+--------+ | +------------+
| ↓ ↓
+----------------+ +-----------------------+
| PostgreSQL | | ClickHouse |
| (raw data) | | (for analytics) |
+----------------+ +-----------------------+
|
+-----------+
| Metabase |
+-----------+
| Domain | Stack |
|---|---|
| Language | Python 3.11 |
| Data Source | Spotify Web API |
| Workflow Orchestration | Apache Airflow |
| Storage | PostgreSQL (OLTP), ClickHouse (OLAP) |
| Visualization | Metabase |
| Infrastructure | Docker, Docker Compose |
| Others | requests, dotenv, psycopg2, clickhouse-driver |
git clone https://github.com/your-username/spotiflow.git
cd spotiflowCreate a .env file in the project root with the following contents:
SPOTIFY_CLIENT_ID=your_spotify_client_id
SPOTIFY_CLIENT_SECRET=your_spotify_client_secretdocker-compose up --build -dThis will spin up:
- Airflow (Webserver + Scheduler)
- PostgreSQL
- ClickHouse
- Metabase
Note: For Apple Silicon (M1/M2),
platform: linux/amd64is used to ensure compatibility.
Inside the container:
docker exec -it airflow-webserver bash
airflow db init
airflow users create --username admin --password admin --firstname Chaerin --lastname Lee --role Admin --email test@example.comThen visit: http://localhost:8080
spotify_etl_dag.py: Main DAG that runs three tasks:extract_task: Calls Spotify API and fetches K-pop artist dataload_task: Saves to PostgreSQL and migrates to ClickHouseclickhouse_task: (Optional) Transformations or aggregations
You can trigger the DAG manually or set it to run on a schedule (@daily).
Once Metabase is up:
-
Visit http://localhost:3000
-
Add a new database → Select ClickHouse
-
Set up connection:
- Host:
clickhouse - Port:
8123 - Username:
default - Password: (leave blank)
- Database:
default
- Host:
-
Browse Data →
spotify_artiststable -
Create charts:
- Popularity vs Followers (scatter)
- Genre-wise average popularity
- Trends over
fetched_at(time-series)
-- Top 10 K-pop artists by followers
SELECT name, followers, popularity
FROM spotify_artists
ORDER BY followers DESC
LIMIT 10;
-- Average popularity per genre
SELECT genres, avg(popularity) AS avg_popularity
FROM spotify_artists
GROUP BY genres
ORDER BY avg_popularity DESC;- Mastered Spotify API authentication and data extraction
- Gained hands-on experience with Airflow DAGs and scheduling
- Handled ClickHouse integration and batch insert optimizations
- Troubleshot Docker container compatibility issues on ARM architecture
- Learned to visualize data effectively using Metabase
- Support more Spotify entities (Albums, Tracks, Playlists)
- Enable full CI/CD workflow for Airflow DAG deployments
- Add real-time data streaming (e.g., via Kafka)
- Improve dashboards with filters, drill-downs, and dynamic widgets
MIT License