Heart Disease Analysis & Risk Factor Prediction

An end-to-end data analysis project focused on identifying significant risk factors for heart disease from a complex clinical dataset. The project involves intensive data cleaning, exploratory data analysis, and statistical analysis, with the final insights presented in a fully interactive Power BI dashboard.

📊 Final Power BI Dashboard

1. 📖 Project Overview

This project provides a comprehensive analysis of the UCI Heart Disease dataset, which aggregates clinical data from four different hospitals. The primary goal is to transform this raw, multi-source data into actionable insights. The workflow covers the entire data analysis lifecycle: from intensive data cleaning and preprocessing to exploratory data analysis (EDA) and statistical risk factor identification. The final deliverable is a dynamic Power BI dashboard designed to help healthcare professionals understand and explore the key drivers of heart disease.

2. 🎯 Business Problem

Cardiovascular diseases are a leading cause of mortality worldwide. For healthcare organizations, the ability to identify high-risk patient profiles is critical for deploying preventive strategies, optimizing resource allocation, and improving patient outcomes. This project addresses this need by analyzing clinical data to determine the most significant predictors of heart disease, thereby providing a foundation for data-informed clinical decision-making.

3. 💾 Data Source

The dataset used is a comprehensive collection of patient records from the UCI Machine Learning Repository, combining data from four medical centers: Cleveland, Hungary, Switzerland, and the V.A. Long Beach.

Raw Dataset: heart_disease_uci.csv
Cleaned Dataset: cleaned_heart_disease.csv

The raw dataset contains 920 records and 16 columns, including various clinical attributes and a target variable indicating the presence of heart disease.

4. 🛠️ Tech Stack & Libraries

Programming Language: Python
Core Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scipy
BI & Visualization: Microsoft Power BI
Development Environment: Jupyter Notebook

5. ⚙️ Project Workflow & Methodology

The analysis was conducted in a systematic, multi-stage process, detailed below:

Step 1: Data Acquisition & Initial Inspection

The raw dataset (heart_disease_uci.csv) was loaded into a Pandas DataFrame.
Initial data understanding was performed using .head(), .shape, .columns, .info(), and .describe(include='all') to assess the data's structure, identify missing values, and understand the initial data types.

Step 2: Data Cleaning & Preprocessing

This was the most critical phase, involving several steps to handle the raw data's inconsistencies:

Duplicate Removal: Checked for and removed any duplicate rows using .drop_duplicates().
Column Standardization: All column names were converted to lowercase and stripped of leading/trailing spaces for consistency.
Missing Value Imputation:
- Identified columns with significant missing data (trestbps, chol, slope, ca, thal, etc.).
- Handled missing values (?, NaN, blanks) using median imputation for key numerical features (ca, thalch, oldpeak) and mode imputation for categorical features (thal, cp, slope).
- Rows with null values in critical columns (trestbps, chol) were dropped to ensure data integrity.
Data Type Correction: Corrected data types for several columns from object to numeric using pd.to_numeric after cleaning.
Feature Engineering & Encoding:
- Categorical text data was mapped to numerical formats (e.g., sex: 'Male'/'Female' to 1/0; fbs and exang: 'True'/'False' to 1/0).
- The multi-class target variable num (ranging from 0-4) was binarized into a target column where 0 represents 'No Heart Disease' and 1 represents 'Presence of Heart Disease'.

Step 3: Exploratory Data Analysis (EDA)

Univariate Analysis: Visualized the distribution of the target variable (Disease Frequency) and key numerical features (age, trestbps, chol, etc.) using histograms and count plots.
Bivariate Analysis: Analyzed the relationship between categorical features and the heart disease outcome using sns.countplot with a hue for the target variable.
Correlation Analysis: A heatmap was generated using sns.heatmap to visualize the correlation matrix of all numerical features, helping to identify linear relationships and potential multicollinearity.

Step 4: Statistical Risk Factor Analysis

A dual approach was used to quantify the importance of each feature:
- Numerical Features: Pearson correlation coefficient was calculated between each numerical feature and the binary target variable.
- Categorical Features: The Chi-Square test of independence (scipy.stats.chi2_contingency) was conducted to determine the statistical significance of the association between each categorical feature and the target.
The results were compiled and ranked to create a definitive list of the most impactful risk factors.

Step 5: Data Export for Business Intelligence

The fully cleaned, processed, and engineered DataFrame was saved as cleaned_heart_disease.csv, making it ready for ingestion into Power BI.

6. 📊 Key Findings: Risk Factor Analysis

The statistical analysis provided a clear hierarchy of heart disease predictors. The table below ranks the features based on their association strength (Correlation for numeric, Chi-Square p-value for categorical).

Feature	Association Type	Association Value	Strength
exang	Correlation	0.491	Strong
oldpeak	Correlation	0.416	Strong
thalch	Correlation	-0.396	Moderate
sex	Correlation	0.299	Moderate
age	Correlation	0.288	Moderate
ca	Correlation	0.210	Moderate
trestbps	Correlation	0.149	Weak
chol	Correlation	-0.137	Weak
fbs	Correlation	0.120	Weak
cp	Chi2 p-value	0.000	Strong
dataset	Chi2 p-value	0.000	Strong
thal	Chi2 p-value	0.000	Strong
slope	Chi2 p-value	0.000	Strong
restecg	Chi2 p-value	0.002	Strong

7. 💡 Actionable Recommendations

Based on the findings, the following actions are recommended for healthcare providers:

Prioritize High-Risk Markers: Patients presenting with exercise-induced angina, asymptomatic chest pain, and abnormal thalassemia results should be prioritized for further cardiovascular screening.
Enhance Diagnostic Protocols: Incorporate oldpeak and ST slope measurements from stress tests as primary indicators in risk assessment models.
Targeted Screening Programs: Implement targeted screening initiatives for the male population, particularly as they enter the 50+ age bracket, where the prevalence of heart disease increases sharply.

8. 🚀 How to Use

Prerequisites

Python 3.8+
Jupyter Notebook or JupyterLab
Microsoft Power BI Desktop

Installation & Setup

Clone the repository:

git clone [https://github.com/your-username/heart-disease-analysis.git](https://github.com/your-username/heart-disease-analysis.git)

Navigate to the project directory:
```
cd heart-disease-analysis
```

Install dependencies:

pip install pandas numpy matplotlib seaborn scipy

Run the Analysis:
- Launch Jupyter Notebook.
- Open and run all cells in heart_disease_analysis.ipynb. This will generate the cleaned_heart_disease.csv.
View the Dashboard:
- Open the project's .pbix file in Power BI Desktop.
- If prompted, update the data source to point to the newly generated cleaned_heart_disease.csv.

9. 📂 File Structure

├── heart_disease_analysis.ipynb # Main Jupyter Notebook with all the analysis ├── heart_disease_uci.csv # Raw input data ├── cleaned_heart_disease.csv # Processed data for Power BI ├── Heart Disease Dashboard.pbix # Power BI dashboard file (placeholder) └── README.md # Project documentation

10. 📞 Contact

Your Name

Email: lakshitapagaria@gmail.com
LinkedIn: https://www.linkedin.com/in/lakshita-pagaria/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Heart Disease Analysis & Risk Factor Prediction

📊 Final Power BI Dashboard

📋 Table of Contents

1. 📖 Project Overview

2. 🎯 Business Problem

3. 💾 Data Source

4. 🛠️ Tech Stack & Libraries

5. ⚙️ Project Workflow & Methodology

Step 1: Data Acquisition & Initial Inspection

Step 2: Data Cleaning & Preprocessing

Step 3: Exploratory Data Analysis (EDA)

Step 4: Statistical Risk Factor Analysis

Step 5: Data Export for Business Intelligence

6. 📊 Key Findings: Risk Factor Analysis

7. 💡 Actionable Recommendations

8. 🚀 How to Use

Prerequisites

Installation & Setup

9. 📂 File Structure

10. 📞 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Dashboard Image.png		Dashboard Image.png
Dashboard Preview.mp4		Dashboard Preview.mp4
Heart Disease Analysis & Risk Prediction Dashboard.pbix		Heart Disease Analysis & Risk Prediction Dashboard.pbix
README.md		README.md
cleaned_heart_disease.csv		cleaned_heart_disease.csv
heart_disease_analysis.ipynb		heart_disease_analysis.ipynb
heart_disease_uci.csv		heart_disease_uci.csv

LakshitaPagaria/Heart-Disease-Analysis-Risk-Factor-Prediction

Folders and files

Latest commit

History

Repository files navigation

Heart Disease Analysis & Risk Factor Prediction

📊 Final Power BI Dashboard

📋 Table of Contents

1. 📖 Project Overview

2. 🎯 Business Problem

3. 💾 Data Source

4. 🛠️ Tech Stack & Libraries

5. ⚙️ Project Workflow & Methodology

Step 1: Data Acquisition & Initial Inspection

Step 2: Data Cleaning & Preprocessing

Step 3: Exploratory Data Analysis (EDA)

Step 4: Statistical Risk Factor Analysis

Step 5: Data Export for Business Intelligence

6. 📊 Key Findings: Risk Factor Analysis

7. 💡 Actionable Recommendations

8. 🚀 How to Use

Prerequisites

Installation & Setup

9. 📂 File Structure

10. 📞 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages