An end-to-end data analysis project focused on identifying significant risk factors for heart disease from a complex clinical dataset. The project involves intensive data cleaning, exploratory data analysis, and statistical analysis, with the final insights presented in a fully interactive Power BI dashboard.
- 📖 Project Overview
- 🎯 Business Problem
- 💾 Data Source
- 🛠️ Tech Stack & Libraries
- ⚙️ Project Workflow & Methodology
- 📊 Key Findings: Risk Factor Analysis
- 💡 Actionable Recommendations
- 🚀 Getting Started
- 📂 File Structure
- 📞 Contact
This project provides a comprehensive analysis of the UCI Heart Disease dataset, which aggregates clinical data from four different hospitals. The primary goal is to transform this raw, multi-source data into actionable insights. The workflow covers the entire data analysis lifecycle: from intensive data cleaning and preprocessing to exploratory data analysis (EDA) and statistical risk factor identification. The final deliverable is a dynamic Power BI dashboard designed to help healthcare professionals understand and explore the key drivers of heart disease.
Cardiovascular diseases are a leading cause of mortality worldwide. For healthcare organizations, the ability to identify high-risk patient profiles is critical for deploying preventive strategies, optimizing resource allocation, and improving patient outcomes. This project addresses this need by analyzing clinical data to determine the most significant predictors of heart disease, thereby providing a foundation for data-informed clinical decision-making.
The dataset used is a comprehensive collection of patient records from the UCI Machine Learning Repository, combining data from four medical centers: Cleveland, Hungary, Switzerland, and the V.A. Long Beach.
- Raw Dataset:
heart_disease_uci.csv - Cleaned Dataset:
cleaned_heart_disease.csv
The raw dataset contains 920 records and 16 columns, including various clinical attributes and a target variable indicating the presence of heart disease.
- Programming Language:
Python - Core Libraries:
Pandas,NumPy,Matplotlib,Seaborn,Scipy - BI & Visualization:
Microsoft Power BI - Development Environment:
Jupyter Notebook
The analysis was conducted in a systematic, multi-stage process, detailed below:
- The raw dataset (
heart_disease_uci.csv) was loaded into a Pandas DataFrame. - Initial data understanding was performed using
.head(),.shape,.columns,.info(), and.describe(include='all')to assess the data's structure, identify missing values, and understand the initial data types.
This was the most critical phase, involving several steps to handle the raw data's inconsistencies:
- Duplicate Removal: Checked for and removed any duplicate rows using
.drop_duplicates(). - Column Standardization: All column names were converted to lowercase and stripped of leading/trailing spaces for consistency.
- Missing Value Imputation:
- Identified columns with significant missing data (
trestbps,chol,slope,ca,thal, etc.). - Handled missing values (
?,NaN, blanks) using median imputation for key numerical features (ca,thalch,oldpeak) and mode imputation for categorical features (thal,cp,slope). - Rows with null values in critical columns (
trestbps,chol) were dropped to ensure data integrity.
- Identified columns with significant missing data (
- Data Type Correction: Corrected data types for several columns from
objecttonumericusingpd.to_numericafter cleaning. - Feature Engineering & Encoding:
- Categorical text data was mapped to numerical formats (e.g.,
sex: 'Male'/'Female' to1/0;fbsandexang: 'True'/'False' to1/0). - The multi-class target variable
num(ranging from 0-4) was binarized into atargetcolumn where0represents 'No Heart Disease' and1represents 'Presence of Heart Disease'.
- Categorical text data was mapped to numerical formats (e.g.,
- Univariate Analysis: Visualized the distribution of the target variable (Disease Frequency) and key numerical features (
age,trestbps,chol, etc.) using histograms and count plots. - Bivariate Analysis: Analyzed the relationship between categorical features and the heart disease outcome using
sns.countplotwith ahuefor the target variable. - Correlation Analysis: A heatmap was generated using
sns.heatmapto visualize the correlation matrix of all numerical features, helping to identify linear relationships and potential multicollinearity.
- A dual approach was used to quantify the importance of each feature:
- Numerical Features: Pearson correlation coefficient was calculated between each numerical feature and the binary
targetvariable. - Categorical Features: The Chi-Square test of independence (
scipy.stats.chi2_contingency) was conducted to determine the statistical significance of the association between each categorical feature and thetarget.
- Numerical Features: Pearson correlation coefficient was calculated between each numerical feature and the binary
- The results were compiled and ranked to create a definitive list of the most impactful risk factors.
- The fully cleaned, processed, and engineered DataFrame was saved as
cleaned_heart_disease.csv, making it ready for ingestion into Power BI.
The statistical analysis provided a clear hierarchy of heart disease predictors. The table below ranks the features based on their association strength (Correlation for numeric, Chi-Square p-value for categorical).
| Feature | Association Type | Association Value | Strength |
|---|---|---|---|
| exang | Correlation | 0.491 | Strong |
| oldpeak | Correlation | 0.416 | Strong |
| thalch | Correlation | -0.396 | Moderate |
| sex | Correlation | 0.299 | Moderate |
| age | Correlation | 0.288 | Moderate |
| ca | Correlation | 0.210 | Moderate |
| trestbps | Correlation | 0.149 | Weak |
| chol | Correlation | -0.137 | Weak |
| fbs | Correlation | 0.120 | Weak |
| cp | Chi2 p-value | 0.000 | Strong |
| dataset | Chi2 p-value | 0.000 | Strong |
| thal | Chi2 p-value | 0.000 | Strong |
| slope | Chi2 p-value | 0.000 | Strong |
| restecg | Chi2 p-value | 0.002 | Strong |
Based on the findings, the following actions are recommended for healthcare providers:
- Prioritize High-Risk Markers: Patients presenting with exercise-induced angina, asymptomatic chest pain, and abnormal thalassemia results should be prioritized for further cardiovascular screening.
- Enhance Diagnostic Protocols: Incorporate
oldpeakandST slopemeasurements from stress tests as primary indicators in risk assessment models. - Targeted Screening Programs: Implement targeted screening initiatives for the male population, particularly as they enter the 50+ age bracket, where the prevalence of heart disease increases sharply.
- Python 3.8+
- Jupyter Notebook or JupyterLab
- Microsoft Power BI Desktop
-
Clone the repository:
git clone [https://github.com/your-username/heart-disease-analysis.git](https://github.com/your-username/heart-disease-analysis.git)
-
Navigate to the project directory:
cd heart-disease-analysis -
Install dependencies:
pip install pandas numpy matplotlib seaborn scipy
-
Run the Analysis:
- Launch Jupyter Notebook.
- Open and run all cells in
heart_disease_analysis.ipynb. This will generate thecleaned_heart_disease.csv.
-
View the Dashboard:
- Open the project's
.pbixfile in Power BI Desktop. - If prompted, update the data source to point to the newly generated
cleaned_heart_disease.csv.
- Open the project's
├── heart_disease_analysis.ipynb # Main Jupyter Notebook with all the analysis ├── heart_disease_uci.csv # Raw input data ├── cleaned_heart_disease.csv # Processed data for Power BI ├── Heart Disease Dashboard.pbix # Power BI dashboard file (placeholder) └── README.md # Project documentation
Your Name
- Email:
lakshitapagaria@gmail.com - LinkedIn:
https://www.linkedin.com/in/lakshita-pagaria/
