This project implements an Extract, Transform, Load (ETL) pipeline to process data from multiple file formats and prepare it for database loading.
This script aims to:
- Read CSV, JSON, and XML file types
- Extract the required data from different file types
- Transform data to the required format
- Save the transformed data in a ready-to-load format, which can be loaded into an RDBMS
- Multi-format support: Handles CSV, JSON, and XML files
- Data transformation: Converts data to standardized format
- Logging: Comprehensive logging of ETL operations
- Output generation: Creates transformed data in CSV format
- Python 3.7+
- pandas library
- Virtual environment (recommended)
-
Clone the repository:
git clone https://github.com/Alex-stack-cell/etl.git cd etl -
Extract the source data:
unzip source.zip
This will extract the following files:
source1.csv,source1.json,source1.xmlsource2.csv,source2.json,source2.xmlsource3.csv,source3.json,source3.xml
-
Create and activate virtual environment:
python3 -m venv venv source venv/bin/activate # On macOS/Linux # or venv\Scripts\activate # On Windows
-
Install dependencies:
pip install pandas
-
Ensure virtual environment is activated:
source venv/bin/activate -
Run the ETL script:
python3 etl.py
-
Check the output:
- Transformed data will be saved to
transformed_data.csv - Logs will be written to
log_file.txt
- Transformed data will be saved to
etl/
├── etl.py # Main ETL script
├── source.zip # Compressed source data files
├── transformed_data.csv # Output transformed data
├── log_file.txt # ETL operation logs
├── .gitignore # Git ignore rules
└── README.md # This file
Note: After extracting source.zip, you'll have access to:
source1.csv,source1.json,source1.xmlsource2.csv,source2.json,source2.xmlsource3.csv,source3.json,source3.xml
The script generates:
- transformed_data.csv: Clean, standardized data ready for database loading
- log_file.txt: Detailed log of all ETL operations with timestamps
The transformed data includes:
- name: Person's name
- height: Height in meters
- weight: Weight in kilograms
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is part of the IBM Python for Data Engineering course.
Alex Stack Cell