ETL Data Processing Project

Overview

This project implements an Extract, Transform, Load (ETL) pipeline to process data from multiple file formats and prepare it for database loading.

Goal

This script aims to:

Read CSV, JSON, and XML file types
Extract the required data from different file types
Transform data to the required format
Save the transformed data in a ready-to-load format, which can be loaded into an RDBMS

Features

Multi-format support: Handles CSV, JSON, and XML files
Data transformation: Converts data to standardized format
Logging: Comprehensive logging of ETL operations
Output generation: Creates transformed data in CSV format

Prerequisites

Python 3.7+
pandas library
Virtual environment (recommended)

Installation

Clone the repository:

git clone https://github.com/Alex-stack-cell/etl.git
cd etl

Extract the source data:
```
unzip source.zip
```
This will extract the following files:
- source1.csv, source1.json, source1.xml
- source2.csv, source2.json, source2.xml
- source3.csv, source3.json, source3.xml

Create and activate virtual environment:

python3 -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

Install dependencies:
```
pip install pandas
```

Usage

Ensure virtual environment is activated:
```
source venv/bin/activate
```
Run the ETL script:
```
python3 etl.py
```
Check the output:
- Transformed data will be saved to transformed_data.csv
- Logs will be written to log_file.txt

File Structure

etl/
├── etl.py                 # Main ETL script
├── source.zip             # Compressed source data files
├── transformed_data.csv   # Output transformed data
├── log_file.txt          # ETL operation logs
├── .gitignore            # Git ignore rules
└── README.md             # This file

Note: After extracting source.zip, you'll have access to:

source1.csv, source1.json, source1.xml
source2.csv, source2.json, source2.xml
source3.csv, source3.json, source3.xml

Output

The script generates:

transformed_data.csv: Clean, standardized data ready for database loading
log_file.txt: Detailed log of all ETL operations with timestamps

Data Format

The transformed data includes:

name: Person's name
height: Height in meters
weight: Weight in kilograms

Contributing

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

This project is part of the IBM Python for Data Engineering course.

Author

Alex Stack Cell

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETL Data Processing Project

Overview

Goal

Features

Prerequisites

Installation

Usage

File Structure

Output

Data Format

Contributing

License

Author

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
etl.py		etl.py
source.zip		source.zip

Alex-stack-cell/etl

Folders and files

Latest commit

History

Repository files navigation

ETL Data Processing Project

Overview

Goal

Features

Prerequisites

Installation

Usage

File Structure

Output

Data Format

Contributing

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages