This repository contains a real-time pipeline to ingest/receive data from a real-time stream of events and deploys a streaming application to aggregate some events in real time.
The web application feeds event data to Kinesis Streams, the raw data is stored in an S3 bucket using Amazon Firehose, additionally, the aggregated data is stored in a sub folder on Amazon S3 (see diagram below)
- Clone this repository.
- Set up an AWS account and download your access and secret key from AWS IAM
- Configure AWS CLI on your terminal using
aws configure - Type in your credentials
- Navigate to
IACin your terminal and type these commandsterraform init,terraform applyto set up AWS infrastructure.
Step 1: Navigate to AWS Kinesis Data Streams, select raw_data_stream and click on Process data in real-time
Step 2: Create an Apache Flink - Studio Notebook
Step 3: Select a database (This was already been created on the Terraform)
Step 4: Run the notebook and click Open Apache Zeppelin
Step 5: On the notebook, run the CREATE TABLE raw_data_table on sql_queries file in the repo
Step 6: Run the producer.py file to produce data
**Step 7: Run the Query to perform aggregations on sql_queries file in the repo
Step 8: Run the CREATE TABLE aggregated_data_table and INSERT INTO aggregated_data_table on sql_queries file
Step 9: Create a firehose to connect to S3, specify the source and directory structure

