Before being able to run the program you need to (1) Install the gcloud CLI from Google Cloud Services. (2) Initialize it. (3) Set up a default application login.
On Debian based systems first add the Google Cloud SDK distribution URI as package source
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.listImport the public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - Update the package list
sudo apt updateand install the SDK
sudo apt install google-cloud-sdkTo initialize it run
gcloud initFollow the steps, you will be prompted to log-in with your google account from the browser and select or create a new project.
Run the command
gcloud auth application-default loginand add GOOGLE_APPLICATION_CRDENTIALS=/home/{username}/.config/gcloud/application_default_credentials.json to your environment variables
Create a virtual environment (recommended) with python -m venv .venv and activate it with source .venv/bin/activate (or in Windows .\.venv\Scripts\activate). Then install the requirements with pip install -r requirements.txt.
To get a list of available commands run
python -m spark_data_analysis --helpEach question can be run separately, for example the following command will run the first two questions:
python -m spark_data_analysis questions -n 1 2To run all questions, use the following command:
python -m spark_data_analysis questions -allYou can either choose to run the Spark Streaming application or the questions, to run the Spark Streaming demo use the following command:
python -m spark_data_analysis -p 1 streaming-p 1 tells the program to use only one part of the dataset.
To run the Spark Streaming application you need to have Kafka and Zookeeper running. For this you can use the docker-compose.yml file provided in the docker directory. Starting from Kafka 3.3 Zookeeper is no longer a requirement for running a Kafka broker so we are omitting it in our compose file, we're also using the native (faster even if less feature rich) version of Kafka. You can start the services manually with the following command:
docker compose up -dBut by default the demo will automatically start the services for you unless you specify the --no-docker flag.
python -m spark_data_analysis streaming --no-dockerTo compile the report, use the following command:
python -m spark_data_analysis --report